[Haskell-cafe] Re: Space Efficiency When Sorting a List of Many
Lists
Heinrich Apfelmus
apfelmus at quantentunnel.de
Sat Jan 9 12:48:01 EST 2010
Peter Green wrote:
> Heinrich, thanks for some great help and food for thought!
My pleasure. :)
>> Thanks for describing the background of this problem in detail! I was
>> mainly asking because I'm always looking for interesting Haskell topics
>> that can be turned into a tutorial of sorts, and this problem makes a
>> great example.
>
> Another interesting problem is starting from a file of single wagers and
> trying to compress them (i.e. inverse of 'explosion'. I believe this
> problem is analogous to Set Cover and therefore NP-Complete. Heuristics
> are the order of the day here.
Interesting indeed! What kind of heuristics do you employ there? It
seems quite difficult to me, somehow.
>> [...]
>> main = interact (unlines . toCSV . sortExplode . fromCSV . lines)
>>
> Thank you *very* much for this code! I will try to get my head around
> it. I understand the broad outline you posted elsewhere, but will take
> me a while to fully grasp the above as I'm only up to ~p200 in Real
> World Haskell :).
>
> As for performance of your code above on my file of compressed wagers
> which expands to 3.7M single wagers:
>
> (My original version posted here and using vanilla sort)
> 541 MB total memory in use (5 MB lost due to fragmentation)
> INIT time 0.00s ( 0.01s elapsed)
> MUT time 13.98s ( 15.82s elapsed)
> GC time 8.69s ( 9.64s elapsed)
> EXIT time 0.00s ( 0.00s elapsed)
> Total time 22.67s ( 25.47s elapsed)
>
> (Your much improved version)
> 10 MB total memory in use (1 MB lost due to fragmentation)
> INIT time 0.00s ( 0.00s elapsed)
> MUT time 7.61s ( 9.38s elapsed)
> GC time 3.48s ( 3.58s elapsed)
> EXIT time 0.00s ( 0.00s elapsed)
> Total time 11.08s ( 12.95s elapsed)
>
> Very impressive and thanks again!
Category theory saves the day. ;)
The code is based on three observations:
1) Sorting and exploding can be interleaved to create version that can
stream results in an on-line fashion.
2) There is a standard formulation of this pattern in terms of folds and
unfolds (= catamorphisms and anamorphisms).
3) Lazy evaluation can be used to make this look like an off-line algorithm.
To my surprise, I don't have good references for these, but maybe the
following can help a bit. Something similar to 1) can be found in
Graham Hutton. The countdown problem.
http://www.cs.nott.ac.uk/~gmh/bib.html#countdown
and
Richard Bird. A program to solve Sudoku.
http://www.cs.tufts.edu/~nr/comp150fp/archive/richard-bird/sudoku.pdf
The standard reference for 2) is
Meijer et al. Functional programming with bananas,
lenses, envelopes and barbed wire.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.125
and 3) is folklore, maybe
John Hughes. Why functional programming matters.
http://www.cs.chalmers.se/~rjmh/Papers/whyfp.html
could serve as an introduction to lazy evaluation.
>> Note that there's also the possibility of not expanding the compressed
>> wagers at all, and perform set operations directly. For instance, it is
>> straightforward to intersect two such sets of size n in O(n^2) time.
>> Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
>> wager combinations.
>
> Not sure I quite understand you here. In my mind, set elements *are*
> single combinations. It is possible for two quite different-looking
> files of compressed wagers to contain exactly the same single wager
> elements. So I'm not sure how to compare without some notion of
> explosion to single combinations.
What I mean is that there are two ways to represent sets of wagers:
* A list of single combinations -> [Row a]
* A list of compressed wagers -> [Row [a]]
To perform set operations, you first convert the latter into the former.
But the idea is that there might be a way of performing set operations
without doing any conversion.
Namely, the compressed wagers are cartesian products which behave well
under intersection: the intersection of two compressed wagers is again a
compressed wager
intersectC :: Eq a => Row [a] -> Row [a] -> Row [a]
intersectC [] [] = []
intersectC (xs:xrow) (ys:yrow) =
intersect xs ys : intersectC xrow yrow
In formulas:
intersect (cartesian x) (cartesian y)
= cartesian (intersectC x y)
for compressed wagers x, y :: Row [a] with
intersect = list intersection, for instance Data.List.intersect
cartesian = converts a compressed wager to a list of
single combinations
This allows you to intersect two lists of compressed wagers directly
intersect' :: [Row [a]] -> [Row [a]] -> [Row [a]]
intersect' xs ys = filter (not . null) $
[intersectC x y | x<-xs, y<-ys]
Unfortunately, the result will have quadratic size; but if you have
significantly less compressed wagers than single combinations, this may
be worth a try. Not to mention that you might be able to compress the
result again.
Mathematically, this exploits the equations
(A × B) ∩ (C × D) = (A ∩ C) × (B ∩ D)
"The intersection of two rectangles is another rectangle"
and
(A ∪ B) ∩ (C ∪ D) = (A ∩ C) ∪ (A ∩ D)
∪ (B ∩ C) ∪ (B ∩ D)
where A,B,C,D are sets.
Regards,
Heinrich Apfelmus
--
http://apfelmus.nfshost.com
More information about the Haskell-Cafe
mailing list