[Haskell-cafe] Re: Space Efficiency When Sorting a List of Many Lists

Sat Jan 9 12:48:01 EST 2010

Peter Green wrote:
> Heinrich, thanks for some great help and food for thought!

My pleasure. :)

>> Thanks for describing the background of this problem in detail! I was
>> mainly asking because I'm always looking for interesting Haskell topics
>> that can be turned into a tutorial of sorts, and this problem makes a
>> great example.
> 
> Another interesting problem is starting from a file of single wagers and
> trying to compress them  (i.e. inverse of 'explosion'. I believe this
> problem is analogous to Set Cover and therefore NP-Complete. Heuristics
> are the order of the day here.

Interesting indeed! What kind of heuristics do you employ there? It
seems quite difficult to me, somehow.

>>    [...]
>>    main = interact (unlines . toCSV . sortExplode . fromCSV . lines)
>>
> Thank you *very* much for this code! I will try to get my head around
> it. I understand the broad outline you posted elsewhere, but will take
> me a while to fully grasp the above as I'm only up to ~p200 in Real
> World Haskell :).
> 
> As for performance of your code above on my file of compressed wagers
> which expands to 3.7M single wagers:
> 
> (My original version posted here and using vanilla sort)
> 541 MB total memory in use (5 MB lost due to fragmentation)
> INIT  time    0.00s  (  0.01s elapsed)
>   MUT   time   13.98s  ( 15.82s elapsed)
>   GC    time    8.69s  (  9.64s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time   22.67s  ( 25.47s elapsed)
> 
> (Your much improved version)
> 10 MB total memory in use (1 MB lost due to fragmentation)
>   INIT  time    0.00s  (  0.00s elapsed)
>   MUT   time    7.61s  (  9.38s elapsed)
>   GC    time    3.48s  (  3.58s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time   11.08s  ( 12.95s elapsed)
> 
> Very impressive and thanks again!

Category theory saves the day. ;)

The code is based on three observations:
1) Sorting and exploding can be interleaved to create version that can
stream results in an on-line fashion.
2) There is a standard formulation of this pattern in terms of folds and
unfolds (= catamorphisms and anamorphisms).
3) Lazy evaluation can be used to make this look like an off-line algorithm.

To my surprise, I don't have good references for these, but maybe the
following can help a bit. Something similar to 1) can be found in

  Graham Hutton. The countdown problem.
  http://www.cs.nott.ac.uk/~gmh/bib.html#countdown

and

  Richard Bird. A program to solve Sudoku.
  http://www.cs.tufts.edu/~nr/comp150fp/archive/richard-bird/sudoku.pdf

The standard reference for 2) is

  Meijer et al. Functional programming with bananas,
    lenses, envelopes and barbed wire.
  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.125

and 3) is folklore, maybe

  John Hughes. Why functional programming matters.
  http://www.cs.chalmers.se/~rjmh/Papers/whyfp.html

could serve as an introduction to lazy evaluation.

>> Note that there's also the possibility of not expanding the compressed
>> wagers at all, and perform set operations directly. For instance, it is
>> straightforward to intersect two such sets of size n in O(n^2) time.
>> Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
>> wager combinations.
> 
> Not sure I quite understand you here. In my mind, set elements *are*
> single combinations. It is possible for two quite different-looking
> files of compressed wagers to contain exactly the same single wager
> elements. So I'm not sure how to compare without some notion of
> explosion to single combinations.

What I mean is that there are two ways to represent sets of wagers:

* A list of single combinations ->  [Row a]
* A list of compressed wagers   ->  [Row [a]]

To perform set operations, you first convert the latter into the former.
But the idea is that there might be a way of performing set operations
without doing any conversion.

Namely, the compressed wagers are cartesian products which behave well
under intersection: the intersection of two compressed wagers is again a
compressed wager

   intersectC :: Eq a => Row [a] -> Row [a] -> Row [a]
   intersectC []        []        = []
   intersectC (xs:xrow) (ys:yrow) =
       intersect xs ys : intersectC xrow yrow

In formulas:

     intersect (cartesian x) (cartesian y)
   = cartesian (intersectC x y)

for compressed wagers  x, y :: Row [a]  with

   intersect = list intersection, for instance Data.List.intersect
   cartesian = converts a compressed wager to a list of
               single combinations

This allows you to intersect two lists of compressed wagers directly

   intersect' :: [Row [a]] -> [Row [a]] -> [Row [a]]
   intersect' xs ys = filter (not . null) $
                      [intersectC x y | x<-xs, y<-ys]

Unfortunately, the result will have quadratic size; but if you have
significantly less compressed wagers than single combinations, this may
be worth a try. Not to mention that you might be able to compress the
result again.

Mathematically, this exploits the equations

   (A × B) ∩ (C × D) = (A ∩ C) × (B ∩ D)

   "The intersection of two rectangles is another rectangle"

and

   (A ∪ B) ∩ (C ∪ D) = (A ∩ C) ∪ (A ∩ D)
                       ∪ (B ∩ C) ∪ (B ∩ D)

where A,B,C,D are sets.

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com