[Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

Wed Dec 30 23:57:32 EST 2009

On Wed, Dec 30, 2009 at 8:39 PM, Peter Green <kinch1967 at me.com> wrote:
> I'm a Haskell neophyte, so may be missing something obvious in the problem
> outlined below. I'm fairly proficient in Python + have some limited
> experience in OCaml and F#, so know just enough to be be dangerous, but not
> nearly enough to really know what I'm doing here.
>
> OK, I have text files containing 'compressed lists' Compressed lists look
> like this:
>
> 8+12,11,7+13,10
> 1+2+3,1+9,3+6,4
> .
> .
>
> Sublists are comma-delimited, and sublist elements are separated by '+'
> character.
>
> I parse these to look like so:
>
> [[[8,12],[11],[7,13],[10]],
>  [[1,2,3],[1,9],[3,6],[4]],
> .
> .
> ]
>
> I need to explode these and produce a lex-sorted list of exploded lists:
>
> [[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4],
>  [8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]]
>
> I then output this data as comma-delimited lists:
>
> 1,1,3,4
> 1,1,6,4
> .
> .
> 12,11,13,10
>
> I can do all this in my (no doubt fairly naive) program shown below. In real
> life, one of my test data files contains compressed lists which all contain
> 7 sublists. This data (correctly) explodes to a list containing ~3,700,000
> exploded lists. All good as far as correctly transforming input to output
> goes. However, sorting the data uses a lot of memory:
>
> Partial output from ./explode +RTS -sstderr:
>
> 540 MB total memory in use (4 MB lost due to fragmentation)
>
> If I do not do any sorting on the exploded lists, i.e. modify the main
> function to be
>        main = interact (unlines . toCSV . explode . fromCSV . lines)
>
> I then see this partial output from ./explode +RTS --stderr:
>
> 2 MB total memory in use (0 MB lost due to fragmentation)
>
> I can guess that there might be be less laziness and more instantiation when
>  sorting is introduced, but my questions are:
>        (1) Am I doing anything terribly stupid/naive here?
>        (2) If so, what can I do to improve space efficiency?
>
> TIA!
>
>
> import Data.List (sort)
> import Data.List.Split (splitOn)
>
> -- Cartesian Product over a List of Lists
> -- cp [[1,2],[3],[4,5,6]] -->
> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
> cp          :: [[a]] -> [[a]]
> cp []       =  [[]]
> cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]
>
> -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
> fromCSV :: [String] -> [[[Int]]]
> fromCSV = map parseOneLine
>    where parseOneLine = map parseGroup . splitOn ","
>              where parseGroup = map read . splitOn "+"
>
> -- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
> [[1,3,4],[1,3,5],
> -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],[2,15,16]]
> explode :: [[[a]]] -> [[a]]
> explode =  concatMap cp
>
> -- toCSV [[8,11,7,10,12],[8,11,7,10,12],[8,11,7,10,12]] -->
> -- ["8,11,7,10,12","8,11,7,10,12","8,11,7,10,12"]
> toCSV :: (Show a) => [[a]] -> [String]
> toCSV = map $ tail . init . show
> --toCSV = map (intercalate "," . map show)
>
> main = interact (unlines . toCSV . sort . explode . fromCSV . lines)
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

I'm not sure about your problem specifically but the external-sort
package on Hackage may be of interest.

Alex