[Haskell-beginners] Unexpected Space Leak (despite using external-sort)

Stephen Blackheath [to Haskell-Beginners] mutilating.cauliflowers.stephen at blacksapphire.com
Sun Jan 3 21:43:13 EST 2010


Peter,

I can't see anything in your code that stops it being lazy, and I also
read the source for external-sort.  It all looks OK except for one bit.
 There's even a comment that says "It would be better if I changed Ord
for blocks to only check the first element."  Looks like it's comparing
the whole list instead of the first element when merging the lists, and
so it's a bit difficult to guess how much it is evaluating of each list.
 If there are many repeated elements, it could be a lot.

You could try fixing that.  You could hack kMerge using a newtype and an
Ord instance (since Splay takes Ord instances).


Steve

Peter Green wrote:
> I am using the external-sort package to sort my output in the program
> below. I made this choice because my output dataset [[Int]] can be large
> (e.g. >3M items, each [Int]).
> 
> What my program does:
> 
> (1) Reads in a file containing 'compressed lists' which look like so:
> 
> 8+12,11,7+13,10
> 1+2+3,1+9,3+6,4
> .
> .
> 
> One compressed list per line. These compressed lists are parsed to
> become [[[Int]]]
> 
> [[[8,12],[11],[7,13],[10]],
>  [[1,2,3],[1,9],[3,6],[4]],
> .
> .
> ]
> 
> Generally files of compressed lists have lengths of ~10,000 lines.
> 
> (2) Compressed lists are exploded to [[int]] via concatMap Cartesian
> Product over [[[Int]]], so we end up with [[Int]]
> 
> [[8, 11, 7, 10],
>  [8, 11, 13, 10],
>  [12, 11, 7, 10],
>  [12, 11, 13, 10],
>  [1, 1, 3, 4],
>  .
>  .
>  [3, 9, 6, 4]]
> 
> These 'exploded lists are' *much* longer than the input lists. It's
> common for them to have >100K [Int] members, or even >1M [Int] members.
> 
> (3) This [[Int]] data must be sorted in to lexicographic order and
> output as CSV data: i.e. output should be:
> 
> 1,1,3,4
> 3,9,6,4
> 8,11,7,10
> .
> .
> 12,11,13,10
> 
> There is no way to avoid the necessity of sorting the final product. I
> can *not* take advantage of any structure in the input data to avoid
> this sort stage. e.g. sorting the (smaller) input compressed lists *will
> not* obviate the need to lex sort the (larger) final output.
> 
> The program listing below works correctly, but *is not* space-efficient.
> In fact it is *less* space-efficient than the equivalent  program using
> Prelude sort!
> 
> e.g. with one data set which explodes to ~3.7M I am seeing 639MB of
> total memory use.
> 
> I am sure externalSort is not at fault because:
> (a) I'm the Newbie.
> (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not small
> because sorting a [[Int]] with 3.7M [Int] members.)
> 
> I'm sure it must be something I'm doing wrong elsewhere which is causing
> the entire output data list to be read into memory.
> 
> What I really want to happen is:
> 
> (1) Data lazily read in from stdin and lazily parsed
> (2) Data lazily sucked out to disk and strictly sorted by externalSort
> (but no big deal since this happening on disk)
> (3) Data lazily sucked back out of externalSort, lazily formatted for
> output, and lazily written to stdout.
> 
> So memory usage should hopefully never go over (say 10MB). But in fact
> I'm using 638MB. So somewhere in the code below I must be doing
> something very wrong.
> 
> I hope someone can tell me what I am doing wrong here.
> 
> TIA!
> 
> 
> import Algorithms.ExternalSort
> import Data.List.Split (splitOn)
> import System.IO
> 
> -- Cartesian Product over a List of Lists
> -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
> -- cp [[1,2],[3],[4,5,6]] -->
> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
> cp          :: [[a]] -> [[a]]
> cp []       =  [[]]
> cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]
> 
> -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
> fromCSV :: [String] -> [[[Int]]]
> fromCSV = map parseOneLine
>     where parseOneLine = map parseGroup . splitOn ","
>               where parseGroup = map read . splitOn "+"
> 
> -- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
> [[1,3,4],[1,3,5],
> -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],[2,15,16]]
> explode :: [[[a]]] -> [[a]]
> explode =  concatMap cp
> 
> -- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" -->
> --
> [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4],[1,1,6,4],
> --  [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4],[3,1,3,4],
> --  [3,1,6,4],[3,9,3,4],[3,9,6,4]]
> toSingles :: String -> [[Int]]
> toSingles = explode . fromCSV . lines
> 
> -- toCSV [8,11,7,10,12] --> "8,11,7,10,12"
> toCSV :: (Show a) => [a] -> String
> toCSV = tail . init . show
> 
> main = do
>   getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)
> 
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners
> 


More information about the Beginners mailing list