[Haskell-beginners] Unexpected Space Leak (despite using external-sort)

Mon Jan 4 00:19:27 EST 2010

On 04/01/2010, at 10:43 AM, Stephen Blackheath [to Haskell-Beginners]  
wrote:

> Peter,
>
> I can't see anything in your code that stops it being lazy, and I also
> read the source for external-sort.  It all looks OK except for one  
> bit.
> There's even a comment that says "It would be better if I changed Ord
> for blocks to only check the first element."  Looks like it's  
> comparing
> the whole list instead of the first element when merging the lists,  
> and
> so it's a bit difficult to guess how much it is evaluating of each  
> list.
> If there are many repeated elements, it could be a lot.
>
> You could try fixing that.  You could hack kMerge using a newtype  
> and an
> Ord instance (since Splay takes Ord instances).
>
>
> Steve

Thanks Steve. I think I understand your point. It seems like external- 
sort needs patching to make block comparison more efficient for  
everyone and that my case is just particularly extreme/degenerate.  
Just right now, I'm not yet confident enough about my Haskell  
abilities to go and create a new type for 'block' with more efficient  
Ord and then modify the rest of the external-sort code accordingly.  
Posted program was the 2nd Haskell prog I've every written. I'll get  
there in time, though :).

I wonder if there is another way of making my code work better with  
the existing externalSort function:

I am sorting [[Int]], an example of which might be:

[[8, 11, 7, 10],
[8, 11, 13, 10],
[12, 11, 7, 10],
[12, 11, 13, 10],
[1, 1, 3, 4],
.
.
[3, 9, 6, 4]]

I need numerical lex sorting, so I can't just map directly to  
ByteString as this would result in "12" collating before "1", etc.

I know that each sublist I am sorting never contains an integer  
greater than 20. I could encode each sublist as a ByteString where 1  
maps to a, 2 maps to B, 3, maps to C, etc.

So would have [1,1,3,4] -> "aacd" (ByteString), etc.

Hopefully Ord comparisons between ByteStrings are much less expensive  
than comparisons between instances of  [Int] and I will see nicer  
behaviour since external-sort does work as advertised on lists of  
millions of Ints?

So I could sort these, then map back to underlying Int data when I  
output. Schwartzian transform type stuff.

Anyway, I'll give it a go.

> Peter Green wrote:
>> I am using the external-sort package to sort my output in the program
>> below. I made this choice because my output dataset [[Int]] can be  
>> large
>> (e.g. >3M items, each [Int]).
>>
>> What my program does:
>>
>> (1) Reads in a file containing 'compressed lists' which look like so:
>>
>> 8+12,11,7+13,10
>> 1+2+3,1+9,3+6,4
>> .
>> .
>>
>> One compressed list per line. These compressed lists are parsed to
>> become [[[Int]]]
>>
>> [[[8,12],[11],[7,13],[10]],
>> [[1,2,3],[1,9],[3,6],[4]],
>> .
>> .
>> ]
>>
>> Generally files of compressed lists have lengths of ~10,000 lines.
>>
>> (2) Compressed lists are exploded to [[int]] via concatMap Cartesian
>> Product over [[[Int]]], so we end up with [[Int]]
>>
>> [[8, 11, 7, 10],
>> [8, 11, 13, 10],
>> [12, 11, 7, 10],
>> [12, 11, 13, 10],
>> [1, 1, 3, 4],
>> .
>> .
>> [3, 9, 6, 4]]
>>
>> These 'exploded lists are' *much* longer than the input lists. It's
>> common for them to have >100K [Int] members, or even >1M [Int]  
>> members.
>>
>> (3) This [[Int]] data must be sorted in to lexicographic order and
>> output as CSV data: i.e. output should be:
>>
>> 1,1,3,4
>> 3,9,6,4
>> 8,11,7,10
>> .
>> .
>> 12,11,13,10
>>
>> There is no way to avoid the necessity of sorting the final  
>> product. I
>> can *not* take advantage of any structure in the input data to avoid
>> this sort stage. e.g. sorting the (smaller) input compressed lists  
>> *will
>> not* obviate the need to lex sort the (larger) final output.
>>
>> The program listing below works correctly, but *is not* space- 
>> efficient.
>> In fact it is *less* space-efficient than the equivalent  program  
>> using
>> Prelude sort!
>>
>> e.g. with one data set which explodes to ~3.7M I am seeing 639MB of
>> total memory use.
>>
>> I am sure externalSort is not at fault because:
>> (a) I'm the Newbie.
>> (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not  
>> small
>> because sorting a [[Int]] with 3.7M [Int] members.)
>>
>> I'm sure it must be something I'm doing wrong elsewhere which is  
>> causing
>> the entire output data list to be read into memory.
>>
>> What I really want to happen is:
>>
>> (1) Data lazily read in from stdin and lazily parsed
>> (2) Data lazily sucked out to disk and strictly sorted by  
>> externalSort
>> (but no big deal since this happening on disk)
>> (3) Data lazily sucked back out of externalSort, lazily formatted for
>> output, and lazily written to stdout.
>>
>> So memory usage should hopefully never go over (say 10MB). But in  
>> fact
>> I'm using 638MB. So somewhere in the code below I must be doing
>> something very wrong.
>>
>> I hope someone can tell me what I am doing wrong here.
>>
>> TIA!
>>
>>
>> import Algorithms.ExternalSort
>> import Data.List.Split (splitOn)
>> import System.IO
>>
>> -- Cartesian Product over a List of Lists
>> -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
>> -- cp [[1,2],[3],[4,5,6]] -->
>> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
>> cp          :: [[a]] -> [[a]]
>> cp []       =  [[]]
>> cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]
>>
>> -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
>> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
>> fromCSV :: [String] -> [[[Int]]]
>> fromCSV = map parseOneLine
>>    where parseOneLine = map parseGroup . splitOn ","
>>              where parseGroup = map read . splitOn "+"
>>
>> -- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
>> [[1,3,4],[1,3,5],
>> -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], 
>> [2,15,16]]
>> explode :: [[[a]]] -> [[a]]
>> explode =  concatMap cp
>>
>> -- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" -->
>> --
>> [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4], 
>> [1,1,6,4],
>> --  [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4], 
>> [3,1,3,4],
>> --  [3,1,6,4],[3,9,3,4],[3,9,6,4]]
>> toSingles :: String -> [[Int]]
>> toSingles = explode . fromCSV . lines
>>
>> -- toCSV [8,11,7,10,12] --> "8,11,7,10,12"
>> toCSV :: (Show a) => [a] -> String
>> toCSV = tail . init . show
>>
>> main = do
>>  getContents >>= externalSort . toSingles >>= mapM_ (putStrLn .  
>> toCSV)
>>
>> _______________________________________________
>> Beginners mailing list
>> Beginners at haskell.org
>> http://www.haskell.org/mailman/listinfo/beginners
>>