[Haskell-beginners] Unexpected Space Leak (despite using
external-sort)
Peter Green
kinch1967 at me.com
Mon Jan 4 00:19:27 EST 2010
On 04/01/2010, at 10:43 AM, Stephen Blackheath [to Haskell-Beginners]
wrote:
> Peter,
>
> I can't see anything in your code that stops it being lazy, and I also
> read the source for external-sort. It all looks OK except for one
> bit.
> There's even a comment that says "It would be better if I changed Ord
> for blocks to only check the first element." Looks like it's
> comparing
> the whole list instead of the first element when merging the lists,
> and
> so it's a bit difficult to guess how much it is evaluating of each
> list.
> If there are many repeated elements, it could be a lot.
>
> You could try fixing that. You could hack kMerge using a newtype
> and an
> Ord instance (since Splay takes Ord instances).
>
>
> Steve
Thanks Steve. I think I understand your point. It seems like external-
sort needs patching to make block comparison more efficient for
everyone and that my case is just particularly extreme/degenerate.
Just right now, I'm not yet confident enough about my Haskell
abilities to go and create a new type for 'block' with more efficient
Ord and then modify the rest of the external-sort code accordingly.
Posted program was the 2nd Haskell prog I've every written. I'll get
there in time, though :).
I wonder if there is another way of making my code work better with
the existing externalSort function:
I am sorting [[Int]], an example of which might be:
[[8, 11, 7, 10],
[8, 11, 13, 10],
[12, 11, 7, 10],
[12, 11, 13, 10],
[1, 1, 3, 4],
.
.
[3, 9, 6, 4]]
I need numerical lex sorting, so I can't just map directly to
ByteString as this would result in "12" collating before "1", etc.
I know that each sublist I am sorting never contains an integer
greater than 20. I could encode each sublist as a ByteString where 1
maps to a, 2 maps to B, 3, maps to C, etc.
So would have [1,1,3,4] -> "aacd" (ByteString), etc.
Hopefully Ord comparisons between ByteStrings are much less expensive
than comparisons between instances of [Int] and I will see nicer
behaviour since external-sort does work as advertised on lists of
millions of Ints?
So I could sort these, then map back to underlying Int data when I
output. Schwartzian transform type stuff.
Anyway, I'll give it a go.
> Peter Green wrote:
>> I am using the external-sort package to sort my output in the program
>> below. I made this choice because my output dataset [[Int]] can be
>> large
>> (e.g. >3M items, each [Int]).
>>
>> What my program does:
>>
>> (1) Reads in a file containing 'compressed lists' which look like so:
>>
>> 8+12,11,7+13,10
>> 1+2+3,1+9,3+6,4
>> .
>> .
>>
>> One compressed list per line. These compressed lists are parsed to
>> become [[[Int]]]
>>
>> [[[8,12],[11],[7,13],[10]],
>> [[1,2,3],[1,9],[3,6],[4]],
>> .
>> .
>> ]
>>
>> Generally files of compressed lists have lengths of ~10,000 lines.
>>
>> (2) Compressed lists are exploded to [[int]] via concatMap Cartesian
>> Product over [[[Int]]], so we end up with [[Int]]
>>
>> [[8, 11, 7, 10],
>> [8, 11, 13, 10],
>> [12, 11, 7, 10],
>> [12, 11, 13, 10],
>> [1, 1, 3, 4],
>> .
>> .
>> [3, 9, 6, 4]]
>>
>> These 'exploded lists are' *much* longer than the input lists. It's
>> common for them to have >100K [Int] members, or even >1M [Int]
>> members.
>>
>> (3) This [[Int]] data must be sorted in to lexicographic order and
>> output as CSV data: i.e. output should be:
>>
>> 1,1,3,4
>> 3,9,6,4
>> 8,11,7,10
>> .
>> .
>> 12,11,13,10
>>
>> There is no way to avoid the necessity of sorting the final
>> product. I
>> can *not* take advantage of any structure in the input data to avoid
>> this sort stage. e.g. sorting the (smaller) input compressed lists
>> *will
>> not* obviate the need to lex sort the (larger) final output.
>>
>> The program listing below works correctly, but *is not* space-
>> efficient.
>> In fact it is *less* space-efficient than the equivalent program
>> using
>> Prelude sort!
>>
>> e.g. with one data set which explodes to ~3.7M I am seeing 639MB of
>> total memory use.
>>
>> I am sure externalSort is not at fault because:
>> (a) I'm the Newbie.
>> (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not
>> small
>> because sorting a [[Int]] with 3.7M [Int] members.)
>>
>> I'm sure it must be something I'm doing wrong elsewhere which is
>> causing
>> the entire output data list to be read into memory.
>>
>> What I really want to happen is:
>>
>> (1) Data lazily read in from stdin and lazily parsed
>> (2) Data lazily sucked out to disk and strictly sorted by
>> externalSort
>> (but no big deal since this happening on disk)
>> (3) Data lazily sucked back out of externalSort, lazily formatted for
>> output, and lazily written to stdout.
>>
>> So memory usage should hopefully never go over (say 10MB). But in
>> fact
>> I'm using 638MB. So somewhere in the code below I must be doing
>> something very wrong.
>>
>> I hope someone can tell me what I am doing wrong here.
>>
>> TIA!
>>
>>
>> import Algorithms.ExternalSort
>> import Data.List.Split (splitOn)
>> import System.IO
>>
>> -- Cartesian Product over a List of Lists
>> -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
>> -- cp [[1,2],[3],[4,5,6]] -->
>> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
>> cp :: [[a]] -> [[a]]
>> cp [] = [[]]
>> cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
>>
>> -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
>> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
>> fromCSV :: [String] -> [[[Int]]]
>> fromCSV = map parseOneLine
>> where parseOneLine = map parseGroup . splitOn ","
>> where parseGroup = map read . splitOn "+"
>>
>> -- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
>> [[1,3,4],[1,3,5],
>> -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],
>> [2,15,16]]
>> explode :: [[[a]]] -> [[a]]
>> explode = concatMap cp
>>
>> -- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" -->
>> --
>> [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4],
>> [1,1,6,4],
>> -- [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4],
>> [3,1,3,4],
>> -- [3,1,6,4],[3,9,3,4],[3,9,6,4]]
>> toSingles :: String -> [[Int]]
>> toSingles = explode . fromCSV . lines
>>
>> -- toCSV [8,11,7,10,12] --> "8,11,7,10,12"
>> toCSV :: (Show a) => [a] -> String
>> toCSV = tail . init . show
>>
>> main = do
>> getContents >>= externalSort . toSingles >>= mapM_ (putStrLn .
>> toCSV)
>>
>> _______________________________________________
>> Beginners mailing list
>> Beginners at haskell.org
>> http://www.haskell.org/mailman/listinfo/beginners
>>
More information about the Beginners
mailing list