[Haskell-beginners] Unexpected Space Leak (despite
Stephen Blackheath [to Haskell-Beginners]
mutilating.cauliflowers.stephen at blacksapphire.com
Sun Jan 3 21:43:13 EST 2010
I can't see anything in your code that stops it being lazy, and I also
read the source for external-sort. It all looks OK except for one bit.
There's even a comment that says "It would be better if I changed Ord
for blocks to only check the first element." Looks like it's comparing
the whole list instead of the first element when merging the lists, and
so it's a bit difficult to guess how much it is evaluating of each list.
If there are many repeated elements, it could be a lot.
You could try fixing that. You could hack kMerge using a newtype and an
Ord instance (since Splay takes Ord instances).
Peter Green wrote:
> I am using the external-sort package to sort my output in the program
> below. I made this choice because my output dataset [[Int]] can be large
> (e.g. >3M items, each [Int]).
> What my program does:
> (1) Reads in a file containing 'compressed lists' which look like so:
> One compressed list per line. These compressed lists are parsed to
> become [[[Int]]]
> Generally files of compressed lists have lengths of ~10,000 lines.
> (2) Compressed lists are exploded to [[int]] via concatMap Cartesian
> Product over [[[Int]]], so we end up with [[Int]]
> [[8, 11, 7, 10],
> [8, 11, 13, 10],
> [12, 11, 7, 10],
> [12, 11, 13, 10],
> [1, 1, 3, 4],
> [3, 9, 6, 4]]
> These 'exploded lists are' *much* longer than the input lists. It's
> common for them to have >100K [Int] members, or even >1M [Int] members.
> (3) This [[Int]] data must be sorted in to lexicographic order and
> output as CSV data: i.e. output should be:
> There is no way to avoid the necessity of sorting the final product. I
> can *not* take advantage of any structure in the input data to avoid
> this sort stage. e.g. sorting the (smaller) input compressed lists *will
> not* obviate the need to lex sort the (larger) final output.
> The program listing below works correctly, but *is not* space-efficient.
> In fact it is *less* space-efficient than the equivalent program using
> Prelude sort!
> e.g. with one data set which explodes to ~3.7M I am seeing 639MB of
> total memory use.
> I am sure externalSort is not at fault because:
> (a) I'm the Newbie.
> (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not small
> because sorting a [[Int]] with 3.7M [Int] members.)
> I'm sure it must be something I'm doing wrong elsewhere which is causing
> the entire output data list to be read into memory.
> What I really want to happen is:
> (1) Data lazily read in from stdin and lazily parsed
> (2) Data lazily sucked out to disk and strictly sorted by externalSort
> (but no big deal since this happening on disk)
> (3) Data lazily sucked back out of externalSort, lazily formatted for
> output, and lazily written to stdout.
> So memory usage should hopefully never go over (say 10MB). But in fact
> I'm using 638MB. So somewhere in the code below I must be doing
> something very wrong.
> I hope someone can tell me what I am doing wrong here.
> import Algorithms.ExternalSort
> import Data.List.Split (splitOn)
> import System.IO
> -- Cartesian Product over a List of Lists
> -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
> -- cp [[1,2],,[4,5,6]] -->
> cp :: [[a]] -> [[a]]
> cp  = []
> cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
> -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
> -- [[[8,12],,[7,13],],[[1,2,3],[1,9],[3,6],]]
> fromCSV :: [String] -> [[[Int]]]
> fromCSV = map parseOneLine
> where parseOneLine = map parseGroup . splitOn ","
> where parseGroup = map read . splitOn "+"
> -- explode [[[1,2],,[4,5,6]], [[1, 2], [14,15], ]] -->
> -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],[2,15,16]]
> explode :: [[[a]]] -> [[a]]
> explode = concatMap cp
> -- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" -->
> -- [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4],[3,1,3,4],
> -- [3,1,6,4],[3,9,3,4],[3,9,6,4]]
> toSingles :: String -> [[Int]]
> toSingles = explode . fromCSV . lines
> -- toCSV [8,11,7,10,12] --> "8,11,7,10,12"
> toCSV :: (Show a) => [a] -> String
> toCSV = tail . init . show
> main = do
> getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)
> Beginners mailing list
> Beginners at haskell.org
More information about the Beginners