[Haskell-beginners] Unexpected Space Leak (despite using
external-sort)
Peter Green
kinch1967 at me.com
Sun Jan 3 20:35:04 EST 2010
I am using the external-sort package to sort my output in the program
below. I made this choice because my output dataset [[Int]] can be
large (e.g. >3M items, each [Int]).
What my program does:
(1) Reads in a file containing 'compressed lists' which look like so:
8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.
One compressed list per line. These compressed lists are parsed to
become [[[Int]]]
[[[8,12],[11],[7,13],[10]],
[[1,2,3],[1,9],[3,6],[4]],
.
.
]
Generally files of compressed lists have lengths of ~10,000 lines.
(2) Compressed lists are exploded to [[int]] via concatMap Cartesian
Product over [[[Int]]], so we end up with [[Int]]
[[8, 11, 7, 10],
[8, 11, 13, 10],
[12, 11, 7, 10],
[12, 11, 13, 10],
[1, 1, 3, 4],
.
.
[3, 9, 6, 4]]
These 'exploded lists are' *much* longer than the input lists. It's
common for them to have >100K [Int] members, or even >1M [Int] members.
(3) This [[Int]] data must be sorted in to lexicographic order and
output as CSV data: i.e. output should be:
1,1,3,4
3,9,6,4
8,11,7,10
.
.
12,11,13,10
There is no way to avoid the necessity of sorting the final product. I
can *not* take advantage of any structure in the input data to avoid
this sort stage. e.g. sorting the (smaller) input compressed lists
*will not* obviate the need to lex sort the (larger) final output.
The program listing below works correctly, but *is not* space-
efficient. In fact it is *less* space-efficient than the equivalent
program using Prelude sort!
e.g. with one data set which explodes to ~3.7M I am seeing 639MB of
total memory use.
I am sure externalSort is not at fault because:
(a) I'm the Newbie.
(b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not
small because sorting a [[Int]] with 3.7M [Int] members.)
I'm sure it must be something I'm doing wrong elsewhere which is
causing the entire output data list to be read into memory.
What I really want to happen is:
(1) Data lazily read in from stdin and lazily parsed
(2) Data lazily sucked out to disk and strictly sorted by externalSort
(but no big deal since this happening on disk)
(3) Data lazily sucked back out of externalSort, lazily formatted for
output, and lazily written to stdout.
So memory usage should hopefully never go over (say 10MB). But in fact
I'm using 638MB. So somewhere in the code below I must be doing
something very wrong.
I hope someone can tell me what I am doing wrong here.
TIA!
import Algorithms.ExternalSort
import Data.List.Split (splitOn)
import System.IO
-- Cartesian Product over a List of Lists
-- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
-- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],
[2,3,6]]
cp :: [[a]] -> [[a]]
cp [] = [[]]
cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
-- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
fromCSV :: [String] -> [[[Int]]]
fromCSV = map parseOneLine
where parseOneLine = map parseGroup . splitOn ","
where parseGroup = map read . splitOn "+"
-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4],
[1,3,5],
-- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],
[2,15,16]]
explode :: [[[a]]] -> [[a]]
explode = concatMap cp
-- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" -->
-- [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4],
[1,1,6,4],
-- [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4],
[3,1,3,4],
-- [3,1,6,4],[3,9,3,4],[3,9,6,4]]
toSingles :: String -> [[Int]]
toSingles = explode . fromCSV . lines
-- toCSV [8,11,7,10,12] --> "8,11,7,10,12"
toCSV :: (Show a) => [a] -> String
toCSV = tail . init . show
main = do
getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)
More information about the Beginners
mailing list