[Haskell-beginners] Unexpected Space Leak (despite using external-sort)

Peter Green kinch1967 at me.com
Sun Jan 3 20:35:04 EST 2010


I am using the external-sort package to sort my output in the program  
below. I made this choice because my output dataset [[Int]] can be  
large (e.g. >3M items, each [Int]).

What my program does:

(1) Reads in a file containing 'compressed lists' which look like so:

8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.

One compressed list per line. These compressed lists are parsed to  
become [[[Int]]]

[[[8,12],[11],[7,13],[10]],
  [[1,2,3],[1,9],[3,6],[4]],
.
.
]

Generally files of compressed lists have lengths of ~10,000 lines.

(2) Compressed lists are exploded to [[int]] via concatMap Cartesian  
Product over [[[Int]]], so we end up with [[Int]]

[[8, 11, 7, 10],
  [8, 11, 13, 10],
  [12, 11, 7, 10],
  [12, 11, 13, 10],
  [1, 1, 3, 4],
  .
  .
  [3, 9, 6, 4]]

These 'exploded lists are' *much* longer than the input lists. It's  
common for them to have >100K [Int] members, or even >1M [Int] members.

(3) This [[Int]] data must be sorted in to lexicographic order and  
output as CSV data: i.e. output should be:

1,1,3,4
3,9,6,4
8,11,7,10
.
.
12,11,13,10

There is no way to avoid the necessity of sorting the final product. I  
can *not* take advantage of any structure in the input data to avoid  
this sort stage. e.g. sorting the (smaller) input compressed lists  
*will not* obviate the need to lex sort the (larger) final output.

The program listing below works correctly, but *is not* space- 
efficient. In fact it is *less* space-efficient than the equivalent   
program using Prelude sort!

e.g. with one data set which explodes to ~3.7M I am seeing 639MB of  
total memory use.

I am sure externalSort is not at fault because:
(a) I'm the Newbie.
(b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not  
small because sorting a [[Int]] with 3.7M [Int] members.)

I'm sure it must be something I'm doing wrong elsewhere which is  
causing the entire output data list to be read into memory.

What I really want to happen is:

(1) Data lazily read in from stdin and lazily parsed
(2) Data lazily sucked out to disk and strictly sorted by externalSort  
(but no big deal since this happening on disk)
(3) Data lazily sucked back out of externalSort, lazily formatted for  
output, and lazily written to stdout.

So memory usage should hopefully never go over (say 10MB). But in fact  
I'm using 638MB. So somewhere in the code below I must be doing  
something very wrong.

I hope someone can tell me what I am doing wrong here.

TIA!


import Algorithms.ExternalSort
import Data.List.Split (splitOn)
import System.IO

-- Cartesian Product over a List of Lists
-- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
-- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5], 
[2,3,6]]
cp          :: [[a]] -> [[a]]
cp []       =  [[]]
cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]

-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
-- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
fromCSV :: [String] -> [[[Int]]]
fromCSV = map parseOneLine
     where parseOneLine = map parseGroup . splitOn ","
               where parseGroup = map read . splitOn "+"

-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4], 
[1,3,5],
-- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], 
[2,15,16]]
explode :: [[[a]]] -> [[a]]
explode =  concatMap cp

-- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" -->
-- [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4], 
[1,1,6,4],
--  [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4], 
[3,1,3,4],
--  [3,1,6,4],[3,9,3,4],[3,9,6,4]]
toSingles :: String -> [[Int]]
toSingles = explode . fromCSV . lines

-- toCSV [8,11,7,10,12] --> "8,11,7,10,12"
toCSV :: (Show a) => [a] -> String
toCSV = tail . init . show

main = do
   getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)



More information about the Beginners mailing list