Rogan Creswick creswick at gmail.com
Fri Feb 27 17:18:47 EST 2009

First off, my apologies for breaking etiquette, if/when I do -- I've

I have recently been trying to process a large data set (the 2.8tb
wikipedia data dump), and also replace my scripting needs with haskell
(needs that have previously been filled with bash, perl, and bits of
Java).  Last week I needed to do some quick scanning of the (7zipped)
wikipedia dump to get a feel for the size of articles, and from that
determine the best way to process the whole enchilada... cutting to
the chase, I ended up with a file consisting of byte offsets and lines
matched by a grep pattern (a 250mb file).  Specifically, 11m lines of:

1405:  <page>
14062:  <page>
15979:  <page>
18665:  <page>
920680797:  <page>
......
2807444041476:  <page>
2807444043623:  <page>

I needed to know how large the lagest <page> elements were, so I'd
know if they would fit in memory, and some idea of how many would
cause swapping, etc. So, I wrote a simple app in haskell (below) to
find the sizes of each <page> and sort them.  Unfortunately, it
consumes an absurd amount of memory (3+gb) and dies with an
out-of-memory error.  Given the input size, and what it is doing, this
seems ridiculously high -- can anyone help me understand what is going
on, and how I can prevent this sort of rampant memory use?

I can provide a link to the input file if anyone wants it, but it
doesn't seem particularly useful, given the simplicity and size.
Since I needed to get results fairly quickly, I've re-implemented this
in java, so that reference implementation is also available should
anyone want it (the approach that is most similar to the haskell
requires a 1.4gb heap, but by streaming the string->long parsing, that
requirement drops to ~600mb, which seems pretty reasonable, since the
*output* is 215mb.)

Thanks!
Rogan

\begin{code}
-- Compiled with:
-- $ghc --make offsetSorter.hs -- -- (ghc v. 6.8.2) -- -- Run with: --$ time ./offsetSorter data/byteOffsets.txt > haskOffsets.txt
-- offsetSorter: out of memory (requested 1048576 bytes)
--
-- real	4m12.130s
-- user	3m4.812s
-- sys	0m5.660s
--(OOM happened after consuming just over 3000mb of Virt, 2.6gb Res,
according to top.)
--

import System (getArgs)
import Data.Maybe
import Text.Printf (printf)
import Data.Function (on)
import Data.List (sort)
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as C8
import qualified Data.ByteString as B

-- get the lines
-- parse each line to get the offset.
-- scan the list of offsets

-- | The full file size:
maxSize :: Integer
maxSize = 2807444044080

-- | Block is a contiguous chunk of data.
-- The first entry is the offset, the second is the length.
data Block = Block {
offset::Integer
, size::Integer
} deriving (Eq)

-- | Ordering of Blocks is based entirely on the block size.
instance Ord Block where
compare = compare on size

instance Show Block where
show (Block o s) = (show o) ++ "  " ++ (show s)

-- turn the file into a list of offsets:
getOffsets :: ByteString -> [Integer]
getOffsets = catMaybes . map parseOffset . C8.lines

-- | Pull out the offsets frome a line of the file.
parseOffset :: ByteString -> Maybe Integer
parseOffset s = do
(i, _) <- C8.readInteger (C8.filter (/=':') s)
Just i

-- | Get the offsets between entries in a list
getSizes :: [Integer]  -> [Integer]
getSizes (x:y:[]) = [y - x]
getSizes (x:y:ys) = (y - x):(getSizes (y:ys))

-- | creates and returns a list of Blocks, given a file's content.
blocks :: ByteString -> [Block]
blocks s = zipWith (Block) offsets sizes
where offsets = getOffsets s
sizes   = getSizes (offsets ++ [maxSize])

main :: IO ()
main = do
args <- getArgs
printf "%s" $unlines$ map (show) (sort \$! blocks content)