Don Stewart dons at galois.com
Fri Feb 27 17:20:06 EST 2009

creswick:
> First off, my apologies for breaking etiquette, if/when I do -- I've
>
> I have recently been trying to process a large data set (the 2.8tb
> wikipedia data dump), and also replace my scripting needs with haskell
> (needs that have previously been filled with bash, perl, and bits of
> Java).  Last week I needed to do some quick scanning of the (7zipped)
> wikipedia dump to get a feel for the size of articles, and from that
> determine the best way to process the whole enchilada... cutting to
> the chase, I ended up with a file consisting of byte offsets and lines
> matched by a grep pattern (a 250mb file).  Specifically, 11m lines of:
>
> 1405:  <page>
> 14062:  <page>
> 15979:  <page>
> 18665:  <page>
> 920680797:  <page>
> ......
> 2807444041476:  <page>
> 2807444043623:  <page>
>
> I needed to know how large the lagest <page> elements were, so I'd
> know if they would fit in memory, and some idea of how many would
> cause swapping, etc. So, I wrote a simple app in haskell (below) to
> find the sizes of each <page> and sort them.  Unfortunately, it
> consumes an absurd amount of memory (3+gb) and dies with an
> out-of-memory error.  Given the input size, and what it is doing, this
> seems ridiculously high -- can anyone help me understand what is going
> on, and how I can prevent this sort of rampant memory use?
>
> I can provide a link to the input file if anyone wants it, but it
> doesn't seem particularly useful, given the simplicity and size.
> Since I needed to get results fairly quickly, I've re-implemented this
> in java, so that reference implementation is also available should
> anyone want it (the approach that is most similar to the haskell
> requires a 1.4gb heap, but by streaming the string->long parsing, that
> requirement drops to ~600mb, which seems pretty reasonable, since the
> *output* is 215mb.)
>
> Thanks!
> Rogan
>
> \begin{code}
> -- Compiled with:
> -- \$ ghc --make offsetSorter.hs

YIKES!! Use the optimizer!

ghc -O2 --make

-- Don