[Haskell-cafe] [newbie] processing large logs

Sat May 13 22:54:28 EDT 2006

dons:
> martine:
> > On 5/14/06, Eugene Crosser <crosser at average.org> wrote:
> > >main = printMax . (foldr processLine empty) . lines =<< getContents
> > >[snip]
> > >The thing kinda works on small data sets, but if you feed it with
> > >250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
> > >500,000 lines I get "*** Exception: stack overflow" (using runhaskell
> > >from ghc 6.2.4).
> > 
> > To elaborate on Udo's point:
> > If you look at the definition of foldr you'll see where the stack
> > overflow is coming from:  foldr recurses all the way down to the end
> > of the list, so your stack gets 250k (or attempts 500k) entries deep
> > so it can process the last line in the file first, then unwinds.
> 
> Also, don't use runhaskell! Compile the code with -O :)

Not sure what processLine does, but just trying out Data.ByteString on
this as a test:

> import qualified Data.ByteString.Char8 as B
> import Data.List
> 
> main = print . foldl' processLine 0 . B.lines =<< B.getContents
>     where processLine acc l = if B.length l > 10 then acc+1 else acc

Just count the long lines. Probably you do something fancier.

Anyway, 32M runs through this in:

    $ time ./a.out < /home/dons/fps/tests/32M
    470400
    ./a.out < /home/dons/fps/tests/32M  0.31s user 0.28s system 28% cpu
    2.082 total

with 32M heap (these are strict byte arrays).

Using Data.ByteString.Lazy:

> import qualified Data.ByteString.Lazy as B
> import Data.List
> 
> main = print . foldl' processLine 0 . B.split 10  =<< B.getContents
>     where processLine acc l = if B.length l > 10 then acc+1 else acc

    $ time ./a.out < /home/dons/fps/tests/32M
    470400
    ./a.out < /home/dons/fps/tests/32M  0.32s user 0.11s system 26% cpu
    1.592 total

With only 3M heap used.

-- Don