[Haskell-cafe] [newbie] processing large logs

Udo Stenzel u.stenzel at web.de
Sat May 13 18:25:06 EDT 2006


Eugene Crosser wrote:
> This is my program:
> ========
> module Main where
> import Data.Map
> main = printMax . (foldr processLine empty) . lines =<< getContents
> processLine line map = insertWith (\new old -> new + old) line 1 map
> printMax map = putStrLn $ show $ foldWithKey
>    (\key val accum -> if val > (snd accum) then (key,val) else accum)
>        ("",0) map
> ========
> The thing kinda works on small data sets, but if you feed it with
> 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
> 500,000 lines I get "*** Exception: stack overflow"

Your program isn't strict enough.  While you expect it to keep a
"running total" in the map which is updated with each new line, it
really only creates lots of thunks that are only evaluated when the
result is demanded.  These thunks are as large as the input plus
overhead.

You have to force the evaluation of intermediate results.  To do so, you
have to replace foldr by foldl (foldr is just recursion, foldl is
accumulator recursion), then use the strict variant of that, and then
evaluate all values before putting them into the map.  In summary, this
should work (untested code, note the use of foldl'):

main = printMax . (foldl' processLine empty) . lines =<< getContents
processLine map line =
    let total = findWithDefault 0 line map + 1
    in total `seq` insert line total map 
    

Yes, this is all terribly non-obvious.  It takes time until you see
where lazyness is going to hurt you, and you'll easily overlook some
such situations.  I also think, it's an unfortunate oversight that
insertWith is lazy and that there's no way to make it strict as a mere
user of Data.Map.


Udo.
-- 
"Guy Steele leads a small team of researchers in Burlington,
Massachusetts, who are taking on an _enormous_challenge_ -- create a
programming language better than Java."
	-- Sun.Com (emphasis by Paul Graham)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://www.haskell.org//pipermail/haskell-cafe/attachments/20060514/4157f97c/attachment.bin


More information about the Haskell-Cafe mailing list