[Haskell-beginners] Lazy file IO & Space leaks/waste
Aleksandar Dimitrov
aleks.dimitrov at googlemail.com
Thu Nov 5 06:01:11 EST 2009
Hello list,
I'm currently writing a small linguistic corpus analyzer. My input file is only
25MB, but profiling shows that the overall amount of allocation over the
program's runtime is several GB. That's a little too much - adding to that is
the fact that the program is abysmally slow, so I'm suspecting a space leak
somewhere.
I'm using the ByteString.Lazy.Char8 class in order to work efficiently with lazy
IO and I must admit that I'm very inexperienced with predicting runtime and
space behaviour of lazy IO :-( It worked well in the past, but I'm stuck now.
The program can be found here:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=11863#a11863
The important bits are as follows:
> mf :: [C.ByteString] -> StdWord
> mf [] = Word [] C.empty
> mf s = Word (tail s) (head s)
>
> f' = mf . reverse . C.words
> main :: IO ()
> main = do
> corpus_name <- liftM head getArgs
> corpus <- liftM (Corpus . (map f') . C.lines) $ C.readFile corpus_name
> print $ length (content corpus)
> let interesting = filterForInterestingTags interestingTags corpus
> print $ show (freqMap interesting)
freqMap also uses foldr' to fold the corpus it is given into a Data.Map, but
according to the profiling output, that's not where the bad stuff happens.
Here's the interesting parts of the profiling output:
> analyse +RTS -p -RTS corpus/bnc-samp.tt
>
> total time = 30.46 secs (1523 ticks @ 20 ms)
> total alloc = 14,871,619,904 bytes (excludes profiling overheads)
>
> COST CENTRE MODULE no. entries %time %alloc %time %alloc
>
> MAIN MAIN 1 0 0.0 0.0 100.0 100.0
> main Main 221 0 0.0 0.0 0.0 0.0
> CAF Main 206 14 0.1 0.1 100.0 100.0
> showsPrec_aSF Main 223 1 0.0 0.0 0.0 0.0
> interestingTags Main 218 1 0.0 0.0 0.0 0.0
> f' Main 216 1 0.0 0.0 0.0 0.0
> mf Main 217 1 0.0 0.0 0.0 0.0
> main Main 212 1 2.4 4.4 99.9 99.9
> f' Main 219 0 89.0 91.5 90.2 92.7
> mf Main 220 2427450 1.2 1.2 1.2 1.2
> filterForInterestingTags Main 215 1 5.4 0.1 5.4 0.1
> freqMap Main 214 1 0.6 0.5 2.0 2.7
> compare_aRL Main 222 1141996 1.4 2.2 1.4 2.2
Obviously the main cost centre seems to be f' (which I've factored out in order
to view it separately as a cost centre.) It's not the call to reverse that makes
it slow (removing it doesn't affect the run time.) Interestingly, it prints out
the first statement (length) very quickly, then takes a lot of time. I've
removed the call to freqMap, and it seems that GHC is smart enough to drop the
call to f' completely, because only the length ever gets evaluated. But the
freqMap also needs the processing from f', and that's where the bad stuff starts
happening.
Should I look into DeepSeq? I've tried adding strictness to the functions by
hand, and also to the data structures, but that didn't seem to help so far. So
I'm looking for solutions to make this faster. I'm guessing that a smart `seq`
somewhere might help, but I don't know where :-\
As a side note: how can I find out how the run-time data structures look like in
memory? I.e. I want to know *what* exactly stays around as thunks in memory so
that I can focus better on where to add strictness or redirect the program flow.
Ideally, only a very smart part of the file should ever be in memory, with
processing happening incrementally!
Thanks for any pointers!
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/beginners/attachments/20091105/5917de71/attachment.bin
More information about the Beginners
mailing list