[Haskell-cafe] Program using 500MB RAM to process 5MB file

Fri Apr 3 00:48:00 EDT 2009

2009/4/2  <lucas at die.net.au>:
> On Thu, Apr 02, 2009 at 07:55:07PM -0400, Rick R wrote:
>> You could profile your app for memory usage. Then you could figure out just
>> what function is blowing up the mem usage and figure out how to optimize it.
>>
>>
>> http://book.realworldhaskell.org/read/profiling-and-optimization.html
>>
>>
>> 2009/4/2 <lucas at die.net.au>
>>
>> > I'm relatively new to haskell so as one does, I am rewriting an
>> > existing program in haskell to help learn the language.
>> >
>> > However, it eats up all my RAM whenever I run the program.
>> >
>> > http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3175#a3175
>> >
>> > Obviously I'm doing something wrong, but without my magical FP pants I
>> > don't know what that might be.
>> >
>
> I ran some profiling as suggested,
>
> [SNIP]
>
> total time  =        8.36 secs   (418 ticks @ 20 ms)
> total alloc = 3,882,593,720 bytes  (excludes profiling overheads)
>
> COST CENTRE                    MODULE               %time %alloc
>
> line                           PkgDb                 89.7   93.5
>
> COST CENTRE MODULE no. entries %time %alloc %time %alloc
> line        PkgDb  305 109771  89.7  93.3   89.7  93.3
>
> [SNIP]
>
> The line function is part of the file parser
>
> line :: Parser String
> line = anyChar `manyTill` newline
>
> files' :: Parser Files
> files' = line `manyTill` newline
>
> Perhaps I should also explain the structure of the file. It's for a
> simple package manager called pkgutils, used for CRUX[1]. The file
> contains information for all the packages installed and is structured
> as follows
>
> <package name>
> <package version>
> <file>
> <file>
> ...
> <file>
>
> <package name>
> ...
>
> From profiling it shows that the memory is simple consumed by reading
> in all the lines, the graph from using -p -hd shows an almost Ologn2
> growth of the heap as the collection of lines grows.
>
> Is there a better way to do this?

In this case the syntax of your file seems pretty simple.  Someone
else suggested a streaming approach.  Putting those ideas together, I
defined testInput as follows:

testInput = "<package name>\n"
          ++"<package version>\n"
          ++"<file>\n"
          ++"<file>\n"
          ++"...\n"
          ++"<file>\n"
          ++"\n"
          ++"<package name>\n"
          ++"...\n"

Here is an interactive experiment I tried:
GHCi> :t lines
lines :: String -> [String]
GHCi> lines testInput
["<package name>","<package
version>","<file>","<file>","...","<file>","","<package name>","..."]

Okay, looks we can use 'lines' from the Prelude to split the input into lines.

GHCi> :t cycle
cycle :: [a] -> [a]
GHCi> :t cycle testInput
cycle testInput :: [Char]

Using cycle on the testInput like this will give us an infinite input
that we can use to see if we have a streaming approach.

GHCi> take 10 $ lines $ cycle testInput
["<package name>","<package
version>","<file>","<file>","...","<file>","","<package
name>","...","<package name>"]

Success.  So if you like, you could use something like hGetContents
that reads a file lazily, run lines over the file and get a lazy
stream of the file lines.  Then you can use something like takeWhile
to take lines until you hit an empty line to build up your parsing
functions.

You could experiment with bytestrings, but in this particular
application I wouldn't expect to see a huge benefit.  Here I think you
can run in space proportional to the longest list of files that you
encounter in the input.

Hope that helps,
Jason