[Haskell-cafe] How on Earth Do You Reason about Space?

Wed Jun 1 12:28:28 CEST 2011

On Wed, Jun 1, 2011 at 12:55 AM, Aleksandar Dimitrov <
aleks.dimitrov at googlemail.com> wrote:

> On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:
>
> > None of these leak space for me (all compiled with ghc-7.0.3 -O2).
> > Performance was pretty comparable for every version, although
> Aleksander's
> > original did seem to have a very small edge.
>
> How big were your input corpora?
>

Today I was using multiple copies of War & Peace, as Brandon specified.
 Total size is about 90M.

>
> So it seems that I can't get rid of a factor of around 3x the input file
> size.
> Luckily, the dependency seems to be linear. Here's some profiling:
>
> <<ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes
> residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT
> (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>>
> ../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s
> system 99% cpu 52.064 total
>
> ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)
>
> Redirecting the program's output to a file will yield a mere 2.2M for the
> data
> gathered by the above script. Since those 2.2M of data are all I care
> about, why
> do I need so much more RAM to compute them?
>
> Are my demands unreasonable?
>

I think the issue is data sharing, as Brandon mentioned.  A bytestring
consists of an offset, length, and a pointer.  You're using a chunk size of
64k, which means the generated bytestrings all have a pointer to that 64k of
data.  Suppose there's one new word in that 64k, and it's near the beginning
of the chunk.  Even though the word may only be a few characters long, it'll
reference the entire chunk and keep it from being GC'd.

There are a few solutions to this.  The first is to make a copy of the
bytestring so only the required data is retained.  In my experiments this
wasn't helpful, but it would depend on your corpus.  The second is to start
with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
 If your corpus is similar to natural language, I think it'll probably work
better for you as well.

Note that Johan's Ngram code also only keeps the minimum required data,
giving it a good memory profile.   I didn't notice this last night because I
was testing with different data, and unfortunately the peculiar distribution
of that data masked this problem.

John L
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110601/8d872322/attachment.htm>