[Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

Wed Jun 1 16:24:23 CEST 2011

Hello Johan,

On Wed, Jun 01, 2011 at 08:52:04AM +0200, Johan Tibell wrote:
> I thought it'd be educational to do some back-of-the-envelope
> calculations to see how much memory we'd expect to use to store words
> in a HashMap ByteString Int.

Thank you for your writeup, which is very informative! You should really think
about writing that tutorial you mentioned earlier, it would probably be a boon
for many.

One additional thought: it might be interesting to provide this outside of this
mailing list, perhaps as a documentation addendum to unordered-containers, since
it really explains the size needs for HashMaps of ByteStrings to folks without
knowledge of higher arcane wizardry.

> For example, the average English word length is 5 characters, if you
> include stop words. We'd expect to use
> 
>     (9 + 9 + 2) * 8 + 5 = 165
> 
> bytes per unique word in our input corpus, on a 64-bit machine. Plus
> any GC overhead. This is probably more than one would expect.

Using an sqlite database, I committed 1/16 of my whole corpus (which is in
German) in ten separate steps and counted the amount of unigrams contained
therein: 2,207,369; Assuming German affection for longwords, and the fact that
the data contained umlauts in UTF-8, let's up the average word length a bit:
say, 15 (I pulled that out of my ass, obviously.) This brings us to 2207369 *
180 bytes, or ~400MB hash table.

> You could try using the lower overhead SmallString type from the
> smallstring package [6]. It has an overhead of 4 words per string
> (instead of 9 like ByteString). There's some runtime overhead involved
> in converting a value (i.e. Text) to a SmallString. I don't know if
> this overhead will be noticeable in your program.

I ended up using Data.Text over SmallString, also because I need to do other
operations on the words (case folding, mass matching, and possibly more) and
Text seemed more attractive for these tasks, but I'll try using SmallString,
too.

I think, overall, Text will probably use more memory than ByteString, but in my
particular case, the problem wasn't the size of the data structure, but the fact
that it seemed to retain chunks of the original input file.

> P.S. If your program indeed has a space leak this won't help you, but
> it's a good way to figure out if your program uses a reasonable amount
> of memory.

Yes, it is indeed! It turned out I was leaking space, since ByteStrings were
holding references to iteratee-read chunks (I have no idea what happened with
the program you gave me, that did not depend on iteratee. I can recall it having
a worse performance on memory, which right now seems counterintuitive. I'll try
to check it again later this week.)

But even without leaking space, this will provide me (and hopefully others) with
a good estimate on how well they can expect unordered-containers to perform.

To me, strict byte strings were a high-performance black box I didn't think
about. I thought, if I store stuff as a bytestring, and a strict one, no less,
there's *no* way I could expect to perform any better by switching the string
type. Turns out that was an unjustified assumption.

The wealth of string types in Haskell land seems kind of confusing to the newbie
:-(

Thanks again for your hard work on unordered-containers and for explaining this
stuff!

Regards and best wishes,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110601/907adb26/attachment.pgp>