[Haskell-cafe] Computing the memory footprint of a HashMap ByteString Int (Was: How on Earth Do You Reason about Space?)

Wed Jun 1 17:33:16 CEST 2011

On Wed, Jun 1, 2011 at 4:24 PM, Aleksandar Dimitrov
<aleks.dimitrov at googlemail.com> wrote:
> One additional thought: it might be interesting to provide this outside of this
> mailing list, perhaps as a documentation addendum to unordered-containers, since
> it really explains the size needs for HashMaps of ByteStrings to folks without
> knowledge of higher arcane wizardry.

I think it would be a good answer on StackOverflow, but no one asked
me this question there. I could list the size overhead of a HashMap in
the docs, but I'm about to change it so I won't do so until after the
change. I don't know how big guarantees I want to make in the docs
either, as it might constrain future improvements to the
implementation. Perhaps in an addendum, like you said.

> I ended up using Data.Text over SmallString, also because I need to do other
> operations on the words (case folding, mass matching, and possibly more) and
> Text seemed more attractive for these tasks, but I'll try using SmallString,
> too.

Text uses 6 words per value, plus the size of the UTF-16 encoded
content. There's a Google Summer of Code project this year to convert
Text to UTF-8, which should decrease the space usage. In addition,
Text values aren't pinned on the heap, unlike ByteStrings, so they
should be nicer to the GC. The lowest overhead string type you could
imagine (given how GHC implements ByteArray#) would have a 4 word
overhead. Text trades 2 extra words to support efficient slicing (just
like ByteString).

When Text uses UTF-8 internally it should be possible to convert Text
to/from SmallString in O(1) time as the underlying ByteArray# could
just be wrapped in a SmallString constructor instead of a Text
constructor. This means that you could freely convert HashMap keys to
SmallString to save some space.

> I think, overall, Text will probably use more memory than ByteString, but in my
> particular case, the problem wasn't the size of the data structure, but the fact
> that it seemed to retain chunks of the original input file.

Given that ByteString uses 3 words more than Text you'll probably use
about the same (or slightly less) amount of space, given your string
lengths.

> To me, strict byte strings were a high-performance black box I didn't think
> about. I thought, if I store stuff as a bytestring, and a strict one, no less,
> there's *no* way I could expect to perform any better by switching the string
> type. Turns out that was an unjustified assumption.

You only have to know very little to use ByteString. Knowing that it
does slicing by sharing the underlying buffer (just like Java's
Strings) is unfortunately one of them. You can use the 'copy' function
to make sure the underlying storage matches the ByteString's length.

> The wealth of string types in Haskell land seems kind of confusing to the newbie

I sympathize. My rule of thumb is: use Text for Unicode data and
ByteString for (large) byte blobs.

Cheers,
Johan