[Haskell-cafe] How on Earth Do You Reason about Space?
Daniel Fischer
daniel.is.fischer at googlemail.com
Wed Jun 1 12:42:54 CEST 2011
On Wednesday 01 June 2011 12:13:54, John Lato wrote:
> > From: Brandon Moore <brandon_m_moore at yahoo.com>
> >
> >
> > I was worried data sharing might mean your keys
> > retain entire 64K chunks of the input. However, it
> > seems enumLines depends on the StringLike ByteString
> > instance, which just converts to and from String.
> > That can't be efficient, but I suppose it avoids excessive sharing.
>
> That's true for 'enumLines', however the OP is using 'enumLinesBS',
> which operates on bytestrings directly.
>
> Data sharing certainly could be an issue here. I tried performing
> Data.ByteString.copy before inserting the key into the map, but that
> used more memory. I don't have an explanation for this; it's not what
> I would expect.
If you don't copy, the small ByteStrings (the words) point into the large
chunk, which peacefully rests in memory. A few ripples come from the chunk
boundaries, where more often than not a word will have parts in both
chunks, a new ByteString is allocated for those.
A bit of space is wasted for the Constructor of the large chunk and for the
word-separators ('\n' here), and at the boundaries. Not a big deal, the
newlines cost one byte per text-word, but the constructors cost five or so
machine words per text-word.
If you copy each small ByteString from the chunk, a) the chunk stays in
memory until it's completely copied [doesn't matter much with chunk sizes
of a few k] and b) you get memory fragmentation because of alignment
requirements.
>
> The other parameter which affects sharing is the chunk size. I got a
> much better memory profile when using a chunksize of 1024 instead of
> 65536.
>
> Oddly enough, when using the large chunksize I saw lower memory usage
> from Data.Map, but with the small chunksize Data.HashMap has a
> significant advantage.
>
> John Lato
More information about the Haskell-Cafe
mailing list