[Haskell-cafe] Re: Constructing Data.Map from ByteString

Tue Mar 11 20:46:09 EDT 2008

On Tue, 2008-03-11 at 11:13 +0000, Dave Tapley wrote:
> Just a few updates on this one:
> 
> I've upgraded to bytestring-0.9.0.5 from Darcs, no improvement.
> Also this morning I tried using Data.HashMap with Bytestring's readInt
> and HashMap's hashInt.. The result was a Stack space overflow :(
> 
> God damn it I don't want to have to go back to C++ but soon will have
> no choice with time constraints :(

Just to keep haskel-cafe readers up-to-date with the discussion of this
on #haskell...

So if we take a step back and look at the size and form of the data. The
input is a huge file of the form:

465	foo
1236	bar
594	baz

And we have hundreds of megabytes of this. So more precisely it's lines
with an integer (apparently guaranteed to fit into a 32bit unsigned
word) a tab and a shortish string. The overall line length is around
10-15 characters in general.

So let's look at some sizes:

Let's assume the lines are 13 characters long (including \n terminator)
so that's just 13 bytes.

data Map k a  = Tip 
              | Bin {-# UNPACK #-} !Size !k a !(Map k a) !(Map k a) 

So an Data.Map node is 6 words.

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8)
                     {-# UNPACK #-} !Int                -- offset
                     {-# UNPACK #-} !Int                -- length

This is also 5 words (because ForeignPtr unpacks to 2 words).

So if we make a Map ByteString ByteString and we have an entry per line
then we need 5+5+6 words. On a 32bit machine that's (5+5+6)*4 = 64bytes.

So 64bytes storage for each 13 byte line. So that's approximately a
factor of 5. That explains the original observation.

So what is a more memory efficient representation?

Well we can eliminate the ByteString key by taking advantage of the fact
that the key is always guaranteed to fit into a Word. So we can use an
IntMap. An IntMap is 3-5 words per node (3 for leaf and 5 for internal
for an average of 4). Then the ByteString map value has redundancy, we
know that all of them are substrings of the same big ByteString so we
could factor that out and use a:

data SubStr = SubStr {-# UNPACK #-} !Int  -- offset
                     {-# UNPACK #-} !Int  -- length

That's 3 words. So that's still (4+3)*4 = 28 bytes per 13 byte line.

To get something really compact we could use an index composed of three
unboxed Int arrays. One for the key, one for the offset and one for the
length. We'd keep the key array sorted and the other two arrays
synchronised. We'd use a binary search on the key array and then look up
the corresponding entries in the offset and length arrays. That
representation uses 3 words = 12 bytes per line. We could reduce that to
10 or 9 bytes if we assume the key lengths fit in a Word16 or Word8.

So that gets us down to just double the original input file size. We
could further reduce this by making a new value ByteString consisting of
just all the values concatenated together without the keys in between.
That'd temporarily use another copy but that might be acceptable given
that this index is going to be used heavily later with more data.

Duncan