[Haskell-cafe] Data.Binary suboptimal instance
Andrew Coppin
andrewcoppin at btinternet.com
Fri May 22 14:42:51 EDT 2009
So lately I've been working on a little program to generate trippy
graphics. (Indeed, some of you may remember it from a few years back...)
Anyway, getting to the point, I just restructured my program. As part of
the restructuring, I got rid of all the jiggery-pokery with
Data.Array.Storable and so forth and decided to use Data.Binary.
My first problem was quite simple. I want to write a program that reads
an item from the input file, processes it, writes it to the output file,
and repeats until the input file is empty. Data.Binary doesn't appear to
provide any obvious way to do this. You can't be in the Get and Put
monads simultaneously, and I can't figure out how to interleave them. In
the end, I ended up with something like
xs <- decodeFile f1
encodeFile f2 (map f xs)
Unfortunately, it looks like this doesn't do what I want. Serialised
lists have their length prepended to them, which means that encodeFile
is going to have to generate the *entire* list in memory so it can count
how big it is, before it can output a single byte to disk. (!) This is
particularly annoying since the output list will be exactly the same
size as the input one, and the very first thing decodeFile is doing is
reading in that size figure.
Maybe there's a way around this glitch, and I just haven't thought of it
yet. But you'd think that wanting to lazily process data from an input
file and write it into an output file would be an extremely common
use-case. Given that, I'm not seeing an obvious way to handle it.
Anyway, that's one small problem, but I can live with it. There's
another, much bigger problem though, and that's really what I wanted to
talk about...
As we all know, Data.Binary is *built* for speed. So imagine my shock
when my newly adjusted program now based around this library turned out
to be massively slower, and the files it produced were drastically
larger. It didn't take me that long to dig up the problem.
The problem seems to boil down to this: The Binary instance for Double
(and Float, by the way) is... well I guess you could argue it's very
portable, but efficient it isn't. As we all know, an IEEE-754
double-precision floating-point number occupies 64 bits; 1 sign bit, 11
exponent bits, and 52 mantissa bits (implicitly 53). I had assumed that
the Binary instance for Double would simply write these bits to disk,
requiring approximately 0 computational power, and exactly 64 bits of
disk space. I was wrong.
...what it *actually* does is convert the 64-bit Double into a 32-bit
exponent and an arbitrary-precision integer part. (!) It appears to do
this by floating-point arithmetic rather than just low-level bit
shuffling. Looking up how arbitrary-precision integers are serialised, I
see that there's an 8-bit "flag" indicating whether the integer fits
into 32 bits, and if it doesn't, there's an 8-bit sign value, followed
by a serialised list of bytes. So, in other words, a 32-bit length
value, followed by the bytes themselves.
To summarise, when I ask for a 64-bit Double to be serialised, I get this:
32-bit exponent.
8-bit flag (probably always "1")
8-bit sign (either "0" or "1")
32-bit size (probably always "8")
64-bit mantissa
That's 144 bits in total, i.e., 225% larger than the original Double.
(Let's not even go into how much computer power it takes to construct
all this data before it's written to disk...)
So, that's why my files balooned in size, and why my program slowed to a
crawl. But how do I fix this? Well, my solution was simple, brutal, and
probably *highly* non-portable. It begins with Unsafe.Coerce (that's
never a good thing!)
Put simply, if you take your 64-bit Double and unsafeCoerce it into a
Word64 and then ask Data.Binary to serialise that, it writes it straight
to disk without screwing around with it first. The result ought to be
portable to any architecture that uses IEEE-754 arithmetic natively.
(Anybody know of an arch this *doesn't* apply to?) But sure, if you
wanted to do this on some other architecture with a different native
float format, you'd have to write some trixy code to handle it. (And
then add conditional compilation for it.)
Is there any danger that there might be some kind of improvement to the
Double instance in the next version of Data.Binary?
END.
More information about the Haskell-Cafe
mailing list