Converting things to and from binary

Hal Daume III hdaume@ISI.EDU
Tue, 20 May 2003 07:40:59 -0700 (PDT)

I'll respond to this post specifically now,

> > library, but that it seems to be addressed towards a different
> > class of problems (specifically data-compression) to those
> > I am interested in (blasting data rapidly in and out).  The main
> > differences are

I don't think that's necessarily on...

> > (1) the framework I proposed in the original message:
> >
> > is byte-based, while the York Binary framework is bit-based.
> > I would imagine that this means the York Binary framework would be
> > very much less efficient at handling long sequences of bytes,
> > since they will presumably have to be shifted before
> > being written to the destination.

Indeed, the new "general" version of the binary module supports both byte
and bit operations.  Several tests by SimonM and I have shown that the bit
style is about 20% slower than the byte style.

> > (2) the York Binary library uses the IO monad, and presumably
> > various variables within a BinHandle, to keep track of state.
> > I think this is unnecessary, for example I don't think the
> > process of converting a value to a byte array should really
> > have to go through IO.  We are supposed to be functional
> > programmers after all.

What's wrong with the IO monad?  :)

More seriously, I think the idea of going to and from lists of Word8 is
going to kill performance.  Especially if you are going to write to a file
in the end.  You're going to have to go:

  data structure -> [Word8] -> File

and likely the middle won't be deforested by any compiler.  Whether you
use functional arrays or lists is somewhat irrelevant -- functional arrays
are also exceedingly slow.

> > (3) the York binary library provides two things you can
> > write bits to (a Handle, and a fixed area in memory) and a
> > large set of operations (seek and co), but it would be
> > difficult for a normal programmer to extend this.  (For
> > example, what about someone in GHC wanting to write to
> > a Posix Fd?)  On the other hand the framework I propose
> > has only two basic operations for writing, and two for
> > reading, which means it should be much easier to define
> > alternative consumers and sources of binary data.

This is true and I would say that this is the primary drawback of the
design.  Of course, (SimonM, please chime in here), it's probably possible
to extend the library to support writing to Fds (I'm not sure though, due
to the stateful stuff) but I agree that this is a fairly obtuse solution.

That said, my understanding of what GHC uses the binary module for
internally is to create large BinMems (things in memory, essentially large
arrays) and then write those to Handles at the end.  There's no reason
we couldn't provide a 'BinMem -> IO (IOUArray Int Word8)' or something
function that would allow you to peek at the data and write it to a
Fd or do something else you wanted.  I think this solves the problem of
giving it to different consumers.  A similar function could provide access
for arbitrary producers.

 - Hal