[Haskell-cafe] Using Data.Binary for compression

Thu Nov 15 01:03:52 EST 2007

Hi,

I'd like to be able to use Data.Binary (or similar) for compression.
Say I have an abstract type Symbol, and for each value of Symbol I
have a representation in terms of some number of bits. For compression
to be efficient, commonly-used Symbols should have very short
representations, while less common ones can be longer.

Since an encoding like [Bool] would be really inefficient for this (at
least I think it would, though some fancy fusion tricks might be able
to help), I was thinking a reasonable approach might be to use Word8
(for example), and then specify a number of bits n, indicating that
only the first n bits are to be written to the compressed
representation.

I was looking at the internals of Data.Binary, and saw it seems that
PutM could be used for this purpose (was something like this its
original purpose?). Today, I put this together:
------------------------------------
type BitRep = Word8
type NBits = Int

type MyBits = (BitRep, NBits)

(#) :: MyBits -> MyBits -> PutM MyBits
(a, m) # (b, n) = case (a .|. (b `shiftR` m), m + n) of
  ans@(ab, s) -> if s < 8 then return ans
    else putWord8 ab >> return (b `shiftL` (8 - m), s - 8)
------------------------------------
Then, it would be easy enough to map [Symbol] -> [MyBits], and then
use something like foldM (#) to get into the PutM monad.

A couple of questions:

(1) Am I reinventing the wheel? I haven't seen anything like this, but
it would be nice to be a bit more certain.

(2) This seems like it will work ok, but the feel is not as clean as
the current Data.Binary interface. Is there something I'm missing that
might make it easier to integrate this?

(3) Right now this is just proof of concept, but eventually I'd like
to do some performance tuning, and it would be nice to have a
representation that's amenable to this. Any thoughts on speeding this
up while keeping the interface reasonably clean would be much
appreciated.

Thanks!

-Chad