[Haskell-cafe] Using Data.Binary for compression

Thu Nov 15 14:21:43 EST 2007

On Thu, Nov 15, 2007 at 11:10:01AM -0800, Chad Scherrer wrote:
> > > Almost all 'real users' just use Codec.Compression.GZip.  It's very
> > > fast, very compositional, and (perhaps suprisingly) almost as
> > > effective as application-specific schemes.
> >
> > I was about to say the same thing. So so much simpler to use Duncan's
> > carefully written zlib binding,
> 
> I have several types of symbols, and for each type the probabilities
> are very predictable - to the point where they could even be
> hard-coded. And upon completion I can be sure the first two questions
> will be "Can we make it smaller?" and "Can we make it faster?". GZip
> (while very cool) is adaptive and general-purpose, so it's building
> frequency tables as it goes and ignoring the structure of the data I
> should be able to take advantage of.
> 
> With an awful lot of trouble, it must be possible to write something
> in C to go faster and yield better compression than gzip for this
> particular data. With the probability structure known in advance,
> there are just a lot of steps taken by gzip that are no longer needed.
> Besides this, gzip only assumes an arbitrary sequence of bytes, but my
> data are much more structured than this.

The catch is that gzip can beat even ideal arithmetic compression, if there
happen to be correlations between symbols.  So what you claim is correct
only if there are no correlations other than those taken into account in
your known probability structure.  Any chance you can tell us what this
mysterious data is?

But bit stream operations (and data compression) are seriously cool in any
case, so I hope you'll go ahead with this!
-- 
David Roundy
Department of Physics
Oregon State University