[Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Sun Jul 8 04:42:37 EDT 2007

Andrew Coppin wrote:
> Dave Bayer wrote:
>> I was beginning to accept that I might die before clearing my 
>> pipeline of research projects I want to code up. Haskell has given me 
>> new hope.
>
> Indeed. ;-)
>
> Today I hve implemented encoders and decoders for RLE, MTF, Fibonacci 
> codes, and LZW. Next on my list is BWT, Huffman codes and arithmetic 
> coding. (That last is *very* hard though...)

In case anybody cares...

   darcs get http://www.orphi.me.uk/darcs/ToyCompression
   ghc -O2 --make Encode
   ghc -O2 --make Decode

You can now do

  Encode LZW foo

Will look for a file named "foo", and generate a file called "foo-LZW" 
containing the LZW-compressed version of it. Also "foo-LZW.log.txt", 
which contains some statistics.

Similarly,

  Decode LZW foo-LZW

will make a file "foo-LZW-unLZW", which should *hopefully* be identical 
to the original one. (!)

It didn't occur to me, but I should probably go add some way to discover 
a list of available algorithms! LOL. Anyway, currently working:

   RLE (Run Length Encoding). Counts runs of symbols. Considers 8 bits 
to be one "symbol". The maximum run length is 255.
   MTF (Move To Front). Takes an input file and produces an output file 
of exactly the same size, but containing mostly *low* numbers. Works 
well with RLE or Fib.
   BWT (Burners-Wheeler Transform). Like MTF, does no actual 
compression, but makes the file more "compressible".
   Fib (Fibonacci codes). Stores numbers using Fibonacci codes instead 
of unsigned binary. This makes low numbers smaller and high numbers 
bigger. (Use it with MTF...)
   LZW (Lempel-Ziv-Welch). The daddy. Looks for repeated input strings 
and compresses them.

Caution: Encoding or decoding BWT currently requires absurd amounts of 
time and space. (Like, > 2 GB RAM and 8 minutes wall time to process 12 
KB of text.) I hope to fix that soon...

Currently it's unclear whether LZW or BWT+MTF+Fib gives the best 
compression. (Mainly because BWT is so hard to run!) I hope to implement 
a few other algorithms soonly.

(Note: LZW is typically used with a variable number of bits per output 
symbol. I haven't done this yet. It simply uses 16 bits in all cases. 
Once I fix this, compression will go up. Also, once the symbol 
dictionary is full, the encoder just resets itself.)