[Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Sun Jul 8 04:50:36 EDT 2007

andrewcoppin:
> Andrew Coppin wrote:
> >Dave Bayer wrote:
> >>I was beginning to accept that I might die before clearing my 
> >>pipeline of research projects I want to code up. Haskell has given me 
> >>new hope.
> >
> >Indeed. ;-)
> >
> >Today I hve implemented encoders and decoders for RLE, MTF, Fibonacci 
> >codes, and LZW. Next on my list is BWT, Huffman codes and arithmetic 
> >coding. (That last is *very* hard though...)
> 
> In case anybody cares...
> 
>   darcs get http://www.orphi.me.uk/darcs/ToyCompression
>   ghc -O2 --make Encode
>   ghc -O2 --make Decode
> 
> You can now do
> 
>  Encode LZW foo
> 
> Will look for a file named "foo", and generate a file called "foo-LZW" 
> containing the LZW-compressed version of it. Also "foo-LZW.log.txt", 
> which contains some statistics.
> 
> Similarly,
> 
>  Decode LZW foo-LZW
> 
> will make a file "foo-LZW-unLZW", which should *hopefully* be identical 
> to the original one. (!)
> 
> It didn't occur to me, but I should probably go add some way to discover 
> a list of available algorithms! LOL. Anyway, currently working:
> 
>   RLE (Run Length Encoding). Counts runs of symbols. Considers 8 bits 
> to be one "symbol". The maximum run length is 255.
>   MTF (Move To Front). Takes an input file and produces an output file 
> of exactly the same size, but containing mostly *low* numbers. Works 
> well with RLE or Fib.
>   BWT (Burners-Wheeler Transform). Like MTF, does no actual 
> compression, but makes the file more "compressible".
>   Fib (Fibonacci codes). Stores numbers using Fibonacci codes instead 
> of unsigned binary. This makes low numbers smaller and high numbers 
> bigger. (Use it with MTF...)
>   LZW (Lempel-Ziv-Welch). The daddy. Looks for repeated input strings 
> and compresses them.
> 
> Caution: Encoding or decoding BWT currently requires absurd amounts of 
> time and space. (Like, > 2 GB RAM and 8 minutes wall time to process 12 
> KB of text.) I hope to fix that soon...
> 
> Currently it's unclear whether LZW or BWT+MTF+Fib gives the best 
> compression. (Mainly because BWT is so hard to run!) I hope to implement 
> a few other algorithms soonly.
> 
> (Note: LZW is typically used with a variable number of bits per output 
> symbol. I haven't done this yet. It simply uses 16 bits in all cases. 
> Once I fix this, compression will go up. Also, once the symbol 
> dictionary is full, the encoder just resets itself.)
> 

Good work. Probably worth benchmarking against the other compression
libraries on hackage, just to get a sense for how well your
implementations are optimised (and maybe how to best interact with lazy
bytestrings?). See for example:

    zlib   http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.3
    bzlib  http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.3

and also:
    lzf    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Codec-Compression-LZF-0.2

-- Don