[Haskell-cafe] Toy compression algorithms [was: A very edgy
language]
Donald Bruce Stewart
dons at cse.unsw.edu.au
Sun Jul 8 04:50:36 EDT 2007
andrewcoppin:
> Andrew Coppin wrote:
> >Dave Bayer wrote:
> >>I was beginning to accept that I might die before clearing my
> >>pipeline of research projects I want to code up. Haskell has given me
> >>new hope.
> >
> >Indeed. ;-)
> >
> >Today I hve implemented encoders and decoders for RLE, MTF, Fibonacci
> >codes, and LZW. Next on my list is BWT, Huffman codes and arithmetic
> >coding. (That last is *very* hard though...)
>
> In case anybody cares...
>
> darcs get http://www.orphi.me.uk/darcs/ToyCompression
> ghc -O2 --make Encode
> ghc -O2 --make Decode
>
> You can now do
>
> Encode LZW foo
>
> Will look for a file named "foo", and generate a file called "foo-LZW"
> containing the LZW-compressed version of it. Also "foo-LZW.log.txt",
> which contains some statistics.
>
> Similarly,
>
> Decode LZW foo-LZW
>
> will make a file "foo-LZW-unLZW", which should *hopefully* be identical
> to the original one. (!)
>
> It didn't occur to me, but I should probably go add some way to discover
> a list of available algorithms! LOL. Anyway, currently working:
>
> RLE (Run Length Encoding). Counts runs of symbols. Considers 8 bits
> to be one "symbol". The maximum run length is 255.
> MTF (Move To Front). Takes an input file and produces an output file
> of exactly the same size, but containing mostly *low* numbers. Works
> well with RLE or Fib.
> BWT (Burners-Wheeler Transform). Like MTF, does no actual
> compression, but makes the file more "compressible".
> Fib (Fibonacci codes). Stores numbers using Fibonacci codes instead
> of unsigned binary. This makes low numbers smaller and high numbers
> bigger. (Use it with MTF...)
> LZW (Lempel-Ziv-Welch). The daddy. Looks for repeated input strings
> and compresses them.
>
> Caution: Encoding or decoding BWT currently requires absurd amounts of
> time and space. (Like, > 2 GB RAM and 8 minutes wall time to process 12
> KB of text.) I hope to fix that soon...
>
> Currently it's unclear whether LZW or BWT+MTF+Fib gives the best
> compression. (Mainly because BWT is so hard to run!) I hope to implement
> a few other algorithms soonly.
>
> (Note: LZW is typically used with a variable number of bits per output
> symbol. I haven't done this yet. It simply uses 16 bits in all cases.
> Once I fix this, compression will go up. Also, once the symbol
> dictionary is full, the encoder just resets itself.)
>
Good work. Probably worth benchmarking against the other compression
libraries on hackage, just to get a sense for how well your
implementations are optimised (and maybe how to best interact with lazy
bytestrings?). See for example:
zlib http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.3
bzlib http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.3
and also:
lzf http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Codec-Compression-LZF-0.2
-- Don
More information about the Haskell-Cafe
mailing list