[Haskell-cafe] Toy compression algorithms [was: A very edgy
language]
Andrew Coppin
andrewcoppin at btinternet.com
Sun Jul 8 04:42:37 EDT 2007
Andrew Coppin wrote:
> Dave Bayer wrote:
>> I was beginning to accept that I might die before clearing my
>> pipeline of research projects I want to code up. Haskell has given me
>> new hope.
>
> Indeed. ;-)
>
> Today I hve implemented encoders and decoders for RLE, MTF, Fibonacci
> codes, and LZW. Next on my list is BWT, Huffman codes and arithmetic
> coding. (That last is *very* hard though...)
In case anybody cares...
darcs get http://www.orphi.me.uk/darcs/ToyCompression
ghc -O2 --make Encode
ghc -O2 --make Decode
You can now do
Encode LZW foo
Will look for a file named "foo", and generate a file called "foo-LZW"
containing the LZW-compressed version of it. Also "foo-LZW.log.txt",
which contains some statistics.
Similarly,
Decode LZW foo-LZW
will make a file "foo-LZW-unLZW", which should *hopefully* be identical
to the original one. (!)
It didn't occur to me, but I should probably go add some way to discover
a list of available algorithms! LOL. Anyway, currently working:
RLE (Run Length Encoding). Counts runs of symbols. Considers 8 bits
to be one "symbol". The maximum run length is 255.
MTF (Move To Front). Takes an input file and produces an output file
of exactly the same size, but containing mostly *low* numbers. Works
well with RLE or Fib.
BWT (Burners-Wheeler Transform). Like MTF, does no actual
compression, but makes the file more "compressible".
Fib (Fibonacci codes). Stores numbers using Fibonacci codes instead
of unsigned binary. This makes low numbers smaller and high numbers
bigger. (Use it with MTF...)
LZW (Lempel-Ziv-Welch). The daddy. Looks for repeated input strings
and compresses them.
Caution: Encoding or decoding BWT currently requires absurd amounts of
time and space. (Like, > 2 GB RAM and 8 minutes wall time to process 12
KB of text.) I hope to fix that soon...
Currently it's unclear whether LZW or BWT+MTF+Fib gives the best
compression. (Mainly because BWT is so hard to run!) I hope to implement
a few other algorithms soonly.
(Note: LZW is typically used with a variable number of bits per output
symbol. I haven't done this yet. It simply uses 16 bits in all cases.
Once I fix this, compression will go up. Also, once the symbol
dictionary is full, the encoder just resets itself.)
More information about the Haskell-Cafe
mailing list