[Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Thu Jul 12 04:16:22 EDT 2007

G'day.

Quoting Bulat Ziganshin <bulat.ziganshin at gmail.com>:

> well, ppm differs from simple order-0 coder in that it uses previous
> symbols to calculate probability. lzw differs from order-0 coder with
> flat static probabilities in that it encodes a whole word each time.

LZW doesn't have the concept of a "word" at all.

Both PPM and LZW build up contexts as they go by starting with one
context for each symbol in the input alphabet, and by extending
contexts with single symbols as they are found.  (Conceptually; the
algorithms are slightly different, but they do the same job.)

The difference has to do, almost completely, with coding.  LZW uses a
degenerate model to decide what the next input symbol is.  It assumes
that all "seen" contexts are equally likely, and hence uses a binary
code to encode them.  (A "binary" code represents the numbers 0 to k-1
using k codewords each of which is (log k) bits in length.)

PPM uses a different coding scheme, using the observed frequency of each
context to drive an arithmetic coder.

> there is nothing common between ppm and lzw except for this basic
> order-0 model, so i don't see any reasons to call lzw a sort of ppm.

There are a lot of differences between PPM and LZW, but they're based
around the same principle: Build contexts based on what you've seen so
far, and use that to predict what the next symbol should be.

> it will be more fair to say that lzw is order-0 coder with static flat
> probabilities, you agree?

That's the _coding_ phase, sure.

All known text compression algorithms basically work by building a
model of the text, to try to predict the next symbol, and then using
that to drive a coder which represents what was actually found using some
bit representation.

PPM and LZW use very different coders, but their models are quite similar.

> you said about algorithm that efficiently encodes english words using
> huffman codes. is such algorithm really exists?

Ask Google about "word-based huffman".

Cheers,
Andrew Bromage