[Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Sun Jul 8 11:07:59 EDT 2007

Bulat Ziganshin wrote:
> Hello Andrew,
>
> Sunday, July 8, 2007, 3:10:04 PM, you wrote:
>   
>> Indeed. I'm more interested in which algorithms produce the best
>> compression ratios than how to implement them fast.
>>     
>
> algorithms you are tried so far (except for bwt+mtf) is the standard
> student's set :)  of those, lzw is most advanced, but even this
> algorithm is 20-year old
>   

Yeah, well, naturally I'm implementing the simple ones first. ;-)

Actually, LZW works surprisingly well for such a trivial little 
algorithm... When you compare it to the complexity of an arithmetic 
coder driven by a high-order adaptive PPM model (theoretically the best 
general-purpose algorithm that exists), it's amazing that anything so 
simple as LZW should work at all!

>> Currently the code
>> uses very general, flexible, polymorphic types all over the place, 
>> everything is normal lists, I'm using monadic parsing rather than 
>> bit-twiddling, etc etc etc. It goes alright with 100 KB files on my
>>     
>
> testing today's algorithms on such small datasets don't make much
> sense because they need much more input to gather enough statistics
> about data compressed. i prefer hundreds-megabytes tests and don't
> know anyone with a test files smaller than 1 mb
>   

OTOH, I'd like it to finish this month...

(LZW is faster now I'm using Data.Map. But even so, 15 minutes to 
compress 640 KB of zeros? That's pretty slow!)

> btw, in most cases, C algorithm with today's C compilers will work
> faster than asm oe - because you should have very good knowledge of
> COU internals to write efficient program. i dream about the days when
> the same will become true for Haskell
>   

We all do... *sigh*

(Or rather, most of us dream. A tiny few people seem to be trying to 
actually do stuff about it...)

> i never heard about lzw+huf compressors :)  there are two lz families
> - lz78 (lzw) which was popular in 80's and used variable-size words,
> and lz77 (lzss) which together with huffman was the most popular in
> 90's. compress (.Z) and pkzip 1 was lzw, while zip is lzh (i.e.
> lz77+huffman) 
>
> there is no much meaning to use huffman with lzw because many words
> will have only one or two occurrences
>   

Hmm... that's interesting. I was *sure* it said that LHA used 
LZW+Huffman... oh well!

You say that there would be "no meaning" to using a Huffman code to 
encode the output from an LZW stage. But consider this: The standard 
approach is for the LZW encoder to spit out 9 bits/symbol initially, and 
than 10 bits/symbol once the table fills up a bit more, and then 11 
bits/symbol after that, and so on. (Usually until some hard-coded limit 
is reached. Typically the encoder just resets itself at that point or 
something.) So we're allocating more and more bits per symbol, but this 
is based only on how full the table is, *regardless* of whether any of 
these new symbols are even in fact *used*. ;-) Now, by using a Huffman 
code (scanning the actual produced output rather than working 
adaptively) we can avoid assigning bit combinations to unused symbols.

(The downside of course is that now we need a Huffman table in the 
output - and any rare symbols might end up with rather long codewords. 
But remember: Huffman *guarantees* 0% compression or higher on the 
payload itself. A Huffman-compressed payload can *never* be bigger, only 
smaller or same size. So long as you can encode the Huffman table 
efficiently, you should be fine...)

> btw, i invite you to the http://encode.ru/forums where you will find
> many compressor authors. you will be second one there who use haskell
> and first one who write compression algorithms in it :)
>   

.ru = Russia?

>> OTOH, if the Gods behind GHC want to take a look and figure out whether
>> there's any magic that can be added to the optimiser, I wouldn't 
>> complain... ;-)
>>     
>> (Realistically though. My program takes a [Word8] and turns it into a 
>> [Bool] before running a parser over it. The GHC optimiser doesn't really
>> stand a hope in hell of optimising that into a program that reads a 
>> machine word into a CPU register and starts playing with bit flips on it...)
>>     
>
> as time goes, those terrible compilers becomes smarter and smarter :)
>   

Oh hey, I think GHC is already pretty smart. But no optimiser can ever 
hope to cover *every* possible case. And transforming [Bool] -> [Bool] 
into UArray Word8 -> UArray Word8 just seems a liiiiittle bit 
optimistic, methinks. ;-)

>
>> PPS. Does GHC make use of MMX, SSE, et al?
>>     
>
> it can do it with via-c compilation and there are plans for native
> codegenerator too.

The way I heard it is that it *does* native code generation, but it's 
"not as good" as the via-C version. (Can't imagine why... You would have 
thought directly implementing functional primitives would be much easier 
than trying to mangle them to fit C's multitude of limitations. Still, 
what do I know?)

> but it seems that you don't know asm and believe in
> advertising hype. i'm pretty sure that mmx can't help in algorithms
> you implementing and you are very far from level of performance where
> using mmx may give any improvement
>   

MMX is probably useless. Some of the streaming stuff (with all the CPU 
cache twiddling and so on) might be useful for list processing. Frankly, 
it *all* seems so low-level that I can't begin to imagine how you could 
ever take advantage of this stuff without writing raw assembly by hand...

(I read the IA32 manual one time. It was a long time ago though. I don't 
remember lots of detail. As for do I "know assembly"... I've programmed 
the 6502 and the M68000, although not extensively. My point is that I 
know my way round the inside of a von Neumann computers, that's all... ;-)