[Haskell-cafe] Toy compression algorithms [was: A very edgy language]

Mon Jul 9 03:30:34 EDT 2007

Hello Andrew,

Sunday, July 8, 2007, 7:07:59 PM, you wrote:

> Actually, LZW works surprisingly well for such a trivial little
> algorithm... When you compare it to the complexity of an arithmetic 
> coder driven by a high-order adaptive PPM model (theoretically the best
> general-purpose algorithm that exists), it's amazing that anything so 
> simple as LZW should work at all!

i don't think that ppm is so complex - it's just probability of
symbol in some context. it's just too slow in naive implementation

>> there is no much meaning to use huffman with lzw because many words
>> will have only one or two occurrences

> (The downside of course is that now we need a Huffman table in the 
> output - and any rare symbols might end up with rather long codewords.
> But remember: Huffman *guarantees* 0% compression or higher on the 
> payload itself. A Huffman-compressed payload can *never* be bigger, only
> smaller or same size. So long as you can encode the Huffman table 
> efficiently, you should be fine...)

the devil in details. just imagine size of huffman table with 64k
entries :)  huffman encoding is inappropriate for lzw output simply
because most words will have only a few occurrences and economy on
their optimal encoding doesn't justify price of their entries in the table

> .ru = Russia?

of course

>>> (Realistically though. My program takes a [Word8] and turns it into a
>>> [Bool] before running a parser over it. The GHC optimiser doesn't really
>>> stand a hope in hell of optimising that into a program that reads a 
>>> machine word into a CPU register and starts playing with bit flips on it...)
>>
>> as time goes, those terrible compilers becomes smarter and smarter :)
>>   

> Oh hey, I think GHC is already pretty smart. But no optimiser can ever
> hope to cover *every* possible case. And transforming [Bool] -> [Bool]
> into UArray Word8 ->> UArray Word8 just seems a liiiiittle bit
> optimistic, methinks. ;-)

15 years ago i've written very smart asm program (btw, it was ARJ
unpacker) with handmade function inlining, loop unrolling, register
allocation, cpu recognition and so on. now, most of these tricks are
standard for C compilers. times changes and now it's hard to imagine which
optimizations will be available 10 years later

> The way I heard it is that it *does* native code generation, but it's
> "not as good" as the via-C version. (Can't imagine why... You would have
> thought directly implementing functional primitives would be much easier
> than trying to mangle them to fit C's multitude of limitations. Still,
> what do I know?)

ghc's native and via-C modes are blind vs lame. in native mode, its
codegenerator is comparable with 20 years-old C codegenerators. see
above how much modern C compilers changed in these years. in via-C
mode it generates unnatural C code which is hard to optimize for any C
compiler. the jhc is very different story - it generates natural C
code, so for simple, "low-level" programs its speed is the same as
with C. but it lacks huge base of high-level GHC optimizations. as you
may guess, in order to have C-like speed, compiler should implement
both good high-level and low-level optimization. there are
(low-priority) plans to provide LLVM backend for ghc which may solve
this problem. but actually speed of generated code isn't number 1
priority for ghc users

-- 
Best regards,
 Bulat                            mailto:Bulat.Ziganshin at gmail.com