[Haskell-cafe] Toy compression algorithms [was: A very edgy
language]
Andrew Coppin
andrewcoppin at btinternet.com
Mon Jul 9 16:15:07 EDT 2007
Bulat Ziganshin wrote:
> Hello Andrew,
>
> Sunday, July 8, 2007, 7:07:59 PM, you wrote:
>
> i don't think that ppm is so complex - it's just probability of
> symbol in some context. it's just too slow in naive implementation
>
>
Oh, sure, the *idea* is simple enough. Trying to actually *implement* it
correctly is something else... ;-)
(Same statemenst go for arithmetic coding, really.)
>> (The downside of course is that now we need a Huffman table in the
>> output - and any rare symbols might end up with rather long codewords.
>> But remember: Huffman *guarantees* 0% compression or higher on the
>> payload itself. A Huffman-compressed payload can *never* be bigger, only
>> smaller or same size. So long as you can encode the Huffman table
>> efficiently, you should be fine...)
>>
>
> the devil in details. just imagine size of huffman table with 64k
> entries :) huffman encoding is inappropriate for lzw output simply
> because most words will have only a few occurrences and economy on
> their optimal encoding doesn't justify price of their entries in the table
>
...which is why you need to "encode the Huffman table efficiently", to
quote myself. ;-)
Using canonical Huffman, you only actually need to know how many bits
were assigned to each symbol. This information is probably very
ameanable to RLE. (Which, incidentally, is why I started this whole
"parser on top of a phaser" crazyness in the first place.) So, yeah,
there may be 64k symbols - but if only 1k of them are ever *used*... ;-)
>> .ru = Russia?
>>
>
> of course
>
My Russian is very rusty. ;-)
>> Oh hey, I think GHC is already pretty smart. But no optimiser can ever
>> hope to cover *every* possible case. And transforming [Bool] -> [Bool]
>> into UArray Word8 ->> UArray Word8 just seems a liiiiittle bit
>> optimistic, methinks. ;-)
>>
>
> 15 years ago i've written very smart asm program (btw, it was ARJ
> unpacker) with handmade function inlining, loop unrolling, register
> allocation, cpu recognition and so on. now, most of these tricks are
> standard for C compilers. times changes and now it's hard to imagine which
> optimizations will be available 10 years later
>
Yes, but there are limits to what an optimiser can hope to accomplish.
For example, you wouldn't implement a bubble sort and seriously expect
the compiler to be able to "optimise" that into a merge sort, would you? ;-)
> ghc's native and via-C modes are blind vs lame. in native mode, its
> codegenerator is comparable with 20 years-old C codegenerators. see
> above how much modern C compilers changed in these years. in via-C
> mode it generates unnatural C code which is hard to optimize for any C
> compiler.
I'll take your word for it. ;-)
(I have made cursory attempts to comprehend the inner workings of GHC -
but this is apparently drastically beyond my powers of comprehension.)
> the jhc is very different story
Yes - last I heard, it's an experimental research project rather than a
production-ready compiler...
More information about the Haskell-Cafe
mailing list