[Haskell-cafe] Abstraction leak

Fri Jun 29 14:39:28 EDT 2007

...and again today I found myself trying to do something that would be 
very easy in an imperative language, but I cannot think of a single good 
way of doing it in Haskell. Hopfully somebody can give me some hints.

I'm writing a whole bunch of data compression programs. In particular, 
one of them was a Huffman encoder. The encoder scans the input file, 
computes symbol frequencies, and builds a canonical Huffman table, which 
it then dumps to file. The actual compressed data goes after that.

The encoder consists of two functions. One takes the Huffman table and 
serializes it. The other does the actual encoding. The (++) operator 
joins the two together before going to file. Getting the data back is 
mildly more tricky. There's a decoder function that takes the complete 
data stream, parses out the Huffman table, and returns a pair containing 
the Huffman table and the remaining data. Then a second decoder function 
actually does the Huffman decoding on this data.

So far, it's all pretty easy. The size of the Huffman table determins 
how many bytes the decoder needs to read, so it's easy to figure out 
where the table ends.

And then I had a brainwave. I altered the encoder so that the output 
from the Huffman table serialize function codes through a simple RLE 
encoder before going to file. Then a went and altered the decoder to... 
uh... oops.

Now I have a problem. It's easy enough to pass the entire data stream 
through an RLE decoder and feed that to the Huffman table deserialize 
function, and it will give be back the table. But I now have *no clue* 
where the table ends in the original stream!

If this was Java, you would write all these compression and 
decompression stages as stream wrappers. So you wrap the raw input 
stream with an RLE decoder, have the function read the Huffman table, 
and then take off the RLE decoder and process the rest of the stream.

However, in Haskell, the reading and deserializing of the Huffman table 
doesn't so anything detectable to the original data stream, so you can't 
tell how much of it the RLE decoder actually processed. Um... help!

(What I did in the end was to write the size of the RLE encoded Huffman 
table to the file and read it back in the decoder. But this shouldn't 
actually be necessary...)