[Haskell-cafe] ANNOUNCE: zlib and bzlib 0.5 releases

Sun Nov 2 10:46:00 EST 2008

I'm pleased to announce updates to the zlib and bzlib packages.

The releases are on Hackage:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib

What's new
==========

What's new in these releases is that the extended API is slightly nicer.
The simple API that most packages use is unchanged.

In particular, these functions have different types:
compressWith   :: CompressParams   -> ByteString -> ByteString
decompressWith :: DecompressParams -> ByteString -> ByteString

The CompressParams and DecompressParams types are records of
compression/decompression parameters. The functions are used like so:

compressWith   defaultCompressParams { ... }
decompressWith defaultDecompressParams { ... }

There is also a new parameter to control the size of the first output
buffer. This lets applications save memory when they happen to have a
good estimate of the output size (some apps like darcs know this
exactly). By getting a good estimate and (de)compressing into a
single-chunk lazy bytestring this lets apps convert to a strict
bytestring with no extra copying cost.

Future directions
=================

The simple API is very unlikely to change.

The current error handling for decompression is not ideal. It just
throws exceptions for failures like bad format or unexpected end of
stream. This is a tricky area because error streaming behaviour does not
mix easily with error handling.

On option which I use in the iconv library is to have a data type
describe the real error conditions, something like:

data DataStream = Chunk Strict.ByteString Checksum DataStream
                | Error Error -- for some suitable error type
                | End Checksum

With suitable fold functions and functions to convert to a lazy
ByteString. Then people who care about error handling and streaming
behaviour can use that type directly. For example it should be trivial
to convert to an iterator style.

People have also asked for a continuation style api to give more control
over dynamic behaviour like flushing the compression state (eg in a http
server). Unfortunately this does not look easy. The zlib state is
mutable and while this can be hidden in a lazy list, it cannot be hidden
if we provide access to intermediate continuations. That is because
those continuations can be re-run whereas a lazy list evaluates each
element at most once (and with suitable internal locking this is even
true for SMP).

Background
==========

The zlib and bzlib packages provide functions for compression and
decompression in the gzip, zlib and bzip2 formats. Both provide pure
functions on streams of data represented by lazy ByteStrings:

compress, decompress :: ByteString -> ByteString

This makes it easy to use either in memory or with disk or network IO.
For example a simple gzip compression program is just:

> import qualified Data.ByteString.Lazy as ByteString
> import qualified Codec.Compression.GZip as GZip
>
> main = ByteString.interact GZip.compress

Or you could lazily read in and decompress .gz file using:

> content <- GZip.decompress <$> ByteString.readFile file

General
=======

Both packages are bindings to the corresponding C libs, so they depend
on those external C libraries (except on Windows where we build a
bundled copy of the C lib source code). The compression speed is as you
would expect since it's the C lib that is doing all the work.

The zlib package is used in cabal-install to work with .tar.gz files. So
it has actually been tested on Windows. It works with all versions of
ghc since 6.4 (though it requires Cabal-1.2).

The darcs repos for the development versions live on code.haskell.org.

I'm very happy to get feedback on the API, the documentation or of
course any bug reports.

Duncan