[Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

Wed May 18 19:32:26 CEST 2011

Hello Haskell-Cafe,

my main question is whether requiring FlexibleInstances is a problem
for code that aims to become part of the Haskell platform. The
following explanation gives the context for this question.

As some of you may know the blaze-builder library is now used in quite
a few places. That's nice, but it doesn't mean that blaze-builder is a
finished solution to the problem of providing an API for
high-performance buffered output (creation of chunked representations)
of sequences of bytes.

In fact, one of my current goals with this work is to polish it such
that it can be integrated into the 'bytestring' library. This has the
benefit that functions that create lazy bytestrings (e.g., pack, map,
unfoldr, filter) can be implemented such that they create well-sized
chunks even if the argument bytestring is hugely fragmented. Moreover,
this integration also establishes a single builder type as the output
representation. Therefore, other creators of bytestrings (e.g.,
'text', 'base16-bytestring', 'zlib') can provide results of type
Builder, which enables O(1) appends of their results and the
preservation of well-sizedness of the created chunks.

As part of this goal, I'm currently working on the first of the
following three points that are paramount to achieving great encoding
performance:

  1. Ensure that individual Haskell values are encoded with minimal overhead.
  2. Ensure that concatenation of sequences of bytes is efficient.
  3. Ensure that the average chunk size is large.

The core principle used to tackle (1) is avoiding intermediate data
structures.  The core abstraction used is the one of a Write (see [1]
for the corresponding library.)

  data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))

A value `Write bound io :: Write a` denotes an encoding scheme for
values of type `a` that uses at most `bound` bytes space. Given a
values `x :: a` and a pointer `po` to the next free byte `io x po`
encodes `x` to memory starting from `po` and returns the pointer to
the next free byte after the encoding of `x`.

In most cases Writes are used as an abstract datatype. They serve as
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through streams of
Haskell values. Hence, typical users of Writes are functions like

  fromWrite          :: Write a -> a -> Builder
  fromWriteList      :: Write a -> [a] -> Builder
  fromWriteUnfoldr   :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
  mapWriteByteString :: Write Word8 -> S.ByteString -> Builder

They consume the given datastructure efficiently and wrap the Writes
in the bounds checking code to detect when a buffer is full and
request a new one.

There are many providers of Writes. Each bounded-length-encoding of a
standard Haskell value is likely to have a corresponding Write. For
example, encoding an Int32 as a big-endian, little-endian, and
host-endian byte-sequence is currently achieved with the following
three functions.

  writeInt32BE :: Write Int32
  writeInt32LE :: Write Int32
  writeInt32HE :: Write Int32

I would like to avoid naming all these encodings individually.
Especially, as the situation becomes worse for more elaborate
encodings like hexadecimal encodings. There, we encounter encodings
like the utf8-encoding of the hexadecimal-encoding with lower-case
letters of an Int32.

  writeInt32HexLowerUtf8 :: Write Int32

I really don't like that. Therefore, I'm thinking about the following
solution based on type-classes. We introduce a single typeclass

  class Writable a where
      write :: Write a

and use a bunch of newtypes to denote our encodings.

  newtype Ascii7   a = Ascii7   { unAscii7   :: a }
  newtype Utf8     a = Utf8     { unUtf8     :: a }
  newtype HexUpper a = HexUpper { unHexUpper :: a }
  newtype HexLower a = HexLower { unHexLower :: a }
  ...

Assuming FlexibleInstnaces, we can write encodings like the above
hex-encoding as instances

  instance Write (Utf8 (HexLower Int32)) where
    write = ...

This composes rather nicely and allows the implementations to exploit
special properties of the involved data. For example, if we also had a
HTML escaping marker

  newtype Html     a = Html     { unHtml     :: a }

Then, the instance

  instance Write (Utf8 (HTML (HexLower Int32))) where
    write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))

exploits that no HTML escaping is required for a hex-number.  Assuming
FlexibleContexts, the user can also build abbreviations for builders
using fixed encodings.

  utf8 :: Writable (Utf8 a) => a -> Builder
  utf8 = fromWrite write . Utf8

Note that, on the Builder level, a probably better way would be to
have an analogous 'ToBuilder' typeclass to abstract the various
encodings. Part of these instances then reuse the corresponding
instances from Writable.

I think this type-class based interface to select the correct
efficient implementation of an encoding is rather nice. However, I
don't know if 'FlexibleInstances' and 'FlexibleContexts' are fine to
use in code that aims to become part of the Haskell platform.
Moreover, I might well overlook some drawbacks of this design.

I'm looking forward to your feedback.

best regards,
Simon

[1] https://github.com/meiersi/system-io-write