[Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

Thu May 19 10:53:08 CEST 2011

Hi Simon,

On Wed, May 18, 2011 at 7:32 PM, Simon Meier <iridcode at gmail.com> wrote:
> In fact, one of my current goals with this work is to polish it such
> that it can be integrated into the 'bytestring' library.

We should definitely add a builder monoid in the bytestring package.

Since Write mentions IO, I thought I should point out that we need to
separate any code that mentions IO from the the code that doesn't
(i.e. the pure Builder API). The use of IO is an implementation detail
in bytestring. We should follow the existing bytestring pattern and
put any code that mentions IO in e.g.
Data.ByteString.Lazy.Builder.Internal. This allows the few people who
need to access the internals to do so while making it clear that these
are in fact internals. Long term we'd like to switch bytestring over
from ForeignPtr to ByteArray#, if possible. There are currently some
technical obstacles to such a switch, but factoring out the IO code at
least makes it somewhat easier if we ever get around to switching.

Avoiding IO in the main API means that the main builder type must not
mention IO (or things related to IO, such as Storable).

> The core principle used to tackle (1) is avoiding intermediate data
> structures.  The core abstraction used is the one of a Write (see [1]
> for the corresponding library.)
>
>  data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
>
> A value `Write bound io :: Write a` denotes an encoding scheme for
> values of type `a` that uses at most `bound` bytes space. Given a
> values `x :: a` and a pointer `po` to the next free byte `io x po`
> encodes `x` to memory starting from `po` and returns the pointer to
> the next free byte after the encoding of `x`.
>
> In most cases Writes are used as an abstract datatype. They serve as
> an interface between implementors of the low-level bit-twiddling
> required to efficiently implement encodings like UTF-8 or Base16 and
> the providers of efficient traversal functions through streams of
> Haskell values. Hence, typical users of Writes are functions like
>
>  fromWrite          :: Write a -> a -> Builder
>  fromWriteList      :: Write a -> [a] -> Builder
>  fromWriteUnfoldr   :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
>  mapWriteByteString :: Write Word8 -> S.ByteString -> Builder

We want to allow users to efficiently create new builders, for their
own data type. This is crucial as the bytestring package cannot
provide efficient builders for every possible type, as it would have
to depend on most of Hackage (i.e. on all packages that define types
that we want efficient builders for) to do so. Allowing the user to
get hold of the underlying buffer in a controlled way makes the
builder extensible. This is good.

Write achieves this separation, but it has some costs which I'm not
entirely comfortable with.

First, it leads to lots of API duplication. For every type (e.g. Word)
we want to be able serialize we have two morally identical functions

    writeWordhost :: Word -> Write
    fromWordhost :: Word -> Builder

in the API, where the latter simply calls the former and does some
additional "wrapping".

See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html
for examples.

Simon, is the reason for this duplication this comment on top of
Blaze.ByteString.Builder.Word?

    "Note that for serializing a three tuple (x,y,z) of bytes (or other
    word values) you should use the expression

        fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z

    instead of

        fromWord8 x `mappend` fromWord8 y `mappend` fromWord z

    The first expression will result in a single atomic write of three
    bytes, while the second expression will check for each byte, if
    there is free space left in the output buffer. Coalescing these
    checks can improve performance quite a bit, as long as you use it
    sensibly."

Coalescing of buffer space checks can be achieved without separating
writes into Write and Builder. I've done so in the binary package [1]
using rewrite rules. The rewrite rules fire reliable so that any
"syntactic" series of puts i.e.

f = do
    putWord8 1
    putWord8 2
    putWord8 3

result in one bounds check, followed by three pokes into the buffer.
To do so all that is needed is to define all builders in terms of

    writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder

and create a rewrite rule for append/writeAtMost. writeAtMost is
essentially the same as your Write [2], except it never leads to any
constructors getting allocated.

At the moment, the addition of Write means that

    import Blaze.ByteString.Builder

    f :: [Word8] -> Builder
    f xs = fromWriteList writeWord8 xs

is faster than the Data.Binary equivalent

    import Data.Binary.Builder

    g :: [Word8] -> Builder
    g [] = mempty
    g (x:xs) = singleton x `mappend` g xs

Fortunately this was due to a bug in GHC [3]. After this bug has been
fixed I expect Data.Binary to perform on par with
Blaze.ByteString.Builder, at least for code that involves loops
(blaze-builder might still do a better job with fragmentation).

Cheers,
Johan

1. https://github.com/kolmodin/binary/
2. I stole your idea of taking the maximum number of bytes to write
instead of the actual number of bytes to write. This allows many more
bounds checks to be merged (e.g. when serializing values that use
variable length encodings, such as UTF-8).
3. http://hackage.haskell.org/trac/ghc/ticket/4978