[Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

Thu May 19 22:46:36 CEST 2011

Hi Johan,

thanks for the extensive and motivating feedback.

2011/5/19 Johan Tibell <johan.tibell at gmail.com>:
> On Wed, May 18, 2011 at 7:32 PM, Simon Meier <iridcode at gmail.com> wrote:
>> In fact, one of my current goals with this work is to polish it such
>> that it can be integrated into the 'bytestring' library.
>
> We should definitely add a builder monoid in the bytestring package.
>
> Since Write mentions IO, I thought I should point out that we need to
> separate any code that mentions IO from the the code that doesn't
> (i.e. the pure Builder API). The use of IO is an implementation detail
> in bytestring. We should follow the existing bytestring pattern and
> put any code that mentions IO in e.g.
> Data.ByteString.Lazy.Builder.Internal. This allows the few people who
> need to access the internals to do so while making it clear that these
> are in fact internals. Long term we'd like to switch bytestring over
> from ForeignPtr to ByteArray#, if possible. There are currently some
> technical obstacles to such a switch, but factoring out the IO code at
> least makes it somewhat easier if we ever get around to switching.
>
> Avoiding IO in the main API means that the main builder type must not
> mention IO (or things related to IO, such as Storable).

I completely agree with you. The system-io-write library [1] and the
bytestring fork [2] I'm working on provide separate interfaces for
standard and expert users. The naming of the system-io-write library
is tentative and can be adapted once it's place is clear.

>> The core principle used to tackle (1) is avoiding intermediate data
>> structures.  The core abstraction used is the one of a Write (see [1]
>> for the corresponding library.)
>>
>>  data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
>>
>> A value `Write bound io :: Write a` denotes an encoding scheme for
>> values of type `a` that uses at most `bound` bytes space. Given a
>> values `x :: a` and a pointer `po` to the next free byte `io x po`
>> encodes `x` to memory starting from `po` and returns the pointer to
>> the next free byte after the encoding of `x`.
>>
>> In most cases Writes are used as an abstract datatype. They serve as
>> an interface between implementors of the low-level bit-twiddling
>> required to efficiently implement encodings like UTF-8 or Base16 and
>> the providers of efficient traversal functions through streams of
>> Haskell values. Hence, typical users of Writes are functions like
>>
>>  fromWrite          :: Write a -> a -> Builder
>>  fromWriteList      :: Write a -> [a] -> Builder
>>  fromWriteUnfoldr   :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
>>  mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
>
> We want to allow users to efficiently create new builders, for their
> own data type. This is crucial as the bytestring package cannot
> provide efficient builders for every possible type, as it would have
> to depend on most of Hackage (i.e. on all packages that define types
> that we want efficient builders for) to do so. Allowing the user to
> get hold of the underlying buffer in a controlled way makes the
> builder extensible. This is good.
>
> Write achieves this separation, but it has some costs which I'm not
> entirely comfortable with.
>
> First, it leads to lots of API duplication. For every type (e.g. Word)
> we want to be able serialize we have two morally identical functions
>
>    writeWordhost :: Word -> Write
>    fromWordhost :: Word -> Builder
>
> in the API, where the latter simply calls the former and does some
> additional "wrapping".

Yes, I agree with this duplication. I'll explain below what we gain
from it. Note that I factored out the whole Write stuff into its own
library (system-io-write) for the bytestring integration. Therefore,
an end-user of bytestring will only see the Builder versions except
he's doing more low-level stuff to gain some extra performance.

> See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html
> for examples.
>
> Simon, is the reason for this duplication this comment on top of
> Blaze.ByteString.Builder.Word?
>
>    "Note that for serializing a three tuple (x,y,z) of bytes (or other
>    word values) you should use the expression
>
>        fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z
>
>    instead of
>
>        fromWord8 x `mappend` fromWord8 y `mappend` fromWord z
>
>    The first expression will result in a single atomic write of three
>    bytes, while the second expression will check for each byte, if
>    there is free space left in the output buffer. Coalescing these
>    checks can improve performance quite a bit, as long as you use it
>    sensibly."

That's one of the reasons, but not the main one. The core reason is
that Write's provide
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through (streams of)
Haskell values. For simple traversals like

  fromWrite          :: Write a -> a -> Builder
  fromWriteList      :: Write a -> [a] -> Builder
  fromWriteUnfoldr   :: Write b -> (a -> Maybe (b, a)) -> a -> Builder

there might be the option that GHC is clever enough and can find the
efficient loop. However, for more complicated functions like

  mapWriteByteString :: Write Word8 -> S.ByteString -> Builder

That certainly isn't the case. I'm using quite a few tricks there [3]
to enable a tight inner loop with few live variables.

> Coalescing of buffer space checks can be achieved without separating
> writes into Write and Builder. I've done so in the binary package [1]
> using rewrite rules. The rewrite rules fire reliable so that any
> "syntactic" series of puts i.e.
>
> f = do
>    putWord8 1
>    putWord8 2
>    putWord8 3
>
> result in one bounds check, followed by three pokes into the buffer.
> To do so all that is needed is to define all builders in terms of
>
>    writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder
>
> and create a rewrite rule for append/writeAtMost. writeAtMost is
> essentially the same as your Write [2], except it never leads to any
> constructors getting allocated.

Hmm, as far as I see it `writeAtMost` corresponds in blaze-builder to
implementing every Builder with `fromWrite <someWrite>`. I don't think
that is a good idea because (1) there are many more use-cases for
Writes than just sequencing and (2) I don't think GHC is clever enough
to optimize all of these use cases well enough. For a hard use-case,
have a look at the implementation of `mapWriteByteString` [3]. There,
I'm using the static bound on the size of the written data to upper
bound the number of input bytes I can process without checking the
output bound.

In my opinion, Writes and Builders have different use-cases and
different semantics. Providing a type modeling Writes makes therefore
sense to me. Moreover, note that Writes are built as a compile time
abstraction. All their definitions are intended to be completely
inlined and care is taken that the inliner also does so. Therefore,
they incur no runtime cost.

> At the moment, the addition of Write means that
>
>    import Blaze.ByteString.Builder
>
>    f :: [Word8] -> Builder
>    f xs = fromWriteList writeWord8 xs
>
> is faster than the Data.Binary equivalent
>
>    import Data.Binary.Builder
>
>    g :: [Word8] -> Builder
>    g [] = mempty
>    g (x:xs) = singleton x `mappend` g xs
>
> Fortunately this was due to a bug in GHC [3]. After this bug has been
> fixed I expect Data.Binary to perform on par with
> Blaze.ByteString.Builder, at least for code that involves loops
> (blaze-builder might still do a better job with fragmentation).

OK. That sounds interesting. Adding rewriting rules that coalesce
adjacent calls of `fromWrite` might be worth investigating, as they'd
improve the user experience. However, as explained above Writes have
uses cases beyond this optimization. Their main use case being an
abstraction of bounded-size encodings.

> 1. https://github.com/kolmodin/binary/
> 2. I stole your idea of taking the maximum number of bytes to write
> instead of the actual number of bytes to write. This allows many more
> bounds checks to be merged (e.g. when serializing values that use
> variable length encodings, such as UTF-8).
> 3. http://hackage.haskell.org/trac/ghc/ticket/4978

best regards,
Simon

[1] https://github.com/meiersi/system-io-write
[2] https://github.com/meiersi/bytestring
[3] https://github.com/meiersi/bytestring/blob/master/Data/ByteString/Builder/Write.hs#L112