[Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform
iridcode at gmail.com
Thu May 19 22:46:36 CEST 2011
thanks for the extensive and motivating feedback.
2011/5/19 Johan Tibell <johan.tibell at gmail.com>:
> On Wed, May 18, 2011 at 7:32 PM, Simon Meier <iridcode at gmail.com> wrote:
>> In fact, one of my current goals with this work is to polish it such
>> that it can be integrated into the 'bytestring' library.
> We should definitely add a builder monoid in the bytestring package.
> Since Write mentions IO, I thought I should point out that we need to
> separate any code that mentions IO from the the code that doesn't
> (i.e. the pure Builder API). The use of IO is an implementation detail
> in bytestring. We should follow the existing bytestring pattern and
> put any code that mentions IO in e.g.
> Data.ByteString.Lazy.Builder.Internal. This allows the few people who
> need to access the internals to do so while making it clear that these
> are in fact internals. Long term we'd like to switch bytestring over
> from ForeignPtr to ByteArray#, if possible. There are currently some
> technical obstacles to such a switch, but factoring out the IO code at
> least makes it somewhat easier if we ever get around to switching.
> Avoiding IO in the main API means that the main builder type must not
> mention IO (or things related to IO, such as Storable).
I completely agree with you. The system-io-write library  and the
bytestring fork  I'm working on provide separate interfaces for
standard and expert users. The naming of the system-io-write library
is tentative and can be adapted once it's place is clear.
>> The core principle used to tackle (1) is avoiding intermediate data
>> structures. The core abstraction used is the one of a Write (see 
>> for the corresponding library.)
>> data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
>> A value `Write bound io :: Write a` denotes an encoding scheme for
>> values of type `a` that uses at most `bound` bytes space. Given a
>> values `x :: a` and a pointer `po` to the next free byte `io x po`
>> encodes `x` to memory starting from `po` and returns the pointer to
>> the next free byte after the encoding of `x`.
>> In most cases Writes are used as an abstract datatype. They serve as
>> an interface between implementors of the low-level bit-twiddling
>> required to efficiently implement encodings like UTF-8 or Base16 and
>> the providers of efficient traversal functions through streams of
>> Haskell values. Hence, typical users of Writes are functions like
>> fromWrite :: Write a -> a -> Builder
>> fromWriteList :: Write a -> [a] -> Builder
>> fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
>> mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
> We want to allow users to efficiently create new builders, for their
> own data type. This is crucial as the bytestring package cannot
> provide efficient builders for every possible type, as it would have
> to depend on most of Hackage (i.e. on all packages that define types
> that we want efficient builders for) to do so. Allowing the user to
> get hold of the underlying buffer in a controlled way makes the
> builder extensible. This is good.
> Write achieves this separation, but it has some costs which I'm not
> entirely comfortable with.
> First, it leads to lots of API duplication. For every type (e.g. Word)
> we want to be able serialize we have two morally identical functions
> writeWordhost :: Word -> Write
> fromWordhost :: Word -> Builder
> in the API, where the latter simply calls the former and does some
> additional "wrapping".
Yes, I agree with this duplication. I'll explain below what we gain
from it. Note that I factored out the whole Write stuff into its own
library (system-io-write) for the bytestring integration. Therefore,
an end-user of bytestring will only see the Builder versions except
he's doing more low-level stuff to gain some extra performance.
> See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html
> for examples.
> Simon, is the reason for this duplication this comment on top of
> "Note that for serializing a three tuple (x,y,z) of bytes (or other
> word values) you should use the expression
> fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z
> instead of
> fromWord8 x `mappend` fromWord8 y `mappend` fromWord z
> The first expression will result in a single atomic write of three
> bytes, while the second expression will check for each byte, if
> there is free space left in the output buffer. Coalescing these
> checks can improve performance quite a bit, as long as you use it
That's one of the reasons, but not the main one. The core reason is
that Write's provide
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through (streams of)
Haskell values. For simple traversals like
fromWrite :: Write a -> a -> Builder
fromWriteList :: Write a -> [a] -> Builder
fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
there might be the option that GHC is clever enough and can find the
efficient loop. However, for more complicated functions like
mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
That certainly isn't the case. I'm using quite a few tricks there 
to enable a tight inner loop with few live variables.
> Coalescing of buffer space checks can be achieved without separating
> writes into Write and Builder. I've done so in the binary package 
> using rewrite rules. The rewrite rules fire reliable so that any
> "syntactic" series of puts i.e.
> f = do
> putWord8 1
> putWord8 2
> putWord8 3
> result in one bounds check, followed by three pokes into the buffer.
> To do so all that is needed is to define all builders in terms of
> writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder
> and create a rewrite rule for append/writeAtMost. writeAtMost is
> essentially the same as your Write , except it never leads to any
> constructors getting allocated.
Hmm, as far as I see it `writeAtMost` corresponds in blaze-builder to
implementing every Builder with `fromWrite <someWrite>`. I don't think
that is a good idea because (1) there are many more use-cases for
Writes than just sequencing and (2) I don't think GHC is clever enough
to optimize all of these use cases well enough. For a hard use-case,
have a look at the implementation of `mapWriteByteString` . There,
I'm using the static bound on the size of the written data to upper
bound the number of input bytes I can process without checking the
In my opinion, Writes and Builders have different use-cases and
different semantics. Providing a type modeling Writes makes therefore
sense to me. Moreover, note that Writes are built as a compile time
abstraction. All their definitions are intended to be completely
inlined and care is taken that the inliner also does so. Therefore,
they incur no runtime cost.
> At the moment, the addition of Write means that
> import Blaze.ByteString.Builder
> f :: [Word8] -> Builder
> f xs = fromWriteList writeWord8 xs
> is faster than the Data.Binary equivalent
> import Data.Binary.Builder
> g :: [Word8] -> Builder
> g  = mempty
> g (x:xs) = singleton x `mappend` g xs
> Fortunately this was due to a bug in GHC . After this bug has been
> fixed I expect Data.Binary to perform on par with
> Blaze.ByteString.Builder, at least for code that involves loops
> (blaze-builder might still do a better job with fragmentation).
OK. That sounds interesting. Adding rewriting rules that coalesce
adjacent calls of `fromWrite` might be worth investigating, as they'd
improve the user experience. However, as explained above Writes have
uses cases beyond this optimization. Their main use case being an
abstraction of bounded-size encodings.
> 1. https://github.com/kolmodin/binary/
> 2. I stole your idea of taking the maximum number of bytes to write
> instead of the actual number of bytes to write. This allows many more
> bounds checks to be merged (e.g. when serializing values that use
> variable length encodings, such as UTF-8).
> 3. http://hackage.haskell.org/trac/ghc/ticket/4978
More information about the Haskell-Cafe