[Haskell-cafe] ByteString and ByteString.Builder questions

Wed Nov 29 17:24:17 UTC 2023

On Wed, Nov 29, 2023 at 11:49:06AM +0000, Zoran Bošnjak wrote:

> if I understand correctly, the ByteString.Builder is used to
> efficiently construct sequence of bytes from smaller parts.

Best used in continuation-passing-style (right-associatively), where all
the subsequent builders are lazily added as part of constructing the
"head" builder. 

    builder = chunk1 <> (chunk2 <> (chunk3 <> (... <> chunkN)...))

Repeatedly appending tail chunks (effectively left-associate) is
noticeably less efficient (similar to lists).  A work-around is to
instead append (Builder->Builder) endomorphisms.

    b1 = Endo (mappend chunk1)
    b2 = b1 <> Endo (mappend chunk2)
    b3 = b2 <> Endo (mappend chunk3)
    ...
    bN = ...

and then extract the final builder via: `appEndo bN mempty`.
Endomorphism append will be more efficient once there are many parts to
combine.

> However, for inspecting data (take, head, index...), a plain
> ByteString is required.

For efficient processing of network streams, you'd perhaps use a
streaming API that exposes the input as a monadic stream of chunks,
and perhaps a corresponding parser layered on top that supports
consuming chunks monadically.  The `streaming` ecosystem for
example has support for this model.

> What if the byte sequence manipulation task requires both, for example:
> - receive ByteString from the network (e.g: Network.Socket.ByteString.recv :: ... -> IO ByteString)
> - inspect and manipulate data (pure function)
> - resend to the network (e.g: Network.Socket.ByteString.sendMany :: ... -> [ByteString] -> IO ())

The input packet will be a `ByteString`, the output packet should be a
builder, that is converted at the last moment to a (possibly lazy)
bytestring for transmission.  You shouldn't need to read your
output, so a single representation is sufficient.

> It is somewhat inconvenient to use 2 different types for the task,
> namely the ByteString and the Builder... where both represent a
> sequence of bytes.

A builder is not a sequence of bytes as such, it is a CPS-style
generator for a slice of a future sequence of bytes that can
incrementally build the entire sequence without reallocation
or copying (at least when the output is a lazy bytestring).

> I have tryed to define a Bytes type where both representations are available:
> 
> import qualified Data.ByteString as BS                                                                                
> import qualified Data.ByteString.Lazy as Bsl                                                                          
> import qualified Data.ByteString.Builder as Bld  
> 
> data Bytes = Bytes
>     { toByteString :: ByteString
>     , toBuilder    :: Builder
>     , length       :: Int
>     } 

This is not a productive direction to explore.  Instead your *output*
should be a Builder, either constructed lazily in one go (with the tail
parts already lazily appended), or constructed by concatenation of
(Builder->Builder) endomorphisms.  The inputs that individual builder
chunks will consume can be bytestring slices mixed with various other
data (e.g. builders for binary length fields that convert ints to
big-endian wire-form, ...).

-- 
    Viktor.