[Haskell-cafe] RFC: demanding lazy instances of Data.Binary

Mon Nov 19 21:06:31 EST 2007

In light of this discussion, I think the "fully spine-strict list instance
does more good than bad" argument is starting to sound like a premature
optimization. Consequently, using a newtype to treat the necessarily lazy
instances as special cases is an inappropriate bandaid.

My current opinion: If Data.Binary makes both a fully strict list instance
(not []) and a fully lazy list instance (this would be the default for [])
available, then that will also make available all of the other intermediate
strictness. I'll elaborate that a bit. If the user defines a function
appSpecificSplit :: MyDataType -> [StrictList a], then the user can control
the compactness and laziness of the serialisation by tuning that splitting
function. Niel's 255 schema fits as one particular case, the split255 :: [a]
-> [StrictList a] function. I would hesitate to hard code a number of
elements, since it certainly depends on the application and only exposing it
as a parameter maximizes the reusability of the code.

Related action: Thus I think that instead of a standard newtype to denote
laziness expectations, there ought to be a standard strict list data type
and Binary instance AND some documentation popularizing this as a possible
optimization whenever the generated bytestrings from []s are too big. Is
Data.Sequence suitable for use as StrictList? Or perhaps something from the
strict library? I'm not fully savvy to the head-strictness/unpacking
differences and trade-offs.

"Reaching for the sky" idea: Does the Put "monad" offer enough information
for an instance to be able to recognize when it has filled a lazy
bytestring's first chunk? It could cater its strictness (i.e. vary how much
of the spine is forced before any output is generated) in order to best line
up with the chunks of lazy bytestring it is producing. This might be trying
to fit too much into the interface. And it might even make Put an actual
monad ;)

On Nov 19, 2007 4:16 PM, Duncan Coutts <duncan.coutts at worc.ox.ac.uk> wrote:

> On Mon, 2007-11-19 at 13:39 -0800, Don Stewart wrote:
> > nicolas.frisby:
> > >    I've got a first draft with the newtype and just an instance for
> list.
> > >
> > >    If you'd prefer fewer questions, please let me know ;)
> > >
> > >    0) I've cabalised it ("lazy-binary"), but I don't have anywhere to
> host
> > >    it. Would it be appropriate to host on [1]darcs.haskell.org or
> HackageDB
> > >    (yet?). Suggestions?
> >
> > You can host it on code.haskell.org, ask for an account here:
>
> I think we should consider if a lazier serialisation of lists shouldn't
> be the default first before thinking about forking the whole library.
>
> It depends on how much laziness you want. We could certainly make it so
> that this is true:
>
> (decode . encode) [1..] = [1..]
>
> rather than giving _|_. However the approach of Data.Binary is lazy
> serialisation but in chunks, big chunks. So while the above may be true,
> this would not be:
>
> (head . decode . encode) [1, _|_] = 1
>
> because we generate 32k chunks of data when serialising. But do you
> really need it to be this lazy? Would it enough for it to be lazy in
> chunks.
>
> There is a good argument I think that the current fully strict
> serialisation is bad just from a performance perspective, and that
> instead we should serialise lists semi-lazily, using a chunked
> representation. For example Neil's serialisation library uses length
> prefixed chunks with a maximum chunk size of 255. The end of a list is
> denoted by a 0 length final chunk. This has the advantage that we only
> have to force a limited number of elements (to find the length) before
> serialising.
>
> If you want it really lazy then it would have to flush after each
> element to create a new lazy bytestring chunk. Note that flushing this
> often looses many of the performance advantages of the Data.Binary
> stuff.
>
> > >
> > >    1) The fact that serialisation is fully strict for 32760 bytes but
> not for
> > >    32761 makes the direct application of strictCheck intractable. Do
> you have
> > >    any ideas how to circumvent that?
>
> Test using a much smaller chunk size. I'd test sizes from 1 to something
> like one more than the machine word size.
>
> > >    2) Also, any suggestions for other datatypes to provide default
> instances
> > >    for? Tree type structures immediately raise the question of which
> > >    traversal should be the default. I'm learning towards providing
> none since
> > >    the goal of constant space usage actually depends on the
> serialisation
> > >    order matching how the deserialised tree will be traversed.
> >
> > Lazy Arrays?
> >
> > >    3) I don't think it is applicable in anyway whatsoever to strict
> types
> > >    like Int, Data.Set, and Data.Sequence? Counter-arguments?
> >
> > Well, atomic types like Int, I don't think it makes sense, but Sets and
> > Sequence are lazy, aren't they?
>
> Sequences are like spine strict lists. Sets are strict in as much as the
> element type's (==) function is strict.
>
> > >    4) Perhaps the tight correspondence between serialisation and
> traversal
> > >    necessary for constant space usage does indeed mean that the
> instance for
> > >    the lazy list is the only appropriate one to provide. Perhaps the
> Chunks
> > >    data type and a function splitN :: Int -> [a] -> Chunks [a] would
> also be
> > >    helpful.
> >
> > Yes, it is probably the only lazy instance anyone cares about, anyway.
>
> Yes. So I think we should be clear about what we want and see if we
> can't just fix the default.
>
> Duncan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20071119/371a1915/attachment-0001.htm