[Haskell-cafe] RFC: demanding lazy instances of Data.Binary

Mon Nov 19 17:16:21 EST 2007

On Mon, 2007-11-19 at 13:39 -0800, Don Stewart wrote:
> nicolas.frisby:
> >    I've got a first draft with the newtype and just an instance for list.
> > 
> >    If you'd prefer fewer questions, please let me know ;)
> > 
> >    0) I've cabalised it ("lazy-binary"), but I don't have anywhere to host
> >    it. Would it be appropriate to host on [1]darcs.haskell.org or HackageDB
> >    (yet?). Suggestions?
> 
> You can host it on code.haskell.org, ask for an account here:

I think we should consider if a lazier serialisation of lists shouldn't
be the default first before thinking about forking the whole library.

It depends on how much laziness you want. We could certainly make it so
that this is true:

(decode . encode) [1..] = [1..]

rather than giving _|_. However the approach of Data.Binary is lazy
serialisation but in chunks, big chunks. So while the above may be true,
this would not be:

(head . decode . encode) [1, _|_] = 1

because we generate 32k chunks of data when serialising. But do you
really need it to be this lazy? Would it enough for it to be lazy in
chunks.

There is a good argument I think that the current fully strict
serialisation is bad just from a performance perspective, and that
instead we should serialise lists semi-lazily, using a chunked
representation. For example Neil's serialisation library uses length
prefixed chunks with a maximum chunk size of 255. The end of a list is
denoted by a 0 length final chunk. This has the advantage that we only
have to force a limited number of elements (to find the length) before
serialising.

If you want it really lazy then it would have to flush after each
element to create a new lazy bytestring chunk. Note that flushing this
often looses many of the performance advantages of the Data.Binary
stuff.

> > 
> >    1) The fact that serialisation is fully strict for 32760 bytes but not for
> >    32761 makes the direct application of strictCheck intractable. Do you have
> >    any ideas how to circumvent that?

Test using a much smaller chunk size. I'd test sizes from 1 to something
like one more than the machine word size.

> >    2) Also, any suggestions for other datatypes to provide default instances
> >    for? Tree type structures immediately raise the question of which
> >    traversal should be the default. I'm learning towards providing none since
> >    the goal of constant space usage actually depends on the serialisation
> >    order matching how the deserialised tree will be traversed.
> 
> Lazy Arrays?
>   
> >    3) I don't think it is applicable in anyway whatsoever to strict types
> >    like Int, Data.Set, and Data.Sequence? Counter-arguments?
> 
> Well, atomic types like Int, I don't think it makes sense, but Sets and
> Sequence are lazy, aren't they?

Sequences are like spine strict lists. Sets are strict in as much as the
element type's (==) function is strict.

> >    4) Perhaps the tight correspondence between serialisation and traversal
> >    necessary for constant space usage does indeed mean that the instance for
> >    the lazy list is the only appropriate one to provide. Perhaps the Chunks
> >    data type and a function splitN :: Int -> [a] -> Chunks [a] would also be
> >    helpful.
> 
> Yes, it is probably the only lazy instance anyone cares about, anyway.

Yes. So I think we should be clear about what we want and see if we
can't just fix the default.

Duncan