[Haskell-cafe] Interesting folds over bytestring lists?

Fri Sep 21 07:46:41 EDT 2007

In message <a45dff840709200853i4b0418a7p4c1090343bd73501 at mail.gmail.com> "Justin
Bailey" <jgbailey at gmail.com> writes:
> On 9/20/07, Duncan Coutts <duncan.coutts at worc.ox.ac.uk> wrote:
> > A lazy bytestring is a list of strict bytestring which externally looks like one
> > big string. Could you not just use a lazy bytestring and it's take and drop
> > functions? Perhaps you can help me understand what it is you're trying to do?
> 
> I'm working on the ICFP contest from this year, and the algorithm
> frequently prepends long strings to the front of the "DNA" string
> being processed. I originally worked only with a lazy bytestring but
> it 'append' wasn't fast enough, so I'm trying this representation.

But you do realise it's exactly the same representation. Append for a lazy
bytestring is O(n) in the number of chunks n, this will also be true for your
'new' representation.

> Your email makes me think I should work directly with a list of strict
> bytestrings,

That's exactly what a lazy bytestring is. You'll not get any performance
improvements without changing the data representation. A list is not good enough
for what you want to do because so many operations are O(n) in the number of chunks.

> but in that case what happens when I want to take a large
> chunk of strings out of the middle of the list? Would that be an O(n)
> operation?

Yes. That's exactly the problem.

What you want rather than a list of strict bytestrings is a tree of strict
bytestrings. You want a fingertree of strict bytestrings:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/fingertree

newtype ByteSequence = BS (FingerTree (Sum Int) Strict.ByteString)

instance Measured (Sum Int) Strict.ByteString where
  measure = Sum . Strict.length

You'll have to wrap the operations you need, (like split, take, drop and append)
to make the ByteSequence look like a single sequence of bytes rather than a
sequence of chunks. You probably want to enforce an invariant that no chunk is
ever empty (as we do for lazy bytestrings). For best performance over a large
number of inserts and deletes you might need to implement merging adjacent small
blocks so that the block size does not degrade too much.

An alternative structure if you tend to do lots of inserts and deletes at near
the same spot is a zipper structure with a cursor. I'm not so sure what the best
structure for that might be, perhaps just a pair of finger trees giving the
parts of the sequence before and after the insertion point (since finger trees
give fast access to the ends but slower O(log n) access to points n chunks from
the closer end).

Have fun :-)

I should point out that other people who did this year's ICFP contest have also
looked at structures like this (though mostly after the contest finished), so
you might want to talk or collaborate with them.

Duncan