[Haskell-cafe] Efficient way to break up a lazy bytestring

Fri Dec 29 21:32:26 EST 2006

> breakUp s
> 	| L.null s = []
> 	| otherwise = h:(breakUp r) where
> 		(h,r) = L.splitAt 72 s
> 
> Running this on the 2G file blows up the stack pretty quickly, taking  
> the first 1 million records (there are 20M of them) with a big stack  
> parameter gives about 25% productivity, with GC taking the other 75%.
> 
> My understanding is that while this looks tail recursive, it isn't  
> really because of laziness.  I've tried throwing seq operators  
> around, but they don't seem to do much to help the efficiency.

This function by itself doesn't really have any particular behaviour
with respect to stack and heap usage, since it is just a linear mapping
from one lazy sequence to another. To understand the stack blowout,
you'll need to look at what you are doing with the result of this
function.

For example, a foldl over the result might be building a big thunk on
the heap, which could blow the stack when evaluated*. If this is the
case, you might avoid building the big thunk by using foldl', a version
of foldl which evaluates as it folds.

Invoking GHC with optimisation might also help, since the strictness
evaluator can often do the equivalent of converting foldl to foldl' at
the appropriate places.

Have you looked at the Performance pages on the Wiki, in particular, the
Strictness and Laziness pages?

http://www.haskell.org/haskellwiki/Performance

In my (limited) Haskell experience, I was continually being surprised by
inexplicable stack blowouts until I spent a little time doing some
focussed experiments, mainly involving folds over large lists. If you
haven't done that, I would recommend it.

* - Note that foldl is tail recursive, so clearly tail recursion doesn't
necessarily result in a well-behaved loop in a lazy implementation!