[Haskell-cafe] Patterns for processing large but finite streams

Fri Jul 1 12:01:47 CEST 2011

>
> From: Eugene Kirpichov <ekirpichov at gmail.com>
> Subject: [Haskell-cafe] Patterns for processing large but finite
>        streams
> To: Haskell Cafe <haskell-cafe at haskell.org>
> Message-ID: <BANLkTikdsvQ2Wv4WJR+QMuvksoav0kTqXw at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi,
>
> I'm rewriting timeplot to avoid holding the whole input in memory, and
> naturally a problem arises:
>
> How to represent large but finite streams and functions that process
> them, returning other streams or some kinds of aggregate values?
>
> Examples:
> * Adjacent differences of a stream of numbers
> * Given a stream of numbers with times, split it into buckets by time
> of given width and produce a stream of (bucket, 50%,75% and 90%
> quantiles in this bucket)
> * Sum a stream of numbers
>
> Is this, perhaps, what comonads are for? Or iteratees?
>

This is exactly what iteratees are for.  Specifically, enumeratees are
stream transformers, iteratees are stream consumers, and enumerators are
stream producers.  Consider adjacent differences:

Given the stream [a, b, c, d, e ....], you want to produce [b-a, c-b, d-c,
e-d, ...]

This is a stream transformer, so you need an enumeratee.  Using iteratee,
there are at least two obvious ways to produce it:

1)  High-level, but probably not as good performance. Use
Data.Iteratee.ListLike.roll

> import Data.Iteratee as I
> import Control.Applicative
>
> diff [x,y] = y-x
> diff [x]    = 0
> diffs = convStream (map diff <$> roll 2 1)
>

2) somewhat explicit, probably better performance

> import qualified Data.ListLike as LL
> e' iter = do
>   h <- I.head
>   unfoldConvStream f h iter
> where
>  f lastEl = do
>    c <- getChunk
>    if LL.null c
>      then return (lastEl, LL.empty)
>      else do
>        let h = LL.head c
>             t = LL.tail c
>        return (LL.last c, LL.cons (h-lastEl) (LL.zipWith (-) t (LL.init
c)))

either of these can be run by using an enumerator:

*Main> enumPure1Chunk [1..10] (joinI $ e stream2list) >>= run
[1,1,1,1,1,1,1,1,1,0]
*Main> let e'2 = e' :: Enumeratee [Int] [Int] IO a
*Main> enumPure1Chunk [1..10] (joinI $ e'2 stream2list) >>= run
[1,1,1,1,1,1,1,1,1]

I should optimize 'roll', it wouldn't be hard.

Summing is easy; iteratee has "Data.Iteratee.ListLike.sum" built-in, but you
could also use a fold.

enumPure1Chunk is only useful for small amounts of data, but iteratee
packages provide enumerators over files, handles, etc., as well as
mechanisms by which you can create your own enumerators.

The e' enumeratee is really just a model; I'd probably write one specific to
whichever type of stream I wanted to work with.  This one assumes a cheap
'cons', for example.

For producing a stream of buckets, if the times are ordered it would be
simple to do with Data.Iteratee.ListLike.breakE.  If the times aren't
ordered, I would probably use 'group' instead to collect a set number of
samples.

In my view, the biggest difference between iteratee and enumerator is the
stream abstraction.  Iteratee provides

> I.Iteratee s m a

where 's' is the stream type, e.g. [Int], String, ByteString, Vector Word8,
etc.  Although the library only processes a chunk at a time (where a chunk
is a subsection of the stream), the type is that of the whole stream.

Enumerator instead provides

> E.Iteratee s m a

here, 's' is the type of the chunk.  Enumerator treats the stream as having
type [s].

The implementations are different too, but ideally that would be hidden from
most users.  Although the iteratee implementation allows you to use >>= and
$, whereas enumerator sometimes requires you to use >>== and $$.

I think IterIO mostly follows the same type as Enumerator, although the
implementation is again quite different.

John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110701/24af43b8/attachment.htm>