[Haskell-cafe] Stream fusion and span/break/group/init/tails

Fri Apr 26 04:46:42 CEST 2013

On 25/04/2013, at 3:47 AM, Duncan Coutts wrote:

> It looks like fold and unfold fusion systems have dual limitations:
> fold-based fusion cannot handle zip style functions, while unfold-based
> fusion cannot handle unzip style functions. That is fold-based cannot
> consume multiple inputs, while unfold-based cannot produce multiple
> outputs.

Yes. This is a general property of data-flow programs and not just compilation via Data.Vector style co-inductive stream fusion, or a property of fold/unfold/hylo fusion. 

Consider these general definitions of streams and costreams.

-- A stream must always produce an element.
type Stream a   = IO a

-- A costream must always consume an element.
type CoStream a = a -> IO ()

And operators on them (writing S for Stream and C for CoStream).

-- Versions of map.
map     :: (a -> b) -> S a -> S b    (ok)
comap   :: (a -> b) -> C b -> C a    (ok)

-- Versions of unzip.
unzip   :: S (a, b) -> (S a, S b)    (bad)
counzip :: C a -> C b -> C (a, b)    (ok)
unzipc  :: S (a, b) -> C b -> S a    (ok)

-- Versions of zip.
zip     :: S a -> S b -> S (a, b)    (ok)
cozip   :: C (a, b) -> (C a, C b)    (bad)
zipc    :: C (a, b) -> S a -> C b    (ok)

The operators marked (ok) can be implemented without buffering data, while the combinators marked (bad) may need an arbitrary sized buffer.

Starting with 'unzip', suppose we pull elements from the first component of the result (the (S a)) but not the second component (the (S b)). To provide these 'a' elements, 'unzip' must pull tuples from its source stream (S (a, b)) and buffer the 'b' part until someone pulls from the (S b).

Dually, with 'cozip', suppose we push elements into the first component of the result (the (C a)). The implementation must buffer them until someone pushes the corresponding element into the (C b), only then can it push the whole tuple into the source (C (a, b)) costream.

The two combinators unzipc and zipc are hybrids:

For 'unzipc', if we pull an element from the (S a), then the implementation can pull a whole (a, b) tuple from the source (S (a, b)) and then get rid of the 'b' part by pushing it into the (C b). The fact that it can get rid of the 'b' part means it doesn't need a buffer.

Similarly, for 'zipc', if we push a 'b' into the (C b) then the implementation can pull the corresponding 'a' part from the (S a) and then push the whole (a, b) tuple into the C (a, b). The fact that it can get the corresponding 'a' means it doesn't need a buffer.

I've got some hand drawn diagrams of this if anyone wants them (mail me), but none of it helps implement 'unzip' for streams or 'cozip' for costreams. 

> I'll be interested to see in more detail the approach that Ben is
> talking about. As Ben says, intuitively the problem is that when you've
> got multiple outputs so you need to make sure that someone is consuming
> them and that that consumption is appropriately synchronised so that you
> don't have to buffer (buffering would almost certainly eliminate the
> gains from fusion). That might be possible if ultimately the multiple
> outputs are combined again in some way, so that overall you still have a
> single consumer, that can be turned into a single lazy or eager loop.

At least for high performance applications, I think we've reached the limit of what short-cut fusion approaches can provide. By "short cut fusion", I mean crafting a special source program so that the inliner + simplifier + constructor specialisation transform can crunch down the intermediate code into a nice loop. Geoff Mainland's recent paper extended stream fusion with support for SIMD operations, but I don't think stream fusion can ever be made to fuse programs with unzip/cozip-like operators properly. This is a serious problem for DPH, because the DPH vectoriser naturally produces code that contains these operators.

I'm currently working on Repa 4, which will include a GHC plugin that hijacks the intermediate GHC core code and performs the transformation described in Richard Water's paper "Automatic transformation of series expressions into loops". The plugin will apply to stream programs, but not affect the existing fusion mechanism via delayed arrays. I'm using a cut down 'clock calculus' from work on synchronous data-flow languages to guarantee that all outputs from an unzip operation are consumed in lock-step. Programs that don't do this won't be well typed. Forcing synchronicity guarantees that Waters's transform will apply to the program.

The Repa plugin will also do proper SIMD vectorisation for stream programs, producing the SIMD primops that Geoff recently added. Along the way it will brutally convert all operations on boxed/lifted numeric data to their unboxed equivalents, because I am sick of adding bang patterns to every single function parameter in Repa programs. 

Ben.