# Custom reducing functions for DPH

Roman Leshchinskiy rl at cse.unsw.edu.au
Mon May 10 23:32:15 EDT 2010

```On 11/05/2010, at 05:29, Edward Z. Yang wrote:

> Some of the important primitives offered by Data Parallel Haskell are
> reduction primitives such as sumP and prodP, which take a data parallel
> array and reduce it to a single value.  I was wondering what the current
> capabilities for end-users interested in implementing their own
> reduction primitives were.  In particular, if I want to run data parallel
> computations on strings, I will generally want a more exotic set of
> combining operators.
>
> thoughtpolice informed me that GHC 6.10 seemed to have sumP/prodP hard
> coded into the vectorisation and optimisation stages, so this didn't
> really seem possible in userspace.  I'm interested to know if this situation
> has changed.  No hard feelings if it hasn't; I'm really just playing around
> with DPH and seeing what it can do. :-)

Short answer: we will eventually allow arbitrary operators in folds and scans but we need some time to figure out how to do this efficiently and it's not really on the top of our todo list at the moment.

That said, there are no theoretical problems with allowing user-defined operators and we could, with some work, provide support for them now. The only real difficulty is efficiency. Consider than in (foldP f xs) the function f might itself do something in parallel. In that case, we have to eliminate the nesting. This is quite straightforward to do by splitting xs in two halves, ys and zs, and then computing (zipWithP f ys zs). Then, we split the result, zip again and so on until we are left with a one-element array. This works because zipWithP "knows" how to eliminate nested parallelism. In fact, we have just directly encoded the parallel tree reduction algorithm.

We can do *much* better if we know that f is purely sequential, though (e.g., if f is Int addition). Now, we just split xs into chunks, one per thread, compute the local sums (a tight, sequential loop on each thread) and then combine them. This is a huge performance win.

For now, we have just provided specialised folds which we know are fast and which are easy to implement. When the rest of the system works reliably, we'll think about how to implement the general fold combinator with good performance for purely sequential combining operators.

Roman

```