[Haskell-cafe] Re: Parallel foldl doesn't work correctly

Sun Dec 13 13:09:34 EST 2009

On Sat, 2009-12-12 at 13:46 +0000, Ben Millwood wrote:
> On Sat, Dec 12, 2009 at 10:08 AM, Maciej Piechotka
> <uzytkownik2 at gmail.com> wrote:
> > If operation is associative it can be done using divide et impera
> > spliting list in half and operating on it pararerlly then split in half
> > etc.
> 

Thank you very much for the replies.

I've come to the conclusion that, yep, you can't (directly) parallelise
of fold operation, as fold guarantees order of processing.

With something like map the runtime is free to continue sparking
function application to each element without waiting for the result.
So we spark f x, force evaluation of the remainder of the xs and
recurse.
I'm *guessing* at a detailed level when we are creating the output list,
haskell can concat each result element before f x returns due to
laziness - that is, haskell doesn't need to wait for evaluation of f x,
before continuing?

With fold, and specifically with foldl (+), this isn't the case as (+)
is strict on both arguments and thus it cannot continue until each
sparked evaluation has completed and combined with the accumulator.  If
(+) was not strict on both arguments, I'm not sure if could solider
on... assuming I've understood map correctly!?

Writing it out long hand (sorry if this is tedious!), we have:

using :: a -> Strategy a -> a
using x s = s x `seq` x

rwhnf :: Strategy a 
rwhnf x = x `seq` ()  

parList :: Strategy a -> Strategy [a]
parList strat []     = ()
parList strat (x:xs) = strat x `par` (parList strat xs)

parMap :: Strategy b -> (a -> b) -> [a] -> [b]
parMap strat f = (`using` parList strat) . map f 

'using' applies a strategy to an item, and then returns the item.
'parList' is a (combinator) strategy which applies an atomic strategy to
each element in the list *in parallel* (for example forcing each element
to WHNF).

So for parMap we have xs passed into 'map f' - the result is then passed
to 'using' which will force application of 'f' on each element in
parallel by way of 'parList'.  No forced evaluation is dependant on a
previous evaluation.

Now for parFoldl - a crude and wrong representation for my purposes
could be:

parFoldl :: Num b => Strategy b -> (a -> b) -> [a] -> b
parFoldl strat f = sum . (`using` parList strat) . map f

This isn't really a fold of course, but it is doing roughly the same
thing, it's summing the results of applying function 'f' to each element
in a list.

The problem here is that sum will only allow one spark at a time,
because

sum []     = 0
sum (x:xs) = x + sum xs

So we get something like:
0 + (x4 + (x3 + (x2 + (x1))))

For example the result for (x4 + previous) can only be evaluated after
x3, x2 and x1 have been evaluated.  This means it won't spark evaluation
on x4 until (x3 + ....) has been evaluated, thus only one core is ever
used.

I believe fold is just the general case of sum and the same logic
applies.

I suppose my questions are:

Have I got this right, if not very succinct!?  

Is it purely the strictness of (+) that causes this situation?

Ignoring DPH, is it possible to write a parallel fold avoiding something
like the technique below?

Anyhow, a workaround similar to those suggested I came up with is to
divide the folds up across the cores and then sum the sub-folds - this
produces approximately double the performance across two cores:

import Control.Parallel.Strategies (parMap,rwhnf)
import Data.List (foldl')
import Data.List.Split (chunk)
import GHC.Conc (numCapabilities)

-- Prepare to share work to be 
-- done across available cores
chunkOnCpu :: [a] -> [[a]]
chunkOnCpu xs = chunk (length xs `div` numCapabilities) xs

-- Spark a fold of each chunk and
-- sum the results. Only works because
-- for associative folds.
foldChunks :: ([a] -> a) -> (a -> b -> a) -> a -> [[b]] -> a
foldChunks combineFunc foldFunc acc = 
  combineFunc . (parMap rwhnf $ foldl' foldFunc acc)

-- Some pointless work to keep thread busy
workFunc :: Int -> Int
workFunc 1 = 1
workFunc x = workFunc $ x - 1

-- Do some work on element x and append
foldFunc :: Int -> Int -> Int
foldFunc acc x = acc + workFunc x 

testList = repeat 1000000000
answer =  foldChunks sum foldFunc 0 $ chunkOnCpu (take 50 testList)

main :: IO()
main = print answer