Having trouble with parallel Haskell

Thu Jun 5 05:44:15 EDT 2008

Bryan O'Sullivan wrote:
> I intended to write a half chapter or so about parallel programming in
> Haskell for the book I'm working on, but I've been coming to the
> conclusion that the time is not yet ripe for this.  I hope it will be
> helpful if I share my experiences here.
> 
> Specifically, I was planning to write about parallel programming in pure
> code: use of "par" and programming with strategies.
> 
> The most substantial problem is that the threaded RTS in GHC 6.8.2 is
> very crashy in the face of "par": about 90% of my runs fail with a
> segfault or an assertion failure.  Simon Marlow mentioned that this bug
> is fixed, but I've been unsuccessful in building a GHC 6.8.3 release
> candidate snapshot so far, so I can't verify this.
> 
> When a run does go through, I have found it surprisingly difficult to
> actually get both cores of a dual-core system to show activity.  As
> there are no tools I can use to see what is happening, I am stumped.  It
> is of course quite likely that I am doing something wrong that results
> in unexpected dependencies, but I cannot find a means to gaining insight
> into the problem.
> 
> As an example, I have a simple parallel quicksort that is very
> conservative in its sparking, and I get no joy out of it on a dual-core
> machine: it mostly sticks to a single core.  This wouldn't be a problem
> if I had some way to spot the presumed data dependency that is
> serialising the code, but no joy.

I don't have a great deal of time to look at this right now, but I had a 
bit more luck with this version:

parSort :: (NFData a, Ord a) => Int -> [a] -> [a]
parSort d list@(x:xs)
   | d <= 0    = sort list
   | otherwise = rnf below `pseq` (
                 rnf above `pseq` (
                 rnf lesser `par` (
                 rnf greater `pseq` (
                 (lesser ++ x:greater)))))
       where below = [y | y <- xs, y <  x]
             above = [y | y <- xs, y >= x]
             lesser      = parSort d' below
             greater     = parSort d' above
             d' = d - 1
parSort _ _              = []

I imagine the problem is mainly to do with nested pars: in the original 
code, the left-hand par evaluates immediately into another par, and so on: 
there's no actual *work* being done in each spark.  Which is why I tried to 
do some evaluation before each par above.  It still doesn't achieve much 
speedup though.

Incedentally, this kind of recursive parallelism is typically much harder 
to get any benefit from than, for example, a simple top-level parMap.  If 
you're just trying to introduce parallelism, it might be worthwhile 
sticking to the easy cases: simple strategies, and speeding up concurrent 
programs.

Also you might notice that the GC is doing a lot of work in this example 
(first port of call is always +RTS -sstderr).  I tried the parallel GC, but 
it didn't help at all, I guess because the heap is full of lists and the 
parallel GC likes trees.

The biggest problem, as you note, is the lack of tools for investigating 
parallel performance.  This is high on our priority list, we have a couple 
of projects planned over the summer to improve the situation.

Cheers,
	Simon