[Haskell-beginners] Performance of parallel mergesort

Jon Harrop jon at ffconsultancy.com
Thu Dec 24 17:50:13 EST 2009

On Thursday 24 December 2009 09:09:41 you wrote:
> The mergesort example given here has a rather nasty problem – it creates
> sparks for mergeing the two lists at the same time as it creates the lists,
> so there's always at least one task blocking waiting for it's producers.

I see. So the sparks either contend for locks on the thunks to force that list 
or they repeat the same work.

> This variation on a theme goes roughly twice as fast for me with -N1, and
> *does* get a speedup from -N<somethinghigher>:
> mergesort [] = []
> mergesort [x] = [x]
> mergesort [x,y] = if x < y then [x,y] else [y,x]
> mergesort xs = (forceList sleft) `par`
>                (forceList sright) `pseq`
>                merge sleft sright
>                where
>                  (left,right) = split (length xs `div` 2) xs
>                  sleft = mergesort left
>                  sright = mergesort right
> Note the `pseq` not `par` before the merge.

Can you explain why that helps?

> I also added one more short-circuit case – we want to avoid spawning threads
> for tiny tasks like sorting 1 element lists.

This is something that concerns me. Lots of discussions of parallelism, 
including the Haskell literature I have read, neglect this critical problem 
of making sure that more time is spent doing useful work than spawning tasks 
(sparks). How is this solved in Haskell? Do you just put magic numbers in 
that work on the machine you're currently using?

Also, how much work does a Haskell thread usually have to do to make it worth 

> Here's my results on a dual core running OS X:
> LappyBob:~ tatd2$ time ./test +RTS -N1
> -0.117109518058233
> real 0m4.608s
> user 0m4.400s
> sys 0m0.189s
> LappyBob:~ tatd2$ time ./test +RTS -N2
> -0.117109518058233
> real 0m3.648s
> user 0m6.360s
> sys 0m0.220s
> LappyBob:~ tatd2$ time ./test +RTS -N3
> -0.117109518058233
> real 0m50.679s
> user 1m24.235s
> sys 0m0.620s
> The last result is still off the wall, but it's a lot better.

That's another question: why did you get such awful performance for N=3 and I 
get similar results for N=8 on this 8 core?

Dr Jon Harrop, Flying Frog Consultancy Ltd.

More information about the Beginners mailing list