[Haskell-beginners] Performance of parallel mergesort

Sun Dec 27 22:19:58 EST 2009

On Sunday 27 December 2009 20:56:51 you wrote:
> Jon Harrop wrote:
> > This is something that concerns me. Lots of discussions of parallelism,
> > including the Haskell literature I have read, neglect this critical
> > problem
> > of making sure that more time is spent doing useful work than spawning
> > tasks
> > (sparks). How is this solved in Haskell? Do you just put magic numbers in
> > that work on the machine you're currently using?
>
> It is simply not true that Haskell literature neglects the question of
> spark granularity - this is very basic and is always covered. Read 
> "Real World Haskell" (available free online).  There's no 'magic
> number'. You must explicitly write your code to give the right granule 
> size.

There is no "right granule" size. That's the whole point: the optimum is a 
function of the machine. If you hardcode the granularity then your code isn't 
future proof and isn't portable.

From chapter 24 of Real World Haskell on sorting:

  "At this fine granularity, the cost of using par outweighs any possible 
usefulness. To reduce this effect, we switch to our non-parallel sort after 
passing some threshold."

From the Sorting.hs file, parSort2 accepts a threshold "d":

  parSort2 :: (Ord a) => Int -> [a] -> [a]
  parSort2 d list@(x:xs)
    | d <= 0     = sort list
    | otherwise = force greater `par` (force lesser `pseq`
                                      (lesser ++ x:greater))
        where lesser      = parSort2 d' [y | y <- xs, y <  x]
              greater     = parSort2 d' [y | y <- xs, y >= x]
              d' = d - 1
  parSort2 _ _              = []

From the SortMain.hs file, it is always invoked with the magic number "2":

  testFunction = parSort2 2

Moreover, their approach of subdividing a fixed number of times is suboptimal 
because it inhibits load balancing.

Later, about parallelized IO, they give the code:

  chunkedReadWith :: (NFData a) =>
                    ([LB.ByteString] -> a) -> FilePath -> IO a
  chunkedReadWith func path =
      withChunks (lineChunks (numCapabilities * 4)) func path

where "4" is one magic number that gets multiplied by the magic number the 
user supplied via the +RTS -N<n> command-line option.

They make no attempt to adapt the granularity to the machine at all and rely 
entirely upon magic numbers. Consequently, their parallel sort that got a 25% 
speedup on two cores achieves a 30% slowdown on my 8 core.

> I don't know the exact cost of sparking, but in my experience it 
> is quite small - so - as long as your granule is doing *some* real work,
> it should speed up.

Can you quantify it, e.g. How many FLOPS?

> Jon Harrop wrote:
> > The parallelism is obviously not obtaining any real speedup so something
> > is wrong. But can anyone fix it?
>
> I've spent a bit of time measuring parallel speedup on real commercial
> projects, and this is what I've found:
>
> 1. ghc-6.12 performs significantly better than ghc-6.10, and has now
> been released, therefore don't use ghc-6.10.

Ok.

> 2. The defaults generally work better than giving huge heap sizes.  Your
> -K100000000 - maximum heap size per thread - will either do nothing or
> cause an artificial slowdown (I have observed this with the minimum heap
> size option).  Don't use it, unless experimentation proves it makes
> things better.

On the contrary, the Haskell program dies without it:

$ time ./mergesort +RTS -N8
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.
[1]+  Done                    kwrite mergesort.hs

real    0m33.320s
user    3m29.397s
sys     0m0.592s

I had to add that -K command line option just to get the program to run to 
completion.

> 3. +RTS -s is well worth using. It breaks the time down into MUT
> (mutator) and GC (garbage collector).

Its says 78.5% GC time (with GHC 6.10).

> 4. MUT parallelization is excellent, but the parallel GC is not so good.
>  If your code is GC-heavy it can spend around half of its time garbage
> collecting, which doesn't parallelize so well, and this eats into the
> speedup.

Ok.

> 5. There seems to be a scalability problem with the parallel gc for
> larger numbers of cores (namely 8).  I am guessing somewhat, but my
> experiments tend to confirm the issue raised in Simon Marlow's (the
> implementor of GHC parallelization) recent paper that it's to do with
> "stopping the world" for gc.

Do you mean this bug:

  http://hackage.haskell.org/trac/ghc/ticket/3553

> If GHC's lack of perfection at this point in time makes Haskell "look
> bad" I don't mind.  I am not selling anything, so the reader at least
> knows they're getting the truth.  I see this as one of the great
> advantages of open source.

I'm sure we'd all rather see speedups. :-)

> Progress on GHC has been very rapid in the last couple of years, and so
> I know we'll continue to see the speed of GHC's parallelism improving in
> leaps and bounds.  It's actually still quite a new area, considering the
> difficulty of some of the technical issues and how recent it is that
> multicores are widely available on consumer hardware.

My original intent was to test a claim someone made: that mergesort in Haskell 
is competitively performant and trivially parallelized.

> I know you OCaml/F# guys are making great progress too.

F# is production ready but OCaml is dead in the water when it comes to 
multicore. I'm working on an OCaml replacement called HLVM but it is early 
days yet.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e