[Haskell-beginners] Performance of parallel mergesort
Jon Harrop
jon at ffconsultancy.com
Sun Dec 27 22:19:58 EST 2009
On Sunday 27 December 2009 20:56:51 you wrote:
> Jon Harrop wrote:
> > This is something that concerns me. Lots of discussions of parallelism,
> > including the Haskell literature I have read, neglect this critical
> > problem
> > of making sure that more time is spent doing useful work than spawning
> > tasks
> > (sparks). How is this solved in Haskell? Do you just put magic numbers in
> > that work on the machine you're currently using?
>
> It is simply not true that Haskell literature neglects the question of
> spark granularity - this is very basic and is always covered. Read
> "Real World Haskell" (available free online). There's no 'magic
> number'. You must explicitly write your code to give the right granule
> size.
There is no "right granule" size. That's the whole point: the optimum is a
function of the machine. If you hardcode the granularity then your code isn't
future proof and isn't portable.
From chapter 24 of Real World Haskell on sorting:
"At this fine granularity, the cost of using par outweighs any possible
usefulness. To reduce this effect, we switch to our non-parallel sort after
passing some threshold."
From the Sorting.hs file, parSort2 accepts a threshold "d":
parSort2 :: (Ord a) => Int -> [a] -> [a]
parSort2 d list@(x:xs)
| d <= 0 = sort list
| otherwise = force greater `par` (force lesser `pseq`
(lesser ++ x:greater))
where lesser = parSort2 d' [y | y <- xs, y < x]
greater = parSort2 d' [y | y <- xs, y >= x]
d' = d - 1
parSort2 _ _ = []
From the SortMain.hs file, it is always invoked with the magic number "2":
testFunction = parSort2 2
Moreover, their approach of subdividing a fixed number of times is suboptimal
because it inhibits load balancing.
Later, about parallelized IO, they give the code:
chunkedReadWith :: (NFData a) =>
([LB.ByteString] -> a) -> FilePath -> IO a
chunkedReadWith func path =
withChunks (lineChunks (numCapabilities * 4)) func path
where "4" is one magic number that gets multiplied by the magic number the
user supplied via the +RTS -N<n> command-line option.
They make no attempt to adapt the granularity to the machine at all and rely
entirely upon magic numbers. Consequently, their parallel sort that got a 25%
speedup on two cores achieves a 30% slowdown on my 8 core.
> I don't know the exact cost of sparking, but in my experience it
> is quite small - so - as long as your granule is doing *some* real work,
> it should speed up.
Can you quantify it, e.g. How many FLOPS?
> Jon Harrop wrote:
> > The parallelism is obviously not obtaining any real speedup so something
> > is wrong. But can anyone fix it?
>
> I've spent a bit of time measuring parallel speedup on real commercial
> projects, and this is what I've found:
>
> 1. ghc-6.12 performs significantly better than ghc-6.10, and has now
> been released, therefore don't use ghc-6.10.
Ok.
> 2. The defaults generally work better than giving huge heap sizes. Your
> -K100000000 - maximum heap size per thread - will either do nothing or
> cause an artificial slowdown (I have observed this with the minimum heap
> size option). Don't use it, unless experimentation proves it makes
> things better.
On the contrary, the Haskell program dies without it:
$ time ./mergesort +RTS -N8
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.
[1]+ Done kwrite mergesort.hs
real 0m33.320s
user 3m29.397s
sys 0m0.592s
I had to add that -K command line option just to get the program to run to
completion.
> 3. +RTS -s is well worth using. It breaks the time down into MUT
> (mutator) and GC (garbage collector).
Its says 78.5% GC time (with GHC 6.10).
> 4. MUT parallelization is excellent, but the parallel GC is not so good.
> If your code is GC-heavy it can spend around half of its time garbage
> collecting, which doesn't parallelize so well, and this eats into the
> speedup.
Ok.
> 5. There seems to be a scalability problem with the parallel gc for
> larger numbers of cores (namely 8). I am guessing somewhat, but my
> experiments tend to confirm the issue raised in Simon Marlow's (the
> implementor of GHC parallelization) recent paper that it's to do with
> "stopping the world" for gc.
Do you mean this bug:
http://hackage.haskell.org/trac/ghc/ticket/3553
> If GHC's lack of perfection at this point in time makes Haskell "look
> bad" I don't mind. I am not selling anything, so the reader at least
> knows they're getting the truth. I see this as one of the great
> advantages of open source.
I'm sure we'd all rather see speedups. :-)
> Progress on GHC has been very rapid in the last couple of years, and so
> I know we'll continue to see the speed of GHC's parallelism improving in
> leaps and bounds. It's actually still quite a new area, considering the
> difficulty of some of the technical issues and how recent it is that
> multicores are widely available on consumer hardware.
My original intent was to test a claim someone made: that mergesort in Haskell
is competitively performant and trivially parallelized.
> I know you OCaml/F# guys are making great progress too.
F# is production ready but OCaml is dead in the water when it comes to
multicore. I'm working on an OCaml replacement called HLVM but it is early
days yet.
--
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e
More information about the Beginners
mailing list