[Haskell-beginners] Performance of parallel mergesort
gale at sefer.org
Mon Dec 28 07:56:17 EST 2009
This discussion definitely does not belong on the
Haskell-Beginners list. Besides not being a topic
for beginners, being there is keeping it under the
radar of many or most of the people who work on
these things in Haskell.
I am moving the discussion to the GHC users list,
as suggested by Antoine. For those joining us in
progress, the first part of the thread starts here:
On Mon, Dec 28, 2009 at 5:19 AM, Jon Harrop wrote:
> On Sunday 27 December 2009 20:56:51 Stephen Blackheath wrote:
>> Jon Harrop wrote:
>> > This is something that concerns me. Lots of discussions of parallelism,
>> > including the Haskell literature I have read, neglect this critical
>> > problem
>> > of making sure that more time is spent doing useful work than spawning
>> > tasks
>> > (sparks). How is this solved in Haskell? Do you just put magic numbers in
>> > that work on the machine you're currently using?
>> It is simply not true that Haskell literature neglects the question of
>> spark granularity - this is very basic and is always covered. Read
>> "Real World Haskell" (available free online). There's no 'magic
>> number'. You must explicitly write your code to give the right granule
> There is no "right granule" size. That's the whole point: the optimum is a
> function of the machine. If you hardcode the granularity then your code isn't
> future proof and isn't portable.
> From chapter 24 of Real World Haskell on sorting:
> "At this fine granularity, the cost of using par outweighs any possible
> usefulness. To reduce this effect, we switch to our non-parallel sort after
> passing some threshold."
> From the Sorting.hs file, parSort2 accepts a threshold "d":
> parSort2 :: (Ord a) => Int -> [a] -> [a]
> parSort2 d list@(x:xs)
> | d <= 0 = sort list
> | otherwise = force greater `par` (force lesser `pseq`
> (lesser ++ x:greater))
> where lesser = parSort2 d' [y | y <- xs, y < x]
> greater = parSort2 d' [y | y <- xs, y >= x]
> d' = d - 1
> parSort2 _ _ = 
> From the SortMain.hs file, it is always invoked with the magic number "2":
> testFunction = parSort2 2
> Moreover, their approach of subdividing a fixed number of times is suboptimal
> because it inhibits load balancing.
> Later, about parallelized IO, they give the code:
> chunkedReadWith :: (NFData a) =>
> ([LB.ByteString] -> a) -> FilePath -> IO a
> chunkedReadWith func path =
> withChunks (lineChunks (numCapabilities * 4)) func path
> where "4" is one magic number that gets multiplied by the magic number the
> user supplied via the +RTS -N<n> command-line option.
> They make no attempt to adapt the granularity to the machine at all and rely
> entirely upon magic numbers. Consequently, their parallel sort that got a 25%
> speedup on two cores achieves a 30% slowdown on my 8 core.
>> I don't know the exact cost of sparking, but in my experience it
>> is quite small - so - as long as your granule is doing *some* real work,
>> it should speed up.
> Can you quantify it, e.g. How many FLOPS?
>> Jon Harrop wrote:
>> > The parallelism is obviously not obtaining any real speedup so something
>> > is wrong. But can anyone fix it?
>> I've spent a bit of time measuring parallel speedup on real commercial
>> projects, and this is what I've found:
>> 1. ghc-6.12 performs significantly better than ghc-6.10, and has now
>> been released, therefore don't use ghc-6.10.
>> 2. The defaults generally work better than giving huge heap sizes. Your
>> -K100000000 - maximum heap size per thread - will either do nothing or
>> cause an artificial slowdown (I have observed this with the minimum heap
>> size option). Don't use it, unless experimentation proves it makes
>> things better.
> On the contrary, the Haskell program dies without it:
> $ time ./mergesort +RTS -N8
> Stack space overflow: current size 8388608 bytes.
> Use `+RTS -Ksize' to increase it.
> + Done kwrite mergesort.hs
> real 0m33.320s
> user 3m29.397s
> sys 0m0.592s
> I had to add that -K command line option just to get the program to run to
>> 3. +RTS -s is well worth using. It breaks the time down into MUT
>> (mutator) and GC (garbage collector).
> Its says 78.5% GC time (with GHC 6.10).
>> 4. MUT parallelization is excellent, but the parallel GC is not so good.
>> If your code is GC-heavy it can spend around half of its time garbage
>> collecting, which doesn't parallelize so well, and this eats into the
>> 5. There seems to be a scalability problem with the parallel gc for
>> larger numbers of cores (namely 8). I am guessing somewhat, but my
>> experiments tend to confirm the issue raised in Simon Marlow's (the
>> implementor of GHC parallelization) recent paper that it's to do with
>> "stopping the world" for gc.
> Do you mean this bug:
>> If GHC's lack of perfection at this point in time makes Haskell "look
>> bad" I don't mind. I am not selling anything, so the reader at least
>> knows they're getting the truth. I see this as one of the great
>> advantages of open source.
> I'm sure we'd all rather see speedups. :-)
>> Progress on GHC has been very rapid in the last couple of years, and so
>> I know we'll continue to see the speed of GHC's parallelism improving in
>> leaps and bounds. It's actually still quite a new area, considering the
>> difficulty of some of the technical issues and how recent it is that
>> multicores are widely available on consumer hardware.
> My original intent was to test a claim someone made: that mergesort in Haskell
> is competitively performant and trivially parallelized.
>> I know you OCaml/F# guys are making great progress too.
> F# is production ready but OCaml is dead in the water when it comes to
> multicore. I'm working on an OCaml replacement called HLVM but it is early
> days yet.
> Dr Jon Harrop, Flying Frog Consultancy Ltd.
> Beginners mailing list
> Beginners at haskell.org
More information about the Glasgow-haskell-users