[Haskell-beginners] Performance of parallel mergesort

Mon Dec 28 07:56:17 EST 2009

This discussion definitely does not belong on the
Haskell-Beginners list. Besides not being a topic
for beginners, being there is keeping it under the
radar of many or most of the people who work on
these things in Haskell.

I am moving the discussion to the GHC users list,
as suggested by Antoine. For those joining us in
progress, the first part of the thread starts here:

http://www.haskell.org/pipermail/beginners/2009-December/003045.html

On Mon, Dec 28, 2009 at 5:19 AM, Jon Harrop wrote:
> On Sunday 27 December 2009 20:56:51 Stephen Blackheath wrote:
>> Jon Harrop wrote:
>> > This is something that concerns me. Lots of discussions of parallelism,
>> > including the Haskell literature I have read, neglect this critical
>> > problem
>> > of making sure that more time is spent doing useful work than spawning
>> > tasks
>> > (sparks). How is this solved in Haskell? Do you just put magic numbers in
>> > that work on the machine you're currently using?
>>
>> It is simply not true that Haskell literature neglects the question of
>> spark granularity - this is very basic and is always covered. Read
>> "Real World Haskell" (available free online).  There's no 'magic
>> number'. You must explicitly write your code to give the right granule
>> size.
>
> There is no "right granule" size. That's the whole point: the optimum is a
> function of the machine. If you hardcode the granularity then your code isn't
> future proof and isn't portable.
>
> From chapter 24 of Real World Haskell on sorting:
>
>  "At this fine granularity, the cost of using par outweighs any possible
> usefulness. To reduce this effect, we switch to our non-parallel sort after
> passing some threshold."
>
> From the Sorting.hs file, parSort2 accepts a threshold "d":
>
>  parSort2 :: (Ord a) => Int -> [a] -> [a]
>  parSort2 d list@(x:xs)
>    | d <= 0     = sort list
>    | otherwise = force greater `par` (force lesser `pseq`
>                                      (lesser ++ x:greater))
>        where lesser      = parSort2 d' [y | y <- xs, y <  x]
>              greater     = parSort2 d' [y | y <- xs, y >= x]
>              d' = d - 1
>  parSort2 _ _              = []
>
> From the SortMain.hs file, it is always invoked with the magic number "2":
>
>  testFunction = parSort2 2
>
> Moreover, their approach of subdividing a fixed number of times is suboptimal
> because it inhibits load balancing.
>
> Later, about parallelized IO, they give the code:
>
>  chunkedReadWith :: (NFData a) =>
>                    ([LB.ByteString] -> a) -> FilePath -> IO a
>  chunkedReadWith func path =
>      withChunks (lineChunks (numCapabilities * 4)) func path
>
> where "4" is one magic number that gets multiplied by the magic number the
> user supplied via the +RTS -N<n> command-line option.
>
> They make no attempt to adapt the granularity to the machine at all and rely
> entirely upon magic numbers. Consequently, their parallel sort that got a 25%
> speedup on two cores achieves a 30% slowdown on my 8 core.
>
>> I don't know the exact cost of sparking, but in my experience it
>> is quite small - so - as long as your granule is doing *some* real work,
>> it should speed up.
>
> Can you quantify it, e.g. How many FLOPS?
>
>> Jon Harrop wrote:
>> > The parallelism is obviously not obtaining any real speedup so something
>> > is wrong. But can anyone fix it?
>>
>> I've spent a bit of time measuring parallel speedup on real commercial
>> projects, and this is what I've found:
>>
>> 1. ghc-6.12 performs significantly better than ghc-6.10, and has now
>> been released, therefore don't use ghc-6.10.
>
> Ok.
>
>> 2. The defaults generally work better than giving huge heap sizes.  Your
>> -K100000000 - maximum heap size per thread - will either do nothing or
>> cause an artificial slowdown (I have observed this with the minimum heap
>> size option).  Don't use it, unless experimentation proves it makes
>> things better.
>
> On the contrary, the Haskell program dies without it:
>
> $ time ./mergesort +RTS -N8
> Stack space overflow: current size 8388608 bytes.
> Use `+RTS -Ksize' to increase it.
> [1]+  Done                    kwrite mergesort.hs
>
> real    0m33.320s
> user    3m29.397s
> sys     0m0.592s
>
> I had to add that -K command line option just to get the program to run to
> completion.
>
>> 3. +RTS -s is well worth using. It breaks the time down into MUT
>> (mutator) and GC (garbage collector).
>
> Its says 78.5% GC time (with GHC 6.10).
>
>> 4. MUT parallelization is excellent, but the parallel GC is not so good.
>>  If your code is GC-heavy it can spend around half of its time garbage
>> collecting, which doesn't parallelize so well, and this eats into the
>> speedup.
>
> Ok.
>
>> 5. There seems to be a scalability problem with the parallel gc for
>> larger numbers of cores (namely 8).  I am guessing somewhat, but my
>> experiments tend to confirm the issue raised in Simon Marlow's (the
>> implementor of GHC parallelization) recent paper that it's to do with
>> "stopping the world" for gc.
>
> Do you mean this bug:
>
>  http://hackage.haskell.org/trac/ghc/ticket/3553
>
>> If GHC's lack of perfection at this point in time makes Haskell "look
>> bad" I don't mind.  I am not selling anything, so the reader at least
>> knows they're getting the truth.  I see this as one of the great
>> advantages of open source.
>
> I'm sure we'd all rather see speedups. :-)
>
>> Progress on GHC has been very rapid in the last couple of years, and so
>> I know we'll continue to see the speed of GHC's parallelism improving in
>> leaps and bounds.  It's actually still quite a new area, considering the
>> difficulty of some of the technical issues and how recent it is that
>> multicores are widely available on consumer hardware.
>
> My original intent was to test a claim someone made: that mergesort in Haskell
> is competitively performant and trivially parallelized.
>
>> I know you OCaml/F# guys are making great progress too.
>
> F# is production ready but OCaml is dead in the water when it comes to
> multicore. I'm working on an OCaml replacement called HLVM but it is early
> days yet.
>
> --
> Dr Jon Harrop, Flying Frog Consultancy Ltd.
> http://www.ffconsultancy.com/?e
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners
>