[Haskell-beginners] Performance of parallel mergesort
Simon Marlow
marlowsd at gmail.com
Wed Dec 30 06:34:08 EST 2009
On 28/12/09 12:56, Yitzchak Gale wrote:
> On Mon, Dec 28, 2009 at 5:19 AM, Jon Harrop wrote:
>> On Sunday 27 December 2009 20:56:51 Stephen Blackheath wrote:
>>> Jon Harrop wrote:
>>>> This is something that concerns me. Lots of discussions of parallelism,
>>>> including the Haskell literature I have read, neglect this critical
>>>> problem
>>>> of making sure that more time is spent doing useful work than spawning
>>>> tasks
>>>> (sparks). How is this solved in Haskell? Do you just put magic numbers in
>>>> that work on the machine you're currently using?
>>>
>>> It is simply not true that Haskell literature neglects the question of
>>> spark granularity - this is very basic and is always covered. Read
>>> "Real World Haskell" (available free online). There's no 'magic
>>> number'. You must explicitly write your code to give the right granule
>>> size.
>>
>> There is no "right granule" size. That's the whole point: the optimum is a
>> function of the machine. If you hardcode the granularity then your code isn't
>> future proof and isn't portable.
While that's true, in practice it's often not a problem.
Sometimes you can pick a granularity that is small enough to give you
thousands of parallel tasks, each of which is plenty big enough to
outweight the overheads of task creation (which are very low in GHC
anyway). Scaling to thousands of cores is likely to be good enough for
quite a while.
You can always pick a granularity based on the number of cores (GHC
gives you access to that as GHC.Conc.numCapabilities).
Or, the program might have a natural granularity, which leaves you
little room for parallelising it in a different way.
>> Moreover, their approach of subdividing a fixed number of times is suboptimal
>> because it inhibits load balancing.
If you create enough tasks, then load balancing works just fine. GHC
(in 6.12.1) uses work-stealing for load balancing, incedentally.
>> Later, about parallelized IO, they give the code:
>>
>> chunkedReadWith :: (NFData a) =>
>> ([LB.ByteString] -> a) -> FilePath -> IO a
>> chunkedReadWith func path =
>> withChunks (lineChunks (numCapabilities * 4)) func path
>>
>> where "4" is one magic number that gets multiplied by the magic number the
>> user supplied via the +RTS -N<n> command-line option.
>>
>> They make no attempt to adapt the granularity to the machine at all and rely
>> entirely upon magic numbers. Consequently, their parallel sort that got a 25%
>> speedup on two cores achieves a 30% slowdown on my 8 core.
I'd recommend trying againn with 6.12.1. You might also want to
experiment with the parallel GC settings - the defaults settings aren't
perfect for every program.
>>> I don't know the exact cost of sparking, but in my experience it
>>> is quite small - so - as long as your granule is doing *some* real work,
>>> it should speed up.
>>
>> Can you quantify it, e.g. How many FLOPS?
you can't measure time in FLOPS. But my guess is that a spark could be
as low as 20 cycles if everything is in the L1 cache and the branches
are predicted correctly (there are 2 function calls, 2 branches, and
about 3 memory references, we could inline to remove the function
calls). It's just a push onto a lock-free work-stealing queue, and
involves no atomic instructions.
>>> 2. The defaults generally work better than giving huge heap sizes. Your
>>> -K100000000 - maximum heap size per thread - will either do nothing or
>>> cause an artificial slowdown (I have observed this with the minimum heap
>>> size option). Don't use it, unless experimentation proves it makes
>>> things better.
Yes, this is because cache use and locality become more important with
multicore execution. The default GC settings use a 512KB young
generation and we use locality-optimised parallel GC in 6.12.1.
>> On the contrary, the Haskell program dies without it:
>>
>> $ time ./mergesort +RTS -N8
>> Stack space overflow: current size 8388608 bytes.
Don't confuse stack size (-K) with heap size (-H or -M). The stack size
is just a limit and shouldn't have any effect on performance; the stack
itself grows dynamically.
>> Its says 78.5% GC time (with GHC 6.10).
A clear sign that there's a problem. I haven't tried parallelising
merge sort recently, but it has been a tricky one to parallelise with
GHC in the past due to the amount of GC going on.
>>> 5. There seems to be a scalability problem with the parallel gc for
>>> larger numbers of cores (namely 8). I am guessing somewhat, but my
>>> experiments tend to confirm the issue raised in Simon Marlow's (the
>>> implementor of GHC parallelization) recent paper that it's to do with
>>> "stopping the world" for gc.
>>
>> Do you mean this bug:
>>
>> http://hackage.haskell.org/trac/ghc/ticket/3553
Probably, yes. I'm working on indepdendent young-generation collection
at the moment which will fix that bug properly. For now, make sure your
cores are all available (nice and +RTS -qa are useful here), and on 8
core machines I usually drop back to -N7.
>>> If GHC's lack of perfection at this point in time makes Haskell "look
>>> bad" I don't mind. I am not selling anything, so the reader at least
>>> knows they're getting the truth. I see this as one of the great
>>> advantages of open source.
Quite, and I'm interested in looking at parallel programs that don't
speed up, particularly if the lack of speedup is due to a bottleneck in
the runtime or GC.
>> My original intent was to test a claim someone made: that mergesort in Haskell
>> is competitively performant and trivially parallelized.
I wouldn't claim that list sorting is trivially parallelized at this
stage. There are problems that are: often you can insert a parList or a
parBuffer and get a speedup, and lots of people have had good
experiences using concurrency to get parallel speedup.
The problem with list sorting is that you have to be careful to get the
forcing and sparking right, and even then you have to be careful that
the forcing itself hasn't added too much overhead (people often forget
to compare against the sequential version, not the parallel version
running on one CPU). I think I would try to rewrite it so that the
subtasks only return when their results are fully evaluated, to avoid
the extra forcing.
I wonder if it would be possible to write some kind of generic "divide
and conquer" Strategy that would help for this kind of problem.
Cheers,
Simon
More information about the Glasgow-haskell-users
mailing list