parallel garbage collection performance
jwlato at gmail.com
Tue Jun 26 02:02:46 CEST 2012
Thanks very much for this information. My observations match your
recommendations, insofar as I can test them.
On Mon, Jun 25, 2012 at 11:42 PM, Simon Marlow <marlowsd at gmail.com> wrote:
> On 19/06/12 02:32, John Lato wrote:
>> Thanks for the suggestions. I'll try them and report back. Although
>> I've since found that out of 3 not-identical systems, this problem
>> only occurs on one. So I may try different kernel/system libs and see
>> where that gets me.
>> -qg is funny. My interpretation from the results so far is that, when
>> the parallel collector doesn't get stalled, it results in a big win.
>> But when parGC does stall, it's slower than disabling parallel gc
> Parallel GC is usually a win for idiomatic Haskell code, it may or may not
> be a good idea for things like Repa - I haven't done much analysis of those
> types of programs yet. Experiment with the -A flag, e.g. -A1m is often
> better than the default if your processor has a large cache.
> However, the parallel GC will be a problem if one or more of your cores is
> being used by other process(es) on the machine. In that case, the GC
> synchronisation will stall and performance will go down the drain. You can
> often see this on a ThreadScope profile as a big delay during GC while the
> other cores wait for the delayed core. Make sure your machine is quiet
> and/or use one fewer cores than the total available. It's not usually a
> good idea to use hyperthreaded cores either.
> I'm also seeing unpredictable performance on a 32-core AMD machine with
> NUMA. I'd avoid NUMA for Haskell for the time being if you can. Indeed you
> get unpredictable performance on this machine even for single-threaded code,
> because it makes a difference on which node the pages of your executable are
> cached (I heard a rumour that Linux has some kind of a fix for this in the
> pipeline, but I don't know the details).
>> I had thought the last core parallel slowdown problem was fixed a
>> while ago, but apparently not?
> We improved matters by inserting some "yield"s into the spinlock loops.
> This helped a lot, but the problem still exists.
>> On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier<benl at ouroborus.net> wrote:
>>> On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
>>>> On June 18, 2012 04:20:51 John Lato wrote:
>>>>> Given this, can anyone suggest any likely causes of this issue, or
>>>>> anything I might want to look for? Also, should I be concerned about
>>>>> the much larger gc_alloc_block_sync level for the slow run? Does that
>>>>> indicate the allocator waiting to alloc a new block, or is it
>>>>> something else? Am I on completely the wrong track?
>>>> A total shot in the dark here, but wasn't there something about really
>>>> performance when you used all the CPUs on your machine under Linux?
>>>> Presumably very tight coupling that is causing all the threads to stall
>>>> everytime the OS needs to do something or something?
>>> This can be a problem for data parallel computations (like in Repa). In
>>> Repa all threads in the gang are supposed to run for the same time, but if
>>> one gets swapped out by the OS then the whole gang is stalled.
>>> I tend to get best results using -N7 for an 8 core machine.
>>> It is also important to enable thread affinity (with the -qa) flag.
>>> For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
>> Glasgow-haskell-users mailing list
>> Glasgow-haskell-users at haskell.org
More information about the Glasgow-haskell-users