parallel garbage collection performance

Tue Jun 26 02:02:46 CEST 2012

Thanks very much for this information.  My observations match your
recommendations, insofar as I can test them.

Cheers,
John

On Mon, Jun 25, 2012 at 11:42 PM, Simon Marlow <marlowsd at gmail.com> wrote:
> On 19/06/12 02:32, John Lato wrote:
>>
>> Thanks for the suggestions.  I'll try them and report back.  Although
>> I've since found that out of 3 not-identical systems, this problem
>> only occurs on one.  So I may try different kernel/system libs and see
>> where that gets me.
>>
>> -qg is funny.  My interpretation from the results so far is that, when
>> the parallel collector doesn't get stalled, it results in a big win.
>> But when parGC does stall, it's slower than disabling parallel gc
>> entirely.
>
>
> Parallel GC is usually a win for idiomatic Haskell code, it may or may not
> be a good idea for things like Repa - I haven't done much analysis of those
> types of programs yet.  Experiment with the -A flag, e.g. -A1m is often
> better than the default if your processor has a large cache.
>
> However, the parallel GC will be a problem if one or more of your cores is
> being used by other process(es) on the machine.  In that case, the GC
> synchronisation will stall and performance will go down the drain.  You can
> often see this on a ThreadScope profile as a big delay during GC while the
> other cores wait for the delayed core.  Make sure your machine is quiet
> and/or use one fewer cores than the total available.  It's not usually a
> good idea to use hyperthreaded cores either.
>
> I'm also seeing unpredictable performance on a 32-core AMD machine with
> NUMA.  I'd avoid NUMA for Haskell for the time being if you can.  Indeed you
> get unpredictable performance on this machine even for single-threaded code,
> because it makes a difference on which node the pages of your executable are
> cached (I heard a rumour that Linux has some kind of a fix for this in the
> pipeline, but I don't know the details).
>
>
>> I had thought the last core parallel slowdown problem was fixed a
>> while ago, but apparently not?
>
>
> We improved matters by inserting some "yield"s into the spinlock loops.
>  This helped a lot, but the problem still exists.
>
> Cheers,
>        Simon
>
>
>
>> Thanks,
>> John
>>
>> On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier<benl at ouroborus.net>  wrote:
>>>
>>>
>>> On 19/06/2012, at 24:48 , Tyson Whitehead wrote:
>>>
>>>> On June 18, 2012 04:20:51 John Lato wrote:
>>>>>
>>>>> Given this, can anyone suggest any likely causes of this issue, or
>>>>> anything I might want to look for?  Also, should I be concerned about
>>>>> the much larger gc_alloc_block_sync level for the slow run?  Does that
>>>>> indicate the allocator waiting to alloc a new block, or is it
>>>>> something else?  Am I on completely the wrong track?
>>>>
>>>>
>>>> A total shot in the dark here, but wasn't there something about really
>>>> bad
>>>> performance when you used all the CPUs on your machine under Linux?
>>>>
>>>> Presumably very tight coupling that is causing all the threads to stall
>>>> everytime the OS needs to do something or something?
>>>
>>>
>>> This can be a problem for data parallel computations (like in Repa). In
>>> Repa all threads in the gang are supposed to run for the same time, but if
>>> one gets swapped out by the OS then the whole gang is stalled.
>>>
>>> I tend to get best results using -N7 for an 8 core machine.
>>>
>>> It is also important to enable thread affinity (with the -qa) flag.
>>>
>>> For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg
>>>
>>> Ben.
>>>
>>>
>>
>> _______________________________________________
>> Glasgow-haskell-users mailing list
>> Glasgow-haskell-users at haskell.org
>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
>