[GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine
GHC
ghc-devs at haskell.org
Fri Aug 19 18:17:10 UTC 2016
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by slyfox):
I've experimented a bit more with trying to pin down where slowdown comes
from.
Some observations:
Observation 1. -j <K> not only allows <K> modules to be compiled at the
same time, but also enables:
- <K> Capabilities
- and <K> garbage collection threads
I've locally removed Capability adjustment from -j handling
and used -j <K> +RTS -N. That does not make performance as
bad with increasing K. That makes sense GC OS threads don't
fight over the same cache.
It would be nice if '''+RTS -N''' would have a precedence over -j option
Observation 2. [Warning: I have no idea how parallel GC works].
The more GC threads we have - the more chances are that one of
GC threads will finish scanning it's part oh heap and will sit
in sched_yield() loop on a free core while main GC thread waits
for completion of other threads doing useful work.
I've found it out by changing yieldThread() to print it's caller.
Vast majority of calls comes from any_work():
{{{
static rtsBool
any_work (void)
{
int g;
gen_workspace *ws;
gct->any_work++;
write_barrier();
// scavenge objects in compacted generation
if (mark_stack_bd != NULL && !mark_stack_empty()) {
return rtsTrue;
}
// Check for global work in any gen. We don't need to check for
// local work, because we have already exited scavenge_loop(),
// which means there is no local work for this thread.
for (g = 0; g < (int)RtsFlags.GcFlags.generations; g++) {
ws = &gct->gens[g];
if (ws->todo_large_objects) return rtsTrue;
if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue;
if (ws->todo_overflow) return rtsTrue;
}
#if defined(THREADED_RTS)
if (work_stealing) {
uint32_t n;
// look for work to steal
for (n = 0; n < n_gc_threads; n++) {
if (n == gct->thread_index) continue;
for (g = RtsFlags.GcFlags.generations-1; g >= 0; g--) {
ws = &gc_threads[n]->gens[g];
if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue;
}
}
}
#endif
gct->no_work++;
#if defined(THREADED_RTS)
yieldThread("any_work");
#endif
return rtsFalse;
}
}}}
I need to dig more into how parallel GC traverses heap to understand how
much of a problem it is.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9221#comment:53>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list