[GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine
GHC
ghc-devs at haskell.org
Tue Aug 23 21:47:21 UTC 2016
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by slyfox):
Replying to [comment:54 simonmar]:
> @slyfox the GC threads all spin in this `any_work()` loop looking for
work they can steal from other threads. It's how the GC work gets
distributed amongst the available worker threads.
Aha, thanks! I've read through
http://community.haskell.org/~simonmar/papers/parallel-gc.pdf
The paper does not look outdated.
The crucial detail: work stealing is enabled by default for gen=1 and
upper.
As default nursery is tiny we don't do stealing from it. That's why I see
poor
GC parallelism on large nurseries for this compilation workload:
[ The numbers below are for 4-CPU system, not 8-CPU i used to post before.
I'll redo tests in a few days
once I get access to it. ]
-qb1/-qb0 comparison:
-qb1 (default):
{{{
$ time ./mk.bash -j4 +RTS -N4 -RTS +RTS -sstderr -A768M -RTS -O0 -j4 +RTS
-l
55,162,983,632 bytes allocated in the heap
639,999,688 bytes copied during GC
88,807,552 bytes maximum residency (2 sample(s))
4,100,072 bytes maximum slop
3379 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 20 colls, 20 par 5.617s 1.582s 0.0791s
0.1410s
Gen 1 2 colls, 1 par 0.227s 0.081s 0.0405s
0.0500s
Parallel GC work balance: 22.34% (serial 0%, perfect 100%)
TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.027s ( 0.026s elapsed)
MUT time 59.245s ( 18.963s elapsed)
GC time 5.843s ( 1.663s elapsed)
EXIT time 0.032s ( 0.034s elapsed)
Total time 65.174s ( 20.686s elapsed)
Alloc rate 931,102,797 bytes per MUT second
Productivity 91.0% of total user, 91.8% of total elapsed
gc_alloc_block_sync: spin=10830; yield=5
whitehole_spin: 0
gen[0].sync: spin=6297; yield=5
gen[1].sync: spin=4240; yield=3
real 0m20.863s
user 1m5.752s
sys 0m4.724s
}}}
-qb0:
{{{
55,154,416,472 bytes allocated in the heap
840,077,056 bytes copied during GC
135,018,976 bytes maximum residency (3 sample(s))
5,739,552 bytes maximum slop
2375 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 31 colls, 31 par 3.078s 0.859s 0.0277s
0.0682s
Gen 1 3 colls, 2 par 0.607s 0.177s 0.0591s
0.0947s
Parallel GC work balance: 74.48% (serial 0%, perfect 100%)
TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.023s ( 0.023s elapsed)
MUT time 59.476s ( 18.325s elapsed)
GC time 3.685s ( 1.036s elapsed)
EXIT time 0.027s ( 0.028s elapsed)
Total time 63.244s ( 19.412s elapsed)
Alloc rate 927,345,481 bytes per MUT second
Productivity 94.1% of total user, 94.5% of total elapsed
gc_alloc_block_sync: spin=58847; yield=38
whitehole_spin: 0
gen[0].sync: spin=126024; yield=104
gen[1].sync: spin=34648; yield=31
real 0m19.575s
user 1m6.804s
sys 0m1.760s
}}}
Note how GC parallelism increases from ~20% to ~75% for this huge nursery
of 768M.
Something hinders mutator's parallelism still. threadscope says threads
are usually
blocked on MVars for long periods of time (order of 0.1-0.2 seconds).
Perhaps those wait for external gcc/gas to assemble modules.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9221#comment:55>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list