[GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine
GHC
ghc-devs at haskell.org
Tue Aug 30 21:42:14 UTC 2016
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by carter):
Replying to [comment:61 slyfox]:
> [warning: not a NUMA expert]
>
> Tl;DR:
>
> I think it depends on what exactly we hit as a bottleneck.
>
> I have a suspiction we saturate RAM bandwidth, not CPU ability
> to retire instructions due to hyperthreads. Basically GHC does
> too many non-local references and one of the ways to speed GHC
> up is either insrease memory locality or decrease HEAP usage.
That's exactly why I'm wondering if hyper threading is messing with us!
Each pair of hyper thread cores shares the same l1 l2 cache, so if we're
memory limited that might be
Triggering a higher rate of cache thrash? Also in some cases when
capabilities numbers are below the number of cores I think we pin two
capabitilties to the same physical core needlessly. I need to dig up
those references and revisit that though :)
>
> Long version:
>
> For a while I tried to figure out why exactly I don't see
> perfect scaling of '''ghc --make''' on my box.
>
> It's easy to see/compare with '''synth.bash +RTS -A256M -RTS'''
benchmark ran with '''-j1''' / '''-j''' options.
>
> I don't have hard evidence but I suspect bottleneck is not due to
> hyperthreads/real core execution engines but due to RAM
> bandwidth limit on CPU-to-memory path. One of the hints
> is '''perf stat''':
>
> {{{
> $ perf stat -e cache-references,cache-
misses,cycles,instructions,branches,faults,migrations,mem-loads,mem-stores
./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS
>
> Performance counter stats for './synth.bash -j +RTS -sstderr -A256M
-qb0 -RTS':
>
> 3 248 577 545 cache-references
(28,64%)
> 740 590 736 cache-misses # 22,797 % of all
cache refs (42,93%)
> 390 025 361 812 cycles
(57,18%)
> 171 496 925 132 instructions # 0,44 insn per
cycle (71,45%)
> 33 736 976 296 branches
(71,47%)
> 1 061 039 faults
> 1 524 migrations
> 67 895 mem-loads
(71,42%)
> 27 652 025 890 mem-stores
(14,27%)
>
> 15,131608490 seconds time elapsed
> }}}
>
> 22% of all cache refs are misses. A huge number. I think it dominates
performance
> (assuming memory access is ~100 times slower than CPU cache access), but
I have no
> hard evidence :)
>
> I have 4 cores with x2 hyperthreads each and get best performance from
-j8,
> not -j4 as one would expect from hyperthreads inctruction retirement:
>
> -j1: 55s; -j4: 18s; -j6: 15s; j8: 14.2s; -j10: 15.0s
>
> {{{
> ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS
>
> 66,769,724,456 bytes allocated in the heap
> 1,658,350,288 bytes copied during GC
> 127,385,728 bytes maximum residency (5 sample(s))
> 1,722,080 bytes maximum slop
> 2389 MB total memory in use (0 MB lost due to fragmentation)
>
> Tot time (elapsed) Avg pause Max
pause
> Gen 0 31 colls, 31 par 6.535s 0.831s 0.0268s
0.0579s
> Gen 1 5 colls, 4 par 1.677s 0.225s 0.0449s
0.0687s
>
> Parallel GC work balance: 80.03% (serial 0%, perfect 100%)
>
> TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)
>
> SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>
> INIT time 0.002s ( 0.002s elapsed)
> MUT time 87.599s ( 12.868s elapsed)
> GC time 8.212s ( 1.056s elapsed)
> EXIT time 0.013s ( 0.015s elapsed)
> Total time 95.841s ( 13.942s elapsed)
>
> Alloc rate 762,222,437 bytes per MUT second
>
> Productivity 91.4% of total user, 92.4% of total elapsed
>
> gc_alloc_block_sync: 83395
> whitehole_spin: 0
> gen[0].sync: 280927
> gen[1].sync: 134537
>
> real 0m14.070s
> user 1m44.835s
> sys 0m2.899s
> }}}
>
> I've noticed that building GHC with '''-fno-worker-wrapper -fno-spec-
constr'''
> makes GHC 4% faster (-j8) (memory allocations is 7% less, a bug #11565
is likely
> at fault) which also hints at memory throughput as a bottleneck.
>
> The conclusion:
>
> AFAIU, thus to make most of GHC we should strive for amount of active
threads
> capable of saturating all the memory IO channels the machine has (but
not much more).
>
> '''perf bench mem all''' suggests RAM bandwidth performance is in range
of 2-32GB/s
> depending how bad workload is. I would assume GHC workload is very non-
linear (and thus bad).
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9221#comment:63>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list