[GHC] #9221: (super!) linear slowdown of parallel builds on 40 core machine
GHC
ghc-devs at haskell.org
Wed Aug 31 15:44:05 UTC 2016
#9221: (super!) linear slowdown of parallel builds on 40 core machine
-------------------------------------+-------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #910, #8224 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by slyfox):
Used the following GNUMakefile for '''./synth.bash''' to compare separate
processes:
{{{
OBJECTS := $(patsubst %.hs,%.o,$(wildcard src/*.hs))
all: $(OBJECTS)
src/%.o: src/%.hs
~/dev/git/ghc-perf/inplace/bin/ghc-stage2 -c +RTS -A256M -RTS $<
-o $@
clean:
$(RM) $(OBJECTS)
.PHONY: clean
}}}
CPU topology:
{{{
$ lstopo-no-graphics
Machine (30GB)
Socket L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#4)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#5)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#6)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#7)
$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31122 MB
node 0 free: 28003 MB
node distances:
node 0
0: 10
}}}
Separate processes:
{{{
$ make clean; time make -j1
real 1m2.561s
user 0m56.523s
sys 0m5.560s
$ make clean; time taskset --cpu-list 0-3 make -j4
real 0m18.756s
user 1m7.758s
sys 0m6.460s
$ make clean; time make -j4
real 0m18.936s
user 1m7.549s
sys 0m6.857s
$ make clean; time make -j6
real 0m17.365s
user 1m32.107s
sys 0m9.155s
$ make clean; time make -j8
real 0m15.964s
user 1m52.058s
sys 0m9.929s
}}}
The speedup compared to -j1 is almost exactly 4x, but it happens on -j
higher than 4
as well. Using CPU affinity makes things better on -j4.
{{{
$ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS
real 0m51.702s
user 0m50.840s
sys 0m0.844s
$ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -RTS
real 0m17.526s
user 1m6.978s
sys 0m1.412s
$ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -qa -RTS
real 0m17.007s
user 1m4.867s
sys 0m1.508s
$ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS
real 0m13.829s
user 1m44.295s
sys 0m2.669s
$ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -qa -RTS
real 0m14.597s
user 1m43.145s
sys 0m3.285s
}}}
The speedup compared to -j1 is around 3.5x, also happens on -j higher than
4.
Using CPU affinity makes things worse on -j4.
In absolute times '''ghc --make -j''' is slightly better that separate
processes
due to less startup(?) overhead. But something else slowly creeps up and
we don't
see 4x factor.
It's more visible on 24-core VM, will post in a few minutes.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9221#comment:65>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list