parallel garbage collection performance
John Lato
jwlato at gmail.com
Mon Jun 18 10:20:51 CEST 2012
Hello,
I have a program that is intermittently experiencing performance
issues that I believe are related to parallel GC, and I was hoping to
get some advice on how I might improve it. Essentially, any given
execution is either slow or fast (the same executable, without
recompiling), most often slow. So far I can't find anything that
would trigger either case. This is with ghc-7.4.2 on 64bit linux.
Here are the statistics from running with -N4 -A8m -s:
slow run:
16,647,460,328 bytes allocated in the heap
313,767,248 bytes copied during GC
17,305,120 bytes maximum residency (22 sample(s))
601,952 bytes maximum slop
73 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1268 colls, 1267 par 8.62s 8.00s 0.0063s 0.0389s
Gen 1 22 colls, 22 par 0.63s 0.60s 0.0275s 0.0603s
Parallel GC work balance: 1.53 (39176141 / 25609887, ideal 4)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 0.01s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 13.66s) 0.01s ( 0.04s)
Task 2 (bound) : 0.00s ( 13.98s) 0.00s ( 0.00s)
Task 3 (worker) : 0.00s ( 18.14s) 0.16s ( 0.44s)
Task 4 (worker) : 0.53s ( 17.49s) 1.29s ( 4.25s)
Task 5 (worker) : 0.00s ( 17.45s) 1.25s ( 4.42s)
Task 6 (worker) : 0.00s ( 14.98s) 1.75s ( 6.90s)
Task 7 (worker) : 0.00s ( 21.87s) 0.02s ( 0.06s)
Task 8 (worker) : 0.01s ( 37.12s) 0.06s ( 0.17s)
Task 9 (worker) : 0.00s ( 21.41s) 4.88s ( 15.99s)
Task 10 (worker) : 0.84s ( 43.06s) 1.99s ( 8.25s)
Task 11 (bound) : 6.39s ( 51.13s) 0.06s ( 0.18s)
Task 12 (worker) : 0.00s ( 0.00s) 8.04s ( 21.42s)
Task 13 (worker) : 0.43s ( 28.38s) 8.14s ( 22.94s)
Task 14 (worker) : 5.35s ( 29.30s) 5.81s ( 22.02s)
SPARKS: 7 (7 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.03s ( 0.01s elapsed)
MUT time 43.88s ( 42.71s elapsed)
GC time 9.26s ( 8.60s elapsed)
EXIT time 0.01s ( 0.01s elapsed)
Total time 53.65s ( 51.34s elapsed)
Alloc rate 374,966,825 bytes per MUT second
Productivity 82.7% of total user, 86.4% of total elapsed
gc_alloc_block_sync: 1388000
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-- --------------------------
Fast run:
42,061,441,560 bytes allocated in the heap
725,062,720 bytes copied during GC
36,963,480 bytes maximum residency (21 sample(s))
1,382,536 bytes maximum slop
141 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3206 colls, 3205 par 8.34s 1.87s 0.0006s 0.0089s
Gen 1 21 colls, 21 par 0.76s 0.17s 0.0081s 0.0275s
Parallel GC work balance: 1.78 (90535973 / 50955059, ideal 4)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 2 (worker) : 0.00s ( 11.50s) 0.00s ( 0.00s)
Task 3 (worker) : 0.00s ( 12.40s) 0.00s ( 0.00s)
Task 4 (worker) : 0.58s ( 12.20s) 0.59s ( 0.61s)
Task 5 (bound) : 0.00s ( 12.89s) 0.00s ( 0.00s)
Task 6 (worker) : 0.00s ( 13.40s) 0.02s ( 0.02s)
Task 7 (worker) : 0.00s ( 14.66s) 0.00s ( 0.00s)
Task 8 (worker) : 0.95s ( 14.18s) 0.69s ( 0.76s)
Task 9 (worker) : 2.82s ( 13.50s) 1.37s ( 1.44s)
Task 10 (worker) : 1.72s ( 17.59s) 1.07s ( 1.16s)
Task 11 (worker) : 3.99s ( 24.68s) 0.37s ( 0.38s)
Task 12 (worker) : 1.24s ( 24.25s) 0.80s ( 0.82s)
Task 13 (bound) : 6.18s ( 25.02s) 0.04s ( 0.04s)
Task 14 (worker) : 1.46s ( 23.42s) 1.59s ( 1.65s)
Task 15 (worker) : 0.00s ( 0.00s) 0.66s ( 0.66s)
Task 16 (worker) : 11.00s ( 23.36s) 1.67s ( 1.70s)
SPARKS: 28 (28 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.04s ( 0.02s elapsed)
MUT time 42.08s ( 23.02s elapsed)
GC time 9.10s ( 2.04s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 51.69s ( 25.09s elapsed)
Alloc rate 987,695,300 bytes per MUT second
Productivity 82.3% of total user, 169.6% of total elapsed
gc_alloc_block_sync: 164572
whitehole_spin: 0
gen[0].sync: 164
gen[1].sync: 18147
When I record an eventlog and view it with Threadscope, the slow run
shows long frequent pauses for GC, whereas on a fast run GC is
extremely fast.
Running with the parallel collector disabled (-qg) is more-or-less
consistently between these two runs.
Given this, can anyone suggest any likely causes of this issue, or
anything I might want to look for? Also, should I be concerned about
the much larger gc_alloc_block_sync level for the slow run? Does that
indicate the allocator waiting to alloc a new block, or is it
something else? Am I on completely the wrong track?
Thanks very much,
John L.
More information about the Glasgow-haskell-users
mailing list