parallel garbage collection performance

Mon Jun 18 10:20:51 CEST 2012

Hello,

I have a program that is intermittently experiencing performance
issues that I believe are related to parallel GC, and I was hoping to
get some advice on how I might improve it.  Essentially, any given
execution is either slow or fast (the same executable, without
recompiling), most often slow.  So far I can't find anything that
would trigger either case.  This is with ghc-7.4.2 on 64bit linux.
Here are the statistics from running with -N4 -A8m -s:

slow run:

  16,647,460,328 bytes allocated in the heap
     313,767,248 bytes copied during GC
      17,305,120 bytes maximum residency (22 sample(s))
         601,952 bytes maximum slop
              73 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1268 colls,  1267 par    8.62s    8.00s     0.0063s    0.0389s
  Gen  1        22 colls,    22 par    0.63s    0.60s     0.0275s    0.0603s

  Parallel GC work balance: 1.53 (39176141 / 25609887, ideal 4)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    ( 13.66s)       0.01s    (  0.04s)
  Task  2 (bound)  :    0.00s    ( 13.98s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.00s    ( 18.14s)       0.16s    (  0.44s)
  Task  4 (worker) :    0.53s    ( 17.49s)       1.29s    (  4.25s)
  Task  5 (worker) :    0.00s    ( 17.45s)       1.25s    (  4.42s)
  Task  6 (worker) :    0.00s    ( 14.98s)       1.75s    (  6.90s)
  Task  7 (worker) :    0.00s    ( 21.87s)       0.02s    (  0.06s)
  Task  8 (worker) :    0.01s    ( 37.12s)       0.06s    (  0.17s)
  Task  9 (worker) :    0.00s    ( 21.41s)       4.88s    ( 15.99s)
  Task 10 (worker) :    0.84s    ( 43.06s)       1.99s    (  8.25s)
  Task 11 (bound)  :    6.39s    ( 51.13s)       0.06s    (  0.18s)
  Task 12 (worker) :    0.00s    (  0.00s)       8.04s    ( 21.42s)
  Task 13 (worker) :    0.43s    ( 28.38s)       8.14s    ( 22.94s)
  Task 14 (worker) :    5.35s    ( 29.30s)       5.81s    ( 22.02s)

  SPARKS: 7 (7 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.03s  (  0.01s elapsed)
  MUT     time   43.88s  ( 42.71s elapsed)
  GC      time    9.26s  (  8.60s elapsed)
  EXIT    time    0.01s  (  0.01s elapsed)
  Total   time   53.65s  ( 51.34s elapsed)

  Alloc rate    374,966,825 bytes per MUT second

  Productivity  82.7% of total user, 86.4% of total elapsed

gc_alloc_block_sync: 1388000
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

-- --------------------------
Fast run:

  42,061,441,560 bytes allocated in the heap
     725,062,720 bytes copied during GC
      36,963,480 bytes maximum residency (21 sample(s))
       1,382,536 bytes maximum slop
             141 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3206 colls,  3205 par    8.34s    1.87s     0.0006s    0.0089s
  Gen  1        21 colls,    21 par    0.76s    0.17s     0.0081s    0.0275s

  Parallel GC work balance: 1.78 (90535973 / 50955059, ideal 4)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  2 (worker) :    0.00s    ( 11.50s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.00s    ( 12.40s)       0.00s    (  0.00s)
  Task  4 (worker) :    0.58s    ( 12.20s)       0.59s    (  0.61s)
  Task  5 (bound)  :    0.00s    ( 12.89s)       0.00s    (  0.00s)
  Task  6 (worker) :    0.00s    ( 13.40s)       0.02s    (  0.02s)
  Task  7 (worker) :    0.00s    ( 14.66s)       0.00s    (  0.00s)
  Task  8 (worker) :    0.95s    ( 14.18s)       0.69s    (  0.76s)
  Task  9 (worker) :    2.82s    ( 13.50s)       1.37s    (  1.44s)
  Task 10 (worker) :    1.72s    ( 17.59s)       1.07s    (  1.16s)
  Task 11 (worker) :    3.99s    ( 24.68s)       0.37s    (  0.38s)
  Task 12 (worker) :    1.24s    ( 24.25s)       0.80s    (  0.82s)
  Task 13 (bound)  :    6.18s    ( 25.02s)       0.04s    (  0.04s)
  Task 14 (worker) :    1.46s    ( 23.42s)       1.59s    (  1.65s)
  Task 15 (worker) :    0.00s    (  0.00s)       0.66s    (  0.66s)
  Task 16 (worker) :   11.00s    ( 23.36s)       1.67s    (  1.70s)

  SPARKS: 28 (28 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.04s  (  0.02s elapsed)
  MUT     time   42.08s  ( 23.02s elapsed)
  GC      time    9.10s  (  2.04s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   51.69s  ( 25.09s elapsed)

  Alloc rate    987,695,300 bytes per MUT second

  Productivity  82.3% of total user, 169.6% of total elapsed

gc_alloc_block_sync: 164572
whitehole_spin: 0
gen[0].sync: 164
gen[1].sync: 18147

When I record an eventlog and view it with Threadscope, the slow run
shows long frequent pauses for GC, whereas on a fast run GC is
extremely fast.

Running with the parallel collector disabled (-qg) is more-or-less
consistently between these two runs.

Given this, can anyone suggest any likely causes of this issue, or
anything I might want to look for?  Also, should I be concerned about
the much larger gc_alloc_block_sync level for the slow run?  Does that
indicate the allocator waiting to alloc a new block, or is it
something else?  Am I on completely the wrong track?

Thanks very much,
John L.