[Haskell-cafe] SMP parallelism increasing GC time dramatically

Wed Oct 5 22:36:13 CEST 2011

Hi Tom,

I think debugging this sort of problem is exactly what we need to be doing
(and making easier).  Have you tried Duncan's newest version of Threadscope
by the way?

It looks like -- completely aside from the GC time -- this program is not
scaling.  The mutator time itself, disregarding GC, isn't going down much
with parallelism (with the total mutator time increasing drastically).
 Either this is completely memory bottlenecked or there is some other kind
of bad interaction (e.g. false sharing, contention on a hot lock, etc).

My inclination would be to figure this out first before worrying about the
GC time.  Is this code that you would be able to share for debugging?

I think we need to get together some general documentation on how to debug
this kind of problem.  For example, you can get some hints as to the memory
behavior by running valgrind/cachegrind on the program.  Also, what does
"top" say, by the way?  Is the process using 1200% CPU?

Cheers,
  -Ryan

On Wed, Oct 5, 2011 at 2:15 PM, Tom Thorne <thomas.thorne21 at gmail.com>wrote:

> I am having some strange performance issues when using SMP parallelism,
> that I think may be something to do with GC. Apologies for the large
> readouts below but I'm not familiar enough to know what is and isn't
> relevant!
>
> I have a pure function that is mapped over a list of around 10 values, and
> this happens several times for each iteration of my program. It does some
> fairly intensive calculations using hmatrix, generating intermediate
> matrices along the way. The computation is significantly more complex for
> some values, so the work done by each call is not spread equally. I did some
> profiling and it seems like the program is spending about 50% of its time in
> that function. First of all, without any attempts at parallelism, I see this
> from ./Main +RTS -s
>
>   67,142,126,336 bytes allocated in the heap
>      147,759,264 bytes copied during GC
>          109,384 bytes maximum residency (58 sample(s))
>          354,408 bytes maximum slop
>                3 MB total memory in use (0 MB lost due to fragmentation)
>
>   Generation 0: 104551 collections,     0 parallel,  1.13s,  1.11s elapsed
>   Generation 1:    58 collections,     0 parallel,  0.01s,  0.01s elapsed
>
>   Parallel GC work balance: -nan (0 / 0, ideal 1)
>
>                         MUT time (elapsed)       GC time  (elapsed)
>   Task  0 (worker) :    0.00s    ( 67.06s)       0.00s    (  0.00s)
>   Task  1 (worker) :    0.00s    ( 67.09s)       0.00s    (  0.00s)
>   Task  2 (bound)  :   66.95s    ( 67.09s)       1.14s    (  1.12s)
>
>   SPARKS: 0 (0 converted, 0 pruned)
>
>   INIT  time    0.00s  (  0.00s elapsed)
>   MUT   time   66.95s  ( 67.09s elapsed)
>   GC    time    1.14s  (  1.12s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time   68.09s  ( 68.21s elapsed)
>
>   %GC time       1.7%  (1.6% elapsed)
>
>   Alloc rate    1,002,835,517 bytes per MUT second
>
>   Productivity  98.3% of total user, 98.2% of total elapsed
>
> gc_alloc_block_sync: 0
> whitehole_spin: 0
> gen[0].sync_large_objects: 0
> gen[1].sync_large_objects: 0
>
> This looks ok to me...
>
> Then if I try to use Control.Parallel to parallelise my code, simpy
> replacing a map with parMap (rdeepseq), on a 12 core machine using +RTS -N12
> -s I get this:
>
>   66,065,148,144 bytes allocated in the heap
>      197,202,056 bytes copied during GC
>          181,312 bytes maximum residency (251 sample(s))
>          387,240 bytes maximum slop
>               12 MB total memory in use (3 MB lost due to fragmentation)
>
>   Generation 0: 37592 collections, 37591 parallel, 245.32s, 26.67s elapsed
>   Generation 1:   251 collections,   251 parallel,  3.12s,  0.33s elapsed
>
>   Parallel GC work balance: 2.41 (24219609 / 10058220, ideal 12)
>
>                         MUT time (elapsed)       GC time  (elapsed)
>   Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
>   Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
>   Task  2 (worker) :    0.00s    ( 17.97s)       0.00s    (  0.00s)
>   Task  3 (worker) :    0.00s    ( 19.35s)       0.00s    (  0.00s)
>   Task  4 (worker) :    0.00s    ( 40.28s)       0.00s    (  0.00s)
>   Task  5 (worker) :    0.00s    ( 45.08s)       0.00s    (  0.00s)
>   Task  6 (worker) :    0.00s    ( 47.06s)       0.00s    (  0.00s)
>   Task  7 (worker) :   18.30s    ( 49.73s)      16.24s    (  1.71s)
>   Task  8 (worker) :    0.00s    ( 51.22s)       0.00s    (  0.00s)
>   Task  9 (worker) :    0.00s    ( 53.75s)       0.00s    (  0.00s)
>   Task 10 (worker) :    0.00s    ( 54.17s)       0.00s    (  0.00s)
>   Task 11 (worker) :    5.65s    ( 54.30s)       0.70s    (  0.08s)
>   Task 12 (worker) :    0.00s    ( 54.41s)       0.41s    (  0.04s)
>   Task 13 (worker) :    4.34s    ( 54.58s)       4.50s    (  0.48s)
>   Task 14 (worker) :    5.82s    ( 54.76s)       5.91s    (  0.64s)
>   Task 15 (worker) :    6.50s    ( 55.01s)       3.37s    (  0.38s)
>   Task 16 (worker) :    7.60s    ( 55.21s)       8.56s    (  0.94s)
>   Task 17 (worker) :   11.05s    ( 55.21s)       9.04s    (  0.96s)
>   Task 18 (worker) :   11.75s    ( 55.21s)      12.94s    (  1.43s)
>   Task 19 (worker) :   16.02s    ( 55.21s)      13.32s    (  1.43s)
>   Task 20 (worker) :   26.98s    ( 55.23s)       7.43s    (  0.77s)
>   Task 21 (worker) :    7.36s    ( 55.23s)       7.47s    (  0.83s)
>   Task 22 (worker) :   16.08s    ( 55.23s)      10.25s    (  1.12s)
>   Task 23 (worker) :    7.04s    ( 55.23s)       4.99s    (  0.57s)
>   Task 24 (worker) :   28.47s    ( 55.23s)       8.78s    (  0.94s)
>   Task 25 (worker) :    7.43s    ( 55.23s)       1.62s    (  0.18s)
>   Task 26 (worker) :    6.33s    ( 55.23s)      11.42s    (  1.23s)
>   Task 27 (worker) :    9.80s    ( 55.23s)       8.72s    (  0.95s)
>   Task 28 (worker) :    4.88s    ( 55.26s)       8.92s    (  0.99s)
>   Task 29 (worker) :    0.00s    ( 55.26s)       0.00s    (  0.00s)
>   Task 30 (bound)  :    5.59s    ( 55.26s)       0.59s    (  0.06s)
>   Task 31 (worker) :   41.16s    ( 55.26s)       3.48s    (  0.38s)
>   Task 32 (worker) :   17.03s    ( 55.26s)       3.90s    (  0.42s)
>   Task 33 (worker) :   14.89s    ( 55.26s)       5.29s    (  0.58s)
>   Task 34 (worker) :    6.30s    ( 55.26s)       1.99s    (  0.21s)
>   Task 35 (worker) :   16.13s    ( 55.26s)      13.95s    (  1.50s)
>   Task 36 (worker) :   16.70s    ( 55.26s)      13.02s    (  1.41s)
>   Task 37 (worker) :   11.68s    ( 55.26s)      15.45s    (  1.68s)
>   Task 38 (worker) :    7.65s    ( 55.26s)      13.61s    (  1.50s)
>   Task 39 (worker) :   14.18s    ( 55.26s)      12.48s    (  1.35s)
>   Task 40 (worker) :    5.51s    ( 55.26s)      11.92s    (  1.31s)
>   Task 41 (worker) :    3.10s    ( 55.26s)       8.17s    (  0.93s)
>
>   SPARKS: 42525 (37707 converted, 519 pruned)
>
>   INIT  time    0.00s  (  0.00s elapsed)
>   MUT   time  365.55s  ( 55.26s elapsed)
>   GC    time  248.45s  ( 27.00s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time  614.01s  ( 82.26s elapsed)
>
>   %GC time      40.5%  (32.8% elapsed)
>
>   Alloc rate    180,724,403 bytes per MUT second
>
>   Productivity  59.5% of total user, 444.4% of total elapsed
>
> gc_alloc_block_sync: 1339708
> whitehole_spin: 36
> gen[0].sync_large_objects: 197621
> gen[1].sync_large_objects: 6396
>
> It seems like suddenly GC is taking a lot longer.
>
> I also tried using Control.Monad.Par, with runPar $ do parMap, and using
> +RTS -N12 -s I get this:
>
> 71,480,561,920 bytes allocated in the heap
>      275,752,912 bytes copied during GC
>          198,704 bytes maximum residency (343 sample(s))
>          394,232 bytes maximum slop
>               12 MB total memory in use (3 MB lost due to fragmentation)
>
>   Generation 0: 37575 collections, 37574 parallel, 169.08s, 16.67s elapsed
>   Generation 1:   343 collections,   343 parallel,  6.24s,  0.61s elapsed
>
>   Parallel GC work balance: 2.56 (33928871 / 13275450, ideal 12)
>
>                         MUT time (elapsed)       GC time  (elapsed)
>   Task  0 (worker) :    0.70s    ( 46.46s)       0.00s    (  0.00s)
>   Task  1 (worker) :    2.94s    ( 46.46s)       0.00s    (  0.00s)
>   Task  2 (worker) :    0.10s    ( 46.46s)       0.00s    (  0.00s)
>   Task  3 (worker) :    0.06s    ( 46.47s)       0.00s    (  0.00s)
>   Task  4 (worker) :    1.70s    ( 46.47s)       0.00s    (  0.00s)
>   Task  5 (worker) :    0.07s    ( 46.47s)       0.00s    (  0.00s)
>   Task  6 (worker) :    2.75s    ( 46.47s)       0.00s    (  0.00s)
>   Task  7 (worker) :    1.21s    ( 46.47s)       0.00s    (  0.00s)
>   Task  8 (worker) :    0.43s    ( 46.47s)       0.00s    (  0.00s)
>   Task  9 (worker) :   10.44s    ( 46.47s)       0.00s    (  0.00s)
>   Task 10 (worker) :    0.06s    ( 46.47s)       0.00s    (  0.00s)
>   Task 11 (worker) :    1.36s    ( 46.47s)       0.00s    (  0.00s)
>   Task 12 (worker) :    0.00s    ( 46.50s)      46.69s    (  4.63s)
>   Task 13 (worker) :    0.00s    ( 46.50s)       0.00s    (  0.00s)
>   Task 14 (bound)  :    0.05s    ( 46.50s)       0.56s    (  0.06s)
>   Task 15 (worker) :   24.84s    ( 46.50s)       2.24s    (  0.21s)
>   Task 16 (worker) :   28.00s    ( 46.50s)       1.95s    (  0.19s)
>   Task 17 (worker) :   35.30s    ( 46.50s)       2.79s    (  0.27s)
>   Task 18 (worker) :   29.03s    ( 46.50s)       4.06s    (  0.39s)
>   Task 19 (worker) :   32.71s    ( 46.50s)       7.04s    (  0.69s)
>   Task 20 (worker) :   32.74s    ( 46.50s)      11.47s    (  1.09s)
>   Task 21 (worker) :   23.35s    ( 46.50s)      15.16s    (  1.49s)
>   Task 22 (worker) :   36.00s    ( 46.50s)       7.94s    (  0.76s)
>   Task 23 (worker) :    0.00s    ( 46.50s)      35.84s    (  3.63s)
>   Task 24 (worker) :   24.70s    ( 46.50s)      21.26s    (  2.06s)
>   Task 25 (worker) :   23.07s    ( 46.50s)      18.33s    (  1.82s)
>
>   SPARKS: 0 (0 converted, 0 pruned)
>
>   INIT  time    0.00s  (  0.00s elapsed)
>   MUT   time  307.74s  ( 46.50s elapsed)
>   GC    time  175.32s  ( 17.29s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time  483.06s  ( 63.79s elapsed)
>
>   %GC time      36.3%  (27.1% elapsed)
>
>   Alloc rate    232,274,501 bytes per MUT second
>
>   Productivity  63.7% of total user, 482.4% of total elapsed
>
> gc_alloc_block_sync: 1619540
> whitehole_spin: 258
> gen[0].sync_large_objects: 232661
> gen[1].sync_large_objects: 6378
>
> This seems slightly better, but the GC time is still much bigger than I
> would expect it to be.
>
> I don't think it is solely because of parallel GC, since running the
> parallel code with +RTS -N1 gives results similar to the non-parallel
> version.
>
> Am I simply misunderstanding the way the GC time is represented in this
> output, or is there something amiss?
>
> (I also noticed that there seem to be a lot more than 12 threads running if
> I look at top while the program is running, but I assume this is something
> to do with the runtime spawning more threads for GC etc)
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20111005/ef6a36a2/attachment.htm>