memory fragmentation with ghc-7.6.1

Mon Oct 1 16:56:26 CEST 2012

Hi Ben,

I would bet on the same memory issues Simon mentioned.  But... while you're
at it would you mind trying a little experiment to share your work items
through a lockfree queue rather than a TQueue?

   http://hackage.haskell.org/package/lockfree-queue

Under some situations this can yield some benefit.  But again, you didn't
see workers retrying transactions so this is probably not an issue.

  -Ryan

On Mon, Oct 1, 2012 at 4:18 AM, Simon Marlow <marlowsd at gmail.com> wrote:

> Hi Ben,
>
> My guess would be that you're running into some kind of memory bottleneck.
>  Three common ones are:
>
>   (1) total memory bandwidth
>   (2) cache ping-ponging
>   (3) NUMA overheads
>
> You would run into (1) if you were using an allocation area size (-A or
> -H) larger than the L2 cache.  Your stats seem to indicate that you're
> running with a large heap - could that be the case?
>
> (2) happens if you share data a lot between cores.  It can also happen if
> the RTS shares data between cores, but I've tried to squash as much of that
> as I can.
>
> (3) is sadly something that happens on these large AMD machines (and to
> some extent large multicore Intel boxes too).  Improving our NUMA support
> is something we really need to do.  NUMA overheads tend to manifest as very
> unpredictable runtimes.
>
> I suggest using perf to gather some low-level stats about cache misses and
> suchlike.
>
>   http://hackage.haskell.org/**trac/ghc/wiki/Debugging/**
> LowLevelProfiling/Perf<http://hackage.haskell.org/trac/ghc/wiki/Debugging/LowLevelProfiling/Perf>
>
> Cheers,
>         Simon
>
>
>
> On 29/09/2012 07:47, Ben Gamari wrote:
>
>> Simon Marlow <marlowsd at gmail.com> writes:
>>
>>  On 28/09/12 17:36, Ben Gamari wrote:
>>>
>>>> Unfortunately, after poking around I found a few obvious problems with
>>>> both the code and my testing configuration which explained the
>>>> performance drop. Things seem to be back to normal now. Sorry for the
>>>> noise! Great job on the new codegen.
>>>>
>>>
>>> That's good to hear, thanks for letting me know!
>>>
>>>  Of course!
>>
>> That being said, I have run in to a bit of a performance issue which
>> could be related to the runtime system. In particular, as I scale up in
>> thread count (from 6 up to 48, the core count of the machine) in my
>> program[1] (test data available), I'm seeing the total runtime increase,
>> as well as a corresponding increase in CPU-seconds used. This despite
>> the RTS claiming consistently high (~94%) productivity. Meanwhile
>> Threadscope shows that nearly all of my threads are working busily with
>> very few STM retries and no idle time. This in an application which
>> should scale reasonably well (or so I believe). Attached below you will
>> find a crude listing of various runtime statistics over a variety of
>> thread counts (ranging into what should probably be regarded as the
>> absurdly large).
>>
>> The application is a parallel Gibbs sampler for learning probabilistic
>> graphical models. It involves a set of worker threads (updateWorkers)
>> pulling work units off of a common TQueue. After grabbing a work unit,
>> the thread will read a reference to the current global state from an
>> IORef. It will then begin a long-running calculation, resulting in a
>> small value (forced to normal form with deepseq) which it then
>> communicates back to a global update thread (diffWorker) in the form of
>> a lambda through another TQueue. The global update thread then maps the
>> global state (the same as was read from the IORef earlier) through this
>> lambda with atomicModifyIORef'. This is all implemented in [2].
>>
>> I do understand that I'm asking a lot of the language and I have been
>> quite impressed by how well Haskell and GHC have stood up to the
>> challenge thusfar. That being said, the behavior I'm seeing seems a bit
>> strange. If synchronization overhead were the culprit, I'd expect to
>> observe STM retries or thread blocking, which I do not see (by eye it
>> seems that STM retries occur on the order of 5/second and worker threads
>> otherwise appear to run uninterrupted except for GC; GHC event log
>> from a 16 thread run available here[3]). If GC were the problem, I would
>> expect this to be manifested in the productivity, which it is clearly
>> not. Do you have any idea what else might be causing such extreme
>> performance degradation with higher thread counts? I would appreciate
>> any input you would have to offer.
>>
>> Thanks for all of your work!
>>
>> Cheers,
>>
>> - Ben
>>
>>
>> [1] https://github.com/bgamari/**bayes-stack/v2<https://github.com/bgamari/bayes-stack/v2>
>> [2] https://github.com/bgamari/**bayes-stack/blob/v2/**
>> BayesStack/Core/Gibbs.hs<https://github.com/bgamari/bayes-stack/blob/v2/BayesStack/Core/Gibbs.hs>
>> [3] http://goldnerlab.physics.**umass.edu/~bgamari/RunCI.**eventlog<http://goldnerlab.physics.umass.edu/~bgamari/RunCI.eventlog>
>>
>>
>>
>> Performance of Citation Influence model on lda-handcraft data set
>> 1115 arcs, 702 nodes, 50 items per node average
>> 100 sweeps in blocks of 10, 200 topics
>> Running with +RTS -A1G
>> ghc-7.7 9c15249e082642f9c4c0113133afd7**8f07f1ade2
>>
>> Cores      User time (s)  Walltime (s)   CPU %       Productivity
>> ======     =============  =============  ======      =============
>> 2          488.66         269.41         188%        93.7%
>> 3          533.43         195.28         281%        94.1%
>> 4          603.92         166.94         374%        94.3%
>> 5          622.40         138.16         466%        93.8%
>> 6          663.73         123.00         558%        94.2%
>> 7          713.96         114.17         647%        94.0%
>> 8          724.66         101.98         736%        93.7%
>> 9          802.75         100.59         826%        .
>> 10         865.05         97.69          917%        .
>> 11         966.97         99.09          1010%       .
>> 12         1238.42        114.28         1117%
>> 13         1242.43        106.53         1206%
>> 14         1428.59        112.48         1310%
>> 15         1299.40        97.52          1387%
>> 16         1559.99        108.86         1481%
>> 17         1972.02        126.49         1604%
>> 18         2157.03        130.91         1696%
>> 19         1966.24        115.29         1770%       .
>> 20         2693.64        146.76         1887%       .
>> 21         3051.16        158.48         1990%       .
>> 22         4100.88        199.18         2109%       93.7%
>> 23         4156.94        193.38         2201%       93.5%
>> 24         4780.46        212.13         2303%       94.3%
>> 25         5733.64        242.78         2407%       .
>> 26         7806.47        313.92         2526%       .
>> 27         6368.32        249.65         2596%       .
>> 28         8563.18        320.26         2717%
>>
>>
>>
>>             -sstderr output from 2 cores:
>>         ==============================**========
>>
>>     214,358,463,960 bytes allocated in the heap
>>     2,261,367,592 bytes copied during GC
>>        71,053,384 bytes maximum residency (7 sample(s))
>>         2,077,816 bytes maximum slop
>>              2299 MB total memory in use (0 MB lost due to fragmentation)
>>
>>                                      Tot time (elapsed)  Avg pause  Max
>> pause
>>    Gen  0       100 colls,   100 par   28.58s   14.27s     0.1427s
>>  0.3345s
>>    Gen  1         7 colls,     6 par    3.36s    1.74s     0.2486s
>>  0.3241s
>>
>>    Parallel GC work balance: 25.06% (serial 0%, perfect 100%)
>>
>>    TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
>>
>>    SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>>
>>    INIT    time    0.03s  (  0.03s elapsed)
>>    MUT     time  474.28s  (252.98s elapsed)
>>    GC      time   31.94s  ( 16.01s elapsed)
>>    EXIT    time    0.12s  (  0.12s elapsed)
>>    Total   time  506.49s  (269.26s elapsed)
>>
>>    Alloc rate    451,969,681 bytes per MUT second
>>
>>    Productivity  93.7% of total user, 176.2% of total elapsed
>>
>> gc_alloc_block_sync: 108391
>> whitehole_spin: 0
>> gen[0].sync: 7392
>> gen[1].sync: 97126
>>
>>
>>
>>             -sstderr output from 24 cores:
>>         ==============================**=========
>>
>>   262,335,150,000 bytes allocated in the heap
>>     1,323,818,912 bytes copied during GC
>>        67,648,048 bytes maximum residency (7 sample(s))
>>         2,927,176 bytes maximum slop
>>             25172 MB total memory in use (0 MB lost due to fragmentation)
>>
>>                                      Tot time (elapsed)  Avg pause  Max
>> pause
>>    Gen  0        29 colls,    29 par   162.40s    6.77s     0.2335s
>>  0.3469s
>>    Gen  1         7 colls,     6 par   118.34s    5.32s     0.7600s
>>  1.2448s
>>
>>    Parallel GC work balance: 5.61% (serial 0%, perfect 100%)
>>
>>    TASKS: 27 (1 bound, 26 peak workers (26 total), using -N24)
>>
>>    SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>>
>>    INIT    time    0.34s  (  0.34s elapsed)
>>    MUT     time  4606.75s  (197.59s elapsed)
>>    GC      time  280.74s  ( 12.09s elapsed)
>>    EXIT    time    0.40s  (  0.40s elapsed)
>>    Total   time  4888.61s  (210.80s elapsed)
>>
>>    Alloc rate    56,945,837 bytes per MUT second
>>
>>    Productivity  94.3% of total user, 2185.8% of total elapsed
>>
>> gc_alloc_block_sync: 5397994
>> whitehole_spin: 0
>> gen[0].sync: 32282
>> gen[1].sync: 5072
>>
>>
>
> ______________________________**_________________
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users at haskell.**org <Glasgow-haskell-users at haskell.org>
> http://www.haskell.org/**mailman/listinfo/glasgow-**haskell-users<http://www.haskell.org/mailman/listinfo/glasgow-haskell-users>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20121001/72cb7996/attachment-0001.htm>