memory fragmentation with ghc-7.6.1
Ryan Newton
rrnewton at gmail.com
Mon Oct 1 16:56:26 CEST 2012
Hi Ben,
I would bet on the same memory issues Simon mentioned. But... while you're
at it would you mind trying a little experiment to share your work items
through a lockfree queue rather than a TQueue?
http://hackage.haskell.org/package/lockfree-queue
Under some situations this can yield some benefit. But again, you didn't
see workers retrying transactions so this is probably not an issue.
-Ryan
On Mon, Oct 1, 2012 at 4:18 AM, Simon Marlow <marlowsd at gmail.com> wrote:
> Hi Ben,
>
> My guess would be that you're running into some kind of memory bottleneck.
> Three common ones are:
>
> (1) total memory bandwidth
> (2) cache ping-ponging
> (3) NUMA overheads
>
> You would run into (1) if you were using an allocation area size (-A or
> -H) larger than the L2 cache. Your stats seem to indicate that you're
> running with a large heap - could that be the case?
>
> (2) happens if you share data a lot between cores. It can also happen if
> the RTS shares data between cores, but I've tried to squash as much of that
> as I can.
>
> (3) is sadly something that happens on these large AMD machines (and to
> some extent large multicore Intel boxes too). Improving our NUMA support
> is something we really need to do. NUMA overheads tend to manifest as very
> unpredictable runtimes.
>
> I suggest using perf to gather some low-level stats about cache misses and
> suchlike.
>
> http://hackage.haskell.org/**trac/ghc/wiki/Debugging/**
> LowLevelProfiling/Perf<http://hackage.haskell.org/trac/ghc/wiki/Debugging/LowLevelProfiling/Perf>
>
> Cheers,
> Simon
>
>
>
> On 29/09/2012 07:47, Ben Gamari wrote:
>
>> Simon Marlow <marlowsd at gmail.com> writes:
>>
>> On 28/09/12 17:36, Ben Gamari wrote:
>>>
>>>> Unfortunately, after poking around I found a few obvious problems with
>>>> both the code and my testing configuration which explained the
>>>> performance drop. Things seem to be back to normal now. Sorry for the
>>>> noise! Great job on the new codegen.
>>>>
>>>
>>> That's good to hear, thanks for letting me know!
>>>
>>> Of course!
>>
>> That being said, I have run in to a bit of a performance issue which
>> could be related to the runtime system. In particular, as I scale up in
>> thread count (from 6 up to 48, the core count of the machine) in my
>> program[1] (test data available), I'm seeing the total runtime increase,
>> as well as a corresponding increase in CPU-seconds used. This despite
>> the RTS claiming consistently high (~94%) productivity. Meanwhile
>> Threadscope shows that nearly all of my threads are working busily with
>> very few STM retries and no idle time. This in an application which
>> should scale reasonably well (or so I believe). Attached below you will
>> find a crude listing of various runtime statistics over a variety of
>> thread counts (ranging into what should probably be regarded as the
>> absurdly large).
>>
>> The application is a parallel Gibbs sampler for learning probabilistic
>> graphical models. It involves a set of worker threads (updateWorkers)
>> pulling work units off of a common TQueue. After grabbing a work unit,
>> the thread will read a reference to the current global state from an
>> IORef. It will then begin a long-running calculation, resulting in a
>> small value (forced to normal form with deepseq) which it then
>> communicates back to a global update thread (diffWorker) in the form of
>> a lambda through another TQueue. The global update thread then maps the
>> global state (the same as was read from the IORef earlier) through this
>> lambda with atomicModifyIORef'. This is all implemented in [2].
>>
>> I do understand that I'm asking a lot of the language and I have been
>> quite impressed by how well Haskell and GHC have stood up to the
>> challenge thusfar. That being said, the behavior I'm seeing seems a bit
>> strange. If synchronization overhead were the culprit, I'd expect to
>> observe STM retries or thread blocking, which I do not see (by eye it
>> seems that STM retries occur on the order of 5/second and worker threads
>> otherwise appear to run uninterrupted except for GC; GHC event log
>> from a 16 thread run available here[3]). If GC were the problem, I would
>> expect this to be manifested in the productivity, which it is clearly
>> not. Do you have any idea what else might be causing such extreme
>> performance degradation with higher thread counts? I would appreciate
>> any input you would have to offer.
>>
>> Thanks for all of your work!
>>
>> Cheers,
>>
>> - Ben
>>
>>
>> [1] https://github.com/bgamari/**bayes-stack/v2<https://github.com/bgamari/bayes-stack/v2>
>> [2] https://github.com/bgamari/**bayes-stack/blob/v2/**
>> BayesStack/Core/Gibbs.hs<https://github.com/bgamari/bayes-stack/blob/v2/BayesStack/Core/Gibbs.hs>
>> [3] http://goldnerlab.physics.**umass.edu/~bgamari/RunCI.**eventlog<http://goldnerlab.physics.umass.edu/~bgamari/RunCI.eventlog>
>>
>>
>>
>> Performance of Citation Influence model on lda-handcraft data set
>> 1115 arcs, 702 nodes, 50 items per node average
>> 100 sweeps in blocks of 10, 200 topics
>> Running with +RTS -A1G
>> ghc-7.7 9c15249e082642f9c4c0113133afd7**8f07f1ade2
>>
>> Cores User time (s) Walltime (s) CPU % Productivity
>> ====== ============= ============= ====== =============
>> 2 488.66 269.41 188% 93.7%
>> 3 533.43 195.28 281% 94.1%
>> 4 603.92 166.94 374% 94.3%
>> 5 622.40 138.16 466% 93.8%
>> 6 663.73 123.00 558% 94.2%
>> 7 713.96 114.17 647% 94.0%
>> 8 724.66 101.98 736% 93.7%
>> 9 802.75 100.59 826% .
>> 10 865.05 97.69 917% .
>> 11 966.97 99.09 1010% .
>> 12 1238.42 114.28 1117%
>> 13 1242.43 106.53 1206%
>> 14 1428.59 112.48 1310%
>> 15 1299.40 97.52 1387%
>> 16 1559.99 108.86 1481%
>> 17 1972.02 126.49 1604%
>> 18 2157.03 130.91 1696%
>> 19 1966.24 115.29 1770% .
>> 20 2693.64 146.76 1887% .
>> 21 3051.16 158.48 1990% .
>> 22 4100.88 199.18 2109% 93.7%
>> 23 4156.94 193.38 2201% 93.5%
>> 24 4780.46 212.13 2303% 94.3%
>> 25 5733.64 242.78 2407% .
>> 26 7806.47 313.92 2526% .
>> 27 6368.32 249.65 2596% .
>> 28 8563.18 320.26 2717%
>>
>>
>>
>> -sstderr output from 2 cores:
>> ==============================**========
>>
>> 214,358,463,960 bytes allocated in the heap
>> 2,261,367,592 bytes copied during GC
>> 71,053,384 bytes maximum residency (7 sample(s))
>> 2,077,816 bytes maximum slop
>> 2299 MB total memory in use (0 MB lost due to fragmentation)
>>
>> Tot time (elapsed) Avg pause Max
>> pause
>> Gen 0 100 colls, 100 par 28.58s 14.27s 0.1427s
>> 0.3345s
>> Gen 1 7 colls, 6 par 3.36s 1.74s 0.2486s
>> 0.3241s
>>
>> Parallel GC work balance: 25.06% (serial 0%, perfect 100%)
>>
>> TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
>>
>> SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>>
>> INIT time 0.03s ( 0.03s elapsed)
>> MUT time 474.28s (252.98s elapsed)
>> GC time 31.94s ( 16.01s elapsed)
>> EXIT time 0.12s ( 0.12s elapsed)
>> Total time 506.49s (269.26s elapsed)
>>
>> Alloc rate 451,969,681 bytes per MUT second
>>
>> Productivity 93.7% of total user, 176.2% of total elapsed
>>
>> gc_alloc_block_sync: 108391
>> whitehole_spin: 0
>> gen[0].sync: 7392
>> gen[1].sync: 97126
>>
>>
>>
>> -sstderr output from 24 cores:
>> ==============================**=========
>>
>> 262,335,150,000 bytes allocated in the heap
>> 1,323,818,912 bytes copied during GC
>> 67,648,048 bytes maximum residency (7 sample(s))
>> 2,927,176 bytes maximum slop
>> 25172 MB total memory in use (0 MB lost due to fragmentation)
>>
>> Tot time (elapsed) Avg pause Max
>> pause
>> Gen 0 29 colls, 29 par 162.40s 6.77s 0.2335s
>> 0.3469s
>> Gen 1 7 colls, 6 par 118.34s 5.32s 0.7600s
>> 1.2448s
>>
>> Parallel GC work balance: 5.61% (serial 0%, perfect 100%)
>>
>> TASKS: 27 (1 bound, 26 peak workers (26 total), using -N24)
>>
>> SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>>
>> INIT time 0.34s ( 0.34s elapsed)
>> MUT time 4606.75s (197.59s elapsed)
>> GC time 280.74s ( 12.09s elapsed)
>> EXIT time 0.40s ( 0.40s elapsed)
>> Total time 4888.61s (210.80s elapsed)
>>
>> Alloc rate 56,945,837 bytes per MUT second
>>
>> Productivity 94.3% of total user, 2185.8% of total elapsed
>>
>> gc_alloc_block_sync: 5397994
>> whitehole_spin: 0
>> gen[0].sync: 32282
>> gen[1].sync: 5072
>>
>>
>
> ______________________________**_________________
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users at haskell.**org <Glasgow-haskell-users at haskell.org>
> http://www.haskell.org/**mailman/listinfo/glasgow-**haskell-users<http://www.haskell.org/mailman/listinfo/glasgow-haskell-users>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20121001/72cb7996/attachment-0001.htm>
More information about the Glasgow-haskell-users
mailing list