[GHC] #8578: Improvements to SpinLock implementation
GHC
ghc-devs at haskell.org
Sat Nov 30 20:42:53 UTC 2013
#8578: Improvements to SpinLock implementation
-------------------------------------+------------------------------------
Reporter: parcs | Owner: parcs
Type: task | Status: patch
Priority: normal | Milestone:
Component: Runtime System | Version: 7.7
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture: Unknown/Multiple
Type of failure: None/Unknown | Difficulty: Unknown
Test Case: | Blocked By:
Blocking: | Related Tickets:
-------------------------------------+------------------------------------
Comment (by thoughtpolice):
Very good stuff Patrick looks nice to me - and I can see how it'd help
scalability.
I did a quick glance and one question I have is why are we using
`busy_wait_nop` in the wait loop, which just becomes `rep; nop` on x86?
For spinlock implementations, Intel at least has a `PAUSE` instruction
which specifically informs the CPU that this is a spinlock wait-loop,
which allows it to optimize cache and memory accesses if possible to avoid
memory ordering violations, requiring the CPUs to synchronize. This can be
quite dramatic on older processors I believe (I wouldn't be surprised if
`rep; nop` devolved in `pause` on some microarchs, but probably not fully
guaranteed.) `PAUSE` might also actually *delay* the CPU for this
optimization to happen, where as `rep nop` will merely run as fast as
possible. So I'd suggest replacing `busy_wait_nop` on x86 with something
like `_pause_mm` from GCC:
{{{
#define _mm_pause() \
({ \
__asm__ __volatile__("pause" ::: "memory"); \
})
}}}
Finally, one thing I did for fun a while back was write an accelerated
spinlock implementation using the new TSX extensions in Intel processors.
In my very non-scientific experiments, this essentially eliminated any
lock contention (or even any need to take a lock) in a lot of cases. I
wonder if it's worth adding here to see if there's any difference in
throughput or contention. You can find the code here (pinpointed to the
code in question):
https://gist.github.com/thoughtpolice/7123036#file-spinlock-rtm-c-L230
Note my example does *not* use lock elision prefixes for `xchg` (which are
backwards compatible,) but instead uses the new TSX RTM instructions to
wrap locks/unlocks in a transaction, allowing speculative elision. In
practice this works just as well and RTM is more flexible.
It would however require checking `cpuid` to see if TSX is available and
if so, dynamically replacing the code path as I have done. On Haswell,
this indirect-branch would probably be predicted so well its overhead is
basically zero, but I don't know what it might do to e.g. AMD machines in
terms of prediction or throughput.
In any case - Patrick, thanks. If you'd like to test the elision thing, or
the impact of `_mm_pause`, I have a powerful Haswell server available (8
hardware threads/32GB RAM) you can play around with for testing.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8578#comment:4>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list