Justifying sched_yield() in the RTS

Fri May 1 08:18:31 UTC 2020

Hello GHC devs,

Through the course of reading some recent material posted by Linus
Torvalds ^1 ^2, I remembered stumbling upon GHC Trac #3553 ^3 some
years ago.

Looking past the harsh tone Linus used in his notes, I think he makes
some pretty reasonable points about the problems that can be caused by
using spinlocks in userspace. Specifically:

- A group of threads yielding while waiting for a spinlock are, in
effect, simply trying to coax the scheduler into scheduling the thread
that is holding the lock. This is much easier for the scheduler to do
correctly with a futex or similar, since the scheduler has some
context around which tasks are blocking/waking each other up.
- Apparently the Linux scheduler (and maybe other schedulers) has some
smarts for preserving cache locality while choosing which physical
execution units run which tasks, and reordering threads in the run
list in an ad-hoc way with sched_yield() interferes with this
mechanism.
- Using sched_yield() (or other random interference with run lists)
can cause disproportionate havoc on NUMA systems, where jostling
around the lock-holding thread by burning time slices and yielding is
especially costly.

All that said, there is perhaps one concrete advantage (aside from
absolute performance) to the current spinlock implementation: the only
functionality it requires from the host OS is a yield-like operation.
A yielding spinlock occupies a comfortable local optimum, achieving
decent performance across platforms with minimal exposure to cross-OS
API differences.

Revisiting #3553, it seems the only reason that a futex was not used
in place of a spinlock with sched_yield() was that, despite all of the
points against it, the spinlock code empirically performed better.
However, these tests were performed years ago and the futex
implementation in the Linux kernel has changed quite a bit.

I think there is a healthy amount of empirical evidence from the GHC
user community that there are problems afoot with the way parallel GC
does locking, and indeed I wonder if the bad cases Linus described are
playing a role. Advice like "use -qg" or "use -qnM with small M" is
often parroted (this Twitter thread ^4 contains both pieces of
advice). Furthermore, I would wager that an RTS using futexes for
locking rather than spinning plays nicer with other CPU intensive
tasks on the same machine. The "scheduler storm" caused by a large
number of threads waiting for a spinlock could have a negative impact
on neighboring processes, e.g. a large number of HECs spinning
certainly don't do any favors for a busy neighboring DB process.

I'm curious if others have thoughts on reviving the futex
investigation. Perhaps the futexes provided by modern Linux are more
competitive with the current spinlock implementation, or perhaps
better scaling on machines with high core counts is worth some cost.
In the case that futexes remain handily defeated by yielding
spinlocks, it's a worthy endeavor to explain precisely _why_ this
happens. Other programs have seen performance issues crop up when the
kernel makes subtle changes to how sched_yield() works. ^5

1: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723
2: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752
3: https://gitlab.haskell.org/ghc/ghc/issues/3553
4: https://twitter.com/ndm_haskell/status/1076187051610042368?s=20
5: https://lwn.net/Articles/31462/