Extended periods of "waking up thread %d on cap %d"

Mon Jan 28 12:40:18 CET 2013

On 26/01/13 00:28, Ben Gamari wrote:
> Recently I've been benchmarking my concurrent Gibbs sampler[1] on a
> largish multicore machine. I've been using GHC HEAD due to various
> GC-related fixes that are present. One thing that I've noticed in
> looking over the event logs is that there are large durations (tens of
> milliseconds) when HECs appear to be constantly bombarded with wake-ups
> from other capabilities.
>
> In these periods, the eventlog will be filled with blocks of messages
> such as this snippet from one benchmark run[2] (long repeated blocks
> marked with ellipses, a few irrelevant messages have been omitted yet
> not marked, see eventlog for full log),
>
>       28.320958s    HEC 7: running thread 293
>       ...
>       28.391802s    HEC 6: running thread 209
>       28.392070s    HEC 6: waking up thread 284 on cap 7
>       28.392302s    HEC 6: waking up thread 284 on cap 7
>       28.392579s    HEC 6: waking up thread 284 on cap 7
>       28.392808s    HEC 6: waking up thread 284 on cap 7
>       .
>       .
>       .
>       28.405971s    HEC 16: waking up thread 284 on cap 7
>       28.406026s    HEC 15: waking up thread 284 on cap 7
>       28.406080s    HEC 12: waking up thread 284 on cap 7
>       28.406091s    HEC 23: waking up thread 284 on cap 7
>       28.406095s    HEC 6: waking up thread 284 on cap 7
>       28.406095s    HEC 23: waking up thread 284 on cap 7
>       28.406225s    HEC 16: waking up thread 284 on cap 7
>       .
>       .
>       .
>       28.490449s    HEC 7: stopping thread 293 (thread yielding)
>       28.490454s    HEC 7: waking up thread 284 on cap 7
>       28.490454s    HEC 7: waking up thread 284 on cap 7
>       28.490455s    HEC 7: waking up thread 284 on cap 7
>       28.490456s    HEC 7: waking up thread 284 on cap 7
>       28.490456s    HEC 7: waking up thread 284 on cap 7
>       28.490457s    HEC 7: waking up thread 284 on cap 7
>       28.490458s    HEC 7: waking up thread 284 on cap 7
>       .
>       .
>       .
>       28.491841s    HEC 7: waking up thread 284 on cap 7
>       28.491850s    HEC 7: migrating thread 284 to cap 0
>       28.491968s    HEC 7: running thread 293
>       .
>       .
>       .
>
> In other words, it appears that HEC 7 took a vacation of more than a
> millisecond as it tried (ultimately unsuccessfully) to wake up thread
> 284. These blocks are easy to spot in this event log and quite difficult
> to explain. In this case I was being silly and spawned a few too many
> capabilities (30 capabilities whereas the machine has 24 real cores, 2
> threads per core due to hyperthreading; I was running 24 threads).

Thread 284 is blocked in an STM transaction:

28320680000: cap 7: running thread 284
28320708000: cap 7: thread 284 has label "diff worker"
28320716000: cap 7: stopping thread 284 (blocked in STM retry)

anything that touches the TVar(s) that thread 284 read during its 
transaction will cause a wakeup, and this entails sending a message to 
cap 7.  Sending a message to a cap involves taking a lock, so it's not 
cheap and could lead to contention if there are lots of attempts to wake 
up the thread.

So what we seem to be looking at is lots of wakeup messages sent to cap 
7, and then at 28.490454s we see cap 7 processing all those messages, 
which takes a full millisecond.

So a way to improve this would probably be to make STM wakeups atomic, 
so we don't need to send repeated wakeup messages if the thread hasn't 
woken up yet.  This part of the system is very tricky, which is one 
reason we simplify things by allowing any number of wakeup messages. 
Only the capability that owns a TSO is allowed to modify it, everyone 
else has to send a message.

For MVar we do break the rules though - when waking up a thread on an 
MVar, the waker removes the TSO from the list (holding a lock on the 
MVar), and then sends a message to the owner.  The owner knows whether 
to really wake up the TSO by checking whether it is still on the list or 
not (it can only be removed from the list once, because of the lock on 
the MVar).

So I think we could adopt a similar policy for STM.  The mutex is 
lockTSO()/unlockTSO(), which is already taken in STM.c:unpark_tso().  To 
indicate that a thread has already been sent a wakeup message, we could 
set tso->block_info.closure to a special value - normally it would be 
END_TSO_QUEUE for a thread blocked in STM.  I've been peering at the 
code a bit, and I'm pretty sure this is safe and doesn't involve many 
changes to the RTS.

Do you want to try this out and let me know if you get stuck?

Cheers,
	Simon

> That being said, even under more reasonable conditions[3] (where I was
> running 24 capabilities with 20 threads on the same 24-core machine) I
> still see problems waking up. For example, from 32.293455s to 32.327349s
> see numerous attempts by many HECs to wake-up thread 517 on cap
> 12. In this period, cap 12 briefly runs thread 537 (32.326781s)
> but while running HEC 17 attempts to wake up 517 again.
>
> All-in-all, something seems a bit strange. Any ideas what might be going
> on here?
>
> I've CC'd Edward Yang (who I understand has recently been doing a
> rework on the scheduler) and Simon Marlow.
>
> Thanks,
>
> - Ben
>
>
> [1] https://github.com/bgamari/bayes-stack
> [2] http://goldnerlab.physics.umass.edu/~bgamari/Benchmark-wakeup.eventlog
> [3] http://goldnerlab.physics.umass.edu/~bgamari/Benchmark-wakeup-smaller.eventlog
>