Extended periods of "waking up thread %d on cap %d"

Sat Jan 26 01:28:27 CET 2013

Recently I've been benchmarking my concurrent Gibbs sampler[1] on a
largish multicore machine. I've been using GHC HEAD due to various
GC-related fixes that are present. One thing that I've noticed in
looking over the event logs is that there are large durations (tens of
milliseconds) when HECs appear to be constantly bombarded with wake-ups
from other capabilities.

In these periods, the eventlog will be filled with blocks of messages
such as this snippet from one benchmark run[2] (long repeated blocks
marked with ellipses, a few irrelevant messages have been omitted yet
not marked, see eventlog for full log),

     28.320958s    HEC 7: running thread 293
     ...
     28.391802s    HEC 6: running thread 209
     28.392070s    HEC 6: waking up thread 284 on cap 7
     28.392302s    HEC 6: waking up thread 284 on cap 7
     28.392579s    HEC 6: waking up thread 284 on cap 7
     28.392808s    HEC 6: waking up thread 284 on cap 7
     .
     .
     .
     28.405971s    HEC 16: waking up thread 284 on cap 7
     28.406026s    HEC 15: waking up thread 284 on cap 7
     28.406080s    HEC 12: waking up thread 284 on cap 7
     28.406091s    HEC 23: waking up thread 284 on cap 7
     28.406095s    HEC 6: waking up thread 284 on cap 7
     28.406095s    HEC 23: waking up thread 284 on cap 7
     28.406225s    HEC 16: waking up thread 284 on cap 7
     .
     .
     .
     28.490449s    HEC 7: stopping thread 293 (thread yielding)
     28.490454s    HEC 7: waking up thread 284 on cap 7
     28.490454s    HEC 7: waking up thread 284 on cap 7
     28.490455s    HEC 7: waking up thread 284 on cap 7
     28.490456s    HEC 7: waking up thread 284 on cap 7
     28.490456s    HEC 7: waking up thread 284 on cap 7
     28.490457s    HEC 7: waking up thread 284 on cap 7
     28.490458s    HEC 7: waking up thread 284 on cap 7
     .
     .
     .
     28.491841s    HEC 7: waking up thread 284 on cap 7
     28.491850s    HEC 7: migrating thread 284 to cap 0
     28.491968s    HEC 7: running thread 293
     .
     .
     .

In other words, it appears that HEC 7 took a vacation of more than a
millisecond as it tried (ultimately unsuccessfully) to wake up thread
284. These blocks are easy to spot in this event log and quite difficult
to explain. In this case I was being silly and spawned a few too many
capabilities (30 capabilities whereas the machine has 24 real cores, 2
threads per core due to hyperthreading; I was running 24 threads).

That being said, even under more reasonable conditions[3] (where I was
running 24 capabilities with 20 threads on the same 24-core machine) I
still see problems waking up. For example, from 32.293455s to 32.327349s
see numerous attempts by many HECs to wake-up thread 517 on cap
12. In this period, cap 12 briefly runs thread 537 (32.326781s)
but while running HEC 17 attempts to wake up 517 again.

All-in-all, something seems a bit strange. Any ideas what might be going
on here?

I've CC'd Edward Yang (who I understand has recently been doing a
rework on the scheduler) and Simon Marlow.

Thanks,

- Ben

[1] https://github.com/bgamari/bayes-stack
[2] http://goldnerlab.physics.umass.edu/~bgamari/Benchmark-wakeup.eventlog
[3] http://goldnerlab.physics.umass.edu/~bgamari/Benchmark-wakeup-smaller.eventlog