[GHC] #9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess
GHC
ghc-devs at haskell.org
Tue Jul 8 10:22:47 UTC 2014
#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess
------------------------------------+-------------------------------------
Reporter: edsko | Owner:
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.8.2
Keywords: | Operating System: Unknown/Multiple
Architecture: Unknown/Multiple | Type of failure: None/Unknown
Difficulty: Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: |
------------------------------------+-------------------------------------
The attached Haskell program is a stress test for `forkProcess`. It starts
100 child processes, each of which do a single, safe, FFI call, after
which the main process waits for all child processes to terminate.
I compile the test with
{{{
# gcc -c -o TestForkProcessC.o -g TestForkProcessC.c
# ghc -debug -threaded -fforce-recomp -Wall TestForkProcess.hs
TestForkProcessC.o
}}}
and then start running it until it fails (that is, until one or more of
the child processes fail to terminate):
{{{
# while ./TestForkProcess +RTS -N1 ; do echo "OK"; done
}}}
Actually, most of the time this happens pretty quickly (often even on the
first call to `TestForkProcess`).
Those child processes that do fail to terminate get stuck in an infinite
loop in `shutdownCapability`, which looks something like:
{{{
void shutdownCapability (Capability *cap, Task *task, rtsBool safe)
{
nat i;
task->cap = cap;
for (i = 0; /* i < 50 */; i++) {
// ... other conditionals omitted
if (cap->suspended_ccalls && safe) {
cap->running_task = NULL;
RELEASE_LOCK(&cap->lock);
// The IO manager thread might have been slow to start up,
// so the first attempt to kill it might not have
// succeeded. Just in case, try again - the kill message
// will only be sent once.
ioManagerDie();
yieldThread();
continue;
}
traceSparkCounters(cap);
RELEASE_LOCK(&cap->lock);
break;
}
}
}}}
(note that I'm only considering the threaded RTS). In the child processes
that loop indefinitely this `cap->suspended_ccalls && safe` condition gets
triggered time and again.
When it does, it gets stuck waiting for a single `InCall`. This `InCall`
is created by a call to `newInCall` in `workerStart` -- i.e., it is
created on pthread startup. That begs the question where this worker task
was created; this I don't know for sure but I am fairly sure that it
happens during the initialization of the IO manager. (The initialization
sequence of the IO manager involves the creation of 4 tasks before we even
get to `main`, so it's bit a hard to navigate.)
I have some further evidence that the I/O manager is involved, although
not necessarily the cause of the problem. On normal termination, the I/O
manager is asked to shutdown by the call to `ioManagerDie` in
`shutdownCapability`, shown above. This will send `IO_MANAGER_DIE`
(`0xFE`) on the I/O managers "control pipe" (created in
`GHC.Event.Thread.startTimerManagerThread`). When the timer manager thread
receives this (in `GHC.Event.TimerManager.handleControlEvent`) it calls
`shutdownManagers`, which shuts down the IO manager threads by sending
them `io_MANAGER_DIE` on their respective pipes. This gets received by
`GHC.Event.Manager.handleControlEvent` and the IO manager threads exit.
(Note on capitalization: `IO_MANAGER_DIE` is the C symbol;
`io_MANAGER_DIE` is the Haskell symbol.)
When the child process fails to terminate, the first part of this process
still happens. The timer manager thread receives `IO_MANAGER_DIE` and
calls `shutdownManagers`. However, now things go wrong, and it seems they
go wrong in one of two ways. The very first thing that `shutdownManagers`
does is acquire the `ioManagerLock`. Sometimes it gets stuck right there.
However, this is not ''always'' the case. Sometimes it does manage to
acquire the lock, and I can see it going through the loop and sending the
shutdown signal to the IO manager thread (I'm saying "the" because I've
exclusively been testing with `-N1`). Either way, in the case that the
child process gets stuck, this signal somehow never arrives at the IO
manager thread (that is, I have a print statement in `readControlMessage`
that prints a message when it receives `IO_MANAGER_DIE`, along with a bit
of information where it was called from, and that print statement never
triggers).
I am not sure where to go from here. Note that I have only been able to
reproduce this on OSX/ghc 7.8. I have not been able to reproduce this
problem on Linux/7.8 (although there _are_ other problems with
`forkProcess` on Linux, which unfortunately are proving even more
elusive). The attached stress test ''does'' very often get stuck on
Linux/7.4 but of course that's a different I/O manager altogether and is
probably an unrelated bug.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9284>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list