[GHC] #13497: GHC does not use select()/poll() correctly on non-Linux platforms
GHC
ghc-devs at haskell.org
Sat Jul 29 22:50:05 UTC 2017
#13497: GHC does not use select()/poll() correctly on non-Linux platforms
-------------------------------------+-------------------------------------
Reporter: nh2 | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone:
Component: Runtime System | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #8684, #12912 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by nh2):
**An update for GHC 8.2:**
GHC 8.2 has improved on this problem for POSIX platforms, two-fold:
1. GHC 8.2 has fixed the crashing problem of `fdReady()` that was
introduced in GHC 8.0.2 (but not present in 8.0.1), that I mentioned in
Comment 2 further above.
2. That fix (commit
[http://github.com/ghc/ghc/commit/ae69eaed6e2a5dff7f3a61d4373b7c52e715e3ad
ae69eaed] - "base: Fix hWaitForInput with timeout on POSIX") works by
repeatedly inspecting the current time (because `select()` is gone and was
replaced by `poll()` in commit
[http://github.com/ghc/ghc/commit/f46369b8a1bf90a3bdc30f2b566c3a7e03672518
f46369b8], which doesn't have the functionality of advancing a timeout
pointer even on Linux), so the problem that was described in the original
issue description is gone for FreeBSD / OSX / any platform that's not
Windows.
However, there are remaining problems with the current in implementation:
* It currently uses `gettimeofday()`, which doesn't use a monotonic clock,
so any time adjustment can make `fdReady()` wait for significantly more or
less than it should.
* It keeps track of the total time waited by adding up time differences
between calls to `gettimeofday()`:
{{{
while ((res = poll(fds, 1, msecs)) < 0) {
if (errno == EINTR) {
if (msecs > 0) {
struct timeval tv;
if (gettimeofday(&tv, NULL) != 0) {
fprintf(stderr, "fdReady: gettimeofday failed: %s\n",
strerror(errno));
abort();
}
int elapsed = 1000 * (tv.tv_sec - tv0.tv_sec)
+ (tv.tv_usec - tv0.tv_usec) / 1000;
msecs -= elapsed;
if (msecs <= 0) return 0;
tv0 = tv;
}
} else {
return (-1);
}
}
}}}
This is inaccurate, because the code between `gettimeofday()` and `tv0 =
tv;` is not tracked. If the process is descheduled by the OS in these
lines, then that time is "lost" and `fdReady()` will wait way too long.
This inacurracy can easily be observed and magnified by increasing the
frequency of signals arriving at the program. Consider this simple program
`ghc-bug-13497-accuracy-test.hs` that waits 5 seconds for input on stdin:
{{{
import System.IO
main = hWaitForInput stdin (5 * 1000)
}}}
When run normally on GHC 8.2.1 (release commit 0cee25253), this program
terminates within 5 seconds when run with the command line
{{{
inplace/bin/ghc-stage2 --make -fforce-recomp -rtsopts ghc-bug-13497
-accuracy-test.hs
/usr/bin/time ./ghc-bug-13497-accuracy-test
}}}
But it starts taking much longer when `+RTS -V...` is added for
increasingly frequent values of `-V`; the effect is even stronger when
setting the idle GC timer to something large (e.g. `-I10` for every 10
seconds):
{{{
no `-V` passed 0.00user 0.00system 0:05.02elapsed 0%CPU
-V0.1 0.00user 0.00system 0:05.01elapsed 0%CPU
-V0.01 0.00user 0.00system 0:05.01elapsed 0%CPU
-V0.001 0.00user 0.00system 0:05.13elapsed 0%CPU
-V0.0001 0.00user 0.00system 0:05.30elapsed 0%CPU
-V0.00001 0.06user 0.00system 0:05.31elapsed 1%CPU
-V0.000001 0.37user 0.20system 0:05.73elapsed 10%CPU
-V0.0000001 2.67user 3.30system 0:12.47elapsed 47%CPU
-V0.00000001 50.44user 7.32system 1:17.50elapsed 74%CPU
-I10 -V0.1 0.00user 0.00system 0:05.10elapsed 0%CPU
-I10 -V0.01 0.00user 0.00system 0:05.25elapsed 0%CPU
-I10 -V0.001 0.00user 0.10system 0:08.47elapsed 1%CPU
-I10 -V0.0001 the program did not terminate within 2 minutes of waiting
}}}
It reason it's worse with `-I10` is that, as described above, without `-I`
ghc stops the timer signal after 0.3 seconds (so no `EINTR`s are occurring
beyond that time), and with `-I` given it doesn't.
Not all of the above is in the non-threaded runtime. Doing the same with
`-threaded` on Linux gives reliable `0.00user 0.00system 0:05.01elapsed
0%CPU` no matter what `-V` or `-I` is passed, and `strace` shows that
there are no `EINTR`s happening in that case. I suspect this is because
`hWaitForInput` calls `fdReady()` as a `safe` foreign call, which makes it
have its own thread in the threaded runtime.
There is also an `unsafe` call to `fdReady()` but that one is only used
with timeouts of `0` so that's not a problem.
On FreeBSD 11, non-threaded, the situation is worse:
{{{
no `-V` passed 5.14 real 0.00 user 0.00 sys
-V0.1 5.10 real 0.00 user 0.00 sys
-V0.01 5.16 real 0.00 user 0.00 sys
-V0.001 5.17 real 0.00 user 0.02 sys
-V0.0001 5.24 real 0.00 user 0.01 sys
-V0.00001 5.89 real 0.01 user 0.08 sys
-V0.000001 5.81 real 0.00 user 0.11 sys
-V0.0000001 6.05 real 0.00 user 0.09 sys
-V0.00000001 5.77 real 0.00 user 0.09 sys
-I10 -V0.1 5.13 real 0.00 user 0.01 sys
-I10 -V0.01 5.24 real 0.00 user 0.01 sys
-I10 -V0.001 5.90 real 0.00 user 0.10 sys
-I10 -V0.0001 5.82 real 0.00 user 0.09 sys
}}}
And with `-threaded` on FreeBSD 11:
{{{
no `-V` passed 5.15 real 0.00 user 0.01 sys
-V0.1 5.30 real 0.00 user 0.00 sys
-V0.01 5.31 real 0.00 user 0.01 sys
-V0.001 5.45 real 0.00 user 0.13 sys
-V0.0001 5.98 real 0.00 user 0.13 sys
-V0.00001 5.93 real 0.00 user 0.15 sys
-V0.000001 5.79 real 0.00 user 0.15 sys
-V0.0000001 5.83 real 0.00 user 0.13 sys
-V0.00000001 5.80 real 0.00 user 0.18 sys
-I10 -V0.1 5.13 real 0.00 user 0.01 sys
-I10 -V0.01 5.27 real 0.00 user 0.03 sys
-I10 -V0.001 5.77 real 0.00 user 0.12 sys
-I10 -V0.0001 5.90 real 0.00 user 0.18 sys
}}}
As you can see, on FreeBSD 11 the `-threaded` doesn't fix the issues as it
does on Linux, and `truss` suggests that that is because `EINTR`s arrive
(while they didn't on Linux).
I'm not sure why with `threaded` on FreeBSD there's `EINTR`s happening but
not on Linux, but I observed that on Linux we have instead:
{{{
[pid 30502] timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC) = 7
...
[pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
...
}}}
So I suspect that the difference is that Linux has `timerfd` and FreeBSD
doesn't.
----
OK, so far the problem description. **The summary is:**
* In the nonthreaded runtime, a high precision `-V` destroys accuracy
* On non-Linux (systems without `timerfd`), this happens even with
`-threaded`
* In any runtime, accuracy can be screwed with due to non-use of monotonic
clocks.
**The fix** is simple:
Use the monotonic clock, and instead of tracking waited time as a sum of
wait intervals, always compare the current time with the _end_ time (time
at entry of `fdReady()` + `msec`).
I have implemented this in commit
[https://github.com/nh2/ghc/commit/12f9d66b5c837c221be080b526dcb61fecb7cf1c
12f9d66b] of my branch
[https://github.com/ghc/ghc/compare/ghc-8.2.1-release...nh2:ghc-8.2.1
-improve-fdRready-precision ghc-8.2.1-improve-fdRready-precision] (the
first link is stable, the latter may change as I write more fixes for
Windows).
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/13497#comment:18>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list