[GHC] #13497: GHC does not use select()/poll() correctly on non-Linux platforms

GHC ghc-devs at haskell.org
Sat Jul 29 22:50:05 UTC 2017


#13497: GHC does not use select()/poll() correctly on non-Linux platforms
-------------------------------------+-------------------------------------
        Reporter:  nh2               |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Runtime System    |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #8684, #12912     |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by nh2):

 **An update for GHC 8.2:**

 GHC 8.2 has improved on this problem for POSIX platforms, two-fold:

 1. GHC 8.2 has fixed the crashing problem of `fdReady()` that was
 introduced in GHC 8.0.2 (but not present in 8.0.1), that I mentioned in
 Comment 2 further above.
 2. That fix (commit
 [http://github.com/ghc/ghc/commit/ae69eaed6e2a5dff7f3a61d4373b7c52e715e3ad
 ae69eaed] - "base: Fix hWaitForInput with timeout on POSIX") works by
 repeatedly inspecting the current time (because `select()` is gone and was
 replaced by `poll()` in commit
 [http://github.com/ghc/ghc/commit/f46369b8a1bf90a3bdc30f2b566c3a7e03672518
 f46369b8], which doesn't have the functionality of advancing a timeout
 pointer even on Linux), so the problem that was described in the original
 issue description is gone for FreeBSD / OSX / any platform that's not
 Windows.

 However, there are remaining problems with the current in implementation:

 * It currently uses `gettimeofday()`, which doesn't use a monotonic clock,
 so any time adjustment can make `fdReady()` wait for significantly more or
 less than it should.
 * It keeps track of the total time waited by adding up time differences
 between calls to `gettimeofday()`:

 {{{
     while ((res = poll(fds, 1, msecs)) < 0) {
         if (errno == EINTR) {
             if (msecs > 0) {
                 struct timeval tv;
                 if (gettimeofday(&tv, NULL) != 0) {
                     fprintf(stderr, "fdReady: gettimeofday failed: %s\n",
                             strerror(errno));
                     abort();
                 }

                 int elapsed = 1000 * (tv.tv_sec - tv0.tv_sec)
                             + (tv.tv_usec - tv0.tv_usec) / 1000;
                 msecs -= elapsed;
                 if (msecs <= 0) return 0;
                 tv0 = tv;
             }
         } else {
             return (-1);
         }
     }
 }}}


 This is inaccurate, because the code between `gettimeofday()` and `tv0 =
 tv;` is not tracked. If the process is descheduled by the OS in these
 lines, then that time is "lost" and `fdReady()` will wait way too long.

 This inacurracy can easily be observed and magnified by increasing the
 frequency of signals arriving at the program. Consider this simple program
 `ghc-bug-13497-accuracy-test.hs` that waits 5 seconds for input on stdin:

 {{{
 import System.IO

 main = hWaitForInput stdin (5 * 1000)
 }}}

 When run normally on GHC 8.2.1 (release commit 0cee25253), this program
 terminates within 5 seconds when run with the command line

 {{{
 inplace/bin/ghc-stage2 --make -fforce-recomp -rtsopts ghc-bug-13497
 -accuracy-test.hs
 /usr/bin/time ./ghc-bug-13497-accuracy-test
 }}}

 But it starts taking much longer when `+RTS -V...` is added for
 increasingly frequent values of `-V`; the effect is even stronger when
 setting the idle GC timer to something large (e.g. `-I10` for every 10
 seconds):

 {{{
 no `-V` passed  0.00user 0.00system 0:05.02elapsed 0%CPU
 -V0.1           0.00user 0.00system 0:05.01elapsed 0%CPU
 -V0.01          0.00user 0.00system 0:05.01elapsed 0%CPU
 -V0.001         0.00user 0.00system 0:05.13elapsed 0%CPU
 -V0.0001        0.00user 0.00system 0:05.30elapsed 0%CPU
 -V0.00001       0.06user 0.00system 0:05.31elapsed 1%CPU
 -V0.000001      0.37user 0.20system 0:05.73elapsed 10%CPU
 -V0.0000001     2.67user 3.30system 0:12.47elapsed 47%CPU
 -V0.00000001   50.44user 7.32system 1:17.50elapsed 74%CPU

 -I10 -V0.1      0.00user 0.00system 0:05.10elapsed 0%CPU
 -I10 -V0.01     0.00user 0.00system 0:05.25elapsed 0%CPU
 -I10 -V0.001    0.00user 0.10system 0:08.47elapsed 1%CPU
 -I10 -V0.0001   the program did not terminate within 2 minutes of waiting
 }}}

 It reason it's worse with `-I10` is that, as described above, without `-I`
 ghc stops the timer signal after 0.3 seconds (so no `EINTR`s are occurring
 beyond that time), and with `-I` given it doesn't.

 Not all of the above is in the non-threaded runtime. Doing the same with
 `-threaded` on Linux gives reliable `0.00user 0.00system 0:05.01elapsed
 0%CPU` no matter what `-V` or `-I` is passed, and `strace` shows that
 there are no `EINTR`s happening in that case. I suspect this is because
 `hWaitForInput` calls `fdReady()` as a `safe` foreign call, which makes it
 have its own thread in the threaded runtime.

 There is also an `unsafe` call to `fdReady()` but that one is only used
 with timeouts of `0` so that's not a problem.

 On FreeBSD 11, non-threaded, the situation is worse:

 {{{
 no `-V` passed 5.14 real         0.00 user         0.00 sys
 -V0.1          5.10 real         0.00 user         0.00 sys
 -V0.01         5.16 real         0.00 user         0.00 sys
 -V0.001        5.17 real         0.00 user         0.02 sys
 -V0.0001       5.24 real         0.00 user         0.01 sys
 -V0.00001      5.89 real         0.01 user         0.08 sys
 -V0.000001     5.81 real         0.00 user         0.11 sys
 -V0.0000001    6.05 real         0.00 user         0.09 sys
 -V0.00000001   5.77 real         0.00 user         0.09 sys

 -I10 -V0.1     5.13 real         0.00 user         0.01 sys
 -I10 -V0.01    5.24 real         0.00 user         0.01 sys
 -I10 -V0.001   5.90 real         0.00 user         0.10 sys
 -I10 -V0.0001  5.82 real         0.00 user         0.09 sys
 }}}

 And with `-threaded` on FreeBSD 11:

 {{{
 no `-V` passed 5.15 real         0.00 user         0.01 sys
 -V0.1          5.30 real         0.00 user         0.00 sys
 -V0.01         5.31 real         0.00 user         0.01 sys
 -V0.001        5.45 real         0.00 user         0.13 sys
 -V0.0001       5.98 real         0.00 user         0.13 sys
 -V0.00001      5.93 real         0.00 user         0.15 sys
 -V0.000001     5.79 real         0.00 user         0.15 sys
 -V0.0000001    5.83 real         0.00 user         0.13 sys
 -V0.00000001   5.80 real         0.00 user         0.18 sys

 -I10 -V0.1     5.13 real         0.00 user         0.01 sys
 -I10 -V0.01    5.27 real         0.00 user         0.03 sys
 -I10 -V0.001   5.77 real         0.00 user         0.12 sys
 -I10 -V0.0001  5.90 real         0.00 user         0.18 sys
 }}}

 As you can see, on FreeBSD 11 the `-threaded` doesn't fix the issues as it
 does on Linux, and `truss` suggests that that is because `EINTR`s arrive
 (while they didn't on Linux).

 I'm not sure why with `threaded` on FreeBSD there's `EINTR`s happening but
 not on Linux, but I observed that on Linux we have instead:

 {{{
 [pid 30502] timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC) = 7
 ...
 [pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
 [pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
 [pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
 [pid 30502] read(7, "\1\0\0\0\0\0\0\0", 8) = 8
 ...
 }}}

 So I suspect that the difference is that Linux has `timerfd` and FreeBSD
 doesn't.

 ----

 OK, so far the problem description. **The summary is:**

 * In the nonthreaded runtime, a high precision `-V` destroys accuracy
 * On non-Linux (systems without `timerfd`), this happens even with
 `-threaded`
 * In any runtime, accuracy can be screwed with due to non-use of monotonic
 clocks.

 **The fix** is simple:

 Use the monotonic clock, and instead of tracking waited time as a sum of
 wait intervals, always compare the current time with the _end_ time (time
 at entry of `fdReady()` + `msec`).

 I have implemented this in commit
 [https://github.com/nh2/ghc/commit/12f9d66b5c837c221be080b526dcb61fecb7cf1c
 12f9d66b] of my branch
 [https://github.com/ghc/ghc/compare/ghc-8.2.1-release...nh2:ghc-8.2.1
 -improve-fdRready-precision ghc-8.2.1-improve-fdRready-precision] (the
 first link is stable, the latter may change as I write more fixes for
 Windows).

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/13497#comment:18>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler


More information about the ghc-tickets mailing list