[GHC] #12004: Windows unexpected failures

GHC ghc-devs at haskell.org
Wed Nov 30 02:13:54 UTC 2016


#12004: Windows unexpected failures
-------------------------------------+-------------------------------------
        Reporter:  enolan            |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:
      Resolution:                    |             Keywords:
Operating System:  Windows           |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):  Phab:D2684
       Wiki Page:                    |  Phab:D2759
-------------------------------------+-------------------------------------

Comment (by Ben Gamari <ben@…>):

 In [changeset:"0ce59be3a2723f814a3e929fd32a44ff4e890a49/ghc" 0ce59be/ghc]:
 {{{
 #!CommitTicketReference repository="ghc"
 revision="0ce59be3a2723f814a3e929fd32a44ff4e890a49"
 Fix testsuite threading, timeout, encoding and performance issues on
 Windows

 In a land far far away, a project called Cygwin was born.
 Cygwin used newlib as it's standard C library implementation.

 But Cygwin wanted to emulate POSIX systems as closely as possible.
 So it implemented `execv` using the Windows function `spawnve`.

 Specifically

 ```
 spawnve (_P_OVERLAY, path, argv, cur_environ ())
 ```

 `_P_OVERLAY` is crucial, as it makes the function behave *sort of*
 like execv on linux. the child process replaces the original process.

 With one major difference because of the difference in process models
 on Windows: the original process signals the caller that it's done.

 this is why the file is still locked. because it's still running,
 control was returned because the parent process was destroyed,
 but the child is still running.

 I think it's just pure dumb luck, that the older runtimes are slow
 enough to give the process time to terminate before we tried deleting
 the file.  Which explains why you do have sporadic failures even on
 older runtimes like 2.5.0, of a test or two (like T7307).

 So this patch fixes a couple of things. I leverage the existing
 `timeout.exe` to implement a workaround for this issue.

 a) The old timeout used to start the process then assign it to the job.
    This is slightly faulty since child processes are only assigned to a
    job is their parent were assigned at the time they started. So this
    was a race condition. I now create the process suspended, assign it
    to the job and then resume it. Which means all child processes are
    not running under the same job.

 b) First things, Is to prevent dangling child processes. I mark the job
    with `JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE` so when the last process in
    the job is done, it insures all processes under the job are killed.

 c) Secondly, I change the way we wait for results. Instead of waiting
    for the parent process to terminate, I wait for the job itself to
    terminate.

    There's a slight subtlety there, we can't wait on the job itself.
    Instead we have to create an I/O Completion port and wait for signals
    on it.  See
    https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743

 This fixes the issues on all runtimes for me and makes T7307 pass
 consistenly.

 The threading was also simplified by hiding all the locking in a single
 semaphore and a completion class. Futhermore some additional error
 reporting was added.

 For encoding the testsuite now no longer passes a file handle to the
 subprocess since on windows, sh.exe seems to acquire a lock on the file
 that is not released in a timely fashion.

 I suspect this because cygwin seems to emulate console handles by
 creating file handles and using those for std handles. So when we give
 it an existing file handle it just locks the file. I what's happening is
 that it's not releasing the handle until all shared cygwin processes are
 dead. Which explains why it worked in single threaded mode.

 So now instead we pass a pipe and do not interpret the resulting data.

 Any bytes written to stdin or read out of stdout/stderr are done so in
 binary mode and we do not interpret the data. The reason for this is
 that we have encoding tests in GHC which pass invalid utf-8. If we try
 to handle the data as text then python will throw an exception instead
 of a test comparison failing.

 Also I have fixed the ability to override `PYTHON` when calling `make
 tests`. This now works the same as with `.\validate`.

 Finally, after cleaning up the locks I was able to make the abort
 behavior work correctly as I believe it was intended: when you press
 Ctrl+C and send an interrupt signal, the testsuite finishes the active
 tests and then gracefully exits showing you a report of the progress it
 did make. So using Ctrl+C will not just *die* as it did before.

 These changes lift the restriction on which python version you use
 (msys/mingw) or which runtime or python 3 or python 2.  All combinations
 should now be supported.

 Test Plan:
 PATH=/usr/local/bin:/mingw64/bin:$APPDATA/cabal/bin:$PATH &&
 PYTHON=/usr/bin/python THREADS=9 make test
 THREADS=9 make test
 PATH=/usr/local/bin:/mingw64/bin:$APPDATA/cabal/bin:$PATH &&
 PYTHON=/usr/bin/python ./validate --quiet --testsuite-only

 Reviewers: erikd, RyanGlScott, bgamari, austin

 Subscribers: jrtc27, mpickering, thomie, #ghc_windows_task_force

 Differential Revision: https://phabricator.haskell.org/D2684

 GHC Trac Issues: #12725, #12554, #12661, #12004
 }}}

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12004#comment:8>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler


More information about the ghc-tickets mailing list