Lost output in multithreaded program

David Hughes David.W.Hughes@cern.ch
Tue, 29 Jul 2003 12:15:27 +0200


Dear list,

I am puzzled by the behaviour of this (stripped-down, uglified) program. 
  It is supposed to run a bunch of shell commands simultaneously, 
collate their standard output and error, and print their output as 
though they had run sequentially.

     module Main where

     import Control.Concurrent
     import Data.Maybe
     import GHC.Handle
     import GHC.IO
     import Posix (popen)

     main = do
         cmds    <- getContents >>= return.lines
         outCh   <- newChan
         mainThr <- myThreadId
         forkIO $ do
                 mapM_ (startCommand outCh) cmds
                 writeChan outCh $ killThread mainThr
         getChanContents outCh >>= sequence_

     startCommand :: Chan (IO ()) -> String -> IO ()
     startCommand ch cmd = do
         -- Prevent lazy reads after forking
         length cmd `seq` return ()
         (out,err,pid) <- popen "/bin/sh" ["-c",cmd] Nothing
         -- Prevent deadlock (waiting to read stdout
         -- while the child waits to write stderr).
         forkIO $ length err `seq` return ()  -- culprit line?
         writeChan ch $ hPutStr stdout out
         writeChan ch $ hFlush stdout
         writeChan ch $ hPutStr stderr err
         writeChan ch $ hFlush stderr

Most of the time it works but every so often a chunk of child output 
gets lost.  (I'm sure I've seen it duplicating chunks of output too, but 
I haven't been able to isolate a test case.)  Corruption of stderr seems 
more common than corruption of stdout.  The corruption doesn't get more 
predictable if I generate a list of commands with fixed 'random' sleeps 
and use that (so I don't think it's a gross timing issue).  The lost 
data is not necessarily a prefix of the output string, nor a suffix. 
The lost data does not reappear later in the program's output.  The lost 
data *does* appear to correspond to byte sequences that are written in a 
single write() by the child process.

Example:

     $ cat test.sh
     #!/bin/sh
     echo -n "std"        ; sleep 1
     echo -n "STD"    >&2 ; sleep 1
     echo       "out"     ; sleep 1
     echo       "ERR" >&2 ; sleep 1
     $ for i in `seq 100`; do echo "sleep $((RANDOM % 10)); ./test.sh"; 
done | ./a.out
   <snip>
     stdout
     STDERR
     stdout
     ERR
     stdout
     STDERR
   <snip>
     stdout
     STDERR
     stdout
     STDstdout
     STDERR
     stdout
     STDERR
   <snip>

If I comment the 'culprit line' above then the program runs without 
error.  This confounds me: I believe that that line should have no 
effect except to suck the child's stderr into the program as soon as 
possible.  (Am I wrong?  And is there a better way of doing this?)

Platform: ghc-6.0-7 / RedHat 8.0 / x86.
Compilation: ghc -package posix test.hs

Where did my output go?  Have I run into some gotcha like the 'lazy 
reads after forking' thing?  Can someone tell me what I'm doing wrong?

Hopeful thanks,
// David
-- 
David Hughes
UNIX sysadmin, Serco SA         -+-         Tel.: +41 22 767 8997
Computing Centre, CERN          -+-        David.W.Hughes@cern.ch

This message expresses my own opinions and should not be construed
as the opinions of Serco (who employ me) or of CERN (where I work).