Strange GHC/STM behaviour

Mon Mar 15 07:58:59 EDT 2010

On 15/03/2010 08:59, Michael Lesniak wrote:
> Hello,
>
>
> In one of my example programs I have a strange behaviour: it is a very
> simple taskpool using STM; in pseudocode it's
>
> 1. generate data structures
> 2. initialize data structures
> 3. fork threads
> 4. wait (using STM) until the pool is empty and all threads are finished
> 5. print a final message
>
> In very few cases, which depend on the number of threads spawned, the
> program hangs *after* the final message of step 5 has been printed.
> "Few cases" means, for example, 50.000 good, terminating runs before
> it hangs. If you increment the number of spawned threads (to a few
> hundred or thousands), it hangs much faster. Since forked threads
> terminate after the main thread terminates (which it should after
> printing the message), this behaviour is quite unexpected.

I've fixed three deadlocks since 6.12.1 was released: two were IO 
manager-related, and one caused by an interaction between the scheduler 
and GC.  It's likely that one of these is your problem.  All of them are 
fixed in 6.12.2, so if you are able to grab a snapshot and test it that 
would be very helpful.

Tue Mar  9 09:58:31 GMT 2010  Simon Marlow <marlowsd at gmail.com>
   * Fix a rare deadlock when the IO manager thread is slow to start up
   This fixes occasional failures of ffi002(threaded1) on a loaded
   machine.

     M ./rts/Capability.c -1 +9

Tue Jan 26 15:00:37 GMT 2010  Simon Marlow <marlowsd at gmail.com>
   * Fix a deadlock, and possibly other problems
   After a bound thread had completed, its TSO remains in the heap until
   it has been GC'd, although the associated Task is returned to the
   caller where it is freed and possibly re-used.

   The bug was that GC was following the pointer to the Task and updating
   the TSO field, meanwhile the Task had already been recycled (it was
   being used by exitScheduler()). Confusion ensued, leading to a very
   occasional deadlock at shutdown, but in principle it could result in
   other crashes too.

   The fix is to remove the link between the TSO and the Task when the
   TSO has completed and the call to schedule() has returned; see
   comments in Schedule.c.

     M ./rts/Schedule.c -3 +18

Thu Feb 25 12:02:55 GMT 2010  Simon Marlow <marlowsd at gmail.com>
   * Plug two race conditions that could lead to deadlocks in the IO manager

     M ./GHC/Conc.lhs -6 +16

> Since I've experienced strange behaviour in the past which was the
> fault of my system configuration[1], I am a bit cautious before
> reporting a bug on GHC's bugtracker, especially since its reproduction
> is so difficult and random.

I've been doing a lot of testing recently that involves running a 
program repeatedly in a loop until it goes wrong, such is the nature of 
non-deterministic concurrency :-)

> So my question is how much circumspection is expected/needed before
> one should enter a bug in the bug tracker? I've tested the attached
> code on three different systems (with different linux systems, but
> always GHC 6.12.1 (since it's a bit costly to install the older
> versions)) and observed the mentioned behaviour. Is this enough to
> justify a bug report?

Sure, by all means submit a bug report.  As mentioned earlier, you might 
be able to avoid doing so if you find that the 6.12.2 snapshot fixes it, 
though.

Cheers,
	Simon