[Haskell-cafe] When is a bug GHC's fault/strange STM behaviour

Sat Mar 13 16:53:38 EST 2010

Am Samstag 13 März 2010 17:36:49 schrieb Michael Lesniak:
> Hello,
>
> In one of my example programs I have a strange behaviour: it is a very
> simple taskpool using STM; in pseudocode it's
>
> 1. generate data structures
> 2. initialize data structures
> 3. fork threads
> 4. wait (using STM) until the pool is empty and all threads are finished
> 5. print a final message
>
> In very few cases, which depend on the number of threads spawned, the
> program hangs *after* the final message of step 5 has been printed.
> "Few cases" means, for example, 50.000 good, terminating runs before
> it hangs. If you increment the number of spawned threads (to a few
> hundred or thousands), it hangs much faster. Since forked threads
> terminate after the main thread terminates (which it should after
> printing the message), this behaviour is quite unexpected.

I won't pretend I really understand what's going on, but it seems that 
occasionally a couple of threads are caught in a retry-loop. Having each 
thread print out its ThreadId after it's done, when it hangs, only one 
thread says it's done.

I don't see how that could happen, but that's what I found.

For the attached programme, in the task-getting,

            else if Set.null work
                    then return Nothing
                    else retry

doesn't really make sense, when the channel is empty, we could return 
Nothing right away. I suppose, in the real programme, some threads might 
write further tasks to the channel, so while not all threads have finished, 
the channel might not be permanently empty?
If not, "return Nothing" whenever the channel is empty ought to reliably 
end all threads and prevent hanging. If yes, writing strict values to 
working:

get chan working = do
    tid <- myThreadId

    -- atomically commit that this thread is not working anymore (since we
    -- try to get a task we must be quasi-idle!
    atomically $ do
        work  <- Set.delete tid `fmap` readTVar working
        writeTVar working $! work

    -- waits for a new task. if all threads are idle and the pool is empty,
    -- return.
    atomically $ do
        empty <- isEmptyTChan chan
        work  <- readTVar working

        if (not empty)
            then do
                task <- readTChan chan
                writeTVar working $! (Set.insert tid work)
                return (Just task)
            else if Set.null work
                    then return Nothing
                    else retry

seems to prevent hanging on my box (running fine with "100 64 1 +RTS -N" 
nearing task 60000, without the strict writes it typically hangs after a 
few dozen or hundred runs).
I think the strict write in "writeTVar working $! (Set.insert tid work)" 
isn't necessary, but I haven't yet tested it.
Why writing a thunk in

    atomically $ do
        work  <- Set.delete tid `fmap` readTVar working
        writeTVar working work

should cause it to hang sometimes, I've no idea. Nor whether that really 
fixes it or it's just a fluke.

>
> Since I've experienced strange behaviour in the past which was the
> fault of my system configuration[1], I am a bit cautious before
> reporting a bug on GHC's bugtracker, especially since its reproduction
> is so difficult and random.
>
> So my question is how much circumspection is expected/needed before
> one should enter a bug in the bug tracker? I've tested the attached
> code on three different systems (with different linux systems, but
> always GHC 6.12.1 (since it's a bit costly to install the older
> versions)) and observed the mentioned behaviour. Is this enough to
> justify a bug report? Or, on the other hand, could someone spot the

I'd ask such things on glasgow-haskell-users, less traffic, it's a GHC-
specific list, you're more likely that one of the GHC experts notices it 
there and can tell you whether it's a bug, a feature or an error in your 
code.

> error in the attached code. Given my history with strange parallel
> behaviour, I am much more sure that it's the fault of my code, but I
> can't spot the error and the described behaviour (halting *after* the
> final message) is really strange.
>
>
> Cheers,
>   Michael
>
> [1] http://www.haskell.org/pipermail/haskell-cafe/2010-March/073938.html