Where STM is unstable at the moment, and how we can fix it

Mon Sep 1 05:39:08 EDT 2008

Sterling Clover wrote:
> This email is inspired by the discussion here: 
> http://hackage.haskell.org/trac/ghc/ticket/2401
> 
> As the ticket discusses, unsafeIOToSTM is, unlike unsafePerformIO or 
> unsafeInterleaveIO, genuinely completely unsafe in that there is no way 
> to use it such that a segfault or deadlock is not at least somewhat 
> encouraged. The code attached to the ticket creates a deadlock solely 
> through using it to write to stdout. But, for the same reason that 
> unsafeIOToSTM is unstable, unsafeInterleaveIO now is very unstable as 
> well -- conceivably, data generated from functions with lazy IO 
> (including those in the prelude) could cause deadlocks within STM, and 
> even segfaults.
> 
> In summary, a "validation" step is performed on all threads inside 
> atomically blocks during garbage collection. This validation step will, 
> on encountering invalid threads (i.e. ones which should be rolled back) 
> immediately kill them dead and retry. This is different than the 
> implementation described in the STM paper, where rollbacks only occur on 
> commit. However, it does add a measure of efficiency.

Its not just an efficiency trick, in fact.  The validation step is 
absolutely necessary for correctness.  The problem is that a transaction 
may have seen an inconsistent view of memory, and as a result it may have 
gone into an infinite loop; the only way to catch and recover from this 
situation is to validate at regular intervals, say before a GC (this 
suffers from the problem that the transaction has to be allocating in order 
to be stopped, but that's another matter).  e.g. the code might be 
something like

   atomically $ do
     a <- readTVar ta
     b <- readTVar tb
     if a == b then loop else return ()

now we might know that a is never equal to b under normal conditions: all 
the transactions in the program satisfy the invariant.  However, since we 
use optimistic concurrency, it might be the case that this thread sees an 
inconsistent view of memory in which a==b.  The case would normally be 
caught at commit time, but this thread isn't going to commit: it goes into 
an infinite loop instead.

> As Simon M. notes, the obvious solution would be to turn rollbacks into 
> regular exceptions, but this would open a number of cans of worms.
> 
> A start, though not sufficient, would be for stm validation to respect 
> blocked status -- not to block on it, obviously, but simply to refuse to 
> rollback a transaction within it.

That wouldn't be correct, because the thread might be in an infinite loop 
inside a block.  However, it would probably work in the cases you're 
interested in, so I wouldn't object to a patch that implemented this 
workaround for the time being.

I do agree that we have a problem here, and I'll re-open the ticket (sorry 
for leaving it closed).  I think raising an (asynchronous) exception is the 
right solution.  We have to make sure the exception cannot be caught by an 
STM catch, but I think that's do-able.

However, another problem we have is that when the IO system re-raises the 
exception, it'll be raised as a synchronous exception rather than an 
asynchronous exception.  I've just spent an hour or so talking this over 
here with Simon PJ and we have some ideas for fixing it, I'll try to write 
it up in a ticket later.

Cheers,
	Simon