Where STM is unstable at the moment, and how we can fix it

Sat Aug 30 14:53:19 EDT 2008

This email is inspired by the discussion here: http:// 
hackage.haskell.org/trac/ghc/ticket/2401

As the ticket discusses, unsafeIOToSTM is, unlike unsafePerformIO or  
unsafeInterleaveIO, genuinely completely unsafe in that there is no  
way to use it such that a segfault or deadlock is not at least  
somewhat encouraged. The code attached to the ticket creates a  
deadlock solely through using it to write to stdout. But, for the  
same reason that unsafeIOToSTM is unstable, unsafeInterleaveIO now is  
very unstable as well -- conceivably, data generated from functions  
with lazy IO (including those in the prelude) could cause deadlocks  
within STM, and even segfaults.

In summary, a "validation" step is performed on all threads inside  
atomically blocks during garbage collection. This validation step  
will, on encountering invalid threads (i.e. ones which should be  
rolled back) immediately kill them dead and retry. This is different  
than the implementation described in the STM paper, where rollbacks  
only occur on commit. However, it does add a measure of efficiency.  
The problem is that the validation code disregards exception  
handlers, since rollback is not an exception, and so anything  
embedded in STM that brackets an IO action, for example, can be  
rolled back without the final part of the exception even being called.

As Simon M. notes, the obvious solution would be to turn rollbacks  
into regular exceptions, but this would open a number of cans of worms.

A start, though not sufficient, would be for stm validation to  
respect blocked status -- not to block on it, obviously, but simply  
to refuse to rollback a transaction within it. Validation on GC is,  
after all, only an efficiency trick and implementation detail, and if  
it lets the occasional invalid transaction stand due to its blocked  
status, that transaction will simply be cleaned up later anyway.

A more thorough solution would be, as I suggest at the end of the  
ticket, to add a new primitive with similar semantics to block --  
blockRollback, of type STM () -> STM (). Anything that took place  
within blockRollback could not be stopped by validation.

Finally, we could "split the difference" between block and  
blockRollback, by simply setting a rollbackBlocked flag on a *top  
level* invocation of block within STM, and thenceforth, not unsetting  
it until that block is exited, regardless of calls to unblock nested  
inside. This would effectively, without introducing a new primitive,  
ensure that rollback did not disrupt things terribly, and thus would  
be the solution that handled the lazyIO issue the best as well.

There are lots of interesting applications of STM that require the  
ability to extend its semantics. To do this is going to require  
unsafeIOToSTM, just as unsafePerformIO is used on occasion as a low  
level tool to create safer and better things on top of (or as  
unsafeCoerce is, for that matter). However, the current state of STM  
means that writing these extensions of STM semantics safely is 100%  
impossible.

I'm not sure which, if any, of the solutions that I'm presenting seem  
the most reasonable. However, without some sort of resolution for  
this issue, STM is far less powerful and useful than it can and  
should be.

--Sterl.