[GHC] #15136: High CPU when asynchronous exception and unblocking retry on TVar raced

GHC ghc-devs at haskell.org
Mon Jun 11 14:09:38 UTC 2018


#15136: High CPU when asynchronous exception and unblocking retry on TVar raced
-------------------------------------+-------------------------------------
        Reporter:  nshimaza          |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  highest           |            Milestone:  8.6.1
       Component:  Runtime System    |              Version:  8.4.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  Runtime crash     |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by carlostome):

 After some investigation I found that indeed there is a deadlock between
 the mutexes of a TSO and a TVar. An excerpt of the debugging log is the
 follwing:

 {{{
 Task 2: cap 1: running thread 7012 (ThreadRunGHC)
 Task 2: STM: 0xad6d08 : stmStartTransaction with 316 tokens
 Task 2: STM: 0xad6d08 : stmStartTransaction()=0x4200224000
 Task 2: STM: 0x4200224000 : stmWriteTVar(0x4200225c68, 0xad5e72)
 Task 2: STM: 0x4200224000 : get_entry_for TVar 0x4200225c68
 Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0
 limit=0
 Task 2: STM: 0x4200224000 : read_current_value(0x4200225c68)=0xad5e69
 Task 2: STM: 0x4200224000 : stmWriteTVar done
 Task 2: STM: 0x4200224000 : stmCommitTransaction()
 Task 2: STM: 0x4200224000 : lock_stm()
 Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0
 limit=1
 Task 2: STM: 0x4200224000 : trying to acquire 0x4200225c68
 Task 2: STM: 0x4200224000 : cond_lock_tvar(0x4200225c68, 0xad5e69)
 -- (1)
 Task 2: STM: 0x4200224000 : success
 Task 2: STM: 0x4200224000 : doing read check
 Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0
 limit=1
 Task 2: STM: 0x4200224000 : read-check succeeded
 Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0
 limit=1
 Task 2: STM: 0x4200224000 : writing 0xad5e72 to 0x4200225c68, waking
 waiters      -- (2)
 Task 2: STM: unpark_waiters_on tvar=0x4200225c68
 Task 2: STM: unpark_waiters_on tvar=0x4200225c68, status=0x4200223b80
 Task 2: STM: cap 1: unpark_waiters_on tvar=0x4200225c68,
 status2=0x4200223b80     -- (3)
 Task 2: STM: cap 1: unpark_tso id 6866
 -- (4)
 Task 1: BlockedOnSTM: lockTSO id 6866
 Task 1: cap 0 BlockedOnSTM: raiseAsync id 6866
 Task 1: cap 0: raising exception in thread 6866.
 Task 1: cap 0: raiseAsync: freezing atomically frame
 -- (5)
 Task 1: STM: 0x4200225f58 : stmAbortTransaction
 -- (6)
 Task 1: STM: 0x4200225f58 : lock_stm()
 Task 1: STM: 0x4200225f58 : aborting top-level transaction cap 0
 Task 1: STM: 0x4200225f58 : aborting top-level transaction
 Task 1: STM: 0x4200225f58 : stmAbortTransaction aborting waiting
 transaction
 Task 1: STM: 0x4200225f58 : remove_watch_queue_entries_for_trec()
 -- (7)
 Task 1: STM: 0x4200225f58 : FOR_EACH_ENTRY, current_chunk=0x4200223cc8
 limit=1
 Task 1: STM: 0x4200225f58 : lock_tvar (0x4200225c68)
 -- (8)
 }}}

 What is happening here has to do with how async exceptions are
 implemented.

 The thread 6866 is waiting on the TVar 0x4200225c68. In the meantime the
 thread 7012 is able to change the value of such TVar to True. To do so,
 first it has to acquire the lock for the TVar (1), then actually change
 its value (2), and finally wake any thread that was waiting on this TVar
 (3). To wake up a concrete waiter it needs to hold its mutex through
 lock_tso() (4).

 The problem is that when an async exception is thrown to a thread whose
 status is BlockedOnSTM, the first thing the thread throwing the exception
 does is to grab the mutex of the target thread with lock_tso()
 (https://github.com/ghc/ghc/blob/93220d46fceabf3afeae36f1fda94e1698c3639a/rts/RaiseAsync.c#L419),
 then because the target thread was in the middle of an STM transaction
 (5) that dependes on the same TVar, it has to abort it (6) for what it has
 to be removed from the TVar watch queue (7) needing the mutex on the TVar
 to do so (8).

 Basically the thread completing the STM grabs first the TVar lock and then
 the TSO lock while for the async exception the locks are taken in reverse
 order.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/15136#comment:2>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler


More information about the ghc-tickets mailing list