[Git][ghc/ghc][wip/tsan/nonmoving] 22 commits: rts: Encapsulate sched_state

Tue Dec 6 12:56:45 UTC 2022


Ben Gamari pushed to branch wip/tsan/nonmoving at Glasgow Haskell Compiler / GHC


Commits:
16a95a38 by Ben Gamari at 2022-12-06T07:56:28-05:00
rts: Encapsulate sched_state

- - - - -
6a4fd558 by Ben Gamari at 2022-12-06T07:56:28-05:00
PrimOps: Fix benign MutVar race

Relaxed ordering is fine here since the later CAS implies a release.

- - - - -
149a91be by Ben Gamari at 2022-12-06T07:56:28-05:00
rts: Style fix

- - - - -
c5b78290 by Ben Gamari at 2022-12-06T07:56:28-05:00
compiler: Use release store in eager blackholing

- - - - -
ab656ae7 by Ben Gamari at 2022-12-06T07:56:28-05:00
rts: Fix ordering of makeStableName

- - - - -
93265cc3 by Ben Gamari at 2022-12-06T07:56:28-05:00
rts: Use ordered accesses instead of explicit barriers

- - - - -
a6ca5913 by Ben Gamari at 2022-12-06T07:56:28-05:00
rts: Statically allocate capabilities

This is a rather simplistic way of solving #17289.

- - - - -
60ffe15f by Ben Gamari at 2022-12-06T07:56:28-05:00
rts: Ensure that all accesses to pending_sync are atomic

- - - - -
5e565a67 by Ben Gamari at 2022-12-06T07:56:28-05:00
rts: Note race with wakeBlockingQueue

- - - - -
1e04cf59 by Ben Gamari at 2022-12-06T07:56:28-05:00
nonmoving: Fix race in marking of blackholes

We must use an acquire-fence when marking to ensure that the indirectee
is visible.

- - - - -
2f938591 by Ben Gamari at 2022-12-06T07:56:28-05:00
nonmoving: Fix segment list races

- - - - -
2726d722 by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Use atomic when looking at bd->gen

Since it may have been mutated by a moving GC.

- - - - -
d3cd49cf by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Eliminate race in bump_static_flag

To ensure that we don't race with a mutator entering a new CAF we take
the SM mutex before touching static_flag. The other option here would be
to instead modify newCAF to use a CAS but the present approach is a bit
safer.

- - - - -
6f52b643 by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Ensure that mutable fields have acquire barrier

- - - - -
947feeab by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Fix races in collector status tracking

Mark a number of accesses to do with tracking of the status of the
concurrent collection thread as atomic. No interesting races here,
merely necessary to satisfy TSAN.

- - - - -
5b09ee54 by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Make segment state updates atomic

- - - - -
651f6f3f by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Refactor update remembered set initialization

This avoids a lock inversion between the storage manager mutex and
the stable pointer table mutex by not dropping the SM_MUTEX in
nonmovingCollect. This requires quite a bit of rejiggering but it
does seem like a better strategy.

- - - - -
b2b2008f by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Ensure that we aren't holding locks when closing them

TSAN complains about this sort of thing.

- - - - -
c868eee0 by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Make bitmap accesses atomic

This is a benign race on any sensible hard since these are byte
accesses. Nevertheless, atomic accesses are necessary to satisfy
TSAN.

- - - - -
87e3d20d by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Fix benign race in update remembered set check

Relaxed load is fine here since we will take the lock before looking at
the list.

- - - - -
a5b1d1d2 by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Fix race in shortcutting

We must use an acquire load to read the info table pointer since if we
find an indirection we must be certain that we see the indirectee.

- - - - -
e090f105 by Ben Gamari at 2022-12-06T07:56:29-05:00
nonmoving: Make free list counter accesses atomic

Since these may race with the allocator(s).

- - - - -


24 changed files:

- compiler/GHC/StgToCmm/Bind.hs
- rts/Capability.c
- rts/Capability.h
- rts/PrimOps.cmm
- rts/Schedule.c
- rts/Schedule.h
- rts/StgMiscClosures.cmm
- rts/eventlog/EventLog.c
- rts/include/Cmm.h
- rts/include/rts/Config.h
- rts/include/rts/storage/ClosureMacros.h
- rts/include/stg/SMP.h
- rts/posix/Select.c
- rts/posix/Signals.c
- rts/sm/GC.c
- rts/sm/NonMoving.c
- rts/sm/NonMoving.h
- rts/sm/NonMovingMark.c
- rts/sm/NonMovingMark.h
- rts/sm/NonMovingShortcut.c
- rts/sm/Storage.c
- rts/win32/AsyncMIO.c
- rts/win32/AwaitEvent.c
- rts/win32/ConsoleHandler.c


Changes:

=====================================
compiler/GHC/StgToCmm/Bind.hs
=====================================
@@ -703,8 +703,8 @@ emitBlackHoleCode node = do
     whenUpdRemSetEnabled $ emitUpdRemSetPushThunk node
     emitStore (cmmOffsetW platform node (fixedHdrSizeW profile)) currentTSOExpr
     -- See Note [Heap memory barriers] in SMP.h.
-    emitPrimCall [] MO_WriteBarrier []
-    emitStore node (CmmReg (CmmGlobal EagerBlackholeInfo))
+    let w = wordWidth platform
+    emitPrimCall [] (MO_AtomicWrite w MemOrderRelease) [node, CmmReg (CmmGlobal EagerBlackholeInfo)]
 
 setupUpdate :: ClosureInfo -> LocalReg -> FCode () -> FCode ()
         -- Nota Bene: this function does not change Node (even if it's a CAF),


=====================================
rts/Capability.c
=====================================
@@ -45,7 +45,7 @@ uint32_t enabled_capabilities = 0;
 // Capabilities, because there may be pointers to them in use
 // (e.g. threads in waitForCapability(), see #8209), so this is
 // an array of Capability* rather than an array of Capability.
-Capability **capabilities = NULL;
+Capability *capabilities[MAX_N_CAPABILITIES];
 
 // Holds the Capability which last became free.  This is used so that
 // an in-call has a chance of quickly finding a free Capability.
@@ -82,7 +82,7 @@ Capability * rts_unsafeGetMyCapability (void)
 STATIC_INLINE bool
 globalWorkToDo (void)
 {
-    return RELAXED_LOAD(&sched_state) >= SCHED_INTERRUPTING
+    return getSchedState() >= SCHED_INTERRUPTING
       || getRecentActivity() == ACTIVITY_INACTIVE; // need to check for deadlock
 }
 #endif
@@ -387,6 +387,12 @@ void initCapabilities (void)
     }
 #endif
 
+    if (RtsFlags.ParFlags.nCapabilities > MAX_N_CAPABILITIES) {
+        errorBelch("warning: this GHC runtime system only supports up to %d capabilities",
+                   MAX_N_CAPABILITIES);
+        RtsFlags.ParFlags.nCapabilities = MAX_N_CAPABILITIES;
+    }
+
     n_capabilities = 0;
     moreCapabilities(0, RtsFlags.ParFlags.nCapabilities);
     n_capabilities = RtsFlags.ParFlags.nCapabilities;
@@ -394,7 +400,6 @@ void initCapabilities (void)
 #else /* !THREADED_RTS */
 
     n_capabilities = 1;
-    capabilities = stgMallocBytes(sizeof(Capability*), "initCapabilities");
     capabilities[0] = &MainCapability;
 
     initCapability(&MainCapability, 0);
@@ -415,8 +420,6 @@ void
 moreCapabilities (uint32_t from USED_IF_THREADS, uint32_t to USED_IF_THREADS)
 {
 #if defined(THREADED_RTS)
-    Capability **new_capabilities = stgMallocBytes(to * sizeof(Capability*), "moreCapabilities");
-
     // We must disable the timer while we do this since the tick handler may
     // call contextSwitchAllCapabilities, which may see the capabilities array
     // as we free it. The alternative would be to protect the capabilities
@@ -428,30 +431,22 @@ moreCapabilities (uint32_t from USED_IF_THREADS, uint32_t to USED_IF_THREADS)
         // THREADED_RTS must work on builds that don't have a mutable
         // BaseReg (eg. unregisterised), so in this case
         // capabilities[0] must coincide with &MainCapability.
-        new_capabilities[0] = &MainCapability;
+        capabilities[0] = &MainCapability;
         initCapability(&MainCapability, 0);
     }
     else
     {
         for (uint32_t i = 0; i < to; i++) {
-            if (i < from) {
-                new_capabilities[i] = capabilities[i];
-            } else {
-                new_capabilities[i] = stgMallocBytes(sizeof(Capability),
+            if (i >= from) {
+                capabilities[i] = stgMallocBytes(sizeof(Capability),
                                                      "moreCapabilities");
-                initCapability(new_capabilities[i], i);
+                initCapability(capabilities[i], i);
             }
         }
     }
 
     debugTrace(DEBUG_sched, "allocated %d more capabilities", to - from);
 
-    Capability **old_capabilities = ACQUIRE_LOAD(&capabilities);
-    RELEASE_STORE(&capabilities, new_capabilities);
-    if (old_capabilities != NULL) {
-        stgFree(old_capabilities);
-    }
-
     startTimer();
 #endif
 }
@@ -581,7 +576,7 @@ releaseCapability_ (Capability* cap,
         // is interrupted, we only create a worker task if there
         // are threads that need to be completed.  If the system is
         // shutting down, we never create a new worker.
-        if (RELAXED_LOAD(&sched_state) < SCHED_SHUTTING_DOWN || !emptyRunQueue(cap)) {
+        if (getSchedState() < SCHED_SHUTTING_DOWN || !emptyRunQueue(cap)) {
             debugTrace(DEBUG_sched,
                        "starting new worker on capability %d", cap->no);
             startWorkerTask(cap);
@@ -1153,7 +1148,7 @@ shutdownCapability (Capability *cap USED_IF_THREADS,
     // isn't safe, for one thing).
 
     for (i = 0; /* i < 50 */; i++) {
-        ASSERT(sched_state == SCHED_SHUTTING_DOWN);
+        ASSERT(getSchedState() == SCHED_SHUTTING_DOWN);
 
         debugTrace(DEBUG_sched,
                    "shutting down capability %d, attempt %d", cap->no, i);
@@ -1285,7 +1280,6 @@ freeCapabilities (void)
 #else
     freeCapability(&MainCapability);
 #endif
-    stgFree(capabilities);
     traceCapsetDelete(CAPSET_OSPROCESS_DEFAULT);
     traceCapsetDelete(CAPSET_CLOCKDOMAIN_DEFAULT);
 }


=====================================
rts/Capability.h
=====================================
@@ -261,11 +261,11 @@ INLINE_HEADER void releaseCapability_ (Capability* cap STG_UNUSED,
 // extern uint32_t enabled_capabilities;
 
 // Array of all the capabilities
-extern Capability **capabilities;
+extern Capability *capabilities[MAX_N_CAPABILITIES];
 
 INLINE_HEADER Capability *getCapability(uint32_t i)
 {
-    return RELAXED_LOAD(&capabilities)[i];
+    return RELAXED_LOAD(&capabilities[i]);
 }
 
 //


=====================================
rts/PrimOps.cmm
=====================================
@@ -776,7 +776,7 @@ stg_atomicModifyMutVar2zh ( gcptr mv, gcptr f )
     StgThunk_payload(y,0) = z;
 
   retry:
-    x = StgMutVar_var(mv);
+    x = %relaxed StgMutVar_var(mv);
     StgThunk_payload(z,1) = x;
 #if defined(THREADED_RTS)
     (h) = prim %cmpxchgW(mv + SIZEOF_StgHeader + OFFSET_StgMutVar_var, x, y);
@@ -829,7 +829,7 @@ stg_atomicModifyMutVarzuzh ( gcptr mv, gcptr f )
     StgThunk_payload(z,0) = f;
 
   retry:
-    x = StgMutVar_var(mv);
+    x = %relaxed StgMutVar_var(mv);
     StgThunk_payload(z,1) = x;
 #if defined(THREADED_RTS)
     (h) = prim %cmpxchgW(mv + SIZEOF_StgHeader + OFFSET_StgMutVar_var, x, z);
@@ -1728,7 +1728,7 @@ stg_takeMVarzh ( P_ mvar /* :: MVar a */ )
         // Write barrier before we make the new MVAR_TSO_QUEUE
         // visible to other cores.
         // See Note [Heap memory barriers]
-        prim_write_barrier;
+        RELEASE_FENCE;
 
         if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
             StgMVar_head(mvar) = q;
@@ -1895,7 +1895,7 @@ stg_putMVarzh ( P_ mvar, /* :: MVar a */
 
         SET_HDR(q, stg_MVAR_TSO_QUEUE_info, CCS_SYSTEM);
         // See Note [Heap memory barriers]
-        prim_write_barrier;
+        RELEASE_FENCE;
 
         if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
             StgMVar_head(mvar) = q;
@@ -2104,7 +2104,7 @@ stg_readMVarzh ( P_ mvar, /* :: MVar a */ )
 
         SET_HDR(q, stg_MVAR_TSO_QUEUE_info, CCS_SYSTEM);
         // See Note [Heap memory barriers]
-        prim_write_barrier;
+        RELEASE_FENCE;
 
         StgTSO__link(CurrentTSO)       = q;
         StgTSO_block_info(CurrentTSO)  = mvar;
@@ -2237,7 +2237,7 @@ stg_readIOPortzh ( P_ ioport /* :: IOPort a */ )
 
         SET_HDR(q, stg_MVAR_TSO_QUEUE_info, CCS_SYSTEM);
         // See Note [Heap memory barriers]
-        prim_write_barrier;
+        RELEASE_FENCE;
 
         StgMVar_head(ioport) = q;
         StgTSO__link(CurrentTSO)       = q;
@@ -2389,7 +2389,8 @@ stg_makeStableNamezh ( P_ obj )
     /* Is there already a StableName for this heap object?
      *  stable_name_table is a pointer to an array of snEntry structs.
      */
-    if ( snEntry_sn_obj(W_[stable_name_table] + index*SIZEOF_snEntry) == NULL ) {
+    sn_obj = %acquire snEntry_sn_obj(W_[stable_name_table] + index*SIZEOF_snEntry);
+    if (sn_obj  == NULL) {
         // At this point we have a snEntry, but it doesn't look as used to the
         // GC yet because we don't have a StableName object for the sn_obj field
         // (remember that sn_obj == NULL means the entry is free). So if we call
@@ -2406,10 +2407,7 @@ stg_makeStableNamezh ( P_ obj )
         // This will make the StableName# object visible to other threads;
         // be sure that its completely visible to other cores.
         // See Note [Heap memory barriers] in SMP.h.
-        prim_write_barrier;
-        snEntry_sn_obj(W_[stable_name_table] + index*SIZEOF_snEntry) = sn_obj;
-    } else {
-        sn_obj = snEntry_sn_obj(W_[stable_name_table] + index*SIZEOF_snEntry);
+        %release snEntry_sn_obj(W_[stable_name_table] + index*SIZEOF_snEntry) = sn_obj;
     }
 
     return (sn_obj);


=====================================
rts/Schedule.c
=====================================
@@ -91,7 +91,7 @@ StgWord recent_activity = ACTIVITY_YES;
 /* if this flag is set as well, give up execution
  * LOCK: none (changes monotonically)
  */
-volatile StgWord sched_state = SCHED_RUNNING;
+StgWord sched_state = SCHED_RUNNING;
 
 /*
  * This mutex protects most of the global scheduler data in
@@ -166,7 +166,6 @@ static void deleteAllThreads (void);
 static void deleteThread_(StgTSO *tso);
 #endif
 
-
 /* ---------------------------------------------------------------------------
    Main scheduling loop.
 
@@ -254,7 +253,7 @@ schedule (Capability *initialCapability, Task *task)
     //   * We might be left with threads blocked in foreign calls,
     //     we should really attempt to kill these somehow (TODO).
 
-    switch (RELAXED_LOAD(&sched_state)) {
+    switch (getSchedState()) {
     case SCHED_RUNNING:
         break;
     case SCHED_INTERRUPTING:
@@ -266,7 +265,7 @@ schedule (Capability *initialCapability, Task *task)
         // other Capability did the final GC, or we did it above,
         // either way we can fall through to the SCHED_SHUTTING_DOWN
         // case now.
-        ASSERT(RELAXED_LOAD(&sched_state) == SCHED_SHUTTING_DOWN);
+        ASSERT(getSchedState() == SCHED_SHUTTING_DOWN);
         // fall through
 
     case SCHED_SHUTTING_DOWN:
@@ -321,7 +320,7 @@ schedule (Capability *initialCapability, Task *task)
          */
         awaitEvent (cap, emptyRunQueue(cap));
 #else
-        ASSERT(sched_state >= SCHED_INTERRUPTING);
+        ASSERT(getSchedState() >= SCHED_INTERRUPTING);
 #endif
     }
 #endif
@@ -371,7 +370,7 @@ schedule (Capability *initialCapability, Task *task)
     // killed, kill it now.  This sometimes happens when a finalizer
     // thread is created by the final GC, or a thread previously
     // in a foreign call returns.
-    if (RELAXED_LOAD(&sched_state) >= SCHED_INTERRUPTING &&
+    if (getSchedState() >= SCHED_INTERRUPTING &&
         !(t->what_next == ThreadComplete || t->what_next == ThreadKilled)) {
         deleteThread(t);
     }
@@ -688,7 +687,7 @@ scheduleYield (Capability **pcap, Task *task)
     if (!shouldYieldCapability(cap,task,false) &&
         (!emptyRunQueue(cap) ||
          !emptyInbox(cap) ||
-         RELAXED_LOAD(&sched_state) >= SCHED_INTERRUPTING)) {
+         getSchedState() >= SCHED_INTERRUPTING)) {
         return;
     }
 
@@ -991,7 +990,7 @@ scheduleDetectDeadlock (Capability **pcap, Task *task)
             }
 
             // either we have threads to run, or we were interrupted:
-            ASSERT(!emptyRunQueue(cap) || sched_state >= SCHED_INTERRUPTING);
+            ASSERT(!emptyRunQueue(cap) || getSchedState() >= SCHED_INTERRUPTING);
 
             return;
         }
@@ -1343,7 +1342,7 @@ scheduleHandleThreadFinished (Capability *cap, Task *task, StgTSO *t)
               if (task->incall->ret) {
                   *(task->incall->ret) = NULL;
               }
-              if (RELAXED_LOAD(&sched_state) >= SCHED_INTERRUPTING) {
+              if (getSchedState() >= SCHED_INTERRUPTING) {
                   if (heap_overflow) {
                       task->incall->rstat = HeapExhausted;
                   } else {
@@ -1431,7 +1430,7 @@ void stopAllCapabilitiesWith (Capability **pCap, Task *task, SyncType sync_type)
 
     acquireAllCapabilities(pCap ? *pCap : NULL, task);
 
-    pending_sync = 0;
+    RELAXED_STORE(&pending_sync, 0);
     signalCondition(&sync_finished_cond);
 }
 #endif
@@ -1603,7 +1602,7 @@ scheduleDoGC (Capability **pcap, Task *task USED_IF_THREADS,
       // cycle.
 #endif
 
-    if (RELAXED_LOAD(&sched_state) == SCHED_SHUTTING_DOWN) {
+    if (getSchedState() == SCHED_SHUTTING_DOWN) {
         // The final GC has already been done, and the system is
         // shutting down.  We'll probably deadlock if we try to GC
         // now.
@@ -1622,7 +1621,7 @@ scheduleDoGC (Capability **pcap, Task *task USED_IF_THREADS,
     major_gc = (collect_gen == RtsFlags.GcFlags.generations-1);
 
 #if defined(THREADED_RTS)
-    if (RELAXED_LOAD(&sched_state) < SCHED_INTERRUPTING
+    if (getSchedState() < SCHED_INTERRUPTING
         && RtsFlags.ParFlags.parGcEnabled
         && collect_gen >= RtsFlags.ParFlags.parGcGen
         && ! oldest_gen->mark)
@@ -1715,7 +1714,7 @@ scheduleDoGC (Capability **pcap, Task *task USED_IF_THREADS,
             }
             if (was_syncing &&
                 (prev_sync == SYNC_GC_SEQ || prev_sync == SYNC_GC_PAR) &&
-                !(RELAXED_LOAD(&sched_state) == SCHED_INTERRUPTING && force_major)) {
+                !(getSchedState() == SCHED_INTERRUPTING && force_major)) {
                 // someone else had a pending sync request for a GC, so
                 // let's assume GC has been done and we don't need to GC
                 // again.
@@ -1723,7 +1722,7 @@ scheduleDoGC (Capability **pcap, Task *task USED_IF_THREADS,
                 // need to do the final GC.
                 return;
             }
-            if (RELAXED_LOAD(&sched_state) == SCHED_SHUTTING_DOWN) {
+            if (getSchedState() == SCHED_SHUTTING_DOWN) {
                 // The scheduler might now be shutting down.  We tested
                 // this above, but it might have become true since then as
                 // we yielded the capability in requestSync().
@@ -1826,7 +1825,7 @@ delete_threads_and_gc:
      * threads in the system.
      * Checking for major_gc ensures that the last GC is major.
      */
-    if (RELAXED_LOAD(&sched_state) == SCHED_INTERRUPTING && major_gc) {
+    if (getSchedState() == SCHED_INTERRUPTING && major_gc) {
         deleteAllThreads();
 #if defined(THREADED_RTS)
         // Discard all the sparks from every Capability.  Why?
@@ -1840,7 +1839,7 @@ delete_threads_and_gc:
             discardSparksCap(getCapability(i));
         }
 #endif
-        RELAXED_STORE(&sched_state, SCHED_SHUTTING_DOWN);
+        setSchedState(SCHED_SHUTTING_DOWN);
     }
 
     /*
@@ -1877,7 +1876,7 @@ delete_threads_and_gc:
 #if defined(THREADED_RTS)
     // reset pending_sync *before* GC, so that when the GC threads
     // emerge they don't immediately re-enter the GC.
-    pending_sync = 0;
+    RELAXED_STORE(&pending_sync, 0);
     signalCondition(&sync_finished_cond);
     GarbageCollect(collect_gen, heap_census, is_overflow_gc, deadlock_detect, gc_type, cap, idle_cap);
 #else
@@ -1885,7 +1884,7 @@ delete_threads_and_gc:
 #endif
 
     // If we're shutting down, don't leave any idle GC work to do.
-    if (RELAXED_LOAD(&sched_state) == SCHED_SHUTTING_DOWN) {
+    if (getSchedState() == SCHED_SHUTTING_DOWN) {
         doIdleGCWork(cap, true /* all of it */);
     }
 
@@ -1962,7 +1961,7 @@ delete_threads_and_gc:
         releaseGCThreads(cap, idle_cap);
     }
 #endif
-    if (heap_overflow && RELAXED_LOAD(&sched_state) == SCHED_RUNNING) {
+    if (heap_overflow && getSchedState() == SCHED_RUNNING) {
         // GC set the heap_overflow flag.  We should throw an exception if we
         // can, or shut down otherwise.
 
@@ -1974,7 +1973,7 @@ delete_threads_and_gc:
             // shutdown now.  Ultimately we want the main thread to return to
             // its caller with HeapExhausted, at which point the caller should
             // call hs_exit().  The first step is to delete all the threads.
-            RELAXED_STORE(&sched_state, SCHED_INTERRUPTING);
+            setSchedState(SCHED_INTERRUPTING);
             goto delete_threads_and_gc;
         }
 
@@ -2720,7 +2719,7 @@ startWorkerTasks (uint32_t from USED_IF_THREADS, uint32_t to USED_IF_THREADS)
 void
 initScheduler(void)
 {
-  sched_state    = SCHED_RUNNING;
+  setSchedState(SCHED_RUNNING);
   setRecentActivity(ACTIVITY_YES);
 
 
@@ -2763,8 +2762,8 @@ exitScheduler (bool wait_foreign USED_IF_THREADS)
     Task *task = newBoundTask();
 
     // If we haven't killed all the threads yet, do it now.
-    if (RELAXED_LOAD(&sched_state) < SCHED_SHUTTING_DOWN) {
-        RELAXED_STORE(&sched_state, SCHED_INTERRUPTING);
+    if (getSchedState() < SCHED_SHUTTING_DOWN) {
+        setSchedState(SCHED_INTERRUPTING);
         nonmovingStop();
         Capability *cap = task->cap;
         waitForCapability(&cap,task);
@@ -2772,7 +2771,7 @@ exitScheduler (bool wait_foreign USED_IF_THREADS)
         ASSERT(task->incall->tso == NULL);
         releaseCapability(cap);
     }
-    ASSERT(sched_state == SCHED_SHUTTING_DOWN);
+    ASSERT(getSchedState() == SCHED_SHUTTING_DOWN);
 
     shutdownCapabilities(task, wait_foreign);
 
@@ -2851,8 +2850,8 @@ performMajorGC(void)
 void
 interruptStgRts(void)
 {
-    ASSERT(sched_state != SCHED_SHUTTING_DOWN);
-    sched_state = SCHED_INTERRUPTING;
+    ASSERT(getSchedState() != SCHED_SHUTTING_DOWN);
+    setSchedState(SCHED_INTERRUPTING);
     interruptAllCapabilities();
 #if defined(THREADED_RTS)
     wakeUpRts();


=====================================
rts/Schedule.h
=====================================
@@ -64,11 +64,23 @@ void releaseAllCapabilities(uint32_t n, Capability *keep_cap, Task *task);
 /* The state of the scheduler.  This is used to control the sequence
  * of events during shutdown.  See Note [shutdown] in Schedule.c.
  */
-#define SCHED_RUNNING       0  /* running as normal */
-#define SCHED_INTERRUPTING  1  /* before threads are deleted */
-#define SCHED_SHUTTING_DOWN 2  /* final shutdown */
+enum SchedState {
+    SCHED_RUNNING       = 0,  /* running as normal */
+    SCHED_INTERRUPTING  = 1,  /* before threads are deleted */
+    SCHED_SHUTTING_DOWN = 2,  /* final shutdown */
+};
+
+extern StgWord sched_state;
 
-extern volatile StgWord sched_state;
+INLINE_HEADER void setSchedState(enum SchedState ss)
+{
+    SEQ_CST_STORE_ALWAYS(&sched_state, (StgWord) ss);
+}
+
+INLINE_HEADER enum SchedState getSchedState(void)
+{
+    return (enum SchedState) SEQ_CST_LOAD_ALWAYS(&sched_state);
+}
 
 /*
  * flag that tracks whether we have done any execution in this time


=====================================
rts/StgMiscClosures.cmm
=====================================
@@ -540,7 +540,8 @@ retry:
         return (p);
     }
 
-    info = GET_INFO(p);
+    // May race with OVERWRITE_INFO in wakeBlockingQueue()
+    info = %relaxed GET_INFO(p);
     if (info == stg_IND_info) {
         // This could happen, if e.g. we got a BLOCKING_QUEUE that has
         // just been replaced with an IND by another thread in


=====================================
rts/eventlog/EventLog.c
=====================================
@@ -478,7 +478,7 @@ endEventLogging(void)
     //
     // N.B. Don't flush if shutting down: this was done in
     // finishCapEventLogging and the capabilities have already been freed.
-    if (sched_state != SCHED_SHUTTING_DOWN) {
+    if (getSchedState() != SCHED_SHUTTING_DOWN) {
         flushEventLog(NULL);
     }
 


=====================================
rts/include/Cmm.h
=====================================
@@ -278,8 +278,7 @@
 // "used".
 
 #define LOAD_INFO_ACQUIRE(ret,x)                \
-    info = %INFO_PTR(UNTAG(x));                 \
-    prim_read_barrier;
+    info = %acquire StgHeader_info(UNTAG(x));
 
 #define UNTAG_IF_PROF(x) UNTAG(x)
 
@@ -289,8 +288,7 @@
   if (GETTAG(x) != 0) {                         \
       ret(x);                                   \
   }                                             \
-  info = %INFO_PTR(x);                          \
-  prim_read_barrier;
+  info = %acquire StgHeader_info(x);
 
 #define UNTAG_IF_PROF(x) (x) /* already untagged */
 


=====================================
rts/include/rts/Config.h
=====================================
@@ -76,3 +76,9 @@ code.
 #if defined(DEBUG)
 #define PROF_SPIN
 #endif
+
+#if defined(THREADED_RTS)
+#define MAX_N_CAPABILITIES 256
+#else
+#define MAX_N_CAPABILITIES 1
+#endif


=====================================
rts/include/rts/storage/ClosureMacros.h
=====================================
@@ -97,6 +97,12 @@ EXTERN_INLINE const StgInfoTable *get_itbl(const StgClosure *c)
     return INFO_PTR_TO_STRUCT(RELAXED_LOAD(&c->header.info));
 }
 
+EXTERN_INLINE const StgInfoTable *get_itbl_acquire(const StgClosure *c);
+EXTERN_INLINE const StgInfoTable *get_itbl_acquire(const StgClosure *c)
+{
+    return INFO_PTR_TO_STRUCT(ACQUIRE_LOAD(&c->header.info));
+}
+
 EXTERN_INLINE const StgRetInfoTable *get_ret_itbl(const StgClosure *c);
 EXTERN_INLINE const StgRetInfoTable *get_ret_itbl(const StgClosure *c)
 {


=====================================
rts/include/stg/SMP.h
=====================================
@@ -574,6 +574,7 @@ load_load_barrier(void) {
 // These are typically necessary only in very specific cases (e.g. WSDeque)
 // where the ordered operations aren't expressive enough to capture the desired
 // ordering.
+#define ACQUIRE_FENCE() __atomic_thread_fence(__ATOMIC_ACQUIRE)
 #define RELEASE_FENCE() __atomic_thread_fence(__ATOMIC_RELEASE)
 #define SEQ_CST_FENCE() __atomic_thread_fence(__ATOMIC_SEQ_CST)
 
@@ -608,6 +609,7 @@ EXTERN_INLINE void load_load_barrier () {} /* nothing */
 #define SEQ_CST_RELAXED_CAS(p,o,n) cas(p,o,n)
 
 // Fences
+#define ACQUIRE_FENCE()
 #define RELEASE_FENCE()
 #define SEQ_CST_FENCE()
 


=====================================
rts/posix/Select.c
=====================================
@@ -362,7 +362,7 @@ awaitEvent(Capability *cap, bool wait)
 
           /* we were interrupted, return to the scheduler immediately.
            */
-          if (sched_state >= SCHED_INTERRUPTING) {
+          if (getSchedState() >= SCHED_INTERRUPTING) {
               return; /* still hold the lock */
           }
 
@@ -459,7 +459,7 @@ awaitEvent(Capability *cap, bool wait)
           }
       }
 
-    } while (wait && sched_state == SCHED_RUNNING
+    } while (wait && getSchedState() == SCHED_RUNNING
                   && emptyRunQueue(cap));
 }
 


=====================================
rts/posix/Signals.c
=====================================
@@ -350,7 +350,7 @@ anyUserHandlers(void)
 void
 awaitUserSignals(void)
 {
-    while (!signals_pending() && sched_state == SCHED_RUNNING) {
+    while (!signals_pending() && getSchedState() == SCHED_RUNNING) {
         pause();
     }
 }
@@ -521,7 +521,7 @@ shutdown_handler(int sig STG_UNUSED)
     // If we're already trying to interrupt the RTS, terminate with
     // extreme prejudice.  So the first ^C tries to exit the program
     // cleanly, and the second one just kills it.
-    if (sched_state >= SCHED_INTERRUPTING) {
+    if (getSchedState() >= SCHED_INTERRUPTING) {
         stg_exit(EXIT_INTERRUPTED);
     } else {
         interruptStgRts();


=====================================
rts/sm/GC.c
=====================================
@@ -355,7 +355,7 @@ GarbageCollect (uint32_t collect_gen,
   deadlock_detect_gc = deadlock_detect;
 
 #if defined(THREADED_RTS)
-  if (major_gc && RtsFlags.GcFlags.useNonmoving && concurrent_coll_running) {
+  if (major_gc && RtsFlags.GcFlags.useNonmoving && RELAXED_LOAD(&concurrent_coll_running)) {
       /* If there is already a concurrent major collection running then
        * there is no benefit to starting another.
        * TODO: Catch heap-size runaway.
@@ -833,23 +833,18 @@ GarbageCollect (uint32_t collect_gen,
     live_blocks += genLiveBlocks(gen);
 
     // add in the partial blocks in the gen_workspaces
-    {
-        uint32_t i;
-        for (i = 0; i < getNumCapabilities(); i++) {
-            live_words  += gcThreadLiveWords(i, gen->no);
-            live_blocks += gcThreadLiveBlocks(i, gen->no);
-        }
+    for (uint32_t i = 0; i < getNumCapabilities(); i++) {
+        live_words  += gcThreadLiveWords(i, gen->no);
+        live_blocks += gcThreadLiveBlocks(i, gen->no);
     }
   } // for all generations
 
   // Flush the update remembered sets. See Note [Eager update remembered set
   // flushing] in NonMovingMark.c
   if (RtsFlags.GcFlags.useNonmoving) {
-      RELEASE_SM_LOCK;
       for (n = 0; n < getNumCapabilities(); n++) {
-          nonmovingAddUpdRemSetBlocks(&getCapability(n)->upd_rem_set.queue);
+          nonmovingAddUpdRemSetBlocks(&getCapability(n)->upd_rem_set);
       }
-      ACQUIRE_SM_LOCK;
   }
 
   // Mark and sweep the oldest generation.
@@ -870,8 +865,6 @@ GarbageCollect (uint32_t collect_gen,
       // old_weak_ptr_list should be empty.
       ASSERT(oldest_gen->old_weak_ptr_list == NULL);
 
-      // we may need to take the lock to allocate mark queue blocks
-      RELEASE_SM_LOCK;
       // dead_weak_ptr_list contains weak pointers with dead keys. Those need to
       // be kept alive because we'll use them in finalizeSchedulers(). Similarly
       // resurrected_threads are also going to be used in resurrectedThreads()
@@ -881,10 +874,9 @@ GarbageCollect (uint32_t collect_gen,
 #if !defined(THREADED_RTS)
       // In the non-threaded runtime this is the only time we push to the
       // upd_rem_set
-      nonmovingAddUpdRemSetBlocks(&gct->cap->upd_rem_set.queue);
+      nonmovingAddUpdRemSetBlocks(&gct->cap->upd_rem_set);
 #endif
       nonmovingCollect(&dead_weak_ptr_list, &resurrected_threads);
-      ACQUIRE_SM_LOCK;
   }
 
   // Update the max size of older generations after a major GC:


=====================================
rts/sm/NonMoving.c
=====================================
@@ -85,6 +85,10 @@ Mutex concurrent_coll_finished_lock;
  *  - A set of *filled* segments, which contain no unallocated blocks and will
  *    be collected during the next major GC cycle
  *
+ * These sets are maintained as atomic singly-linked lists. This is not
+ * susceptible to the ABA problem since we are guaranteed to push a given
+ * segment to a list only once per garbage collection cycle.
+ *
  * Storage for segments is allocated using the block allocator using an aligned
  * group of NONMOVING_SEGMENT_BLOCKS blocks. This makes the task of locating
  * the segment header for a clone a simple matter of bit-masking (as
@@ -518,7 +522,7 @@ static void nonmovingInitSegment(struct NonmovingSegment *seg, uint8_t log_block
 void nonmovingPushFreeSegment(struct NonmovingSegment *seg)
 {
     // See Note [Live data accounting in nonmoving collector].
-    if (nonmovingHeap.n_free > NONMOVING_MAX_FREE) {
+    if (RELAXED_LOAD(&nonmovingHeap.n_free) > NONMOVING_MAX_FREE) {
         bdescr *bd = Bdescr((StgPtr) seg);
         ACQUIRE_SM_LOCK;
         ASSERT(oldest_gen->n_blocks >= bd->blocks);
@@ -543,7 +547,7 @@ void nonmovingPushFreeSegment(struct NonmovingSegment *seg)
 static struct NonmovingSegment *nonmovingPopFreeSegment(void)
 {
     while (true) {
-        struct NonmovingSegment *seg = nonmovingHeap.free;
+        struct NonmovingSegment *seg = ACQUIRE_LOAD(&nonmovingHeap.free);
         if (seg == NULL) {
             return NULL;
         }
@@ -641,13 +645,15 @@ static bool advance_next_free(struct NonmovingSegment *seg, const unsigned int b
 static struct NonmovingSegment *pop_active_segment(struct NonmovingAllocator *alloca)
 {
     while (true) {
-        struct NonmovingSegment *seg = alloca->active;
+        // Synchronizes with CAS in nonmovingPushActiveSegment
+        struct NonmovingSegment *seg = ACQUIRE_LOAD(&alloca->active);
         if (seg == NULL) {
             return NULL;
         }
+        struct NonmovingSegment *next = RELAXED_LOAD(&seg->link);
         if (cas((StgVolatilePtr) &alloca->active,
                 (StgWord) seg,
-                (StgWord) seg->link) == (StgWord) seg) {
+                (StgWord) next) == (StgWord) seg) {
             return seg;
         }
     }
@@ -742,11 +748,12 @@ void nonmovingStop(void)
 {
     if (! RtsFlags.GcFlags.useNonmoving) return;
 #if defined(THREADED_RTS)
-    if (mark_thread) {
+    if (RELAXED_LOAD(&mark_thread)) {
         debugTrace(DEBUG_nonmoving_gc,
                    "waiting for nonmoving collector thread to terminate");
         ACQUIRE_LOCK(&concurrent_coll_finished_lock);
         waitCondition(&concurrent_coll_finished, &concurrent_coll_finished_lock);
+        RELEASE_LOCK(&concurrent_coll_finished_lock);
     }
 #endif
 }
@@ -759,6 +766,9 @@ void nonmovingExit(void)
     nonmovingStop();
 
 #if defined(THREADED_RTS)
+    ACQUIRE_LOCK(&nonmoving_collection_mutex);
+    RELEASE_LOCK(&nonmoving_collection_mutex);
+
     closeMutex(&concurrent_coll_finished_lock);
     closeCondition(&concurrent_coll_finished);
     closeMutex(&nonmoving_collection_mutex);
@@ -892,7 +902,7 @@ static void nonmovingPrepareMark(void)
 static void nonmovingMarkWeakPtrList(MarkQueue *mark_queue, StgWeak *dead_weak_ptr_list)
 {
     for (StgWeak *w = oldest_gen->weak_ptr_list; w; w = w->link) {
-        markQueuePushClosure_(mark_queue, (StgClosure*)w);
+        markQueuePushClosureGC(mark_queue, (StgClosure*)w);
         // Do not mark finalizers and values here, those fields will be marked
         // in `nonmovingMarkDeadWeaks` (for dead weaks) or
         // `nonmovingTidyWeaks` (for live weaks)
@@ -911,8 +921,14 @@ static void nonmovingMarkWeakPtrList(MarkQueue *mark_queue, StgWeak *dead_weak_p
     // - So, to be able to traverse `dead_weak_ptr_list` and run finalizers we
     //   need to mark it.
     for (StgWeak *w = dead_weak_ptr_list; w; w = w->link) {
-        markQueuePushClosure_(mark_queue, (StgClosure*)w);
-        nonmovingMarkDeadWeak(mark_queue, w);
+        markQueuePushClosureGC(mark_queue, (StgClosure*)w);
+
+        // Mark the value and finalizer since they will be needed regardless of
+        // whether we find the weak is live.
+        if (w->cfinalizers != &stg_NO_FINALIZER_closure) {
+            markQueuePushClosureGC(mark_queue, w->value);
+        }
+        markQueuePushClosureGC(mark_queue, w->finalizer);
     }
 }
 
@@ -921,7 +937,7 @@ void nonmovingCollect(StgWeak **dead_weaks, StgTSO **resurrected_threads)
 #if defined(THREADED_RTS)
     // We can't start a new collection until the old one has finished
     // We also don't run in final GC
-    if (concurrent_coll_running || sched_state > SCHED_RUNNING) {
+    if (RELAXED_LOAD(&concurrent_coll_running) || getSchedState() > SCHED_RUNNING) {
         return;
     }
 #endif
@@ -953,7 +969,7 @@ void nonmovingCollect(StgWeak **dead_weaks, StgTSO **resurrected_threads)
 
     // Mark threads resurrected during moving heap scavenging
     for (StgTSO *tso = *resurrected_threads; tso != END_TSO_QUEUE; tso = tso->global_link) {
-        markQueuePushClosure_(mark_queue, (StgClosure*)tso);
+        markQueuePushClosureGC(mark_queue, (StgClosure*)tso);
     }
     trace(TRACE_nonmoving_gc, "Finished marking roots for nonmoving GC");
 
@@ -994,14 +1010,16 @@ void nonmovingCollect(StgWeak **dead_weaks, StgTSO **resurrected_threads)
     // again for the sync if we let it go, because it'll immediately start doing
     // a major GC, because that's what we do when exiting scheduler (see
     // exitScheduler()).
-    if (sched_state == SCHED_RUNNING) {
-        concurrent_coll_running = true;
+    if (getSchedState() == SCHED_RUNNING) {
+        RELAXED_STORE(&concurrent_coll_running, true);
         nonmoving_write_barrier_enabled = true;
         debugTrace(DEBUG_nonmoving_gc, "Starting concurrent mark thread");
-        if (createOSThread(&mark_thread, "non-moving mark thread",
+        OSThreadId thread;
+        if (createOSThread(&thread, "non-moving mark thread",
                            nonmovingConcurrentMark, mark_queue) != 0) {
             barf("nonmovingCollect: failed to spawn mark thread: %s", strerror(errno));
         }
+        RELAXED_STORE(&mark_thread, thread);
     } else {
         nonmovingConcurrentMark(mark_queue);
     }
@@ -1086,7 +1104,7 @@ static void nonmovingMark_(MarkQueue *mark_queue, StgWeak **dead_weaks, StgTSO *
     Task *task = newBoundTask();
 
     // If at this point if we've decided to exit then just return
-    if (sched_state > SCHED_RUNNING) {
+    if (getSchedState() > SCHED_RUNNING) {
         // Note that we break our invariants here and leave segments in
         // nonmovingHeap.sweep_list, don't free nonmoving_large_objects etc.
         // However because we won't be running mark-sweep in the final GC this
@@ -1240,12 +1258,12 @@ finish:
     exitMyTask();
 
     // We are done...
-    mark_thread = 0;
+    RELAXED_STORE(&mark_thread, 0);
     stat_endNonmovingGc();
 
     // Signal that the concurrent collection is finished, allowing the next
     // non-moving collection to proceed
-    concurrent_coll_running = false;
+    RELAXED_STORE(&concurrent_coll_running, false);
     signalCondition(&concurrent_coll_finished);
     RELEASE_LOCK(&nonmoving_collection_mutex);
 #endif


=====================================
rts/sm/NonMoving.h
=====================================
@@ -44,7 +44,7 @@ enum NonmovingSegmentState {
     FREE, CURRENT, ACTIVE, FILLED, FILLED_SWEEPING
 };
 
-#define SET_SEGMENT_STATE(seg, st) (seg)->state = (st)
+#define SET_SEGMENT_STATE(seg, st) RELAXED_STORE(&(seg)->state, (st))
 #else
 #define SET_SEGMENT_STATE(_seg,_st)
 #endif
@@ -169,7 +169,7 @@ INLINE_HEADER void nonmovingPushActiveSegment(struct NonmovingSegment *seg)
         nonmovingHeap.allocators[nonmovingSegmentLogBlockSize(seg) - NONMOVING_ALLOCA0];
     SET_SEGMENT_STATE(seg, ACTIVE);
     while (true) {
-        struct NonmovingSegment *current_active = (struct NonmovingSegment*)VOLATILE_LOAD(&alloc->active);
+        struct NonmovingSegment *current_active = RELAXED_LOAD(&alloc->active);
         seg->link = current_active;
         if (cas((StgVolatilePtr) &alloc->active, (StgWord) current_active, (StgWord) seg) == (StgWord) current_active) {
             break;
@@ -184,8 +184,8 @@ INLINE_HEADER void nonmovingPushFilledSegment(struct NonmovingSegment *seg)
         nonmovingHeap.allocators[nonmovingSegmentLogBlockSize(seg) - NONMOVING_ALLOCA0];
     SET_SEGMENT_STATE(seg, FILLED);
     while (true) {
-        struct NonmovingSegment *current_filled = (struct NonmovingSegment*)VOLATILE_LOAD(&alloc->filled);
-        seg->link = current_filled;
+        struct NonmovingSegment *current_filled = (struct NonmovingSegment*) RELAXED_LOAD(&alloc->filled);
+        RELAXED_STORE(&seg->link, current_filled);
         if (cas((StgVolatilePtr) &alloc->filled, (StgWord) current_filled, (StgWord) seg) == (StgWord) current_filled) {
             break;
         }
@@ -276,12 +276,12 @@ extern uint8_t nonmovingMarkEpoch;
 
 INLINE_HEADER void nonmovingSetMark(struct NonmovingSegment *seg, nonmoving_block_idx i)
 {
-    seg->bitmap[i] = nonmovingMarkEpoch;
+    RELAXED_STORE(&seg->bitmap[i], nonmovingMarkEpoch);
 }
 
 INLINE_HEADER uint8_t nonmovingGetMark(struct NonmovingSegment *seg, nonmoving_block_idx i)
 {
-    return seg->bitmap[i];
+    return RELAXED_LOAD(&seg->bitmap[i]);
 }
 
 INLINE_HEADER void nonmovingSetClosureMark(StgPtr p)


=====================================
rts/sm/NonMovingMark.c
=====================================
@@ -27,6 +27,8 @@
 #include "sm/Storage.h"
 #include "CNF.h"
 
+static void nonmovingResetUpdRemSetQueue (MarkQueue *rset);
+static void nonmovingResetUpdRemSet (UpdRemSet *rset);
 static bool check_in_nonmoving_heap(StgClosure *p);
 static void mark_closure (MarkQueue *queue, const StgClosure *p, StgClosure **origin);
 static void trace_tso (MarkQueue *queue, StgTSO *tso);
@@ -261,17 +263,9 @@ static uint32_t markQueueLength(MarkQueue *q);
 #endif
 static void init_mark_queue_(MarkQueue *queue);
 
-/* Transfers the given capability's update-remembered set to the global
- * remembered set.
- *
- * Really the argument type should be UpdRemSet* but this would be rather
- * inconvenient without polymorphism.
- */
-void nonmovingAddUpdRemSetBlocks(MarkQueue *rset)
+static void nonmovingAddUpdRemSetBlocks_(MarkQueue *rset)
 {
-    if (markQueueIsEmpty(rset)) return;
-
-    // find the tail of the queue
+    // find the tail of the remembered set mark queue
     bdescr *start = rset->blocks;
     bdescr *end = start;
     while (end->link != NULL)
@@ -282,14 +276,45 @@ void nonmovingAddUpdRemSetBlocks(MarkQueue *rset)
     end->link = upd_rem_set_block_list;
     upd_rem_set_block_list = start;
     RELEASE_LOCK(&upd_rem_set_lock);
+}
+
+/*
+ * Transfers the given capability's update-remembered set to the global
+ * remembered set.
+ *
+ * Really the argument type should be UpdRemSet* but this would be rather
+ * inconvenient without polymorphism.
+ */
+static void nonmovingAddUpdRemSetBlocks_lock(MarkQueue *rset)
+{
+    if (markQueueIsEmpty(rset)) return;
 
-    // Reset remembered set
+    nonmovingAddUpdRemSetBlocks_(rset);
+    // Reset the state of the remembered set.
     ACQUIRE_SM_LOCK;
     init_mark_queue_(rset);
     rset->is_upd_rem_set = true;
     RELEASE_SM_LOCK;
 }
 
+/*
+ * Transfers the given capability's update-remembered set to the global
+ * remembered set.
+ *
+ * Really the argument type should be UpdRemSet* but this would be rather
+ * inconvenient without polymorphism.
+ *
+ * Caller must hold SM_LOCK.
+ */
+void nonmovingAddUpdRemSetBlocks(UpdRemSet *rset)
+{
+    if (markQueueIsEmpty(&rset->queue)) return;
+
+    nonmovingAddUpdRemSetBlocks_(&rset->queue);
+    init_mark_queue_(&rset->queue);
+    rset->queue.is_upd_rem_set = true;
+}
+
 #if defined(THREADED_RTS)
 /* Called by capabilities to flush their update remembered sets when
  * synchronising with the non-moving collector as it transitions from mark to
@@ -301,7 +326,7 @@ void nonmovingFlushCapUpdRemSetBlocks(Capability *cap)
                "Capability %d flushing update remembered set: %d",
                cap->no, markQueueLength(&cap->upd_rem_set.queue));
     traceConcUpdRemSetFlush(cap);
-    nonmovingAddUpdRemSetBlocks(&cap->upd_rem_set.queue);
+    nonmovingAddUpdRemSetBlocks_lock(&cap->upd_rem_set.queue);
     atomic_inc(&upd_rem_set_flush_count, 1);
     signalCondition(&upd_rem_set_flushed_cond);
     // After this mutation will remain suspended until nonmovingFinishFlush
@@ -399,7 +424,7 @@ void nonmovingFinishFlush(Task *task)
 {
     // See Note [Unintentional marking in resurrectThreads]
     for (uint32_t i = 0; i < getNumCapabilities(); i++) {
-        reset_upd_rem_set(&getCapability(i)->upd_rem_set);
+        nonmovingResetUpdRemSet(&getCapability(i)->upd_rem_set);
     }
     // Also reset upd_rem_set_block_list in case some of the UpdRemSets were
     // filled and we flushed them.
@@ -424,7 +449,8 @@ push (MarkQueue *q, const MarkQueueEnt *ent)
     if (q->top->head == MARK_QUEUE_BLOCK_ENTRIES) {
         // Yes, this block is full.
         if (q->is_upd_rem_set) {
-            nonmovingAddUpdRemSetBlocks(q);
+            // Flush the block to the global update remembered set
+            nonmovingAddUpdRemSetBlocks_lock(q);
         } else {
             // allocate a fresh block.
             ACQUIRE_SM_LOCK;
@@ -780,7 +806,7 @@ void markQueuePushClosure (MarkQueue *q,
 /* TODO: Do we really never want to specify the origin here? */
 void markQueueAddRoot (MarkQueue* q, StgClosure** root)
 {
-    markQueuePushClosure(q, *root, NULL);
+    markQueuePushClosureGC(q, *root);
 }
 
 /* Push a closure to the mark queue without origin information */
@@ -908,18 +934,24 @@ void initMarkQueue (MarkQueue *queue)
 }
 
 /* Must hold sm_mutex. */
-void init_upd_rem_set (UpdRemSet *rset)
+void nonmovingInitUpdRemSet (UpdRemSet *rset)
 {
     init_mark_queue_(&rset->queue);
     rset->queue.is_upd_rem_set = true;
 }
 
-void reset_upd_rem_set (UpdRemSet *rset)
+static void nonmovingResetUpdRemSetQueue (MarkQueue *rset)
 {
     // UpdRemSets always have one block for the mark queue. This assertion is to
     // update this code if we change that.
-    ASSERT(rset->queue.blocks->link == NULL);
-    rset->queue.top->head = 0;
+    ASSERT(rset->is_upd_rem_set);
+    ASSERT(rset->blocks->link == NULL);
+    rset->top->head = 0;
+}
+
+void nonmovingResetUpdRemSet (UpdRemSet *rset)
+{
+    nonmovingResetUpdRemSetQueue(&rset->queue);
 }
 
 void freeMarkQueue (MarkQueue *queue)
@@ -1187,15 +1219,17 @@ trace_stack (MarkQueue *queue, StgStack *stack)
 static bool
 bump_static_flag(StgClosure **link_field, StgClosure *q STG_UNUSED)
 {
-    while (1) {
-        StgWord link = (StgWord) *link_field;
-        StgWord new = (link & ~STATIC_BITS) | static_flag;
-        if ((link & STATIC_BITS) == static_flag)
-            return false;
-        else if (cas((StgVolatilePtr) link_field, link, new) == link) {
-            return true;
-        }
+    ACQUIRE_SM_LOCK;
+    bool needs_marking;
+    StgWord link = (StgWord) *link_field;
+    if ((link & STATIC_BITS) == static_flag) {
+        needs_marking = false;
+    } else {
+        *link_field = (StgClosure *) ((link & ~STATIC_BITS) | static_flag);
+        needs_marking = true;
     }
+    RELEASE_SM_LOCK;
+    return needs_marking;
 }
 
 /* N.B. p0 may be tagged */
@@ -1211,10 +1245,16 @@ mark_closure (MarkQueue *queue, const StgClosure *p0, StgClosure **origin)
     StgWord tag = GET_CLOSURE_TAG(p);
     p = UNTAG_CLOSURE(p);
 
+    // Push an immutable field to the mark queue.
 #   define PUSH_FIELD(obj, field)                                \
         markQueuePushClosure(queue,                              \
                                 (StgClosure *) (obj)->field,     \
                                 (StgClosure **) &(obj)->field)
+    // Push a mutable field to the mark queue.
+#   define PUSH_FIELD_MUT(obj, field)                            \
+        markQueuePushClosure(queue,                              \
+                                (StgClosure *) ACQUIRE_LOAD(&(obj)->field),     \
+                                (StgClosure **) &(obj)->field)
 
     if (!HEAP_ALLOCED_GC(p)) {
         const StgInfoTable *info = get_itbl(p);
@@ -1281,7 +1321,10 @@ mark_closure (MarkQueue *queue, const StgClosure *p0, StgClosure **origin)
 
     bd = Bdescr((StgPtr) p);
 
-    if (bd->gen != oldest_gen) {
+    // This must be a relaxed load since the object may be a large object,
+    // in which case evacuation by the moving collector will result in
+    // mutation.
+    if (RELAXED_LOAD(&bd->gen) != oldest_gen) {
         // Here we have an object living outside of the non-moving heap. While
         // we likely evacuated nearly everything to the nonmoving heap during
         // preparation there are nevertheless a few ways in which we might trace
@@ -1386,16 +1429,16 @@ mark_closure (MarkQueue *queue, const StgClosure *p0, StgClosure **origin)
     case MVAR_CLEAN:
     case MVAR_DIRTY: {
         StgMVar *mvar = (StgMVar *) p;
-        PUSH_FIELD(mvar, head);
-        PUSH_FIELD(mvar, tail);
-        PUSH_FIELD(mvar, value);
+        PUSH_FIELD_MUT(mvar, head);
+        PUSH_FIELD_MUT(mvar, tail);
+        PUSH_FIELD_MUT(mvar, value);
         break;
     }
 
     case TVAR: {
         StgTVar *tvar = ((StgTVar *)p);
-        PUSH_FIELD(tvar, current_value);
-        PUSH_FIELD(tvar, first_watch_queue_entry);
+        PUSH_FIELD_MUT(tvar, current_value);
+        PUSH_FIELD_MUT(tvar, first_watch_queue_entry);
         break;
     }
 
@@ -1510,8 +1553,12 @@ mark_closure (MarkQueue *queue, const StgClosure *p0, StgClosure **origin)
     }
 
     case BLACKHOLE: {
-        PUSH_FIELD((StgInd *) p, indirectee);
-        StgClosure *indirectee = ((StgInd*)p)->indirectee;
+        // Synchronizes with the release-store in updateWithIndirection.
+        // See Note [Heap memory barriers] in SMP.h.
+        StgInd *ind = (StgInd *) p;
+        ACQUIRE_FENCE();
+        StgClosure *indirectee = RELAXED_LOAD(&ind->indirectee);
+        markQueuePushClosure(queue, indirectee, &ind->indirectee);
         if (GET_CLOSURE_TAG(indirectee) == 0 || origin == NULL) {
             // do nothing
         } else {
@@ -1522,7 +1569,7 @@ mark_closure (MarkQueue *queue, const StgClosure *p0, StgClosure **origin)
 
     case MUT_VAR_CLEAN:
     case MUT_VAR_DIRTY:
-        PUSH_FIELD((StgMutVar *)p, var);
+        PUSH_FIELD_MUT((StgMutVar *)p, var);
         break;
 
     case BLOCKING_QUEUE: {
@@ -1577,7 +1624,7 @@ mark_closure (MarkQueue *queue, const StgClosure *p0, StgClosure **origin)
         StgSmallMutArrPtrs *arr = (StgSmallMutArrPtrs *) p;
         for (StgWord i = 0; i < arr->ptrs; i++) {
             StgClosure **field = &arr->payload[i];
-            markQueuePushClosure(queue, *field, field);
+            markQueuePushClosure(queue, ACQUIRE_LOAD(field), field);
         }
         break;
     }
@@ -1639,6 +1686,7 @@ mark_closure (MarkQueue *queue, const StgClosure *p0, StgClosure **origin)
     }
 
 #   undef PUSH_FIELD
+#   undef PUSH_FIELD_MUT
 
     /* Set the mark bit: it's important that we do this only after we actually push
      * the object's pointers since in the case of marking stacks there may be a
@@ -1719,13 +1767,16 @@ nonmovingMark (MarkQueue *queue)
                 end = arr->ptrs;
             }
             for (StgWord i = start; i < end; i++) {
-                markQueuePushClosure_(queue, arr->payload[i]);
+                StgClosure *c = ACQUIRE_LOAD(&arr->payload[i]);
+                markQueuePushClosure_(queue, c);
             }
             break;
         }
         case NULL_ENTRY:
             // Perhaps the update remembered set has more to mark...
-            if (upd_rem_set_block_list) {
+            // N.B. This must be atomic since we have not yet taken
+            // upd_rem_set_lock.
+            if (RELAXED_LOAD(&upd_rem_set_block_list) != NULL) {
                 ACQUIRE_LOCK(&upd_rem_set_lock);
                 bdescr *old = queue->blocks;
                 queue->blocks = upd_rem_set_block_list;


=====================================
rts/sm/NonMovingMark.h
=====================================
@@ -137,8 +137,7 @@ extern bdescr *upd_rem_set_block_list;
 
 void nonmovingMarkInitUpdRemSet(void);
 
-void init_upd_rem_set(UpdRemSet *rset);
-void reset_upd_rem_set(UpdRemSet *rset);
+void nonmovingInitUpdRemSet(UpdRemSet *rset);
 void updateRemembSetPushClosure(Capability *cap, StgClosure *p);
 void updateRemembSetPushThunk(Capability *cap, StgThunk *p);
 void updateRemembSetPushTSO(Capability *cap, StgTSO *tso);
@@ -165,7 +164,7 @@ void nonmovingResurrectThreads(struct MarkQueue_ *queue, StgTSO **resurrected_th
 bool nonmovingIsAlive(StgClosure *p);
 void nonmovingMarkDeadWeak(struct MarkQueue_ *queue, StgWeak *w);
 void nonmovingMarkLiveWeak(struct MarkQueue_ *queue, StgWeak *w);
-void nonmovingAddUpdRemSetBlocks(struct MarkQueue_ *rset);
+void nonmovingAddUpdRemSetBlocks(UpdRemSet *rset);
 
 void markQueuePush(MarkQueue *q, const MarkQueueEnt *ent);
 void markQueuePushClosureGC(MarkQueue *q, StgClosure *p);


=====================================
rts/sm/NonMovingShortcut.c
=====================================
@@ -153,7 +153,8 @@ selectee_changed:
     // Selectee is a non-moving object, mark it.
     markQueuePushClosure(queue, selectee, NULL);
 
-    const StgInfoTable *selectee_info_tbl = get_itbl(selectee);
+    // This may synchronize with the release in updateWithIndirection.
+    const StgInfoTable *selectee_info_tbl = get_itbl_acquire(selectee);
     switch (selectee_info_tbl->type) {
         case WHITEHOLE: {
             // Probably a loop. Abort.


=====================================
rts/sm/Storage.c
=====================================
@@ -325,7 +325,7 @@ void storageAddCapabilities (uint32_t from, uint32_t to)
     if (RtsFlags.GcFlags.useNonmoving) {
         nonmovingAddCapabilities(to);
         for (i = 0; i < to; ++i) {
-            init_upd_rem_set(&getCapability(i)->upd_rem_set);
+            nonmovingInitUpdRemSet(&getCapability(i)->upd_rem_set);
         }
     }
 


=====================================
rts/win32/AsyncMIO.c
=====================================
@@ -247,7 +247,7 @@ start:
     if (completed_hw == 0) {
         // empty table, drop lock and wait
         OS_RELEASE_LOCK(&queue_lock);
-        if ( wait && sched_state == SCHED_RUNNING ) {
+        if ( wait && getSchedState() == SCHED_RUNNING ) {
             DWORD dwRes = WaitForMultipleObjects(2, wait_handles,
                                                  FALSE, INFINITE);
             switch (dwRes) {


=====================================
rts/win32/AwaitEvent.c
=====================================
@@ -56,7 +56,7 @@ awaitEvent(Capability *cap, bool wait)
     //  - the run-queue is now non- empty
 
   } while (wait
-           && sched_state == SCHED_RUNNING
+           && getSchedState() == SCHED_RUNNING
            && emptyRunQueue(cap)
       );
 }


=====================================
rts/win32/ConsoleHandler.c
=====================================
@@ -91,7 +91,7 @@ static BOOL WINAPI shutdown_handler(DWORD dwCtrlType)
         // If we're already trying to interrupt the RTS, terminate with
         // extreme prejudice.  So the first ^C tries to exit the program
         // cleanly, and the second one just kills it.
-        if (sched_state >= SCHED_INTERRUPTING) {
+        if (getSchedState() >= SCHED_INTERRUPTING) {
             stg_exit(EXIT_INTERRUPTED);
         } else {
             interruptStgRts();



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/801f97e827c3a661116b212c89c0f04d725e0776...e090f10543753358d5be370c721a9db9b2d43246

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/801f97e827c3a661116b212c89c0f04d725e0776...e090f10543753358d5be370c721a9db9b2d43246
You're receiving this email because of your account on gitlab.haskell.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20221206/96fde2eb/attachment-0001.html>