[Git][ghc/ghc][wip/backports] 14 commits: configure.ac: Reset RELEASE to NO

Thu May 14 16:11:26 UTC 2020


Ben Gamari pushed to branch wip/backports at Glasgow Haskell Compiler / GHC


Commits:
584c3b05 by Ben Gamari at 2020-05-12T12:42:34-04:00
configure.ac: Reset RELEASE to NO
- - - - -
40cb7155 by Ben Gamari at 2020-05-14T12:10:51-04:00
rts: Add getCurrentThreadCPUTime helper

(cherry picked from commit cedd6f3041de6abe64dfa3257bec7730a9dced9f)

- - - - -
66be810f by Ben Gamari at 2020-05-14T12:10:51-04:00
rts: Prefer darwin-specific getCurrentThreadCPUTime

macOS Catalina now supports a non-POSIX-compliant version of clock_gettime
which cannot use the clock_gettime codepath.

Fixes #17906.

(cherry picked from commit bb586f894532baf1bcb822afd0df7f9fea198671)

- - - - -
76f679ea by Ben Gamari at 2020-05-14T12:10:51-04:00
nonmoving-gc: Track time usage of nonmoving marking

(cherry picked from commit ace618cd2294989e783bd453cee88e0e1c0dad77)

- - - - -
c8e88946 by Ben Gamari at 2020-05-14T12:10:51-04:00
nonmoving: Eagerly flush all capabilities' update remembered sets

(cherry picked from commit 2fa79119570b358a4db61446396889b8260d7957)

- - - - -
3837ad4c by Ben Gamari at 2020-05-14T12:10:51-04:00
nonmoving: Explicitly memoize block count

A profile cast doubt on whether the compiler hoisted the bound out the
loop as I would have expected here. It turns out it did but nevertheless
it seems clearer to just do this manually.

- - - - -
d9127b8f by Ben Gamari at 2020-05-14T12:10:51-04:00
nonmoving: Clear bitmap after initializing block size

Previously nonmovingInitSegment would clear the bitmap before
initializing the segment's block size. This is broken since
nonmovingClearBitmap looks at the segment's block size to determine how
much bitmap to clear.

- - - - -
a70192c8 by Ben Gamari at 2020-05-14T12:10:51-04:00
hadrian: Allow libnuma library path to be specified

- - - - -
ea602081 by Ben Gamari at 2020-05-14T12:10:51-04:00
hadrian: Refactor gmp arguments

Move the gmp configuration to its own binding.

- - - - -
b74dc41d by Ben Gamari at 2020-05-14T12:10:52-04:00
hadrian: Tell Cabal about integer-gmp library location

- - - - -
e8db3006 by Ben Gamari at 2020-05-14T12:10:52-04:00
rts: Zero block flags with -DZ

Block flags are very useful for determining the state of a block.
However, some block allocator users don't touch them, leading to
misleading values. Ensure that we zero then when zero-on-gc is set. This
is safe and makes the flags more useful during debugging.

- - - - -
e7524deb by Ben Gamari at 2020-05-14T12:10:52-04:00
nonmoving: Fix incorrect failed_to_evac value during deadlock gc

Previously we would incorrectly set the failed_to_evac flag if we
evacuated a value due to a deadlock GC. This would cause us to mark more
things as dirty than strictly necessary. It also turned up a nasty but
which I will fix next.

- - - - -
4c31e510 by Ben Gamari at 2020-05-14T12:10:52-04:00
nonmoving: Fix handling of dirty objects

Previously we (incorrectly) relied on failed_to_evac to be "precise".
That is, we expected it to only be true if *all* of an object's fields
lived outside of the non-moving heap. However, does not match the
behavior of failed_to_evac, which is true if *any* of the object's
fields weren't promoted (meaning that some others *may* live in the
non-moving heap).

This is problematic as we skip the non-moving write barrier for dirty
objects (which we can only safely do if *all* fields point outside of
the non-moving heap).

Clearly this arises due to a fundamental difference in the behavior
expected of failed_to_evac in the moving and non-moving collector.
e.g., in the moving collector it is always safe to conservatively say
failed_to_evac=true whereas in the non-moving collector the safe value
is false.

This issue went unnoticed as I never wrote down the dirtiness
invariant enforced by the non-moving collector. We now define this
invariant as

    An object being marked as dirty implies that all of its fields are
    on the mark queue (or, equivalently, update remembered set).

To maintain this invariant we teach nonmovingScavengeOne to push the
fields of objects which we fail to evacuate to the update remembered
set. This is a simple and reasonably cheap solution and avoids the
complexity and fragility that other, more strict alternative invariants
would require.

All of this is described in a new Note, Note [Dirty flags in the
non-moving collector] in NonMoving.c.

- - - - -
b9df14c4 by Ben Gamari at 2020-05-14T12:10:52-04:00
nonmoving: Optimise the write barrier

(cherry picked from commit a636eadac1f30bae37aeb6526f94893293f098b8)

- - - - -


25 changed files:

- configure.ac
- hadrian/cfg/system.config.in
- hadrian/src/Oracles/Flag.hs
- hadrian/src/Oracles/Setting.hs
- hadrian/src/Settings/Packages.hs
- includes/RtsAPI.h
- includes/rts/Time.h
- libraries/base/GHC/Stats.hsc
- libraries/text
- rts/GetTime.h
- rts/ProfHeap.c
- rts/Stats.c
- rts/Stats.h
- rts/Updates.h
- rts/posix/GetTime.c
- rts/sm/BlockAlloc.c
- rts/sm/Evac.c
- rts/sm/GC.c
- rts/sm/GCThread.h
- rts/sm/NonMoving.c
- rts/sm/NonMovingMark.c
- rts/sm/NonMovingScav.c
- rts/sm/NonMovingSweep.c
- rts/sm/Storage.c
- rts/win32/GetTime.c


Changes:

=====================================
configure.ac
=====================================
@@ -16,7 +16,7 @@ dnl
 AC_INIT([The Glorious Glasgow Haskell Compilation System], [8.10.1], [glasgow-haskell-bugs at haskell.org], [ghc-AC_PACKAGE_VERSION])
 
 # Set this to YES for a released version, otherwise NO
-: ${RELEASE=YES}
+: ${RELEASE=NO}
 
 # The primary version (e.g. 7.5, 7.4.1) is set in the AC_INIT line
 # above.  If this is not a released version, then we will append the
@@ -1285,6 +1285,28 @@ AC_DEFINE_UNQUOTED([USE_LIBDW], [$USE_LIBDW], [Set to 1 to use libdw])
 
 dnl ** Have libnuma?
 dnl --------------------------------------------------------------
+AC_ARG_WITH([libnuma-libraries],
+  [AC_HELP_STRING([--with-libnuma-libraries=ARG],
+    [Find libraries for libnuma in ARG [default=system default]])
+  ],
+  [
+    LibNumaLibDir="$withval"
+    LIBNUMA_LDFLAGS="-L$withval"
+  ])
+
+AC_SUBST(LibNumaLibDir)
+
+AC_ARG_WITH([libnuma-includes],
+  [AC_HELP_STRING([--with-libnuma-includes=ARG],
+    [Find includes for libnuma in ARG [default=system default]])
+  ],
+  [
+    LibNumaIncludeDir="$withval"
+    LIBNUMA_CFLAGS="-I$withval"
+  ])
+
+AC_SUBST(LibNumaIncludeDir)
+
 HaveLibNuma=0
 AC_ARG_ENABLE(numa,
     [AC_HELP_STRING([--enable-numa],
@@ -1292,6 +1314,11 @@ AC_ARG_ENABLE(numa,
          runtime system via numactl's libnuma [default=auto]])])
 
 if test "$enable_numa" != "no" ; then
+  CFLAGS2="$CFLAGS"
+  CFLAGS="$LIBNUMA_CFLAGS $CFLAGS"
+  LDFLAGS2="$LDFLAGS"
+  LDFLAGS="$LIBNUMA_LDFLAGS $LDFLAGS"
+
   AC_CHECK_HEADERS([numa.h numaif.h])
 
   if test "$ac_cv_header_numa_h$ac_cv_header_numaif_h" = "yesyes" ; then
@@ -1300,16 +1327,20 @@ if test "$enable_numa" != "no" ; then
   if test "$enable_numa:$HaveLibNuma" = "yes:0" ; then
       AC_MSG_ERROR([Cannot find system libnuma (required by --enable-numa)])]
   fi
+
+  CFLAGS="$CFLAGS2"
+  LDFLAGS="$LDFLAGS2"
 fi
 
 AC_DEFINE_UNQUOTED([HAVE_LIBNUMA], [$HaveLibNuma], [Define to 1 if you have libnuma])
 if test $HaveLibNuma = "1" ; then
+  AC_SUBST([UseLibNuma],[YES])
   AC_SUBST([CabalHaveLibNuma],[True])
 else
+  AC_SUBST([UseLibNuma],[NO])
   AC_SUBST([CabalHaveLibNuma],[False])
 fi
 
-
 dnl ** Documentation
 dnl --------------------------------------------------------------
 if test -n "$SPHINXBUILD"; then


=====================================
hadrian/cfg/system.config.in
=====================================
@@ -172,8 +172,12 @@ ffi-lib-dir       = @FFILibDir@
 libdw-include-dir   = @LibdwIncludeDir@
 libdw-lib-dir       = @LibdwLibDir@
 
+libnuma-include-dir   = @LibNumaIncludeDir@
+libnuma-lib-dir       = @LibNumaLibDir@
+
 # Optional Dependencies:
 #=======================
 
 with-libdw = @UseLibdw@
+with-libnuma = @UseLibNuma@
 have-lib-mingw-ex = @HaveLibMingwEx@


=====================================
hadrian/src/Oracles/Flag.hs
=====================================
@@ -21,6 +21,7 @@ data Flag = ArSupportsAtFile
           | LeadingUnderscore
           | SolarisBrokenShld
           | WithLibdw
+          | WithLibnuma
           | HaveLibMingwEx
           | UseSystemFfi
 
@@ -39,6 +40,7 @@ flag f = do
             LeadingUnderscore  -> "leading-underscore"
             SolarisBrokenShld  -> "solaris-broken-shld"
             WithLibdw          -> "with-libdw"
+            WithLibnuma        -> "with-libnuma"
             HaveLibMingwEx     -> "have-lib-mingw-ex"
             UseSystemFfi       -> "use-system-ffi"
     value <- lookupValueOrError configFile key


=====================================
hadrian/src/Oracles/Setting.hs
=====================================
@@ -56,6 +56,8 @@ data Setting = BuildArch
              | IconvLibDir
              | LibdwIncludeDir
              | LibdwLibDir
+             | LibnumaIncludeDir
+             | LibnumaLibDir
              | LlvmTarget
              | ProjectGitCommitId
              | ProjectName
@@ -145,6 +147,8 @@ setting key = lookupValueOrError configFile $ case key of
     IconvLibDir        -> "iconv-lib-dir"
     LibdwIncludeDir    -> "libdw-include-dir"
     LibdwLibDir        -> "libdw-lib-dir"
+    LibnumaIncludeDir  -> "libnuma-include-dir"
+    LibnumaLibDir      -> "libnuma-lib-dir"
     LlvmTarget         -> "llvm-target"
     ProjectGitCommitId -> "project-git-commit-id"
     ProjectName        -> "project-name"


=====================================
hadrian/src/Settings/Packages.hs
=====================================
@@ -147,18 +147,7 @@ packageArgs = do
           builder (Cabal Flags) ? arg "in-ghc-tree"
 
         ------------------------------ integerGmp ------------------------------
-        , package integerGmp ? mconcat
-          [ builder Cc ? arg includeGmp
-
-          , builder (Cabal Setup) ? mconcat
-            [ flag GmpInTree ? arg "--configure-option=--with-intree-gmp"
-            -- Windows is always built with inplace GMP until we have dynamic
-            -- linking working.
-            , windowsHost  ? arg "--configure-option=--with-intree-gmp"
-            , flag GmpFrameworkPref ?
-              arg "--configure-option=--with-gmp-framework-preferred"
-            , arg ("--configure-option=CFLAGS=" ++ includeGmp)
-            , arg ("--gcc-options="             ++ includeGmp) ] ]
+        , gmpPackageArgs
 
         ---------------------------------- rts ---------------------------------
         , package rts ? rtsPackageArgs -- RTS deserves a separate function
@@ -181,6 +170,32 @@ packageArgs = do
           builder (Cabal Flags) ? notStage0 ? intLib == integerSimple ?
           pure ["+integer-simple", "-bytestring-builder"] ]
 
+gmpPackageArgs :: Args
+gmpPackageArgs = do
+    -- These are only used for non-in-tree builds.
+    librariesGmp <- getSetting GmpLibDir
+    includesGmp <- getSetting GmpIncludeDir
+
+    -- Windows is always built with inplace GMP until we have dynamic
+    -- linking working.
+    inTreeFlag <- getFlag GmpInTree
+    let inTree = inTreeFlag || windowsHost
+
+    package integerGmp ? mconcat
+          [ builder (Cabal Setup) ? mconcat
+            [ inTree ? arg "--configure-option=--with-intree-gmp"
+            , flag GmpFrameworkPref ?
+              arg "--configure-option=--with-gmp-framework-preferred"
+
+              -- Ensure that the integer-gmp package registration includes
+              -- knowledge of the system gmp's library and include directories.
+            , notM (flag GmpInTree) ? mconcat
+              [ if not (null librariesGmp) then arg ("--extra-lib-dirs=" ++ librariesGmp) else mempty
+              , if not (null includesGmp) then arg ("--extra-include-dirs=" ++ includesGmp) else mempty
+              ]
+            ]
+          ]
+
 -- | RTS-specific command line arguments.
 rtsPackageArgs :: Args
 rtsPackageArgs = package rts ? do
@@ -208,6 +223,8 @@ rtsPackageArgs = package rts ? do
     ffiLibraryDir  <- getSetting FfiLibDir
     libdwIncludeDir   <- getSetting LibdwIncludeDir
     libdwLibraryDir   <- getSetting LibdwLibDir
+    libnumaIncludeDir <- getSetting LibnumaIncludeDir
+    libnumaLibraryDir <- getSetting LibnumaLibDir
 
     -- Arguments passed to GHC when compiling C and .cmm sources.
     let ghcArgs = mconcat
@@ -215,6 +232,8 @@ rtsPackageArgs = package rts ? do
           , arg $ "-I" ++ path
           , flag WithLibdw ? if not (null libdwIncludeDir) then arg ("-I" ++ libdwIncludeDir) else mempty
           , flag WithLibdw ? if not (null libdwLibraryDir) then arg ("-L" ++ libdwLibraryDir) else mempty
+          , flag WithLibnuma ? if not (null libnumaIncludeDir) then arg ("-I" ++ libnumaIncludeDir) else mempty
+          , flag WithLibnuma ? if not (null libnumaLibraryDir) then arg ("-L" ++ libnumaLibraryDir) else mempty
           , arg $ "-DRtsWay=\"rts_" ++ show way ++ "\""
           -- Set the namespace for the rts fs functions
           , arg $ "-DFS_NAMESPACE=rts"
@@ -324,6 +343,9 @@ rtsPackageArgs = package rts ? do
           , any (wayUnit Dynamic) rtsWays ? arg "dynamic"
           , Debug `wayUnit` way           ? arg "find-ptr"
           ]
+        , builder (Cabal Setup) ?
+               if not (null libnumaLibraryDir) then arg ("--extra-lib-dirs="++libnumaLibraryDir) else mempty
+            <> if not (null libnumaIncludeDir) then arg ("--extra-include-dirs="++libnumaIncludeDir) else mempty
         , builder (Cc FindCDependencies) ? cArgs
         , builder (Ghc CompileCWithGhc) ? map ("-optc" ++) <$> cArgs
         , builder Ghc ? ghcArgs


=====================================
includes/RtsAPI.h
=====================================
@@ -151,6 +151,23 @@ typedef struct GCDetails_ {
   Time cpu_ns;
     // The time elapsed during GC itself
   Time elapsed_ns;
+
+    //
+    // Concurrent garbage collector
+    //
+
+    // The CPU time used during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_sync_cpu_ns;
+    // The time elapsed during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_sync_elapsed_ns;
+    // The CPU time used during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_cpu_ns;
+    // The time elapsed during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_elapsed_ns;
 } GCDetails;
 
 //
@@ -241,6 +258,28 @@ typedef struct _RTSStats {
     // The number of times a GC thread has iterated it's outer loop across all
     // parallel GCs
   uint64_t scav_find_work;
+
+  // ----------------------------------
+  // Concurrent garbage collector
+
+    // The CPU time used during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_sync_cpu_ns;
+    // The time elapsed during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_sync_elapsed_ns;
+    // The maximum time elapsed during the post-mark pause phase of the
+    // concurrent nonmoving GC.
+  Time nonmoving_gc_sync_max_elapsed_ns;
+    // The CPU time used during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_cpu_ns;
+    // The time elapsed during the post-mark pause phase of the concurrent
+    // nonmoving GC.
+  Time nonmoving_gc_elapsed_ns;
+    // The maximum time elapsed during the post-mark pause phase of the
+    // concurrent nonmoving GC.
+  Time nonmoving_gc_max_elapsed_ns;
 } RTSStats;
 
 void getRTSStats (RTSStats *s);


=====================================
includes/rts/Time.h
=====================================
@@ -33,6 +33,7 @@ typedef int64_t Time;
 
 #define SecondsToTime(t) ((Time)(t) * TIME_RESOLUTION)
 #define TimeToSeconds(t) ((t) / TIME_RESOLUTION)
+#define TimeToSecondsDbl(t) ((double)(t) / TIME_RESOLUTION)
 
 // Use instead of SecondsToTime() when we have a floating-point
 // seconds value, to avoid truncating it.


=====================================
libraries/base/GHC/Stats.hsc
=====================================
@@ -103,6 +103,25 @@ data RTSStats = RTSStats {
     -- | Total elapsed time (at the previous GC)
   , elapsed_ns :: RtsTime
 
+    -- | The CPU time used during the post-mark pause phase of the concurrent
+    -- nonmoving GC.
+  , nonmoving_gc_sync_cpu_ns :: RtsTime
+    -- | The time elapsed during the post-mark pause phase of the concurrent
+    -- nonmoving GC.
+  , nonmoving_gc_sync_elapsed_ns :: RtsTime
+    -- | The maximum time elapsed during the post-mark pause phase of the
+    -- concurrent nonmoving GC.
+  , nonmoving_gc_sync_max_elapsed_ns :: RtsTime
+    -- | The CPU time used during the post-mark pause phase of the concurrent
+    -- nonmoving GC.
+  , nonmoving_gc_cpu_ns :: RtsTime
+    -- | The time elapsed during the post-mark pause phase of the concurrent
+    -- nonmoving GC.
+  , nonmoving_gc_elapsed_ns :: RtsTime
+    -- | The maximum time elapsed during the post-mark pause phase of the
+    -- concurrent nonmoving GC.
+  , nonmoving_gc_max_elapsed_ns :: RtsTime
+
     -- | Details about the most recent GC
   , gc :: GCDetails
   } deriving ( Read -- ^ @since 4.10.0.0
@@ -146,6 +165,13 @@ data GCDetails = GCDetails {
   , gcdetails_cpu_ns :: RtsTime
     -- | The time elapsed during GC itself
   , gcdetails_elapsed_ns :: RtsTime
+
+    -- | The CPU time used during the post-mark pause phase of the concurrent
+    -- nonmoving GC.
+  , gcdetails_nonmoving_gc_sync_cpu_ns :: RtsTime
+    -- | The time elapsed during the post-mark pause phase of the concurrent
+    -- nonmoving GC.
+  , gcdetails_nonmoving_gc_sync_elapsed_ns :: RtsTime
   } deriving ( Read -- ^ @since 4.10.0.0
              , Show -- ^ @since 4.10.0.0
              )
@@ -192,6 +218,12 @@ getRTSStats = do
     gc_elapsed_ns <- (# peek RTSStats, gc_elapsed_ns) p
     cpu_ns <- (# peek RTSStats, cpu_ns) p
     elapsed_ns <- (# peek RTSStats, elapsed_ns) p
+    nonmoving_gc_sync_cpu_ns <- (# peek RTSStats, nonmoving_gc_sync_cpu_ns) p
+    nonmoving_gc_sync_elapsed_ns <- (# peek RTSStats, nonmoving_gc_sync_elapsed_ns) p
+    nonmoving_gc_sync_max_elapsed_ns <- (# peek RTSStats, nonmoving_gc_sync_max_elapsed_ns) p
+    nonmoving_gc_cpu_ns <- (# peek RTSStats, nonmoving_gc_cpu_ns) p
+    nonmoving_gc_elapsed_ns <- (# peek RTSStats, nonmoving_gc_elapsed_ns) p
+    nonmoving_gc_max_elapsed_ns <- (# peek RTSStats, nonmoving_gc_max_elapsed_ns) p
     let pgc = (# ptr RTSStats, gc) p
     gc <- do
       gcdetails_gen <- (# peek GCDetails, gen) pgc
@@ -211,5 +243,7 @@ getRTSStats = do
       gcdetails_sync_elapsed_ns <- (# peek GCDetails, sync_elapsed_ns) pgc
       gcdetails_cpu_ns <- (# peek GCDetails, cpu_ns) pgc
       gcdetails_elapsed_ns <- (# peek GCDetails, elapsed_ns) pgc
+      gcdetails_nonmoving_gc_sync_cpu_ns <- (# peek GCDetails, nonmoving_gc_sync_cpu_ns) pgc
+      gcdetails_nonmoving_gc_sync_elapsed_ns <- (# peek GCDetails, nonmoving_gc_sync_elapsed_ns) pgc
       return GCDetails{..}
     return RTSStats{..}


=====================================
libraries/text
=====================================
@@ -1 +1 @@
-Subproject commit c6768a2a07e94b8b26d0f0e53517773de1110ce2
+Subproject commit ebb98f3929360f3abb681dfca4caa8a190f9c5a8


=====================================
rts/GetTime.h
=====================================
@@ -13,6 +13,7 @@
 void initializeTimer       (void);
 
 Time getProcessCPUTime     (void);
+Time getCurrentThreadCPUTime (void);
 void getProcessTimes       (Time *user, Time *elapsed);
 
 /* Get the current date and time.


=====================================
rts/ProfHeap.c
=====================================
@@ -1166,6 +1166,8 @@ heapCensusChain( Census *census, bdescr *bd )
     }
 }
 
+// Time is process CPU time of beginning of current GC and is used as
+// the mutator CPU time reported as the census timestamp.
 void heapCensus (Time t)
 {
   uint32_t g, n;
@@ -1173,7 +1175,7 @@ void heapCensus (Time t)
   gen_workspace *ws;
 
   census = &censuses[era];
-  census->time  = mut_user_time_until(t);
+  census->time  = TimeToSecondsDbl(t);
   census->rtime = TimeToNS(stat_getElapsedTime());
 
 


=====================================
rts/Stats.c
=====================================
@@ -26,14 +26,14 @@
 
 #include <string.h> // for memset
 
-#define TimeToSecondsDbl(t) ((double)(t) / TIME_RESOLUTION)
-
 static Time
     start_init_cpu, start_init_elapsed,
     end_init_cpu,   end_init_elapsed,
     start_exit_cpu, start_exit_elapsed,
     start_exit_gc_elapsed, start_exit_gc_cpu,
-    end_exit_cpu,   end_exit_elapsed;
+    end_exit_cpu,   end_exit_elapsed,
+    start_nonmoving_gc_cpu, start_nonmoving_gc_elapsed,
+    start_nonmoving_gc_sync_elapsed;
 
 #if defined(PROFILING)
 static Time RP_start_time  = 0, RP_tot_time  = 0;  // retainer prof user time
@@ -84,7 +84,7 @@ Time stat_getElapsedTime(void)
 double
 mut_user_time_until( Time t )
 {
-    return TimeToSecondsDbl(t - stats.gc_cpu_ns);
+    return TimeToSecondsDbl(t - stats.gc_cpu_ns - stats.nonmoving_gc_cpu_ns);
     // heapCensus() time is included in GC_tot_cpu, so we don't need
     // to subtract it here.
 
@@ -125,6 +125,10 @@ initStats0(void)
     end_init_cpu     = 0;
     end_init_elapsed  = 0;
 
+    start_nonmoving_gc_cpu = 0;
+    start_nonmoving_gc_elapsed = 0;
+    start_nonmoving_gc_sync_elapsed = 0;
+
     start_exit_cpu    = 0;
     start_exit_elapsed = 0;
     start_exit_gc_cpu    = 0;
@@ -175,6 +179,11 @@ initStats0(void)
         .gc_elapsed_ns = 0,
         .cpu_ns = 0,
         .elapsed_ns = 0,
+        .nonmoving_gc_cpu_ns = 0,
+        .nonmoving_gc_elapsed_ns = 0,
+        .nonmoving_gc_max_elapsed_ns = 0,
+        .nonmoving_gc_sync_elapsed_ns = 0,
+        .nonmoving_gc_sync_max_elapsed_ns = 0,
         .gc = {
             .gen = 0,
             .threads = 0,
@@ -189,7 +198,10 @@ initStats0(void)
             .par_balanced_copied_bytes = 0,
             .sync_elapsed_ns = 0,
             .cpu_ns = 0,
-            .elapsed_ns = 0
+            .elapsed_ns = 0,
+            .nonmoving_gc_cpu_ns = 0,
+            .nonmoving_gc_elapsed_ns = 0,
+            .nonmoving_gc_sync_elapsed_ns = 0,
         }
     };
 }
@@ -274,6 +286,11 @@ stat_startExit(void)
     start_exit_gc_cpu = stats.gc_cpu_ns;
 }
 
+/* -----------------------------------------------------------------------------
+   Nonmoving (concurrent) collector statistics
+
+   These two measure the time taken in the concurrent mark & sweep collector.
+   -------------------------------------------------------------------------- */
 void
 stat_endExit(void)
 {
@@ -286,10 +303,115 @@ stat_startGCSync (gc_thread *gct)
     gct->gc_sync_start_elapsed = getProcessElapsedTime();
 }
 
+void
+stat_startNonmovingGc ()
+{
+    start_nonmoving_gc_cpu = getCurrentThreadCPUTime();
+    start_nonmoving_gc_elapsed = getProcessCPUTime();
+}
+
+void
+stat_endNonmovingGc ()
+{
+    Time cpu = getCurrentThreadCPUTime();
+    Time elapsed = getProcessCPUTime();
+    stats.gc.nonmoving_gc_elapsed_ns = elapsed - start_nonmoving_gc_elapsed;
+    stats.nonmoving_gc_elapsed_ns += stats.gc.nonmoving_gc_elapsed_ns;
+
+    stats.gc.nonmoving_gc_cpu_ns = cpu - start_nonmoving_gc_cpu;
+    stats.nonmoving_gc_cpu_ns += stats.gc.nonmoving_gc_cpu_ns;
+
+    stats.nonmoving_gc_max_elapsed_ns =
+      stg_max(stats.gc.nonmoving_gc_elapsed_ns,
+              stats.nonmoving_gc_max_elapsed_ns);
+}
+
+void
+stat_startNonmovingGcSync ()
+{
+    start_nonmoving_gc_sync_elapsed = getProcessElapsedTime();
+    traceConcSyncBegin();
+}
+
+void
+stat_endNonmovingGcSync ()
+{
+    Time end_elapsed = getProcessElapsedTime();
+    stats.gc.nonmoving_gc_sync_elapsed_ns = end_elapsed - start_nonmoving_gc_sync_elapsed;
+    stats.nonmoving_gc_sync_elapsed_ns +=  stats.gc.nonmoving_gc_sync_elapsed_ns;
+    stats.nonmoving_gc_sync_max_elapsed_ns =
+      stg_max(stats.gc.nonmoving_gc_sync_elapsed_ns,
+              stats.nonmoving_gc_sync_max_elapsed_ns);
+    traceConcSyncEnd();
+}
+
 /* -----------------------------------------------------------------------------
    Called at the beginning of each GC
    -------------------------------------------------------------------------- */
 
+/*
+ * Note [Time accounting]
+ * ~~~~~~~~~~~~~~~~~~~~~~
+ * In the "vanilla" configuration (using the standard copying GC) GHC keeps
+ * track of a two different sinks of elapsed and CPU time:
+ *
+ *  - time spent synchronising to initiate garbage collection
+ *  - garbage collection (per generation)
+ *  - mutation
+ *
+ * When using the (concurrent) non-moving garbage collector (see Note
+ * [Non-moving garbage collector]) we also track a few more sinks:
+ *
+ *  - minor GC
+ *  - major GC (namly time spent in the preparatory phase)
+ *  - concurrent mark
+ *  - final synchronization (elapsed only)
+ *  - mutation
+ *
+ * To keep track of these CPU times we rely on the system's per-thread CPU time
+ * clock (exposed via the runtime's getCurrentThreadCPUTime utility).
+ *
+ * CPU time spent in the copying garbage collector is tracked in each GC
+ * worker's gc_thread struct. At the beginning of scavenging each worker
+ * records its OS thread's CPU time its gc_thread (by stat_startGCWorker). At
+ * the end of scavenging we again record the CPU time (in stat_endGCworker).
+ * The differences of these are then summed over by the thread leading the GC
+ * at the end of collection in stat_endGC. By contrast, the elapsed time is
+ * recorded only by the leader.
+ *
+ * Mutator time is derived from the process's CPU time, subtracting out
+ * contributions from stop-the-world and concurrent GCs.
+ *
+ * Time spent in concurrent marking is recorded by stat_{start,end}NonmovingGc.
+ * Likewise, elapsed time spent in the final synchronization is recorded by
+ * stat_{start,end}NonmovingGcSync.
+ */
+
+void
+stat_startGCWorker (Capability *cap STG_UNUSED, gc_thread *gct)
+{
+    bool stats_enabled =
+        RtsFlags.GcFlags.giveStats != NO_GC_STATS ||
+        rtsConfig.gcDoneHook != NULL;
+
+    if (stats_enabled || RtsFlags.ProfFlags.doHeapProfile) {
+        gct->gc_start_cpu = getCurrentThreadCPUTime();
+    }
+}
+
+void
+stat_endGCWorker (Capability *cap STG_UNUSED, gc_thread *gct)
+{
+    bool stats_enabled =
+        RtsFlags.GcFlags.giveStats != NO_GC_STATS ||
+        rtsConfig.gcDoneHook != NULL;
+
+    if (stats_enabled || RtsFlags.ProfFlags.doHeapProfile) {
+        gct->gc_end_cpu = getCurrentThreadCPUTime();
+        ASSERT(gct->gc_end_cpu >= gct->gc_start_cpu);
+    }
+}
+
 void
 stat_startGC (Capability *cap, gc_thread *gct)
 {
@@ -297,7 +419,15 @@ stat_startGC (Capability *cap, gc_thread *gct)
         debugBelch("\007");
     }
 
-    getProcessTimes(&gct->gc_start_cpu, &gct->gc_start_elapsed);
+    bool stats_enabled =
+        RtsFlags.GcFlags.giveStats != NO_GC_STATS ||
+        rtsConfig.gcDoneHook != NULL;
+
+    if (stats_enabled || RtsFlags.ProfFlags.doHeapProfile) {
+        gct->gc_start_cpu = getCurrentThreadCPUTime();
+    }
+
+    gct->gc_start_elapsed = getProcessElapsedTime();
 
     // Post EVENT_GC_START with the same timestamp as used for stats
     // (though converted from Time=StgInt64 to EventTimestamp=StgWord64).
@@ -320,9 +450,9 @@ stat_startGC (Capability *cap, gc_thread *gct)
    -------------------------------------------------------------------------- */
 
 void
-stat_endGC (Capability *cap, gc_thread *gct, W_ live, W_ copied, W_ slop,
-            uint32_t gen, uint32_t par_n_threads, W_ par_max_copied,
-            W_ par_balanced_copied, W_ gc_spin_spin, W_ gc_spin_yield,
+stat_endGC (Capability *cap, gc_thread *initiating_gct, W_ live, W_ copied, W_ slop,
+            uint32_t gen, uint32_t par_n_threads, gc_thread **gc_threads,
+            W_ par_max_copied, W_ par_balanced_copied, W_ gc_spin_spin, W_ gc_spin_yield,
             W_ mut_spin_spin, W_ mut_spin_yield, W_ any_work, W_ no_work,
             W_ scav_find_work)
 {
@@ -364,9 +494,14 @@ stat_endGC (Capability *cap, gc_thread *gct, W_ live, W_ copied, W_ slop,
         stats.elapsed_ns = current_elapsed - start_init_elapsed;
 
         stats.gc.sync_elapsed_ns =
-            gct->gc_start_elapsed - gct->gc_sync_start_elapsed;
-        stats.gc.elapsed_ns = current_elapsed - gct->gc_start_elapsed;
-        stats.gc.cpu_ns = current_cpu - gct->gc_start_cpu;
+            initiating_gct->gc_start_elapsed - initiating_gct->gc_sync_start_elapsed;
+        stats.gc.elapsed_ns = current_elapsed - initiating_gct->gc_start_elapsed;
+        stats.gc.cpu_ns = 0;
+        for (unsigned int i=0; i < par_n_threads; i++) {
+            gc_thread *gct = gc_threads[i];
+            ASSERT(gct->gc_end_cpu >= gct->gc_start_cpu);
+            stats.gc.cpu_ns += gct->gc_end_cpu - gct->gc_start_cpu;
+        }
     }
     // -------------------------------------------------
     // Update the cumulative stats
@@ -473,8 +608,8 @@ stat_endGC (Capability *cap, gc_thread *gct, W_ live, W_ copied, W_ slop,
                     TimeToSecondsDbl(stats.gc.elapsed_ns),
                     TimeToSecondsDbl(stats.cpu_ns),
                     TimeToSecondsDbl(stats.elapsed_ns),
-                    faults - gct->gc_start_faults,
-                        gct->gc_start_faults - GC_end_faults,
+                    faults - initiating_gct->gc_start_faults,
+                        initiating_gct->gc_start_faults - GC_end_faults,
                     gen);
 
             GC_end_faults = faults;
@@ -700,6 +835,21 @@ static void report_summary(const RTSSummaryStats* sum)
                     TimeToSecondsDbl(gen_stats->avg_pause_ns),
                     TimeToSecondsDbl(gen_stats->max_pause_ns));
     }
+    if (RtsFlags.GcFlags.useNonmoving) {
+        const int n_major_colls = sum->gc_summary_stats[RtsFlags.GcFlags.generations-1].collections;
+        statsPrintf("  Gen  1     %5d syncs"
+                    ",                      %6.3fs     %3.4fs    %3.4fs\n",
+                    n_major_colls,
+                    TimeToSecondsDbl(stats.nonmoving_gc_sync_elapsed_ns),
+                    TimeToSecondsDbl(stats.nonmoving_gc_sync_elapsed_ns) / n_major_colls,
+                    TimeToSecondsDbl(stats.nonmoving_gc_sync_max_elapsed_ns));
+        statsPrintf("  Gen  1      concurrent"
+                    ",             %6.3fs  %6.3fs     %3.4fs    %3.4fs\n",
+                    TimeToSecondsDbl(stats.nonmoving_gc_cpu_ns),
+                    TimeToSecondsDbl(stats.nonmoving_gc_elapsed_ns),
+                    TimeToSecondsDbl(stats.nonmoving_gc_elapsed_ns) / n_major_colls,
+                    TimeToSecondsDbl(stats.nonmoving_gc_max_elapsed_ns));
+    }
 
     statsPrintf("\n");
 
@@ -736,6 +886,12 @@ static void report_summary(const RTSSummaryStats* sum)
     statsPrintf("  GC      time  %7.3fs  (%7.3fs elapsed)\n",
                 TimeToSecondsDbl(stats.gc_cpu_ns),
                 TimeToSecondsDbl(stats.gc_elapsed_ns));
+    if (RtsFlags.GcFlags.useNonmoving) {
+        statsPrintf(
+                "  CONC GC time  %7.3fs  (%7.3fs elapsed)\n",
+                TimeToSecondsDbl(stats.nonmoving_gc_cpu_ns),
+                TimeToSecondsDbl(stats.nonmoving_gc_elapsed_ns));
+    }
 
 #if defined(PROFILING)
     statsPrintf("  RP      time  %7.3fs  (%7.3fs elapsed)\n",
@@ -1094,7 +1250,8 @@ stat_exit (void)
 
             stats.mutator_cpu_ns     = start_exit_cpu
                                  - end_init_cpu
-                                 - (stats.gc_cpu_ns - exit_gc_cpu);
+                                 - (stats.gc_cpu_ns - exit_gc_cpu)
+                                 - stats.nonmoving_gc_cpu_ns;
             stats.mutator_elapsed_ns = start_exit_elapsed
                                  - end_init_elapsed
                                  - (stats.gc_elapsed_ns - exit_gc_elapsed);
@@ -1504,7 +1661,8 @@ void getRTSStats( RTSStats *s )
     s->cpu_ns = current_cpu - end_init_cpu;
     s->elapsed_ns = current_elapsed - end_init_elapsed;
 
-    s->mutator_cpu_ns = current_cpu - end_init_cpu - stats.gc_cpu_ns;
+    s->mutator_cpu_ns = current_cpu - end_init_cpu - stats.gc_cpu_ns -
+        stats.nonmoving_gc_cpu_ns;
     s->mutator_elapsed_ns = current_elapsed - end_init_elapsed -
         stats.gc_elapsed_ns;
 }


=====================================
rts/Stats.h
=====================================
@@ -30,13 +30,21 @@ void      stat_endInit(void);
 
 void      stat_startGCSync(struct gc_thread_ *_gct);
 void      stat_startGC(Capability *cap, struct gc_thread_ *_gct);
-void      stat_endGC  (Capability *cap, struct gc_thread_ *_gct, W_ live,
-                       W_ copied, W_ slop, uint32_t gen, uint32_t n_gc_threads,
+void      stat_startGCWorker (Capability *cap, struct gc_thread_ *_gct);
+void      stat_endGCWorker (Capability *cap, struct gc_thread_ *_gct);
+void      stat_endGC  (Capability *cap, struct gc_thread_ *initiating_gct, W_ live,
+                       W_ copied, W_ slop, uint32_t gen,
+                       uint32_t n_gc_threads, struct gc_thread_ **gc_threads,
                        W_ par_max_copied, W_ par_balanced_copied,
                        W_ gc_spin_spin, W_ gc_spin_yield, W_ mut_spin_spin,
                        W_ mut_spin_yield, W_ any_work, W_ no_work,
                        W_ scav_find_work);
 
+void      stat_startNonmovingGcSync(void);
+void      stat_endNonmovingGcSync(void);
+void      stat_startNonmovingGc (void);
+void      stat_endNonmovingGc (void);
+
 #if defined(PROFILING)
 void      stat_startRP(void);
 void      stat_endRP(uint32_t, int, double);


=====================================
rts/Updates.h
=====================================
@@ -50,22 +50,21 @@
                                                                 \
     prim_write_barrier;                                         \
     OVERWRITING_CLOSURE(p1);                                    \
-    IF_NONMOVING_WRITE_BARRIER_ENABLED {                        \
-      ccall updateRemembSetPushThunk_(BaseReg, p1 "ptr");       \
-    }                                                           \
-    StgInd_indirectee(p1) = p2;                                 \
-    prim_write_barrier;                                         \
-    SET_INFO(p1, stg_BLACKHOLE_info);                           \
-    LDV_RECORD_CREATE(p1);                                      \
     bd = Bdescr(p1);                                            \
     if (bdescr_gen_no(bd) != 0 :: bits16) {                     \
+      IF_NONMOVING_WRITE_BARRIER_ENABLED {                      \
+        ccall updateRemembSetPushThunk_(BaseReg, p1 "ptr");     \
+      }                                                         \
       recordMutableCap(p1, TO_W_(bdescr_gen_no(bd)));           \
       TICK_UPD_OLD_IND();                                       \
-      and_then;                                                 \
     } else {                                                    \
       TICK_UPD_NEW_IND();                                       \
-      and_then;                                                 \
-    }
+    }                                                           \
+    StgInd_indirectee(p1) = p2;                                 \
+    prim_write_barrier;                                         \
+    SET_INFO(p1, stg_BLACKHOLE_info);                           \
+    LDV_RECORD_CREATE(p1);                                      \
+    and_then;
 
 #else /* !CMINUSMINUS */
 
@@ -73,28 +72,26 @@ INLINE_HEADER void updateWithIndirection (Capability *cap,
                                           StgClosure *p1,
                                           StgClosure *p2)
 {
-    bdescr *bd;
-
     ASSERT( (P_)p1 != (P_)p2 );
     /* not necessarily true: ASSERT( !closure_IND(p1) ); */
     /* occurs in RaiseAsync.c:raiseAsync() */
     /* See Note [Heap memory barriers] in SMP.h */
     write_barrier();
-    OVERWRITING_CLOSURE(p1);
-    IF_NONMOVING_WRITE_BARRIER_ENABLED {
-        updateRemembSetPushThunk(cap, (StgThunk*)p1);
-    }
-    ((StgInd *)p1)->indirectee = p2;
-    write_barrier();
-    SET_INFO(p1, &stg_BLACKHOLE_info);
-    LDV_RECORD_CREATE(p1);
-    bd = Bdescr((StgPtr)p1);
+    bdescr *bd = Bdescr((StgPtr)p1);
     if (bd->gen_no != 0) {
+      IF_NONMOVING_WRITE_BARRIER_ENABLED {
+          updateRemembSetPushThunk(cap, (StgThunk*)p1);
+      }
         recordMutableCap(p1, cap, bd->gen_no);
         TICK_UPD_OLD_IND();
     } else {
         TICK_UPD_NEW_IND();
     }
+    OVERWRITING_CLOSURE(p1);
+    ((StgInd *)p1)->indirectee = p2;
+    write_barrier();
+    SET_INFO(p1, &stg_BLACKHOLE_info);
+    LDV_RECORD_CREATE(p1);
 }
 
 #endif /* CMINUSMINUS */


=====================================
rts/posix/GetTime.c
=====================================
@@ -25,18 +25,25 @@
 #error No implementation for getProcessCPUTime() available.
 #endif
 
+#if defined(darwin_HOST_OS)
+#include <mach/mach_time.h>
+#include <mach/mach_init.h>
+#include <mach/thread_act.h>
+#include <mach/mach_port.h>
+#endif
+
 #if defined(HAVE_GETTIMEOFDAY) && defined(HAVE_GETRUSAGE)
 // we'll implement getProcessCPUTime() and getProcessElapsedTime()
 // separately, using getrusage() and gettimeofday() respectively
 
-#if !defined(HAVE_CLOCK_GETTIME) && defined(darwin_HOST_OS)
+#if defined(darwin_HOST_OS)
 static uint64_t timer_scaling_factor_numer = 0;
 static uint64_t timer_scaling_factor_denom = 0;
 #endif
 
 void initializeTimer()
 {
-#if !defined(HAVE_CLOCK_GETTIME) && defined(darwin_HOST_OS)
+#if defined(darwin_HOST_OS)
     mach_timebase_info_data_t info;
     (void) mach_timebase_info(&info);
     timer_scaling_factor_numer = (uint64_t)info.numer;
@@ -44,11 +51,64 @@ void initializeTimer()
 #endif
 }
 
+#if defined(HAVE_CLOCK_GETTIME)
+static Time getClockTime(clockid_t clock)
+{
+    struct timespec ts;
+    int res = clock_gettime(clock, &ts);
+    if (res == 0) {
+        return SecondsToTime(ts.tv_sec) + NSToTime(ts.tv_nsec);
+    } else {
+        sysErrorBelch("clock_gettime");
+        stg_exit(EXIT_FAILURE);
+    }
+}
+#endif
+
+Time getCurrentThreadCPUTime(void)
+{
+    // N.B. Since macOS Catalina, Darwin supports clock_gettime but does not
+    // support clock_getcpuclockid. Hence we prefer to use the Darwin-specific
+    // path on Darwin, even if clock_gettime is available.
+#if defined(darwin_HOST_OS)
+    mach_port_t port = pthread_mach_thread_np(osThreadId());
+    thread_basic_info_data_t info = { 0 };
+    mach_msg_type_number_t info_count = THREAD_BASIC_INFO_COUNT;
+    kern_return_t kern_err = thread_info(mach_thread_self(), THREAD_BASIC_INFO,
+                                         (thread_info_t) &info, &info_count);
+    if (kern_err == KERN_SUCCESS) {
+        return SecondsToTime(info.user_time.seconds) + USToTime(info.user_time.microseconds);
+    } else {
+        sysErrorBelch("getThreadCPUTime");
+        stg_exit(EXIT_FAILURE);
+    }
+#elif defined(HAVE_CLOCK_GETTIME)        &&  \
+       defined(CLOCK_PROCESS_CPUTIME_ID) &&  \
+       defined(HAVE_SYSCONF)
+    static bool have_checked_usability = false;
+    if (!have_checked_usability) {
+        // The Linux clock_getres(2) manpage claims that some early versions of
+        // Linux will return values which are uninterpretable in the presence
+        // of migration across CPUs. They claim that clock_getcpuclockid(0)
+        // will return ENOENT in this case. Check this.
+        clockid_t clkid;
+        if (clock_getcpuclockid(0, &clkid)) {
+            sysErrorBelch("getCurrentThreadCPUTime: no supported");
+            stg_exit(EXIT_FAILURE);
+        }
+        have_checked_usability = true;
+    }
+    return getClockTime(CLOCK_THREAD_CPUTIME_ID);
+#else
+#error I know of no means to find the CPU time of current thread on this platform.
+#endif
+}
+
 Time getProcessCPUTime(void)
 {
 #if !defined(BE_CONSERVATIVE)            &&  \
        defined(HAVE_CLOCK_GETTIME)       &&  \
-       defined(_SC_CPUTIME)             &&  \
+       defined(_SC_CPUTIME)              &&  \
        defined(CLOCK_PROCESS_CPUTIME_ID) &&  \
        defined(HAVE_SYSCONF)
     static int checked_sysconf = 0;
@@ -59,15 +119,7 @@ Time getProcessCPUTime(void)
         checked_sysconf = 1;
     }
     if (sysconf_result != -1) {
-        struct timespec ts;
-        int res;
-        res = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
-        if (res == 0) {
-            return SecondsToTime(ts.tv_sec) + NSToTime(ts.tv_nsec);
-        } else {
-            sysErrorBelch("clock_gettime");
-            stg_exit(EXIT_FAILURE);
-        }
+        return getClockTime(CLOCK_PROCESS_CPUTIME_ID);
     }
 #endif
 
@@ -82,16 +134,7 @@ Time getProcessCPUTime(void)
 StgWord64 getMonotonicNSec(void)
 {
 #if defined(HAVE_CLOCK_GETTIME)
-    struct timespec ts;
-    int res;
-
-    res = clock_gettime(CLOCK_ID, &ts);
-    if (res != 0) {
-        sysErrorBelch("clock_gettime");
-        stg_exit(EXIT_FAILURE);
-    }
-    return (StgWord64)ts.tv_sec * 1000000000 +
-           (StgWord64)ts.tv_nsec;
+    return getClockTime(CLOCK_ID);
 
 #elif defined(darwin_HOST_OS)
 
@@ -102,7 +145,9 @@ StgWord64 getMonotonicNSec(void)
 
     struct timeval tv;
 
-    gettimeofday(&tv, (struct timezone *) NULL);
+    if (gettimeofday(&tv, (struct timezone *) NULL) != 0) {
+        debugBlech("getMonotonicNSec: gettimeofday failed: %s", strerror(errno));
+    };
     return (StgWord64)tv.tv_sec * 1000000000 +
            (StgWord64)tv.tv_usec * 1000;
 


=====================================
rts/sm/BlockAlloc.c
=====================================
@@ -233,6 +233,12 @@ initGroup(bdescr *head)
       last->blocks = 0;
       last->link = head;
   }
+
+#if defined(DEBUG)
+  for (uint32_t i=0; i < head->blocks; i++) {
+      head[i].flags = 0;
+  }
+#endif
 }
 
 #if SIZEOF_VOID_P == SIZEOF_LONG
@@ -792,6 +798,12 @@ freeGroup(bdescr *p)
 
   ASSERT(p->free != (P_)-1);
 
+#if defined(DEBUG)
+  for (uint32_t i=0; i < p->blocks; i++) {
+      p[i].flags = 0;
+  }
+#endif
+
   node = p->node;
 
   p->free = (void *)-1;  /* indicates that this block is free */


=====================================
rts/sm/Evac.c
=====================================
@@ -80,16 +80,15 @@ alloc_for_copy (uint32_t size, uint32_t gen_no)
     if (gen_no < gct->evac_gen_no) {
         if (gct->eager_promotion) {
             gen_no = gct->evac_gen_no;
+        } else if (RTS_UNLIKELY(RtsFlags.GcFlags.useNonmoving) && deadlock_detect_gc) {
+            /* See Note [Deadlock detection under nonmoving collector]. */
+            gen_no = oldest_gen->no;
         } else {
             gct->failed_to_evac = true;
         }
     }
 
     if (RTS_UNLIKELY(RtsFlags.GcFlags.useNonmoving)) {
-        /* See Note [Deadlock detection under nonmoving collector]. */
-        if (deadlock_detect_gc)
-            gen_no = oldest_gen->no;
-
         if (gen_no == oldest_gen->no) {
             gct->copied += size;
             to = nonmovingAllocate(gct->cap, size);


=====================================
rts/sm/GC.c
=====================================
@@ -209,6 +209,14 @@ GarbageCollect (uint32_t collect_gen,
   gc_thread *saved_gct;
 #endif
   uint32_t g, n;
+  // The time we should report our heap census as occurring at, if necessary.
+  Time mut_time = 0;
+
+  if (do_heap_census) {
+      RTSStats stats;
+      getRTSStats(&stats);
+      mut_time = stats.mutator_cpu_ns;
+  }
 
   // necessary if we stole a callee-saves register for gct:
 #if defined(THREADED_RTS)
@@ -730,11 +738,13 @@ GarbageCollect (uint32_t collect_gen,
     }
   } // for all generations
 
-  // Flush the update remembered set. See Note [Eager update remembered set
+  // Flush the update remembered sets. See Note [Eager update remembered set
   // flushing] in NonMovingMark.c
   if (RtsFlags.GcFlags.useNonmoving) {
       RELEASE_SM_LOCK;
-      nonmovingAddUpdRemSetBlocks(&gct->cap->upd_rem_set.queue);
+      for (n = 0; n < n_capabilities; n++) {
+          nonmovingAddUpdRemSetBlocks(&capabilities[n]->upd_rem_set.queue);
+      }
       ACQUIRE_SM_LOCK;
   }
 
@@ -846,7 +856,7 @@ GarbageCollect (uint32_t collect_gen,
   if (do_heap_census) {
       debugTrace(DEBUG_sched, "performing heap census");
       RELEASE_SM_LOCK;
-      heapCensus(gct->gc_start_cpu);
+      heapCensus(mut_time);
       ACQUIRE_SM_LOCK;
   }
 
@@ -927,9 +937,11 @@ GarbageCollect (uint32_t collect_gen,
 #endif
 
   // ok, GC over: tell the stats department what happened.
+  stat_endGCWorker(cap, gct);
   stat_endGC(cap, gct, live_words, copied,
              live_blocks * BLOCK_SIZE_W - live_words /* slop */,
-             N, n_gc_threads, par_max_copied, par_balanced_copied,
+             N, n_gc_threads, gc_threads,
+             par_max_copied, par_balanced_copied,
              gc_spin_spin, gc_spin_yield, mut_spin_spin, mut_spin_yield,
              any_work, no_work, scav_find_work);
 
@@ -1209,6 +1221,7 @@ gcWorkerThread (Capability *cap)
 
     SET_GCT(gc_threads[cap->no]);
     gct->id = osThreadId();
+    stat_startGCWorker (cap, gct);
 
     // Wait until we're told to wake up
     RELEASE_SPIN_LOCK(&gct->mut_spin);
@@ -1247,6 +1260,7 @@ gcWorkerThread (Capability *cap)
     gct->wakeup = GC_THREAD_WAITING_TO_CONTINUE;
     debugTrace(DEBUG_gc, "GC thread %d waiting to continue...",
                gct->thread_index);
+    stat_endGCWorker (cap, gct);
     ACQUIRE_SPIN_LOCK(&gct->mut_spin);
     debugTrace(DEBUG_gc, "GC thread %d on my way...", gct->thread_index);
 


=====================================
rts/sm/GCThread.h
=====================================
@@ -185,9 +185,11 @@ typedef struct gc_thread_ {
     W_ no_work;
     W_ scav_find_work;
 
-    Time gc_start_cpu;   // process CPU time
-    Time gc_sync_start_elapsed;  // start of GC sync
-    Time gc_start_elapsed;  // process elapsed time
+    Time gc_start_cpu;             // thread CPU time
+    Time gc_end_cpu;               // thread CPU time
+    Time gc_sync_start_elapsed;    // start of GC sync
+    Time gc_start_elapsed;         // process elapsed time
+    Time gc_end_elapsed;           // process elapsed time
     W_ gc_start_faults;
 
     // -------------------


=====================================
rts/sm/NonMoving.c
=====================================
@@ -18,6 +18,7 @@
 #include "GCThread.h"
 #include "GCTDecl.h"
 #include "Schedule.h"
+#include "Stats.h"
 
 #include "NonMoving.h"
 #include "NonMovingMark.h"
@@ -227,6 +228,10 @@ Mutex concurrent_coll_finished_lock;
  *  - Note [Static objects under the nonmoving collector] (Storage.c) describes
  *    treatment of static objects.
  *
+ *  - Note [Dirty flags in the non-moving collector] (NonMoving.c) describes
+ *    how we use the DIRTY flags associated with MUT_VARs and TVARs to improve
+ *    barrier efficiency.
+ *
  *
  * Note [Concurrent non-moving collection]
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -368,6 +373,7 @@ Mutex concurrent_coll_finished_lock;
  * approximate due to concurrent collection and ultimately seems more costly
  * than the problem demands.
  *
+ *
  * Note [Spark management under the nonmoving collector]
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  * Every GC, both minor and major, prunes the spark queue (using
@@ -386,6 +392,88 @@ Mutex concurrent_coll_finished_lock;
  *    BF_EVACUATED flag won't be set on the nursery blocks) and will consequently
  *    only prune dead sparks living in the non-moving heap.
  *
+ *
+ * Note [Dirty flags in the non-moving collector]
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * Some mutable object types (e.g. MUT_VARs, TVARs) have a one-bit dirty flag
+ * encoded in their info table pointer. The moving collector's uses this flag
+ * to minimize redundant mut_list entries. The flag is preserves the following
+ * simple invariant:
+ *
+ *     An object being marked as dirty implies that the object is on mut_list.
+ *
+ * This allows a nice optimisation in the write barrier (e.g. dirty_MUT_VAR):
+ * if we write to an already-dirty object there is no need to
+ * push it to the mut_list as we know it's already there.
+ *
+ * During GC (scavenging) we will then keep track of whether all of the
+ * object's reference have been promoted. If so we can mark the object as clean.
+ * If not then we re-add it to mut_list and mark it as dirty.
+ *
+ * In the non-moving collector we use the same dirty flag to implement a
+ * related optimisation on the non-moving write barrier: Specifically, the
+ * snapshot invariant only requires that the non-moving write barrier applies
+ * to the *first* mutation to an object after collection begins. To achieve this,
+ * we impose the following invariant:
+ *
+ *     An object being marked as dirty implies that all of its fields are on
+ *     the mark queue (or, equivalently, update remembered set).
+ *
+ * With this guarantee we can safely make the the write barriers dirty objects
+ * no-ops. We perform this optimisation for the following object types:
+ *
+ *  - MVAR
+ *  - TVAR
+ *  - MUT_VAR
+ *
+ * However, maintaining this invariant requires great care. For instance,
+ * consider the case of an MVar (which has two pointer fields) before
+ * preparatory collection:
+ *
+ *    Non-moving heap     ┊      Moving heap
+ *         gen 1          ┊         gen 0
+ *  ──────────────────────┼────────────────────────────────
+ *                        ┊
+ *         MVAR A  ────────────────→ X
+ *        (dirty)  ───────────╮
+ *                        ┊   ╰────→ Y
+ *                        ┊          │
+ *                        ┊          │
+ *           ╭───────────────────────╯
+ *           │            ┊
+ *           ↓            ┊
+ *           Z            ┊
+ *                        ┊
+ *
+ * During the preparatory collection we promote Y to the nonmoving heap but
+ * fail to promote X. Since the failed_to_evac field is conservative (being set
+ * if *any* of the fields are not promoted), this gives us:
+ *
+ *    Non-moving heap     ┊      Moving heap
+ *         gen 1          ┊         gen 0
+ *  ──────────────────────┼────────────────────────────────
+ *                        ┊
+ *         MVAR A  ────────────────→ X
+ *        (dirty)         ┊
+ *           │            ┊
+ *           │            ┊
+ *           ↓            ┊
+ *           Y            ┊
+ *           │            ┊
+ *           │            ┊
+ *           ↓            ┊
+ *           Z            ┊
+ *                        ┊
+ *
+ * This is bad. When we resume mutation a mutator may mutate MVAR A; since it's
+ * already dirty we would fail to add Y to the update remembered set, breaking the
+ * snapshot invariant and potentially losing track of the liveness of Z.
+ *
+ * To avoid this nonmovingScavengeOne we eagerly pushes the values of the
+ * fields of all objects which it fails to evacuate (e.g. MVAR A) to the update
+ * remembered set during the preparatory GC. This allows us to safely skip the
+ * non-moving write barrier without jeopardizing the snapshot invariant.
+ *
  */
 
 memcount nonmoving_live_words = 0;
@@ -401,10 +489,10 @@ static void nonmovingInitSegment(struct NonmovingSegment *seg, uint8_t log_block
     seg->link = NULL;
     seg->todo_link = NULL;
     seg->next_free = 0;
-    nonmovingClearBitmap(seg);
     bd->nonmoving_segment.log_block_size = log_block_size;
     bd->nonmoving_segment.next_free_snap = 0;
     bd->u.scan = nonmovingSegmentGetBlock(seg, 0);
+    nonmovingClearBitmap(seg);
 }
 
 // Add a segment to the free list.
@@ -951,6 +1039,7 @@ static void nonmovingMark_(MarkQueue *mark_queue, StgWeak **dead_weaks, StgTSO *
 {
     ACQUIRE_LOCK(&nonmoving_collection_mutex);
     debugTrace(DEBUG_nonmoving_gc, "Starting mark...");
+    stat_startNonmovingGc();
 
     // Walk the list of filled segments that we collected during preparation,
     // updated their snapshot pointers and move them to the sweep list.
@@ -1132,6 +1221,7 @@ finish:
 
     // We are done...
     mark_thread = 0;
+    stat_endNonmovingGc();
 
     // Signal that the concurrent collection is finished, allowing the next
     // non-moving collection to proceed


=====================================
rts/sm/NonMovingMark.c
=====================================
@@ -21,11 +21,13 @@
 #include "Printer.h"
 #include "Schedule.h"
 #include "Weak.h"
+#include "Stats.h"
 #include "STM.h"
 #include "MarkWeak.h"
 #include "sm/Storage.h"
 #include "CNF.h"
 
+static bool check_in_nonmoving_heap(StgClosure *p);
 static void mark_closure (MarkQueue *queue, const StgClosure *p, StgClosure **origin);
 static void mark_tso (MarkQueue *queue, StgTSO *tso);
 static void mark_stack (MarkQueue *queue, StgStack *stack);
@@ -316,6 +318,7 @@ void nonmovingBeginFlush(Task *task)
     debugTrace(DEBUG_nonmoving_gc, "Starting update remembered set flush...");
     traceConcSyncBegin();
     upd_rem_set_flush_count = 0;
+    stat_startNonmovingGcSync();
     stopAllCapabilitiesWith(NULL, task, SYNC_FLUSH_UPD_REM_SET);
 
     // XXX: We may have been given a capability via releaseCapability (i.e. a
@@ -407,6 +410,7 @@ void nonmovingFinishFlush(Task *task)
 
     debugTrace(DEBUG_nonmoving_gc, "Finished update remembered set flush...");
     traceConcSyncEnd();
+    stat_endNonmovingGcSync();
     releaseAllCapabilities(n_capabilities, NULL, task);
 }
 #endif
@@ -447,10 +451,17 @@ push (MarkQueue *q, const MarkQueueEnt *ent)
 void
 markQueuePushClosureGC (MarkQueue *q, StgClosure *p)
 {
+    if (!check_in_nonmoving_heap(p)) {
+        return;
+    }
+
     /* We should not make it here if we are doing a deadlock detect GC.
      * See Note [Deadlock detection under nonmoving collector].
+     * This is actually no longer true due to call in nonmovingScavengeOne
+     * introduced due to Note [Dirty flags in the non-moving collector]
+     * (see NonMoving.c).
      */
-    ASSERT(!deadlock_detect_gc);
+    //ASSERT(!deadlock_detect_gc);
 
     // Are we at the end of the block?
     if (q->top->head == MARK_QUEUE_BLOCK_ENTRIES) {


=====================================
rts/sm/NonMovingScav.c
=====================================
@@ -31,6 +31,11 @@ nonmovingScavengeOne (StgClosure *q)
         gct->eager_promotion = saved_eager_promotion;
         if (gct->failed_to_evac) {
             mvar->header.info = &stg_MVAR_DIRTY_info;
+
+            // Note [Dirty flags in the non-moving collector] in NonMoving.c
+            markQueuePushClosureGC(&gct->cap->upd_rem_set.queue, (StgClosure *) mvar->head);
+            markQueuePushClosureGC(&gct->cap->upd_rem_set.queue, (StgClosure *) mvar->tail);
+            markQueuePushClosureGC(&gct->cap->upd_rem_set.queue, (StgClosure *) mvar->value);
         } else {
             mvar->header.info = &stg_MVAR_CLEAN_info;
         }
@@ -46,6 +51,10 @@ nonmovingScavengeOne (StgClosure *q)
         gct->eager_promotion = saved_eager_promotion;
         if (gct->failed_to_evac) {
             tvar->header.info = &stg_TVAR_DIRTY_info;
+
+            // Note [Dirty flags in the non-moving collector] in NonMoving.c
+            markQueuePushClosureGC(&gct->cap->upd_rem_set.queue, (StgClosure *) tvar->current_value);
+            markQueuePushClosureGC(&gct->cap->upd_rem_set.queue, (StgClosure *) tvar->first_watch_queue_entry);
         } else {
             tvar->header.info = &stg_TVAR_CLEAN_info;
         }
@@ -160,16 +169,21 @@ nonmovingScavengeOne (StgClosure *q)
     }
 
     case MUT_VAR_CLEAN:
-    case MUT_VAR_DIRTY:
+    case MUT_VAR_DIRTY: {
+        StgMutVar *mv = (StgMutVar *) p;
         gct->eager_promotion = false;
-        evacuate(&((StgMutVar *)p)->var);
+        evacuate(&mv->var);
         gct->eager_promotion = saved_eager_promotion;
         if (gct->failed_to_evac) {
             ((StgClosure *)q)->header.info = &stg_MUT_VAR_DIRTY_info;
+
+            // Note [Dirty flags in the non-moving collector] in NonMoving.c
+            markQueuePushClosureGC(&gct->cap->upd_rem_set.queue, (StgClosure *) mv->var);
         } else {
             ((StgClosure *)q)->header.info = &stg_MUT_VAR_CLEAN_info;
         }
         break;
+    }
 
     case BLOCKING_QUEUE:
     {


=====================================
rts/sm/NonMovingSweep.c
=====================================
@@ -30,12 +30,11 @@ enum SweepResult {
 GNUC_ATTR_HOT static enum SweepResult
 nonmovingSweepSegment(struct NonmovingSegment *seg)
 {
+    const nonmoving_block_idx blk_cnt = nonmovingSegmentBlockCount(seg);
     bool found_free = false;
     bool found_live = false;
 
-    for (nonmoving_block_idx i = 0;
-         i < nonmovingSegmentBlockCount(seg);
-         ++i)
+    for (nonmoving_block_idx i = 0; i < blk_cnt; ++i)
     {
         if (seg->bitmap[i] == nonmovingMarkEpoch) {
             found_live = true;


=====================================
rts/sm/Storage.c
=====================================
@@ -1206,6 +1206,7 @@ dirty_MUT_VAR(StgRegTable *reg, StgMutVar *mvar, StgClosure *old)
         mvar->header.info = &stg_MUT_VAR_DIRTY_info;
         recordClosureMutated(cap, (StgClosure *) mvar);
         IF_NONMOVING_WRITE_BARRIER_ENABLED {
+            // See Note [Dirty flags in the non-moving collector] in NonMoving.c
             updateRemembSetPushClosure_(reg, old);
         }
     }
@@ -1228,6 +1229,7 @@ dirty_TVAR(Capability *cap, StgTVar *p,
         p->header.info = &stg_TVAR_DIRTY_info;
         recordClosureMutated(cap,(StgClosure*)p);
         IF_NONMOVING_WRITE_BARRIER_ENABLED {
+            // See Note [Dirty flags in the non-moving collector] in NonMoving.c
             updateRemembSetPushClosure(cap, old);
         }
     }
@@ -1309,6 +1311,7 @@ update_MVAR(StgRegTable *reg, StgClosure *p, StgClosure *old_val)
 {
     Capability *cap = regTableToCapability(reg);
     IF_NONMOVING_WRITE_BARRIER_ENABLED {
+        // See Note [Dirty flags in the non-moving collector] in NonMoving.c
         StgMVar *mvar = (StgMVar *) p;
         updateRemembSetPushClosure(cap, old_val);
         updateRemembSetPushClosure(cap, (StgClosure *) mvar->head);


=====================================
rts/win32/GetTime.c
=====================================
@@ -34,6 +34,20 @@ getProcessTimes(Time *user, Time *elapsed)
     *elapsed = getProcessElapsedTime();
 }
 
+Time
+getCurrentThreadCPUTime(void)
+{
+    FILETIME creationTime, exitTime, userTime, kernelTime = {0,0};
+
+    if (!GetThreadTimes(GetCurrentThread(), &creationTime,
+                        &exitTime, &kernelTime, &userTime)) {
+        sysErrorBelch("getCurrentThreadCPUTime: Win32 error %lu", GetLastError());
+        return 0;
+    }
+
+    return fileTimeToRtsTime(userTime);
+}
+
 Time
 getProcessCPUTime(void)
 {



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/8ba7aacd3fc882715100788a8ac8f06b6cf914dd...b9df14c49bd6b70e403559e9a70669cc50d7e2f8

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/8ba7aacd3fc882715100788a8ac8f06b6cf914dd...b9df14c49bd6b70e403559e9a70669cc50d7e2f8
You're receiving this email because of your account on gitlab.haskell.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20200514/45e39509/attachment-0001.html>