[Git][ghc/ghc][wip/T7275] 18 commits: CodeGen: Make folds User/DefinerOfRegs INLINEABLE.

Mon Dec 14 15:20:28 UTC 2020

Ben Gamari pushed to branch wip/T7275 at Glasgow Haskell Compiler / GHC


Commits:
51e3bb6d by Andreas Klebinger at 2020-12-08T22:43:21-05:00
CodeGen: Make folds User/DefinerOfRegs INLINEABLE.

Reduces allocation for the test case I was looking at by about 1.2%.
Mostly from avoiding allocation of some folding functions which turn
into let-no-escape bindings which just reuse their environment instead.

We also force inlining in a few key places in CmmSink which helps a bit
more.

- - - - -
69ae10c3 by Andreas Klebinger at 2020-12-08T22:43:21-05:00
CmmSink: Force inlining of foldRegsDefd

Helps avoid allocating the folding function. Improves
perf for T3294 by about 1%.

- - - - -
6e3da800 by Andreas Klebinger at 2020-12-08T22:43:21-05:00
Cmm: Make a few types and utility function slightly stricter.

About 0.6% reduction in allocations for the code I was looking at.

Not a huge difference but no need to throw away performance.

- - - - -
aef44d7f by Andreas Klebinger at 2020-12-08T22:43:21-05:00
Cmm.Sink: Optimize retaining of assignments, live sets.

Sinking requires us to track live local regs after each
cmm statement. We used to do this via "Set LocalReg".

However we can replace this with a solution based on IntSet
which is overall more efficient without losing much. The thing
we lose is width of the variables, which isn't used by the sinking
pass anyway.

I also reworked how we keep assignments to regs mentioned in
skipped assignments. I put the details into
Note [Keeping assignemnts mentioned in skipped RHSs].

The gist of it is instead of keeping track of it via the use count
which is a `IntMap Int` we now use the live regs set (IntSet) which
is quite a bit faster.

I think it also matches the semantics a lot better. The skipped
(not discarded) assignment does in fact keep the regs on it's rhs
alive so keeping track of this in the live set seems like the clearer
solution as well.

Improves allocations for T3294 by yet another 1%.

- - - - -
59f2249b by Andreas Klebinger at 2020-12-08T22:43:21-05:00
GHC.Cmm.Opt: Be stricter in results.

Optimization either returns Nothing if nothing is to be done or
`Just <cmmExpr>` otherwise. There is no point in being lazy in
`cmmExpr`. We usually inspect this element so the thunk gets forced
not long after.

We might eliminate it as dead code once in a blue moon but that's
not a case worth optimizing for.

Overall the impact of this is rather low. As Cmm.Opt doesn't allocate
much (compared to the rest of GHC) to begin with.

- - - - -
54b88eac by Andreas Klebinger at 2020-12-08T22:43:57-05:00
Bump time submodule.

This should fix #19002.

- - - - -
35e7b0c6 by Kirill Elagin at 2020-12-10T01:45:54-05:00
doc: Clarify the default for -fomit-yields

“Yield points enabled” is confusing (and probably wrong?
I am not 100% sure what it means). Change it to a simple “on”.

Undo this change from 2c23fff2e03e77187dc4d01f325f5f43a0e7cad2.
- - - - -
3551c554 by Kirill Elagin at 2020-12-10T01:45:54-05:00
doc: Extra-clarify -fomit-yields

Be more clear on what this optimisation being on by default means
in terms of yields.
- - - - -
6484f0d7 by Sergei Trofimovich at 2020-12-10T01:46:33-05:00
rts/linker/Elf.c: add missing <dlfcn.h> include (musl support)

The change fixes build failure on musl:

```
rts/linker/Elf.c:2031:3: error:
     warning: implicit declaration of function 'dlclose'; did you mean 'close'? [-Wimplicit-function-declaration]
     2031 |   dlclose(nc->dlopen_handle);
          |   ^~~~~~~
          |   close
```

Signed-off-by: Sergei Trofimovich <slyfox at gentoo.org>

- - - - -
ab24ed9b by Ben Gamari at 2020-12-11T03:55:51-05:00
users guide: Fix syntax errors

Fixes errors introduced by 3a55b3a2574f913d046f3a6f82db48d7f6df32e3.

- - - - -
d3a24d31 by Ben Gamari at 2020-12-11T03:55:51-05:00
users guide: Describe GC lifecycle events

Every time I am asked about how to interpret these events I need to
figure it out from scratch. It's well past time that the users guide
properly documents these.

- - - - -
741309b9 by Ben Gamari at 2020-12-11T03:56:27-05:00
gitlab-ci: Fix incorrect Docker image for nightly cross job

Also refactor the job definition to eliminate the bug by construction.

- - - - -
19703bc8 by Ben Gamari at 2020-12-11T03:56:27-05:00
gitlab-ci: Fix name of flavour in ThreadSanitizer job

It looks like I neglected to update this after introduce flavour
transformers.

- - - - -
ed82de5e by Ben Gamari at 2020-12-14T10:20:06-05:00
rts: Break up census logic

Move the logic for taking censuses of "normal" and pinned blocks to
their own functions.

- - - - -
c7f8e22c by Ben Gamari at 2020-12-14T10:20:06-05:00
rts: Implement heap census support for pinned objects

It turns out that this was fairly straightforward to implement since we
are now pretty careful about zeroing slop.

- - - - -
2c1f2e49 by Ben Gamari at 2020-12-14T10:20:06-05:00
Storage: Unconditionally enable zeroing of alignment slop

This is necessary since the user may enable `+RTS -hT` at any time.

- - - - -
370c9f41 by Ben Gamari at 2020-12-14T10:20:06-05:00
rts: Zero shrunk array slop in vanilla RTS

But only when profiling or DEBUG are enabled.

Fixes #17572.

- - - - -
ff664f40 by Ben Gamari at 2020-12-14T10:20:06-05:00
rts: Enforce that mark-region isn't used with -h

As noted in #9666, the mark-region GC is not compatible with heap
profiling. Also add documentation for this flag.

Closes #9666.

- - - - -


21 changed files:

- .gitlab-ci.yml
- compiler/GHC/Cmm/Expr.hs
- + compiler/GHC/Cmm/LRegSet.hs
- compiler/GHC/Cmm/Liveness.hs
- compiler/GHC/Cmm/Node.hs
- compiler/GHC/Cmm/Opt.hs
- compiler/GHC/Cmm/Sink.hs
- compiler/GHC/Cmm/Utils.hs
- compiler/GHC/Utils/Outputable.hs
- compiler/ghc.cabal.in
- docs/users_guide/eventlog-formats.rst
- docs/users_guide/runtime_control.rst
- docs/users_guide/using-optimisation.rst
- includes/Cmm.h
- libraries/time
- rts/ProfHeap.c
- rts/RtsFlags.c
- rts/Stats.c
- rts/linker/Elf.c
- rts/sm/Storage.c
- utils/deriveConstants/Main.hs


Changes:

=====================================
.gitlab-ci.yml
=====================================
@@ -257,27 +257,24 @@ validate-x86_64-linux-deb9-unreg-hadrian:
     CONFIGURE_ARGS: --enable-unregisterised
     TEST_ENV: "x86_64-linux-deb9-unreg-hadrian"
 
-validate-x86_64-linux-deb10-hadrian-cross-aarch64:
-  <<: *nightly
+.build-x86_64-linux-deb10-hadrian-cross-aarch64:
   extends: .validate-linux-hadrian
-  stage: full-build
   image: "registry.gitlab.haskell.org/ghc/ci-images/x86_64-linux-deb10:$DOCKER_REV"
   variables:
     BIN_DIST_NAME: "ghc-x86_64-deb9-linux"
-  rules:
-    - if: '$CI_MERGE_REQUEST_LABELS =~ /.*cross-compilation.*/'
-  variables:
     CONFIGURE_ARGS: --with-intree-gmp
     CROSS_TARGET: "aarch64-linux-gnu"
 
+validate-x86_64-linux-deb10-hadrian-cross-aarch64:
+  extends: .build-x86_64-linux-deb10-hadrian-cross-aarch64
+  stage: full-build
+  rules:
+    - if: '$CI_MERGE_REQUEST_LABELS =~ /.*cross-compilation.*/'
+
 nightly-x86_64-linux-deb10-hadrian-cross-aarch64:
   <<: *nightly
-  extends: .validate-linux-hadrian
+  extends: .build-x86_64-linux-deb10-hadrian-cross-aarch64
   stage: full-build
-  variables:
-    CONFIGURE_ARGS: --with-intree-gmp
-    CROSS_TARGET: "aarch64-linux-gnu"
-
 
 
 ############################################################
@@ -712,7 +709,7 @@ nightly-x86_64-linux-deb9-integer-simple:
   stage: full-build
   variables:
     TEST_ENV: "x86_64-linux-deb9-tsan"
-    BUILD_FLAVOUR: "thread-sanitizer"
+    BUILD_FLAVOUR: "default+thread_sanitizer"
     TSAN_OPTIONS: "suppressions=$CI_PROJECT_DIR/rts/.tsan-suppressions"
     # Haddock is large enough to make TSAN choke without massive quantities of
     # memory.


=====================================
compiler/GHC/Cmm/Expr.hs
=====================================
@@ -53,14 +53,14 @@ import GHC.Types.Basic (Alignment, mkAlignment, alignmentOf)
 -----------------------------------------------------------------------------
 
 data CmmExpr
-  = CmmLit CmmLit               -- Literal
+  = CmmLit !CmmLit               -- Literal
   | CmmLoad !CmmExpr !CmmType   -- Read memory location
   | CmmReg !CmmReg              -- Contents of register
   | CmmMachOp MachOp [CmmExpr]  -- Machine operation (+, -, *, etc.)
   | CmmStackSlot Area {-# UNPACK #-} !Int
                                 -- addressing expression of a stack slot
                                 -- See Note [CmmStackSlot aliasing]
-  | CmmRegOff !CmmReg Int
+  | CmmRegOff !CmmReg !Int
         -- CmmRegOff reg i
         --        ** is shorthand only, meaning **
         -- CmmMachOp (MO_Add rep) [x, CmmLit (CmmInt (fromIntegral i) rep)]
@@ -173,16 +173,16 @@ Now, the assignments of y go away,
 -}
 
 data CmmLit
-  = CmmInt !Integer  Width
+  = CmmInt !Integer  !Width
         -- Interpretation: the 2's complement representation of the value
         -- is truncated to the specified size.  This is easier than trying
         -- to keep the value within range, because we don't know whether
         -- it will be used as a signed or unsigned value (the CmmType doesn't
         -- distinguish between signed & unsigned).
-  | CmmFloat  Rational Width
+  | CmmFloat  Rational !Width
   | CmmVec [CmmLit]                     -- Vector literal
   | CmmLabel    CLabel                  -- Address of label
-  | CmmLabelOff CLabel Int              -- Address of label + byte offset
+  | CmmLabelOff CLabel !Int              -- Address of label + byte offset
 
         -- Due to limitations in the C backend, the following
         -- MUST ONLY be used inside the info table indicated by label2
@@ -191,7 +191,7 @@ data CmmLit
         -- Don't use it at all unless tablesNextToCode.
         -- It is also used inside the NCG during when generating
         -- position-independent code.
-  | CmmLabelDiffOff CLabel CLabel Int Width -- label1 - label2 + offset
+  | CmmLabelDiffOff CLabel CLabel !Int !Width -- label1 - label2 + offset
         -- In an expression, the width just has the effect of MO_SS_Conv
         -- from wordWidth to the desired width.
         --
@@ -363,6 +363,7 @@ instance DefinerOfRegs LocalReg CmmReg where
     foldRegsDefd _ _ z (CmmGlobal _)  = z
 
 instance UserOfRegs GlobalReg CmmReg where
+    {-# INLINEABLE foldRegsUsed #-}
     foldRegsUsed _ _ z (CmmLocal _)    = z
     foldRegsUsed _ f z (CmmGlobal reg) = f z reg
 
@@ -379,6 +380,7 @@ instance Ord r => DefinerOfRegs r r where
 instance (Ord r, UserOfRegs r CmmReg) => UserOfRegs r CmmExpr where
   -- The (Ord r) in the context is necessary here
   -- See Note [Recursive superclasses] in GHC.Tc.TyCl.Instance
+  {-# INLINEABLE foldRegsUsed #-}
   foldRegsUsed platform f !z e = expr z e
     where expr z (CmmLit _)          = z
           expr z (CmmLoad addr _)    = foldRegsUsed platform f z addr


=====================================
compiler/GHC/Cmm/LRegSet.hs
=====================================
@@ -0,0 +1,53 @@
+{-# LANGUAGE GADTs #-}
+{-# LANGUAGE ScopedTypeVariables #-}
+
+module GHC.Cmm.LRegSet (
+    LRegSet,
+    LRegKey,
+
+    emptyLRegSet,
+    nullLRegSet,
+    insertLRegSet,
+    elemLRegSet,
+
+    deleteFromLRegSet,
+    sizeLRegSet,
+
+    plusLRegSet,
+    elemsLRegSet
+  ) where
+
+import GHC.Prelude
+import GHC.Types.Unique
+import GHC.Cmm.Expr
+
+import Data.IntSet as IntSet
+
+-- Compact sets for membership tests of local variables.
+
+type LRegSet = IntSet.IntSet
+type LRegKey = Int
+
+emptyLRegSet :: LRegSet
+emptyLRegSet = IntSet.empty
+
+nullLRegSet :: LRegSet -> Bool
+nullLRegSet = IntSet.null
+
+insertLRegSet :: LocalReg -> LRegSet -> LRegSet
+insertLRegSet l = IntSet.insert (getKey (getUnique l))
+
+elemLRegSet :: LocalReg -> LRegSet -> Bool
+elemLRegSet l = IntSet.member (getKey (getUnique l))
+
+deleteFromLRegSet :: LRegSet -> LocalReg -> LRegSet
+deleteFromLRegSet set reg = IntSet.delete (getKey . getUnique $ reg) set
+
+sizeLRegSet :: IntSet -> Int
+sizeLRegSet = IntSet.size
+
+plusLRegSet :: IntSet -> IntSet -> IntSet
+plusLRegSet = IntSet.union
+
+elemsLRegSet :: IntSet -> [Int]
+elemsLRegSet = IntSet.toList


=====================================
compiler/GHC/Cmm/Liveness.hs
=====================================
@@ -6,9 +6,12 @@
 module GHC.Cmm.Liveness
     ( CmmLocalLive
     , cmmLocalLiveness
+    , cmmLocalLivenessL
     , cmmGlobalLiveness
     , liveLattice
+    , liveLatticeL
     , gen_kill
+    , gen_killL
     )
 where
 
@@ -22,11 +25,14 @@ import GHC.Cmm.Dataflow.Block
 import GHC.Cmm.Dataflow.Collections
 import GHC.Cmm.Dataflow
 import GHC.Cmm.Dataflow.Label
+import GHC.Cmm.LRegSet
 
 import GHC.Data.Maybe
 import GHC.Utils.Outputable
 import GHC.Utils.Panic
 
+import GHC.Types.Unique
+
 -----------------------------------------------------------------------------
 -- Calculating what variables are live on entry to a basic block
 -----------------------------------------------------------------------------
@@ -92,3 +98,66 @@ xferLive platform (BlockCC eNode middle xNode) fBase =
     in mapSingleton (entryLabel eNode) result
 {-# SPECIALIZE xferLive :: Platform -> TransferFun (CmmLive LocalReg) #-}
 {-# SPECIALIZE xferLive :: Platform -> TransferFun (CmmLive GlobalReg) #-}
+
+-----------------------------------------------------------------------------
+-- | Specialization that only retains the keys for local variables.
+--
+-- Local variablas are mostly glorified Ints, and some parts of the compiler
+-- really don't care about anything but the Int part. So we can avoid some
+-- overhead by computing a IntSet instead of a Set LocalReg which (unsurprisingly)
+-- is quite a bit faster.
+-----------------------------------------------------------------------------
+
+type BlockEntryLivenessL  = LabelMap LRegSet
+
+-- | The dataflow lattice
+liveLatticeL :: DataflowLattice LRegSet
+liveLatticeL = DataflowLattice emptyLRegSet add
+  where
+    add (OldFact old) (NewFact new) =
+        let !join = plusLRegSet old new
+        in changedIf (sizeLRegSet join > sizeLRegSet old) join
+
+
+cmmLocalLivenessL :: Platform -> CmmGraph -> BlockEntryLivenessL
+cmmLocalLivenessL platform graph =
+    check $ analyzeCmmBwd liveLatticeL (xferLiveL platform) graph mapEmpty
+  where
+    entry = g_entry graph
+    check facts =
+        noLiveOnEntryL entry (expectJust "check" $ mapLookup entry facts) facts
+
+-- | On entry to the procedure, there had better not be any LocalReg's live-in.
+noLiveOnEntryL :: BlockId -> LRegSet -> a -> a
+noLiveOnEntryL bid in_fact x =
+  if nullLRegSet in_fact then x
+  else pprPanic "LocalReg's live-in to graph" (ppr bid <+> ppr reg_uniques)
+    where
+        -- We convert the int's to uniques so that the printing matches that
+        -- of registers.
+        reg_uniques = map mkUniqueGrimily $ elemsLRegSet in_fact
+
+
+
+
+gen_killL
+    :: (DefinerOfRegs LocalReg n, UserOfRegs LocalReg n)
+    => Platform -> n -> LRegSet -> LRegSet
+gen_killL platform node set =
+    let !afterKill = foldRegsDefd platform deleteFromLRegSet set node
+    in foldRegsUsed platform (flip insertLRegSet) afterKill node
+{-# INLINE gen_killL #-}
+
+xferLiveL
+    :: ( UserOfRegs LocalReg (CmmNode O O)
+       , DefinerOfRegs LocalReg (CmmNode O O)
+       , UserOfRegs LocalReg (CmmNode O C)
+       , DefinerOfRegs LocalReg (CmmNode O C)
+       )
+    => Platform -> TransferFun LRegSet
+xferLiveL platform (BlockCC eNode middle xNode) fBase =
+    let joined = gen_killL platform xNode $! joinOutFacts liveLatticeL xNode fBase
+        !result = foldNodesBwdOO (gen_killL platform) middle joined
+    in mapSingleton (entryLabel eNode) result
+
+


=====================================
compiler/GHC/Cmm/Node.hs
=====================================
@@ -318,6 +318,7 @@ foreignTargetHints target
 -- Instances of register and slot users / definers
 
 instance UserOfRegs LocalReg (CmmNode e x) where
+  {-# INLINEABLE foldRegsUsed #-}
   foldRegsUsed platform f !z n = case n of
     CmmAssign _ expr -> fold f z expr
     CmmStore addr rval -> fold f (fold f z addr) rval
@@ -332,6 +333,7 @@ instance UserOfRegs LocalReg (CmmNode e x) where
           fold f z n = foldRegsUsed platform f z n
 
 instance UserOfRegs GlobalReg (CmmNode e x) where
+  {-# INLINEABLE foldRegsUsed #-}
   foldRegsUsed platform f !z n = case n of
     CmmAssign _ expr -> fold f z expr
     CmmStore addr rval -> fold f (fold f z addr) rval
@@ -348,10 +350,12 @@ instance UserOfRegs GlobalReg (CmmNode e x) where
 instance (Ord r, UserOfRegs r CmmReg) => UserOfRegs r ForeignTarget where
   -- The (Ord r) in the context is necessary here
   -- See Note [Recursive superclasses] in GHC.Tc.TyCl.Instance
+  {-# INLINEABLE foldRegsUsed #-}
   foldRegsUsed _        _ !z (PrimTarget _)      = z
   foldRegsUsed platform f !z (ForeignTarget e _) = foldRegsUsed platform f z e
 
 instance DefinerOfRegs LocalReg (CmmNode e x) where
+  {-# INLINEABLE foldRegsDefd #-}
   foldRegsDefd platform f !z n = case n of
     CmmAssign lhs _ -> fold f z lhs
     CmmUnsafeForeignCall _ fs _ -> fold f z fs
@@ -362,6 +366,7 @@ instance DefinerOfRegs LocalReg (CmmNode e x) where
           fold f z n = foldRegsDefd platform f z n
 
 instance DefinerOfRegs GlobalReg (CmmNode e x) where
+  {-# INLINEABLE foldRegsDefd #-}
   foldRegsDefd platform f !z n = case n of
     CmmAssign lhs _ -> fold f z lhs
     CmmUnsafeForeignCall tgt _ _  -> fold f z (foreignTargetRegs tgt)


=====================================
compiler/GHC/Cmm/Opt.hs
=====================================
@@ -58,7 +58,7 @@ cmmMachOpFoldM
     -> Maybe CmmExpr
 
 cmmMachOpFoldM _ op [CmmLit (CmmInt x rep)]
-  = Just $ case op of
+  = Just $! case op of
       MO_S_Neg _ -> CmmLit (CmmInt (-x) rep)
       MO_Not _   -> CmmLit (CmmInt (complement x) rep)
 
@@ -90,13 +90,13 @@ cmmMachOpFoldM platform conv_outer [CmmMachOp conv_inner [x]]
         -- but remember to use the signedness from the widening, just in case
         -- the final conversion is a widen.
         | rep1 < rep2 && rep2 > rep3 ->
-            Just $ cmmMachOpFold platform (intconv signed1 rep1 rep3) [x]
+            Just $! cmmMachOpFold platform (intconv signed1 rep1 rep3) [x]
         -- Nested widenings: collapse if the signedness is the same
         | rep1 < rep2 && rep2 < rep3 && signed1 == signed2 ->
-            Just $ cmmMachOpFold platform (intconv signed1 rep1 rep3) [x]
+            Just $! cmmMachOpFold platform (intconv signed1 rep1 rep3) [x]
         -- Nested narrowings: collapse
         | rep1 > rep2 && rep2 > rep3 ->
-            Just $ cmmMachOpFold platform (MO_UU_Conv rep1 rep3) [x]
+            Just $! cmmMachOpFold platform (MO_UU_Conv rep1 rep3) [x]
         | otherwise ->
             Nothing
   where
@@ -117,34 +117,34 @@ cmmMachOpFoldM platform mop [CmmLit (CmmInt x xrep), CmmLit (CmmInt y _)]
   = case mop of
         -- for comparisons: don't forget to narrow the arguments before
         -- comparing, since they might be out of range.
-        MO_Eq _   -> Just $ CmmLit (CmmInt (if x_u == y_u then 1 else 0) (wordWidth platform))
-        MO_Ne _   -> Just $ CmmLit (CmmInt (if x_u /= y_u then 1 else 0) (wordWidth platform))
-
-        MO_U_Gt _ -> Just $ CmmLit (CmmInt (if x_u >  y_u then 1 else 0) (wordWidth platform))
-        MO_U_Ge _ -> Just $ CmmLit (CmmInt (if x_u >= y_u then 1 else 0) (wordWidth platform))
-        MO_U_Lt _ -> Just $ CmmLit (CmmInt (if x_u <  y_u then 1 else 0) (wordWidth platform))
-        MO_U_Le _ -> Just $ CmmLit (CmmInt (if x_u <= y_u then 1 else 0) (wordWidth platform))
-
-        MO_S_Gt _ -> Just $ CmmLit (CmmInt (if x_s >  y_s then 1 else 0) (wordWidth platform))
-        MO_S_Ge _ -> Just $ CmmLit (CmmInt (if x_s >= y_s then 1 else 0) (wordWidth platform))
-        MO_S_Lt _ -> Just $ CmmLit (CmmInt (if x_s <  y_s then 1 else 0) (wordWidth platform))
-        MO_S_Le _ -> Just $ CmmLit (CmmInt (if x_s <= y_s then 1 else 0) (wordWidth platform))
-
-        MO_Add r -> Just $ CmmLit (CmmInt (x + y) r)
-        MO_Sub r -> Just $ CmmLit (CmmInt (x - y) r)
-        MO_Mul r -> Just $ CmmLit (CmmInt (x * y) r)
-        MO_U_Quot r | y /= 0 -> Just $ CmmLit (CmmInt (x_u `quot` y_u) r)
-        MO_U_Rem  r | y /= 0 -> Just $ CmmLit (CmmInt (x_u `rem`  y_u) r)
-        MO_S_Quot r | y /= 0 -> Just $ CmmLit (CmmInt (x `quot` y) r)
-        MO_S_Rem  r | y /= 0 -> Just $ CmmLit (CmmInt (x `rem` y) r)
-
-        MO_And   r -> Just $ CmmLit (CmmInt (x .&. y) r)
-        MO_Or    r -> Just $ CmmLit (CmmInt (x .|. y) r)
-        MO_Xor   r -> Just $ CmmLit (CmmInt (x `xor` y) r)
-
-        MO_Shl   r -> Just $ CmmLit (CmmInt (x `shiftL` fromIntegral y) r)
-        MO_U_Shr r -> Just $ CmmLit (CmmInt (x_u `shiftR` fromIntegral y) r)
-        MO_S_Shr r -> Just $ CmmLit (CmmInt (x `shiftR` fromIntegral y) r)
+        MO_Eq _   -> Just $! CmmLit (CmmInt (if x_u == y_u then 1 else 0) (wordWidth platform))
+        MO_Ne _   -> Just $! CmmLit (CmmInt (if x_u /= y_u then 1 else 0) (wordWidth platform))
+
+        MO_U_Gt _ -> Just $! CmmLit (CmmInt (if x_u >  y_u then 1 else 0) (wordWidth platform))
+        MO_U_Ge _ -> Just $! CmmLit (CmmInt (if x_u >= y_u then 1 else 0) (wordWidth platform))
+        MO_U_Lt _ -> Just $! CmmLit (CmmInt (if x_u <  y_u then 1 else 0) (wordWidth platform))
+        MO_U_Le _ -> Just $! CmmLit (CmmInt (if x_u <= y_u then 1 else 0) (wordWidth platform))
+
+        MO_S_Gt _ -> Just $! CmmLit (CmmInt (if x_s >  y_s then 1 else 0) (wordWidth platform))
+        MO_S_Ge _ -> Just $! CmmLit (CmmInt (if x_s >= y_s then 1 else 0) (wordWidth platform))
+        MO_S_Lt _ -> Just $! CmmLit (CmmInt (if x_s <  y_s then 1 else 0) (wordWidth platform))
+        MO_S_Le _ -> Just $! CmmLit (CmmInt (if x_s <= y_s then 1 else 0) (wordWidth platform))
+
+        MO_Add r -> Just $! CmmLit (CmmInt (x + y) r)
+        MO_Sub r -> Just $! CmmLit (CmmInt (x - y) r)
+        MO_Mul r -> Just $! CmmLit (CmmInt (x * y) r)
+        MO_U_Quot r | y /= 0 -> Just $! CmmLit (CmmInt (x_u `quot` y_u) r)
+        MO_U_Rem  r | y /= 0 -> Just $! CmmLit (CmmInt (x_u `rem`  y_u) r)
+        MO_S_Quot r | y /= 0 -> Just $! CmmLit (CmmInt (x `quot` y) r)
+        MO_S_Rem  r | y /= 0 -> Just $! CmmLit (CmmInt (x `rem` y) r)
+
+        MO_And   r -> Just $! CmmLit (CmmInt (x .&. y) r)
+        MO_Or    r -> Just $! CmmLit (CmmInt (x .|. y) r)
+        MO_Xor   r -> Just $! CmmLit (CmmInt (x `xor` y) r)
+
+        MO_Shl   r -> Just $! CmmLit (CmmInt (x `shiftL` fromIntegral y) r)
+        MO_U_Shr r -> Just $! CmmLit (CmmInt (x_u `shiftR` fromIntegral y) r)
+        MO_S_Shr r -> Just $! CmmLit (CmmInt (x `shiftR` fromIntegral y) r)
 
         _          -> Nothing
 
@@ -162,7 +162,7 @@ cmmMachOpFoldM platform mop [CmmLit (CmmInt x xrep), CmmLit (CmmInt y _)]
 
 cmmMachOpFoldM platform op [x@(CmmLit _), y]
    | not (isLit y) && isCommutableMachOp op
-   = Just (cmmMachOpFold platform op [y, x])
+   = Just $! (cmmMachOpFold platform op [y, x])
 
 -- Turn (a+b)+c into a+(b+c) where possible.  Because literals are
 -- moved to the right, it is more likely that we will find
@@ -183,7 +183,7 @@ cmmMachOpFoldM platform op [x@(CmmLit _), y]
 cmmMachOpFoldM platform mop1 [CmmMachOp mop2 [arg1,arg2], arg3]
    | mop2 `associates_with` mop1
      && not (isLit arg1) && not (isPicReg arg1)
-   = Just (cmmMachOpFold platform mop2 [arg1, cmmMachOpFold platform mop1 [arg2,arg3]])
+   = Just $! (cmmMachOpFold platform mop2 [arg1, cmmMachOpFold platform mop1 [arg2,arg3]])
    where
      MO_Add{} `associates_with` MO_Sub{} = True
      mop1 `associates_with` mop2 =
@@ -192,7 +192,7 @@ cmmMachOpFoldM platform mop1 [CmmMachOp mop2 [arg1,arg2], arg3]
 -- special case: (a - b) + c  ==>  a + (c - b)
 cmmMachOpFoldM platform mop1@(MO_Add{}) [CmmMachOp mop2@(MO_Sub{}) [arg1,arg2], arg3]
    | not (isLit arg1) && not (isPicReg arg1)
-   = Just (cmmMachOpFold platform mop1 [arg1, cmmMachOpFold platform mop2 [arg3,arg2]])
+   = Just $! (cmmMachOpFold platform mop1 [arg1, cmmMachOpFold platform mop2 [arg3,arg2]])
 
 -- special case: (PicBaseReg + lit) + N  ==>  PicBaseReg + (lit+N)
 --
@@ -205,27 +205,27 @@ cmmMachOpFoldM platform mop1@(MO_Add{}) [CmmMachOp mop2@(MO_Sub{}) [arg1,arg2],
 cmmMachOpFoldM _ MO_Add{} [ CmmMachOp op at MO_Add{} [pic, CmmLit lit]
                           , CmmLit (CmmInt n rep) ]
   | isPicReg pic
-  = Just $ CmmMachOp op [pic, CmmLit $ cmmOffsetLit lit off ]
+  = Just $! CmmMachOp op [pic, CmmLit $ cmmOffsetLit lit off ]
   where off = fromIntegral (narrowS rep n)
 
 -- Make a RegOff if we can
 cmmMachOpFoldM _ (MO_Add _) [CmmReg reg, CmmLit (CmmInt n rep)]
-  = Just $ cmmRegOff reg (fromIntegral (narrowS rep n))
+  = Just $! cmmRegOff reg (fromIntegral (narrowS rep n))
 cmmMachOpFoldM _ (MO_Add _) [CmmRegOff reg off, CmmLit (CmmInt n rep)]
-  = Just $ cmmRegOff reg (off + fromIntegral (narrowS rep n))
+  = Just $! cmmRegOff reg (off + fromIntegral (narrowS rep n))
 cmmMachOpFoldM _ (MO_Sub _) [CmmReg reg, CmmLit (CmmInt n rep)]
-  = Just $ cmmRegOff reg (- fromIntegral (narrowS rep n))
+  = Just $! cmmRegOff reg (- fromIntegral (narrowS rep n))
 cmmMachOpFoldM _ (MO_Sub _) [CmmRegOff reg off, CmmLit (CmmInt n rep)]
-  = Just $ cmmRegOff reg (off - fromIntegral (narrowS rep n))
+  = Just $! cmmRegOff reg (off - fromIntegral (narrowS rep n))
 
 -- Fold label(+/-)offset into a CmmLit where possible
 
 cmmMachOpFoldM _ (MO_Add _) [CmmLit lit, CmmLit (CmmInt i rep)]
-  = Just $ CmmLit (cmmOffsetLit lit (fromIntegral (narrowU rep i)))
+  = Just $! CmmLit (cmmOffsetLit lit (fromIntegral (narrowU rep i)))
 cmmMachOpFoldM _ (MO_Add _) [CmmLit (CmmInt i rep), CmmLit lit]
-  = Just $ CmmLit (cmmOffsetLit lit (fromIntegral (narrowU rep i)))
+  = Just $! CmmLit (cmmOffsetLit lit (fromIntegral (narrowU rep i)))
 cmmMachOpFoldM _ (MO_Sub _) [CmmLit lit, CmmLit (CmmInt i rep)]
-  = Just $ CmmLit (cmmOffsetLit lit (fromIntegral (negate (narrowU rep i))))
+  = Just $! CmmLit (cmmOffsetLit lit (fromIntegral (negate (narrowU rep i))))
 
 
 -- Comparison of literal with widened operand: perform the comparison
@@ -245,7 +245,7 @@ cmmMachOpFoldM platform cmp [CmmMachOp conv [x], CmmLit (CmmInt i _)]
         -- and the literal fits in the smaller size:
     i == narrow_fn rep i
         -- then we can do the comparison at the smaller size
-  = Just (cmmMachOpFold platform narrow_cmp [x, CmmLit (CmmInt i rep)])
+  = Just $! (cmmMachOpFold platform narrow_cmp [x, CmmLit (CmmInt i rep)])
  where
     maybe_conversion (MO_UU_Conv from to)
         | to > from
@@ -320,8 +320,8 @@ cmmMachOpFoldM platform mop [x, (CmmLit (CmmInt 1 rep))]
         MO_Mul    _ -> Just x
         MO_S_Quot _ -> Just x
         MO_U_Quot _ -> Just x
-        MO_S_Rem  _ -> Just $ CmmLit (CmmInt 0 rep)
-        MO_U_Rem  _ -> Just $ CmmLit (CmmInt 0 rep)
+        MO_S_Rem  _ -> Just $! CmmLit (CmmInt 0 rep)
+        MO_U_Rem  _ -> Just $! CmmLit (CmmInt 0 rep)
 
         -- Comparisons; trickier
         -- See Note [Comparison operators]
@@ -346,18 +346,18 @@ cmmMachOpFoldM platform mop [x, (CmmLit (CmmInt n _))]
   = case mop of
         MO_Mul rep
            | Just p <- exactLog2 n ->
-                 Just (cmmMachOpFold platform (MO_Shl rep) [x, CmmLit (CmmInt p rep)])
+                 Just $! (cmmMachOpFold platform (MO_Shl rep) [x, CmmLit (CmmInt p rep)])
         MO_U_Quot rep
            | Just p <- exactLog2 n ->
-                 Just (cmmMachOpFold platform (MO_U_Shr rep) [x, CmmLit (CmmInt p rep)])
+                 Just $! (cmmMachOpFold platform (MO_U_Shr rep) [x, CmmLit (CmmInt p rep)])
         MO_U_Rem rep
            | Just _ <- exactLog2 n ->
-                 Just (cmmMachOpFold platform (MO_And rep) [x, CmmLit (CmmInt (n - 1) rep)])
+                 Just $! (cmmMachOpFold platform (MO_And rep) [x, CmmLit (CmmInt (n - 1) rep)])
         MO_S_Quot rep
            | Just p <- exactLog2 n,
              CmmReg _ <- x ->   -- We duplicate x in signedQuotRemHelper, hence require
                                 -- it is a reg.  FIXME: remove this restriction.
-                Just (cmmMachOpFold platform (MO_S_Shr rep)
+                Just $! (cmmMachOpFold platform (MO_S_Shr rep)
                   [signedQuotRemHelper rep p, CmmLit (CmmInt p rep)])
         MO_S_Rem rep
            | Just p <- exactLog2 n,
@@ -366,7 +366,7 @@ cmmMachOpFoldM platform mop [x, (CmmLit (CmmInt n _))]
                 -- We replace (x `rem` 2^p) by (x - (x `quot` 2^p) * 2^p).
                 -- Moreover, we fuse MO_S_Shr (last operation of MO_S_Quot)
                 -- and MO_S_Shl (multiplication by 2^p) into a single MO_And operation.
-                Just (cmmMachOpFold platform (MO_Sub rep)
+                Just $! (cmmMachOpFold platform (MO_Sub rep)
                     [x, cmmMachOpFold platform (MO_And rep)
                       [signedQuotRemHelper rep p, CmmLit (CmmInt (- n) rep)]])
         _ -> Nothing


=====================================
compiler/GHC/Cmm/Sink.hs
=====================================
@@ -1,4 +1,6 @@
 {-# LANGUAGE GADTs #-}
+{-# LANGUAGE ScopedTypeVariables #-}
+
 module GHC.Cmm.Sink (
      cmmSink
   ) where
@@ -8,6 +10,7 @@ import GHC.Prelude
 import GHC.Cmm
 import GHC.Cmm.Opt
 import GHC.Cmm.Liveness
+import GHC.Cmm.LRegSet
 import GHC.Cmm.Utils
 import GHC.Cmm.Dataflow.Block
 import GHC.Cmm.Dataflow.Label
@@ -16,29 +19,13 @@ import GHC.Cmm.Dataflow.Graph
 import GHC.Platform.Regs
 
 import GHC.Platform
-import GHC.Types.Unique
 import GHC.Types.Unique.FM
 
 import qualified Data.IntSet as IntSet
 import Data.List (partition)
-import qualified Data.Set as Set
 import Data.Maybe

--- Compact sets for membership tests of local variables.
-
-type LRegSet = IntSet.IntSet
-
-emptyLRegSet :: LRegSet
-emptyLRegSet = IntSet.empty
-
-nullLRegSet :: LRegSet -> Bool
-nullLRegSet = IntSet.null
-
-insertLRegSet :: LocalReg -> LRegSet -> LRegSet
-insertLRegSet l = IntSet.insert (getKey (getUnique l))
-
-elemLRegSet :: LocalReg -> LRegSet -> Bool
-elemLRegSet l = IntSet.member (getKey (getUnique l))
+import GHC.Exts (inline)
 
 -- -----------------------------------------------------------------------------
 -- Sinking and inlining
@@ -167,8 +154,8 @@ type Assignments = [Assignment]
 cmmSink :: Platform -> CmmGraph -> CmmGraph
 cmmSink platform graph = ofBlockList (g_entry graph) $ sink mapEmpty $ blocks
   where
-  liveness = cmmLocalLiveness platform graph
-  getLive l = mapFindWithDefault Set.empty l liveness
+  liveness = cmmLocalLivenessL platform graph
+  getLive l = mapFindWithDefault emptyLRegSet l liveness
 
   blocks = revPostorder graph
 
@@ -188,8 +175,8 @@ cmmSink platform graph = ofBlockList (g_entry graph) $ sink mapEmpty $ blocks
       -- Annotate the middle nodes with the registers live *after*
       -- the node.  This will help us decide whether we can inline
       -- an assignment in the current node or not.
-      live = Set.unions (map getLive succs)
-      live_middle = gen_kill platform last live
+      live = IntSet.unions (map getLive succs)
+      live_middle = gen_killL platform last live
       ann_middles = annotate platform live_middle (blockToList middle)
 
       -- Now sink and inline in this block
@@ -201,7 +188,7 @@ cmmSink platform graph = ofBlockList (g_entry graph) $ sink mapEmpty $ blocks
       -- one predecessor), so identify the join points and the set
       -- of registers live in them.
       (joins, nonjoins) = partition (`mapMember` join_pts) succs
-      live_in_joins = Set.unions (map getLive joins)
+      live_in_joins = IntSet.unions (map getLive joins)
 
       -- We do not want to sink an assignment into multiple branches,
       -- so identify the set of registers live in multiple successors.
@@ -210,26 +197,28 @@ cmmSink platform graph = ofBlockList (g_entry graph) $ sink mapEmpty $ blocks
       -- now live in multiple branches.
       init_live_sets = map getLive nonjoins
       live_in_multi live_sets r =
-         case filter (Set.member r) live_sets of
+         case filter (elemLRegSet r) live_sets of
            (_one:_two:_) -> True
            _ -> False
 
       -- Now, drop any assignments that we will not sink any further.
       (dropped_last, assigs'') = dropAssignments platform drop_if init_live_sets assigs'
 
+      drop_if :: (LocalReg, CmmExpr, AbsMem)
+                      -> [LRegSet] -> (Bool, [LRegSet])
       drop_if a@(r,rhs,_) live_sets = (should_drop, live_sets')
           where
             should_drop =  conflicts platform a final_last
                         || not (isTrivial platform rhs) && live_in_multi live_sets r
-                        || r `Set.member` live_in_joins
+                        || r `elemLRegSet` live_in_joins
 
             live_sets' | should_drop = live_sets
                        | otherwise   = map upd live_sets
 
-            upd set | r `Set.member` set = set `Set.union` live_rhs
+            upd set | r `elemLRegSet` set = set `IntSet.union` live_rhs
                     | otherwise          = set
 
-            live_rhs = foldRegsUsed platform extendRegSet emptyRegSet rhs
+            live_rhs = foldRegsUsed platform (flip insertLRegSet) emptyLRegSet rhs
 
       final_middle = foldl' blockSnoc middle' dropped_last
 
@@ -266,9 +255,9 @@ isTrivial _ _          = False
 --
 -- annotate each node with the set of registers live *after* the node
 --
-annotate :: Platform -> LocalRegSet -> [CmmNode O O] -> [(LocalRegSet, CmmNode O O)]
+annotate :: Platform -> LRegSet -> [CmmNode O O] -> [(LRegSet, CmmNode O O)]
 annotate platform live nodes = snd $ foldr ann (live,[]) nodes
-  where ann n (live,nodes) = (gen_kill platform n live, (live,n) : nodes)
+  where ann n (live,nodes) = (gen_killL platform n live, (live,n) : nodes)
 
 --
 -- Find the blocks that have multiple successors (join points)
@@ -285,13 +274,13 @@ findJoinPoints blocks = mapFilter (>1) succ_counts
 -- filter the list of assignments to remove any assignments that
 -- are not live in a continuation.
 --
-filterAssignments :: Platform -> LocalRegSet -> Assignments -> Assignments
+filterAssignments :: Platform -> LRegSet -> Assignments -> Assignments
 filterAssignments platform live assigs = reverse (go assigs [])
   where go []             kept = kept
         go (a@(r,_,_):as) kept | needed    = go as (a:kept)
                                | otherwise = go as kept
            where
-              needed = r `Set.member` live
+              needed = r `elemLRegSet` live
                        || any (conflicts platform a) (map toNode kept)
                        --  Note that we must keep assignments that are
                        -- referred to by other assignments we have
@@ -312,7 +301,7 @@ filterAssignments platform live assigs = reverse (go assigs [])
 --
 
 walk :: Platform
-     -> [(LocalRegSet, CmmNode O O)]    -- nodes of the block, annotated with
+     -> [(LRegSet, CmmNode O O)]    -- nodes of the block, annotated with
                                         -- the set of registers live *after*
                                         -- this node.
 
@@ -366,11 +355,11 @@ shouldSink _ _other = Nothing
 -- out of inlining, but the inliner will see that r is live
 -- after the instruction and choose not to inline r in the rhs.
 --
-shouldDiscard :: CmmNode e x -> LocalRegSet -> Bool
+shouldDiscard :: CmmNode e x -> LRegSet -> Bool
 shouldDiscard node live
    = case node of
        CmmAssign r (CmmReg r') | r == r' -> True
-       CmmAssign (CmmLocal r) _ -> not (r `Set.member` live)
+       CmmAssign (CmmLocal r) _ -> not (r `elemLRegSet` live)
        _otherwise -> False
 
 
@@ -403,8 +392,9 @@ dropAssignments platform should_drop state assigs
 -- inlining opens up opportunities for doing so.
 
 tryToInline
-   :: Platform
-   -> LocalRegSet               -- set of registers live after this
+   :: forall x. Platform
+   -> LRegSet               -- set of registers live after this
+  --  -> LocalRegSet               -- set of registers live after this
                                 -- node.  We cannot inline anything
                                 -- that is live after the node, unless
                                 -- it is small enough to duplicate.
@@ -415,35 +405,42 @@ tryToInline
       , Assignments             -- Remaining assignments
       )
 
-tryToInline platform live node assigs = go usages node emptyLRegSet assigs
+tryToInline platform liveAfter node assigs =
+  -- pprTrace "tryToInline assig length:" (ppr $ length assigs) $
+    go usages liveAfter node emptyLRegSet assigs
  where
   usages :: UniqFM LocalReg Int -- Maps each LocalReg to a count of how often it is used
   usages = foldLocalRegsUsed platform addUsage emptyUFM node
 
-  go _usages node _skipped [] = (node, [])
+  go :: UniqFM LocalReg Int -> LRegSet -> CmmNode O x -> LRegSet -> Assignments
+     -> (CmmNode O x, Assignments)
+  go _usages _live node _skipped [] = (node, [])
 
-  go usages node skipped (a@(l,rhs,_) : rest)
-   | cannot_inline           = dont_inline
-   | occurs_none             = discard  -- Note [discard during inlining]
-   | occurs_once             = inline_and_discard
-   | isTrivial platform rhs  = inline_and_keep
-   | otherwise               = dont_inline
+  go usages live node skipped (a@(l,rhs,_) : rest)
+   | cannot_inline            = dont_inline
+   | occurs_none              = discard  -- Note [discard during inlining]
+   | occurs_once              = inline_and_discard
+   | isTrivial platform rhs   = inline_and_keep
+   | otherwise                = dont_inline
    where
-        inline_and_discard = go usages' inl_node skipped rest
+        inline_and_discard = go usages' live inl_node skipped rest
           where usages' = foldLocalRegsUsed platform addUsage usages rhs
 
-        discard = go usages node skipped rest
+        discard = go usages live node skipped rest
 
         dont_inline        = keep node  -- don't inline the assignment, keep it
         inline_and_keep    = keep inl_node -- inline the assignment, keep it
 
+        keep :: CmmNode O x -> (CmmNode O x, Assignments)
         keep node' = (final_node, a : rest')
-          where (final_node, rest') = go usages' node' (insertLRegSet l skipped) rest
-                usages' = foldLocalRegsUsed platform (\m r -> addToUFM m r 2)
-                                            usages rhs
-                -- we must not inline anything that is mentioned in the RHS
-                -- of a binding that we have already skipped, so we set the
-                -- usages of the regs on the RHS to 2.
+          where (final_node, rest') = go usages live' node' (insertLRegSet l skipped) rest
+
+                -- Avoid discarding of assignments to vars on the rhs.
+                -- See Note [Keeping assignemnts mentioned in skipped RHSs]
+                -- usages' = foldLocalRegsUsed platform (\m r -> addToUFM m r 2)
+                                            -- usages rhs
+                live' = inline foldLocalRegsUsed platform (\m r -> insertLRegSet r m)
+                                            live rhs
 
         cannot_inline = skipped `regsUsedIn` rhs -- Note [dependent assignments]
                         || l `elemLRegSet` skipped
@@ -451,7 +448,7 @@ tryToInline platform live node assigs = go usages node emptyLRegSet assigs
 
         -- How often is l used in the current node.
         l_usages = lookupUFM usages l
-        l_live   = l `elemRegSet` live
+        l_live   = l `elemLRegSet` live
 
         occurs_once = not l_live && l_usages == Just 1
         occurs_none = not l_live && l_usages == Nothing
@@ -467,6 +464,27 @@ tryToInline platform live node assigs = go usages node emptyLRegSet assigs
         inl_exp (CmmMachOp op args) = cmmMachOpFold platform op args
         inl_exp other = other
 
+{- Note [Keeping assignemnts mentioned in skipped RHSs]
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+    If we have to assignments: [z = y, y = e1] and we skip
+    z we *must* retain the assignment y = e1. This is because
+    we might inline "z = y" into another node later on so we
+    must ensure y is still defined at this point.
+
+    If we dropped the assignment of "y = e1" then we would end up
+    referencing a variable which hasn't been mentioned after
+    inlining.
+
+    We use a hack to do this.
+
+    We pretend the regs from the rhs are live after the current
+    node. Since we only discard assignments to variables
+    which are dead after the current block this prevents discarding of the
+    assignment. It still allows inlining should e1 be a trivial rhs
+    however.
+
+-}
 
 {- Note [improveConditional]
 
@@ -610,18 +628,34 @@ conflicts platform (r, rhs, addr) node
   -- (7) otherwise, no conflict
   | otherwise = False
 
+{- Note [Inlining foldRegsDefd]
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   foldRegsDefd is, after optimization, *not* a small function so
+   it's only marked INLINEABLE, but not INLINE.
+
+   However in some specific cases we call it *very* often making it
+   important to avoid the overhead of allocating the folding function.
+
+   So we simply force inlining via the magic inline function.
+   For T3294 this improves allocation with -O by ~1%.
+
+-}
+
 -- Returns True if node defines any global registers that are used in the
 -- Cmm expression
 globalRegistersConflict :: Platform -> CmmExpr -> CmmNode e x -> Bool
 globalRegistersConflict platform expr node =
-    foldRegsDefd platform (\b r -> b || regUsedIn platform (CmmGlobal r) expr)
+    -- See Note [Inlining foldRegsDefd]
+    inline foldRegsDefd platform (\b r -> b || regUsedIn platform (CmmGlobal r) expr)
                  False node
 
 -- Returns True if node defines any local registers that are used in the
 -- Cmm expression
 localRegistersConflict :: Platform -> CmmExpr -> CmmNode e x -> Bool
 localRegistersConflict platform expr node =
-    foldRegsDefd platform (\b r -> b || regUsedIn platform (CmmLocal  r) expr)
+    -- See Note [Inlining foldRegsDefd]
+    inline foldRegsDefd platform (\b r -> b || regUsedIn platform (CmmLocal  r) expr)
                  False node
 
 -- Note [Sinking and calls]


=====================================
compiler/GHC/Cmm/Utils.hs
=====================================
@@ -264,9 +264,11 @@ cmmOffset platform  e byte_off = case e of
    CmmStackSlot area off -> CmmStackSlot area (off - byte_off)
   -- note stack area offsets increase towards lower addresses
    CmmMachOp (MO_Add rep) [expr, CmmLit (CmmInt byte_off1 _rep)]
-      -> CmmMachOp (MO_Add rep) [expr, CmmLit (CmmInt (byte_off1 + toInteger byte_off) rep)]
-   _ -> CmmMachOp (MO_Add width) [e, CmmLit (CmmInt (toInteger byte_off) width)]
-         where width = cmmExprWidth platform e
+      -> let !lit_off = (byte_off1 + toInteger byte_off)
+         in CmmMachOp (MO_Add rep) [expr, CmmLit (CmmInt lit_off rep)]
+   _ -> let !width = cmmExprWidth platform e
+        in
+        CmmMachOp (MO_Add width) [e, CmmLit (CmmInt (toInteger byte_off) width)]
 
 -- Smart constructor for CmmRegOff.  Same caveats as cmmOffset above.
 cmmRegOff :: CmmReg -> Int -> CmmExpr


=====================================
compiler/GHC/Utils/Outputable.hs
=====================================
@@ -115,6 +115,7 @@ import Data.Int
 import qualified Data.IntMap as IM
 import Data.Set (Set)
 import qualified Data.Set as Set
+import qualified Data.IntSet as IntSet
 import Data.String
 import Data.Word
 import System.IO        ( Handle )
@@ -863,6 +864,9 @@ instance (Outputable a) => Outputable (NonEmpty a) where
 instance (Outputable a) => Outputable (Set a) where
     ppr s = braces (fsep (punctuate comma (map ppr (Set.toList s))))
 
+instance Outputable IntSet.IntSet where
+    ppr s = braces (fsep (punctuate comma (map ppr (IntSet.toList s))))
+
 instance (Outputable a, Outputable b) => Outputable (a, b) where
     ppr (x,y) = parens (sep [ppr x <> comma, ppr y])
 


=====================================
compiler/ghc.cabal.in
=====================================
@@ -205,6 +205,7 @@ Library
         GHC.Cmm.Switch
         GHC.Cmm.Switch.Implement
         GHC.CmmToAsm
+        GHC.Cmm.LRegSet
         GHC.CmmToAsm.BlockLayout
         GHC.CmmToAsm.CFG
         GHC.CmmToAsm.CFG.Dominators


=====================================
docs/users_guide/eventlog-formats.rst
=====================================
@@ -207,9 +207,61 @@ Thread and scheduling events
    :base-ref:`Control.Concurrent.setThreadLabel`).
 
 
+.. _gc-events:
+
 Garbage collector events
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
+The following events mark various points of the lifecycle of a moving garbage
+collection.
+
+A typical garbage collection will look something like the following:
+
+1. A capability realizes that it needs a garbage collection (e.g. as a result
+   of running out of nursery) and requests a garbage collection.  This is
+   marked by :event-type:`REQUEST_SEQ_GC` or :event-type:`REQUEST_PAR_GC`.
+
+2. As other capabilities reach yield points and suspend execution they emit
+   :event-type:`STOP_THREAD` events.
+
+3. When all capabilities have suspended execution, collection will begin,
+   marked by a :event-type:`GC_START` event.
+
+4. As individual parallel GC threads commence with scavenging they will emit
+   :event-type:`GC_WORK` events.
+
+5. If a parallel GC thread runs out of work it will emit a
+   :event-type:`GC_IDLE` event. If it is later handed more work it will emit
+   another :event-type:`GC_WORK` event.
+
+6. Eventually when scavenging has finished a :event-type:`GC_DONE` event
+   will be emitted by each GC thread.
+
+7. A bit of book-keeping is performed.
+
+8. A :event-type:`GC_END` event will be emitted marking the end of the GC cycle.
+
+9. A :event-type:`HEAP_SIZE` event will be emitted giving the
+   cumulative heap allocations of the program until now.
+
+10. A :event-type:`GC_STATS_GHC` event will be emitted
+   containing various details of the collection and heap state.
+
+11. In the case of a major collection, a
+    :event-type:`HEAP_LIVE` event will be emitted describing
+    the current size of the live on-heap data.
+
+12. In the case of the :ghc-flag:`-threaded` RTS, a
+    :event-type:`SPARK_COUNTERS` event will be emitted giving
+    details on how many sparks have been created, evaluated, and GC'd.
+
+13. As mutator threads resume execution they will emit :event-type:`RUN_THREAD`
+    events.
+
+Note that in the case of the concurrent non-moving collector additional events
+will be emitted during the concurrent phase of collection. These are described
+in :ref:`nonmoving-gc-events`.
+
 .. event-type:: GC_START
 
    :tag: 9
@@ -685,6 +737,46 @@ These events mark various stages of the
 :rts-flag:`non-moving collection <--nonmoving-gc>` lifecycle. These are enabled
 with the ``+RTS -lg`` event-set.
 
+A typical non-moving collection cycle will look something like the following:
+
+1. The preparatory phase of collection will emit the usual events associated
+   with a moving collection. See :ref:`gc-events` for details.
+
+2. The concurrent write barrier is enabled and the concurrent mark thread is
+   started. From this point forward mutator threads may emit
+   :event-type:`CONC_UPD_REM_SET_FLUSH` events, indicating that they have
+   flushed their capability-local update remembered sets.
+
+3. Concurrent marking begins, denoted by a :event-type:`CONC_MARK_BEGIN` event.
+
+4. When the mark queue is depleted a :event-type:`CONC_MARK_END` is emitted.
+
+5. If necessary (e.g. due to weak pointer marking), the marking process will
+   continue, returning to step (3) above.
+
+6. When the collector has done as much concurrent marking as it can it will
+   enter the post-mark synchronization phase of collection, denoted by a
+   :event-type:`CONC_SYNC_BEGIN` event.
+
+7. Mutator threads will suspend execution and, if necessary, flush their update
+   remembered sets (indicated by :event-type:`CONC_UPD_REM_SET_FLUSH` events).
+
+8. The collector will do any final marking necessary (indicated by
+   :event-type:`CONC_MARK_BEGIN` and :event-type:`CONC_MARK_END` events).
+
+9. The collector will do a small amount of sweeping, disable the write barrier,
+   emit a :event-type:`CONC_SYNC_END` event, and allow mutators to resume
+
+10. The collector will begin the concurrent sweep phase, indicated by a
+    :event-type:`CONC_SWEEP_BEGIN` event.
+
+11. Once sweeping has concluded a :event-type:`CONC_SWEEP_END` event will be
+    emitted and the concurrent collector thread will terminate.
+
+12. A :event-type:`NONMOVING_HEAP_CENSUS` event will be emitted describing the
+    fragmentation state of the non-moving heap.
+
+
 .. event-type:: CONC_MARK_BEGIN
 
    :tag: 200
@@ -742,8 +834,9 @@ with the ``+RTS -lg`` event-set.
 Non-moving heap census
 ~~~~~~~~~~~~~~~~~~~~~~
 
-The non-moving heap census events (enabled with the ``+RTS -ln`` event-set) are
-intended to provide insight into fragmentation of the non-moving heap.
+The non-moving heap census events (enabled with the :rts-flag:`+RTS -ln <-l ⟨flags⟩>`
+event-set) are intended to provide insight into fragmentation of the non-moving
+heap.
 
 .. event-type:: NONMOVING_HEAP_CENSUS
 
@@ -760,8 +853,8 @@ Ticky counters
 ~~~~~~~~~~~~~~
 
 Programs compiled with :ghc-flag:`-ticky` and :ghc-flag:`-eventlog` and invoked
-with ``+RTS -lT`` will emit periodic samples of the ticky entry counters to the
-eventlog.
+with :rts-flag:`+RTS -lT <-l ⟨flags⟩>` will emit periodic samples of the ticky
+entry counters to the eventlog.
 
 .. event-type:: TICKY_COUNTER_DEF
 


=====================================
docs/users_guide/runtime_control.rst
=====================================
@@ -411,6 +411,17 @@ performance.
     Note that :rts-flag:`--nonmoving-gc` cannot be used with ``-G1``,
     :rts-flag:`profiling <-hc>` nor :rts-flag:`-c`.
 
+.. rts-flag:: -w
+
+    :default: off
+    :since: a long time ago
+    :reverse: none
+
+    Uses a mark-region garbage collection strategy for the oldest-generation heap.
+    Note that this cannot be used in conjunction with heap profiling
+    (:rts-flag:`-hT`) unless linked against the profiling runtime system with
+    :ghc-flag:`-prof`.
+
 .. rts-flag:: -xn
 
     :default: off
@@ -1194,6 +1205,9 @@ When the program is linked with the :ghc-flag:`-eventlog` option
 
     - ``f`` — parallel sparks (fully accurate). Disabled by default.
 
+    - ``T`` — :ghc-flag:`ticky-ticky profiler <-ticky>` events. Disabled by
+      default.
+
     - ``u`` — user events. These are events emitted from Haskell code using
       functions such as ``Debug.Trace.traceEvent``. Enabled by default.
 


=====================================
docs/users_guide/using-optimisation.rst
=====================================
@@ -720,7 +720,7 @@ by saying ``-fno-wombat``.
     :reverse: -fno-omit-yields
     :category:
 
-    :default: yield points enabled
+    :default: on (yields are *not* inserted)
 
     Tells GHC to omit heap checks when no allocation is
     being performed. While this improves binary sizes by about 5%, it
@@ -1197,16 +1197,16 @@ by saying ``-fno-wombat``.
 
     This is the full syntax for cardinalities, demands and sub-demands in BNF:
 
-    .. code-block::
+    .. code-block:: none
 
-      card ::= B | A | 1 | U | S | M    semantics as in the table above
+        card ::= B | A | 1 | U | S | M    semantics as in the table above
 
-      d    ::= card sd                  card = how often, sd = how deep
-            |  card                     abbreviation: Same as "card card"
+        d    ::= card sd                  card = how often, sd = how deep
+              |  card                     abbreviation: Same as "card card"
 
-      sd   ::= card                     polymorphic sub-demand, card at every level
-            |  P(d,d,..)                product sub-demand
-            |  Ccard(sd)                call sub-demand
+        sd   ::= card                     polymorphic sub-demand, card at every level
+              |  P(d,d,..)                product sub-demand
+              |  Ccard(sd)                call sub-demand
 
     For example, ``fst`` is strict in its argument, and also in the first
     component of the argument.  It will not evaluate the argument's second
@@ -1234,17 +1234,17 @@ by saying ``-fno-wombat``.
     We summarise a function's demand properties in its *demand signature*.
     This is the general syntax:
 
-    .. code-block::
+    .. code-block:: none
 
-     {x->dx,y->dy,z->dz...}<d1><d2><d3>...<dn>div
-             ^              ^   ^   ^      ^   ^
-             |              |   |   |      |   |
-             |              \---+---+------/   |
-             |                  |              |
-        demand on free        demand on      divergence
-          variables           arguments      information
-      (omitted if empty)                     (omitted if
-                                           no information)
+        {x->dx,y->dy,z->dz...}<d1><d2><d3>...<dn>div
+                ^              ^   ^   ^      ^   ^
+                |              |   |   |      |   |
+                |              \---+---+------/   |
+                |                  |              |
+           demand on free        demand on      divergence
+             variables           arguments      information
+         (omitted if empty)                     (omitted if
+                                              no information)
 
     We summarise ``fst``'s demand properties in its *demand signature*
     ``<SP(SU,A)>``, which just says "If ``fst`` is applied to one argument,
@@ -1260,13 +1260,11 @@ by saying ``-fno-wombat``.
 
     **Call sub-demands**
 
-    Consider ``maybe``:
+    Consider ``maybe``: ::
 
-    .. code-block::
-
-     maybe :: b -> (a -> b) -> Maybe a -> b
-     maybe n _ Nothing  = n
-     maybe _ s (Just a) = s a
+        maybe :: b -> (a -> b) -> Maybe a -> b
+        maybe n _ Nothing  = n
+        maybe _ s (Just a) = s a
 
     We give it demand signature ``<U><1C1(U)><SU>``.  The ``C1(U)`` is a *call
     sub-demand* that says "Called at most once, where the result is used


=====================================
includes/Cmm.h
=====================================
@@ -630,7 +630,11 @@
 #else
 #define OVERWRITING_CLOSURE_SIZE(c, size) /* nothing */
 #define OVERWRITING_CLOSURE(c) /* nothing */
-#define OVERWRITING_CLOSURE_MUTABLE(c, off) /* nothing */
+/* This is used to zero slop after shrunk arrays. It is important that we do
+ * this whenever profiling is enabled as described in Note [slop on the heap]
+ * in Storage.c. */
+#define OVERWRITING_CLOSURE_MUTABLE(c, off) \
+    if (TO_W_(RtsFlags_ProfFlags_doHeapProfile(RtsFlags)) != 0) { foreign "C" overwritingMutableClosureOfs(c "ptr", off); }
 #endif
 
 // Memory barriers.


=====================================
libraries/time
=====================================
@@ -1 +1 @@
-Subproject commit c25d6a76702b454426e149fb590da5cb69f3bd0a
+Subproject commit df292e1a74c6a87c2c1c889679074dd46ad39461


=====================================
rts/ProfHeap.c
=====================================
@@ -1103,214 +1103,217 @@ heapCensusCompactList(Census *census, bdescr *bd)
     }
 }
 
-/* -----------------------------------------------------------------------------
- * Code to perform a heap census.
- * -------------------------------------------------------------------------- */
+/*
+ * Take a census of the contents of a "normal" (e.g. not large, not compact)
+ * heap block. This can, however, handle PINNED blocks.
+ */
 static void
-heapCensusChain( Census *census, bdescr *bd )
+heapCensusBlock(Census *census, bdescr *bd)
 {
-    StgPtr p;
-    const StgInfoTable *info;
-    size_t size;
-    bool prim;
-
-    for (; bd != NULL; bd = bd->link) {
-
-        // HACK: pretend a pinned block is just one big ARR_WORDS
-        // owned by CCS_PINNED.  These blocks can be full of holes due
-        // to alignment constraints so we can't traverse the memory
-        // and do a proper census.
-        if (bd->flags & BF_PINNED) {
-            StgClosure arr;
-            SET_HDR(&arr, &stg_ARR_WORDS_info, CCS_PINNED);
-            heapProfObject(census, &arr, bd->blocks * BLOCK_SIZE_W, true);
-            continue;
-        }
-
-        p = bd->start;
-
-        // When we shrink a large ARR_WORDS, we do not adjust the free pointer
-        // of the associated block descriptor, thus introducing slop at the end
-        // of the object.  This slop remains after GC, violating the assumption
-        // of the loop below that all slop has been eliminated (#11627).
-        // The slop isn't always zeroed (e.g. in non-profiling mode, cf
-        // OVERWRITING_CLOSURE_OFS).
-        // Consequently, we handle large ARR_WORDS objects as a special case.
-        if (bd->flags & BF_LARGE
-            && get_itbl((StgClosure *)p)->type == ARR_WORDS) {
-            size = arr_words_sizeW((StgArrBytes *)p);
-            prim = true;
-            heapProfObject(census, (StgClosure *)p, size, prim);
-            continue;
-        }
+    StgPtr p = bd->start;
 
+    // In the case of PINNED blocks there can be (zeroed) slop at the beginning
+    // due to object alignment.
+    if (bd->flags & BF_PINNED) {
+        while (p < bd->free && !*p) p++;
+    }
 
-        while (p < bd->free) {
-            info = get_itbl((const StgClosure *)p);
-            prim = false;
+    while (p < bd->free) {
+        const StgInfoTable *info = get_itbl((const StgClosure *)p);
+        bool prim = false;
+        size_t size;
 
-            switch (info->type) {
+        switch (info->type) {
 
-            case THUNK:
-                size = thunk_sizeW_fromITBL(info);
-                break;
+        case THUNK:
+            size = thunk_sizeW_fromITBL(info);
+            break;
 
-            case THUNK_1_1:
-            case THUNK_0_2:
-            case THUNK_2_0:
-                size = sizeofW(StgThunkHeader) + 2;
-                break;
+        case THUNK_1_1:
+        case THUNK_0_2:
+        case THUNK_2_0:
+            size = sizeofW(StgThunkHeader) + 2;
+            break;
 
-            case THUNK_1_0:
-            case THUNK_0_1:
-            case THUNK_SELECTOR:
-                size = sizeofW(StgThunkHeader) + 1;
-                break;
+        case THUNK_1_0:
+        case THUNK_0_1:
+        case THUNK_SELECTOR:
+            size = sizeofW(StgThunkHeader) + 1;
+            break;
 
-            case FUN:
-            case BLACKHOLE:
-            case BLOCKING_QUEUE:
-            case FUN_1_0:
-            case FUN_0_1:
-            case FUN_1_1:
-            case FUN_0_2:
-            case FUN_2_0:
-            case CONSTR:
-            case CONSTR_NOCAF:
-            case CONSTR_1_0:
-            case CONSTR_0_1:
-            case CONSTR_1_1:
-            case CONSTR_0_2:
-            case CONSTR_2_0:
-                size = sizeW_fromITBL(info);
-                break;
+        case FUN:
+        case BLACKHOLE:
+        case BLOCKING_QUEUE:
+        case FUN_1_0:
+        case FUN_0_1:
+        case FUN_1_1:
+        case FUN_0_2:
+        case FUN_2_0:
+        case CONSTR:
+        case CONSTR_NOCAF:
+        case CONSTR_1_0:
+        case CONSTR_0_1:
+        case CONSTR_1_1:
+        case CONSTR_0_2:
+        case CONSTR_2_0:
+            size = sizeW_fromITBL(info);
+            break;
 
-            case IND:
-                // Special case/Delicate Hack: INDs don't normally
-                // appear, since we're doing this heap census right
-                // after GC.  However, GarbageCollect() also does
-                // resurrectThreads(), which can update some
-                // blackholes when it calls raiseAsync() on the
-                // resurrected threads.  So we know that any IND will
-                // be the size of a BLACKHOLE.
-                size = BLACKHOLE_sizeW();
-                break;
+        case IND:
+            // Special case/Delicate Hack: INDs don't normally
+            // appear, since we're doing this heap census right
+            // after GC.  However, GarbageCollect() also does
+            // resurrectThreads(), which can update some
+            // blackholes when it calls raiseAsync() on the
+            // resurrected threads.  So we know that any IND will
+            // be the size of a BLACKHOLE.
+            size = BLACKHOLE_sizeW();
+            break;
 
-            case BCO:
-                prim = true;
-                size = bco_sizeW((StgBCO *)p);
-                break;
+        case BCO:
+            prim = true;
+            size = bco_sizeW((StgBCO *)p);
+            break;
 
-            case MVAR_CLEAN:
-            case MVAR_DIRTY:
-            case TVAR:
-            case WEAK:
-            case PRIM:
-            case MUT_PRIM:
-            case MUT_VAR_CLEAN:
-            case MUT_VAR_DIRTY:
-                prim = true;
-                size = sizeW_fromITBL(info);
-                break;
+        case MVAR_CLEAN:
+        case MVAR_DIRTY:
+        case TVAR:
+        case WEAK:
+        case PRIM:
+        case MUT_PRIM:
+        case MUT_VAR_CLEAN:
+        case MUT_VAR_DIRTY:
+            prim = true;
+            size = sizeW_fromITBL(info);
+            break;
 
-            case AP:
-                size = ap_sizeW((StgAP *)p);
-                break;
+        case AP:
+            size = ap_sizeW((StgAP *)p);
+            break;
 
-            case PAP:
-                size = pap_sizeW((StgPAP *)p);
-                break;
+        case PAP:
+            size = pap_sizeW((StgPAP *)p);
+            break;
 
-            case AP_STACK:
-                size = ap_stack_sizeW((StgAP_STACK *)p);
-                break;
+        case AP_STACK:
+            size = ap_stack_sizeW((StgAP_STACK *)p);
+            break;
 
-            case ARR_WORDS:
-                prim = true;
-                size = arr_words_sizeW((StgArrBytes*)p);
-                break;
+        case ARR_WORDS:
+            prim = true;
+            size = arr_words_sizeW((StgArrBytes*)p);
+            break;
 
-            case MUT_ARR_PTRS_CLEAN:
-            case MUT_ARR_PTRS_DIRTY:
-            case MUT_ARR_PTRS_FROZEN_CLEAN:
-            case MUT_ARR_PTRS_FROZEN_DIRTY:
-                prim = true;
-                size = mut_arr_ptrs_sizeW((StgMutArrPtrs *)p);
-                break;
+        case MUT_ARR_PTRS_CLEAN:
+        case MUT_ARR_PTRS_DIRTY:
+        case MUT_ARR_PTRS_FROZEN_CLEAN:
+        case MUT_ARR_PTRS_FROZEN_DIRTY:
+            prim = true;
+            size = mut_arr_ptrs_sizeW((StgMutArrPtrs *)p);
+            break;
 
-            case SMALL_MUT_ARR_PTRS_CLEAN:
-            case SMALL_MUT_ARR_PTRS_DIRTY:
-            case SMALL_MUT_ARR_PTRS_FROZEN_CLEAN:
-            case SMALL_MUT_ARR_PTRS_FROZEN_DIRTY:
-                prim = true;
-                size = small_mut_arr_ptrs_sizeW((StgSmallMutArrPtrs *)p);
-                break;
+        case SMALL_MUT_ARR_PTRS_CLEAN:
+        case SMALL_MUT_ARR_PTRS_DIRTY:
+        case SMALL_MUT_ARR_PTRS_FROZEN_CLEAN:
+        case SMALL_MUT_ARR_PTRS_FROZEN_DIRTY:
+            prim = true;
+            size = small_mut_arr_ptrs_sizeW((StgSmallMutArrPtrs *)p);
+            break;
 
-            case TSO:
-                prim = true;
+        case TSO:
+            prim = true;
 #if defined(PROFILING)
-                if (RtsFlags.ProfFlags.includeTSOs) {
-                    size = sizeofW(StgTSO);
-                    break;
-                } else {
-                    // Skip this TSO and move on to the next object
-                    p += sizeofW(StgTSO);
-                    continue;
-                }
-#else
+            if (RtsFlags.ProfFlags.includeTSOs) {
                 size = sizeofW(StgTSO);
                 break;
+            } else {
+                // Skip this TSO and move on to the next object
+                p += sizeofW(StgTSO);
+                continue;
+            }
+#else
+            size = sizeofW(StgTSO);
+            break;
 #endif
 
-            case STACK:
-                prim = true;
+        case STACK:
+            prim = true;
 #if defined(PROFILING)
-                if (RtsFlags.ProfFlags.includeTSOs) {
-                    size = stack_sizeW((StgStack*)p);
-                    break;
-                } else {
-                    // Skip this TSO and move on to the next object
-                    p += stack_sizeW((StgStack*)p);
-                    continue;
-                }
-#else
+            if (RtsFlags.ProfFlags.includeTSOs) {
                 size = stack_sizeW((StgStack*)p);
                 break;
+            } else {
+                // Skip this TSO and move on to the next object
+                p += stack_sizeW((StgStack*)p);
+                continue;
+            }
+#else
+            size = stack_sizeW((StgStack*)p);
+            break;
 #endif
 
-            case TREC_CHUNK:
-                prim = true;
-                size = sizeofW(StgTRecChunk);
-                break;
-
-            case COMPACT_NFDATA:
-                barf("heapCensus, found compact object in the wrong list");
-                break;
+        case TREC_CHUNK:
+            prim = true;
+            size = sizeofW(StgTRecChunk);
+            break;
 
-            default:
-                barf("heapCensus, unknown object: %d", info->type);
-            }
+        case COMPACT_NFDATA:
+            barf("heapCensus, found compact object in the wrong list");
+            break;
 
-            heapProfObject(census,(StgClosure*)p,size,prim);
+        default:
+            barf("heapCensus, unknown object: %d", info->type);
+        }
 
-            p += size;
+        heapProfObject(census,(StgClosure*)p,size,prim);
+
+        p += size;
+
+        /* skip over slop, see Note [slop on the heap] */
+        while (p < bd->free && !*p) p++;
+        /* Note [skipping slop in the heap profiler]
+         * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+         *
+         * We make sure to zero slop that can remain after a major GC so
+         * here we can assume any slop words we see until the block's free
+         * pointer are zero. Since info pointers are always nonzero we can
+         * use this to scan for the next valid heap closure.
+         *
+         * Note that not all types of slop are relevant here, only the ones
+         * that can reman after major GC. So essentially just large objects
+         * and pinned objects. All other closures will have been packed nice
+         * and thight into fresh blocks.
+         */
+    }
+}
 
-            /* skip over slop, see Note [slop on the heap] */
+/* -----------------------------------------------------------------------------
+ * Code to perform a heap census.
+ * -------------------------------------------------------------------------- */
+static void
+heapCensusChain( Census *census, bdescr *bd )
+{
+    for (; bd != NULL; bd = bd->link) {
+        // When we shrink a large ARR_WORDS, we do not adjust the free pointer
+        // of the associated block descriptor, thus introducing slop at the end
+        // of the object.  This slop remains after GC, violating the assumption
+        // of the loop below that all slop has been eliminated (#11627).
+        // The slop isn't always zeroed (e.g. in non-profiling mode, cf
+        // OVERWRITING_CLOSURE_OFS).
+        // Consequently, we handle large ARR_WORDS objects as a special case.
+        if (bd->flags & BF_LARGE) {
+            StgPtr p = bd->start;
+            // There may be some initial zeros due to object alignment.
             while (p < bd->free && !*p) p++;
-            /* Note [skipping slop in the heap profiler]
-             * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-             *
-             * We make sure to zero slop that can remain after a major GC so
-             * here we can assume any slop words we see until the block's free
-             * pointer are zero. Since info pointers are always nonzero we can
-             * use this to scan for the next valid heap closure.
-             *
-             * Note that not all types of slop are relevant here, only the ones
-             * that can reman after major GC. So essentially just large objects
-             * and pinned objects. All other closures will have been packed nice
-             * and thight into fresh blocks.
-             */
+            if (get_itbl((StgClosure *)p)->type == ARR_WORDS) {
+                size_t size = arr_words_sizeW((StgArrBytes *)p);
+                bool prim = true;
+                heapProfObject(census, (StgClosure *)p, size, prim);
+                continue;
+            }
         }
+
+        heapCensusBlock(census, bd);
     }
 }
 


=====================================
rts/RtsFlags.c
=====================================
@@ -1849,6 +1849,16 @@ static void normaliseRtsOpts (void)
         barf("The non-moving collector doesn't support -G1");
     }
 
+#if !defined(PROFILING) && !defined(DEBUG)
+    // The mark-region collector is incompatible with heap census unless
+    // we zero slop of blackhole'd thunks, which doesn't happen in the
+    // vanilla way. See #9666.
+    if (RtsFlags.ProfFlags.doHeapProfile && RtsFlags.GcFlags.sweep) {
+        barf("The mark-region collector can only be used with profiling\n"
+             "when linked against the profiled RTS.");
+    }
+#endif
+
     if (RtsFlags.ProfFlags.doHeapProfile != NO_HEAP_PROFILING &&
             RtsFlags.GcFlags.useNonmoving) {
         barf("The non-moving collector doesn't support profiling");


=====================================
rts/Stats.c
=====================================
@@ -570,7 +570,7 @@ stat_endGC (Capability *cap, gc_thread *initiating_gct, W_ live, W_ copied, W_ s
         // Emit events to the event log
 
         // Has to be emitted while all caps stopped for GC, but before GC_END.
-        // See trac.haskell.org/ThreadScope/wiki/RTSsummaryEvents
+        // See https://gitlab.haskell.org/ghc/ghc/-/wikis/RTSsummaryEvents
         // for a detailed design rationale of the current setup
         // of GC eventlog events.
         traceEventGcGlobalSync(cap);


=====================================
rts/linker/Elf.c
=====================================
@@ -32,6 +32,9 @@
 #include <stdlib.h>
 #include <unistd.h>
 #include <string.h>
+#if defined(HAVE_DLFCN_H)
+#include <dlfcn.h>
+#endif
 #if defined(HAVE_SYS_STAT_H)
 #include <sys/stat.h>
 #endif


=====================================
rts/sm/Storage.c
=====================================
@@ -952,10 +952,20 @@ accountAllocation(Capability *cap, W_ n)
  * of closures. This trick is used by the sanity checking code and the heap
  * profiler, see Note [skipping slop in the heap profiler].
  *
- * When profiling we zero:
- *  - Pinned object alignment slop, see MEMSET_IF_PROFILING_W in allocatePinned.
+ * In general we zero:
+ *
+ *  - Pinned object alignment slop, see MEMSET_SLOP_W in allocatePinned.
+ *  - Large object alignment slop, see MEMSET_SLOP_W in allocatePinned.
  *  - Shrunk array slop, see OVERWRITING_CLOSURE_MUTABLE.
  *
+ * Note that this is necessary even in the vanilla (e.g. non-profiling) RTS
+ * since the user may trigger a heap census via +RTS -hT, which can be used
+ * even when not linking against the profiled RTS. Failing to zero slop
+ * due to array shrinking has resulted in a few nasty bugs (#17572, #9666).
+ * However, since array shrink may result in large amounts of slop (unlike
+ * alignment), we take care to only zero such slop when heap profiling or DEBUG
+ * are enabled.
+ *
  * When performing LDV profiling or using a (single threaded) debug RTS we zero
  * slop even when overwriting immutable closures, see Note [zeroing slop when
  * overwriting closures].
@@ -1126,12 +1136,7 @@ allocateMightFail (Capability *cap, W_ n)
  *
  * See Note [skipping slop in the heap profiler]
  */
-#if defined(PROFILING)
-#define MEMSET_IF_PROFILING_W(p, val, len_w) memset(p, val, (len_w) * sizeof(W_))
-#else
-#define MEMSET_IF_PROFILING_W(p, val, len_w) \
-    do { (void)(p); (void)(val); (void)(len_w); } while(0)
-#endif
+#define MEMSET_SLOP_W(p, val, len_w) memset(p, val, (len_w) * sizeof(W_))
 
 /* ---------------------------------------------------------------------------
    Allocate a fixed/pinned object.
@@ -1184,9 +1189,9 @@ allocatePinned (Capability *cap, W_ n /*words*/, W_ alignment /*bytes*/, W_ alig
         } else {
             Bdescr(p)->flags |= BF_PINNED;
             W_ off_w = ALIGN_WITH_OFF_W(p, alignment, align_off);
-            MEMSET_IF_PROFILING_W(p, 0, off_w);
+            MEMSET_SLOP_W(p, 0, off_w);
             p += off_w;
-            MEMSET_IF_PROFILING_W(p + n, 0, alignment_w - off_w - 1);
+            MEMSET_SLOP_W(p + n, 0, alignment_w - off_w - 1);
             return p;
         }
     }
@@ -1258,7 +1263,7 @@ allocatePinned (Capability *cap, W_ n /*words*/, W_ alignment /*bytes*/, W_ alig
 
     p = bd->free;
 
-    MEMSET_IF_PROFILING_W(p, 0, off_w);
+    MEMSET_SLOP_W(p, 0, off_w);
 
     n += off_w;
     p += off_w;


=====================================
utils/deriveConstants/Main.hs
=====================================
@@ -561,6 +561,8 @@ wanteds os = concat
           ,structField  C "StgCompactNFDataBlock" "owner"
           ,structField  C "StgCompactNFDataBlock" "next"
 
+          ,structField_ C "RtsFlags_ProfFlags_doHeapProfile"
+                          "RTS_FLAGS" "ProfFlags.doHeapProfile"
           ,structField_ C "RtsFlags_ProfFlags_showCCSOnException"
                           "RTS_FLAGS" "ProfFlags.showCCSOnException"
           ,structField_ C "RtsFlags_DebugFlags_apply"



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/73f23069b34cea50d237c10125b5814df6b19bff...ff664f40d63a021fb815ec5e794a1c7556968cb8

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/73f23069b34cea50d237c10125b5814df6b19bff...ff664f40d63a021fb815ec5e794a1c7556968cb8
You're receiving this email because of your account on gitlab.haskell.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20201214/88ba901e/attachment-0001.html>