[Git][ghc/ghc][wip/marge_bot_batch_merge_job] 8 commits: Implements MO_S_Mul2 and MO_U_Mul2 using the UMULH, UMULL and SMULH instructions for AArch64

Thu Apr 11 15:13:52 UTC 2024

Marge Bot pushed to branch wip/marge_bot_batch_merge_job at Glasgow Haskell Compiler / GHC


Commits:
30a72192 by Alex Mason at 2024-04-11T11:13:14-04:00
Implements MO_S_Mul2 and MO_U_Mul2 using the  UMULH, UMULL and SMULH instructions for AArch64

Also adds a test for MO_S_Mul2

- - - - -
181b6ff6 by Zubin Duggal at 2024-04-11T11:13:16-04:00
driver: Make `checkHomeUnitsClosed` faster

The implementation of `checkHomeUnitsClosed` was traversing every single path
in the unit dependency graph - this grows exponentially and quickly grows to be
infeasible on larger unit dependency graphs.

Instead we replace this with a faster implementation which follows from the
specificiation of the closure property - there is a closure error if there are
units which are both are both (transitively) depended upon by home units and
(transitively) depend on home units, but are not themselves home units.

To compute the set of units required for closure, we first compute the closure
of the unit dependency graph, then the transpose of this closure, and find all
units that are reachable from the home units in the transpose of the closure.

- - - - -
025ead8f by Andreas Klebinger at 2024-04-11T11:13:17-04:00
STM: Remove (unused)coarse grained locking.

The STM code had a coarse grained locking mode guarded by #defines that was unused.
This commit removes the code.

- - - - -
582a2592 by Andreas Klebinger at 2024-04-11T11:13:17-04:00
STM: Be more optimistic when validating in-flight transactions.

* Don't lock tvars when performing non-committal validation.
* If we encounter a locked tvar don't consider it a failure.

This means in-flight validation will only fail if committing at the
moment of validation is *guaranteed* to fail.

This prevents in-flight validation from failing spuriously if it happens in
parallel on multiple threads or parallel to thread comitting.

- - - - -
088433a3 by Ben Gamari at 2024-04-11T11:13:18-04:00
base: Deprecate GHC.Pack

As proposed in #21461.

Closes #21540.

- - - - -
a5abdbb1 by Ben Gamari at 2024-04-11T11:13:18-04:00
rts: Implement set_initial_registers for AArch64

Fixes #23680.

- - - - -
e8644427 by Ben Gamari at 2024-04-11T11:13:19-04:00
ghcup-metadata: Use Debian 9 binaries on Ubuntu 16, 17

Closes #24646.

- - - - -
d229a260 by Ben Gamari at 2024-04-11T11:13:19-04:00
Bump unix submodule to 2.8.5.1

Closes #24640.

- - - - -


28 changed files:

- .gitlab/rel_eng/mk-ghcup-metadata/mk_ghcup_metadata.py
- compiler/GHC/CmmToAsm/AArch64/CodeGen.hs
- compiler/GHC/CmmToAsm/AArch64/Instr.hs
- compiler/GHC/CmmToAsm/AArch64/Ppr.hs
- compiler/GHC/Driver/Config/StgToCmm.hs
- compiler/GHC/Driver/Make.hs
- compiler/GHC/StgToCmm/Config.hs
- compiler/GHC/StgToCmm/Prim.hs
- libraries/base/changelog.md
- libraries/base/src/GHC/Pack.hs
- libraries/unix
- rts/Exception.cmm
- rts/Libdw.c
- rts/STM.c
- rts/STM.h
- rts/Schedule.c
- rts/include/stg/SMP.h
- + testsuite/tests/driver/multipleHomeUnits/mhu-perf/Makefile
- + testsuite/tests/driver/multipleHomeUnits/mhu-perf/Setup.hs
- + testsuite/tests/driver/multipleHomeUnits/mhu-perf/all.T
- + testsuite/tests/driver/multipleHomeUnits/mhu-perf/genLargeHMU
- + testsuite/tests/driver/multipleHomeUnits/mhu-perf/mhu-perf.stderr
- testsuite/tests/numeric/should_run/all.T
- + testsuite/tests/numeric/should_run/mul2int.hs
- + testsuite/tests/numeric/should_run/mul2int.stdout
- + testsuite/tests/rts/T24142.hs
- + testsuite/tests/rts/T24142.stdout
- testsuite/tests/rts/all.T


Changes:

=====================================
.gitlab/rel_eng/mk-ghcup-metadata/mk_ghcup_metadata.py
=====================================
@@ -216,7 +216,8 @@ def mk_new_yaml(release_mode, version, date, pipeline_type, job_map):
                            , ">= 11": deb11
                            , "unknown_versioning": deb11 }
           , "Linux_Ubuntu" : { "unknown_versioning": ubuntu2004
-                             , "( >= 16 && < 19 )": ubuntu1804
+                             , "( >= 16 && < 18 )": deb9
+                             , "( >= 18 && < 19 )": ubuntu1804
                              }
           , "Linux_Mint"   : { "< 20": ubuntu1804
                              , ">= 20": ubuntu2004 }


=====================================
compiler/GHC/CmmToAsm/AArch64/CodeGen.hs
=====================================
@@ -1556,7 +1556,7 @@ genCCall target dest_regs arg_regs bid = do
   -- pprTraceM "genCCall target" (ppr target)
   -- pprTraceM "genCCall formal" (ppr dest_regs)
   -- pprTraceM "genCCall actual" (ppr arg_regs)
-
+  platform <- getPlatform
   case target of
     -- The target :: ForeignTarget call can either
     -- be a foreign procedure with an address expr
@@ -1584,7 +1584,6 @@ genCCall target dest_regs arg_regs bid = do
       let (_res_hints, arg_hints) = foreignTargetHints target
           arg_regs'' = zipWith (\(r, f, c) h -> (r,f,h,c)) arg_regs' arg_hints
 
-      platform <- getPlatform
       let packStack = platformOS platform == OSDarwin
 
       (stackSpace', passRegs, passArgumentsCode) <- passArguments packStack allGpArgRegs allFpArgRegs arg_regs'' 0 [] nilOL
@@ -1625,6 +1624,139 @@ genCCall target dest_regs arg_regs bid = do
       | [arg_reg] <- arg_regs, [dest_reg] <- dest_regs ->
         unaryFloatOp W64 (\d x -> unitOL $ FABS d x) arg_reg dest_reg
 
+    PrimTarget (MO_S_Mul2 w)
+          -- Life is easier when we're working with word sized operands,
+          -- we can use SMULH to compute the high 64 bits, and dst_needed
+          -- checks if the high half's bits are all the same as the low half's
+          -- top bit.
+          | w == W64
+          , [src_a, src_b] <- arg_regs
+          -- dst_needed = did the result fit into just the low half
+          , [dst_needed, dst_hi, dst_lo] <- dest_regs
+            ->  do
+              (reg_a, _format_x, code_x) <- getSomeReg src_a
+              (reg_b, _format_y, code_y) <- getSomeReg src_b
+
+              let lo = getRegisterReg platform (CmmLocal dst_lo)
+                  hi = getRegisterReg platform (CmmLocal dst_hi)
+                  nd = getRegisterReg platform (CmmLocal dst_needed)
+              return (
+                  code_x `appOL`
+                  code_y `snocOL`
+                  MUL   (OpReg W64 lo) (OpReg W64 reg_a) (OpReg W64 reg_b) `snocOL`
+                  SMULH (OpReg W64 hi) (OpReg W64 reg_a) (OpReg W64 reg_b) `snocOL`
+                  -- Are all high bits equal to the sign bit of the low word?
+                  -- nd = (hi == ASR(lo,width-1)) ? 1 : 0
+                  CMP   (OpReg W64 hi) (OpRegShift W64 lo SASR (widthInBits w - 1)) `snocOL`
+                  CSET  (OpReg W64 nd) NE
+                  , Nothing)
+            -- For sizes < platform width, we can just perform a multiply and shift
+            -- using the normal 64 bit multiply. Calculating the dst_needed value is
+            -- complicated a little by the need to be careful when truncation happens.
+            -- Currently this case can't be generated since
+            -- timesInt2# :: Int# -> Int# -> (# Int#, Int#, Int# #)
+            -- TODO: Should this be removed or would other primops be useful?
+          | w < W64
+          , [src_a, src_b] <- arg_regs
+          , [dst_needed, dst_hi, dst_lo] <- dest_regs
+            ->  do
+              (reg_a', _format_x, code_a) <- getSomeReg src_a
+              (reg_b', _format_y, code_b) <- getSomeReg src_b
+
+              let lo = getRegisterReg platform (CmmLocal dst_lo)
+                  hi = getRegisterReg platform (CmmLocal dst_hi)
+                  nd = getRegisterReg platform (CmmLocal dst_needed)
+                  -- Do everything in a full 64 bit registers
+                  w' = platformWordWidth platform
+
+              (reg_a, code_a') <- signExtendReg w w' reg_a'
+              (reg_b, code_b') <- signExtendReg w w' reg_b'
+
+              return (
+                  code_a  `appOL`
+                  code_b  `appOL`
+                  code_a' `appOL`
+                  code_b' `snocOL`
+                  -- the low 2w' of lo contains the full multiplication;
+                  -- eg: int8 * int8 -> int16 result
+                  -- so lo is in the last w of the register, and hi is in the second w.
+                  SMULL (OpReg w' lo) (OpReg w' reg_a) (OpReg w' reg_b) `snocOL`
+                  -- Make sure we hold onto the sign bits for dst_needed
+                  ASR (OpReg w' hi) (OpReg w' lo)    (OpImm (ImmInt $ widthInBits w)) `appOL`
+                  -- lo can now be truncated so we can get at it's top bit easily.
+                  truncateReg w' w lo `snocOL`
+                  -- Note the use of CMN (compare negative), not CMP: we want to
+                  -- test if the top half is negative one and the top
+                  -- bit of the bottom half is positive one. eg:
+                  -- hi = 0b1111_1111  (actually 64 bits)
+                  -- lo = 0b1010_1111  (-81, so the result didn't need the top half)
+                  -- lo' = ASR(lo,7)   (second reg of SMN)
+                  --     = 0b0000_0001 (theeshift gives us 1 for negative,
+                  --                    and 0 for positive)
+                  -- hi == -lo'?
+                  -- 0b1111_1111 == 0b1111_1111 (yes, top half is just overflow)
+                  -- Another way to think of this is if hi + lo' == 0, which is what
+                  -- CMN really is under the hood.
+                  CMN   (OpReg w' hi) (OpRegShift w' lo SLSR (widthInBits w - 1)) `snocOL`
+                  -- Set dst_needed to 1 if hi and lo' were (negatively) equal
+                  CSET  (OpReg w' nd) EQ `appOL`
+                  -- Finally truncate hi to drop any extraneous sign bits.
+                  truncateReg w' w hi
+                  , Nothing)
+          -- Can't handle > 64 bit operands
+          | otherwise -> unsupported (MO_S_Mul2 w)
+    PrimTarget (MO_U_Mul2  w)
+          -- The unsigned case is much simpler than the signed, all we need to
+          -- do is the multiplication straight into the destination registers.
+          | w == W64
+          , [src_a, src_b] <- arg_regs
+          , [dst_hi, dst_lo] <- dest_regs
+            ->  do
+              (reg_a, _format_x, code_x) <- getSomeReg src_a
+              (reg_b, _format_y, code_y) <- getSomeReg src_b
+
+              let lo = getRegisterReg platform (CmmLocal dst_lo)
+                  hi = getRegisterReg platform (CmmLocal dst_hi)
+              return (
+                  code_x `appOL`
+                  code_y `snocOL`
+                  MUL   (OpReg W64 lo) (OpReg W64 reg_a) (OpReg W64 reg_b) `snocOL`
+                  UMULH (OpReg W64 hi) (OpReg W64 reg_a) (OpReg W64 reg_b)
+                  , Nothing)
+            -- For sizes < platform width, we can just perform a multiply and shift
+            -- Need to be careful to truncate the low half, but the upper half should be
+            -- be ok if the invariant in [Signed arithmetic on AArch64] is maintained.
+            -- Currently this case can't be produced by the compiler since
+            -- timesWord2# :: Word# -> Word# -> (# Word#, Word# #)
+            -- TODO: Remove? Or would the extra primop be useful for avoiding the extra
+            -- steps needed to do this in userland?
+          | w < W64
+          , [src_a, src_b] <- arg_regs
+          , [dst_hi, dst_lo] <- dest_regs
+            ->  do
+              (reg_a, _format_x, code_x) <- getSomeReg src_a
+              (reg_b, _format_y, code_y) <- getSomeReg src_b
+
+              let lo = getRegisterReg platform (CmmLocal dst_lo)
+                  hi = getRegisterReg platform (CmmLocal dst_hi)
+                  w' = opRegWidth w
+              return (
+                  code_x `appOL`
+                  code_y `snocOL`
+                  -- UMULL: Xd = Wa * Wb with 64 bit result
+                  -- W64 inputs should have been caught by case above
+                  UMULL (OpReg W64 lo) (OpReg w' reg_a) (OpReg w' reg_b) `snocOL`
+                  -- Extract and truncate high result
+                  -- hi[w:0] = lo[2w:w]
+                  UBFX (OpReg W64 hi) (OpReg W64 lo)
+                      (OpImm (ImmInt $ widthInBits w)) -- lsb
+                      (OpImm (ImmInt $ widthInBits w)) -- width to extract
+                      `appOL`
+                  truncateReg W64 w lo
+                  , Nothing)
+          | otherwise -> unsupported (MO_U_Mul2  w)
+
+
     -- or a possibly side-effecting machine operation
     -- mop :: CallishMachOp (see GHC.Cmm.MachOp)
     PrimTarget mop -> do
@@ -1714,7 +1846,6 @@ genCCall target dest_regs arg_regs bid = do
 
         -- Arithmatic
         -- These are not supported on X86, so I doubt they are used much.
-        MO_S_Mul2     _w -> unsupported mop
         MO_S_QuotRem  _w -> unsupported mop
         MO_U_QuotRem  _w -> unsupported mop
         MO_U_QuotRem2 _w -> unsupported mop
@@ -1723,7 +1854,6 @@ genCCall target dest_regs arg_regs bid = do
         MO_SubWordC   _w -> unsupported mop
         MO_AddIntC    _w -> unsupported mop
         MO_SubIntC    _w -> unsupported mop
-        MO_U_Mul2     _w -> unsupported mop
 
         -- Memory Ordering
         MO_AcquireFence     ->  return (unitOL DMBISH, Nothing)


=====================================
compiler/GHC/CmmToAsm/AArch64/Instr.hs
=====================================
@@ -79,11 +79,14 @@ regUsageOfInstr platform instr = case instr of
   -- 1. Arithmetic Instructions ------------------------------------------------
   ADD dst src1 src2        -> usage (regOp src1 ++ regOp src2, regOp dst)
   CMP l r                  -> usage (regOp l ++ regOp r, [])
+  CMN l r                  -> usage (regOp l ++ regOp r, [])
   MSUB dst src1 src2 src3  -> usage (regOp src1 ++ regOp src2 ++ regOp src3, regOp dst)
   MUL dst src1 src2        -> usage (regOp src1 ++ regOp src2, regOp dst)
   NEG dst src              -> usage (regOp src, regOp dst)
   SMULH dst src1 src2      -> usage (regOp src1 ++ regOp src2, regOp dst)
   SMULL dst src1 src2      -> usage (regOp src1 ++ regOp src2, regOp dst)
+  UMULH dst src1 src2      -> usage (regOp src1 ++ regOp src2, regOp dst)
+  UMULL dst src1 src2      -> usage (regOp src1 ++ regOp src2, regOp dst)
   SDIV dst src1 src2       -> usage (regOp src1 ++ regOp src2, regOp dst)
   SUB dst src1 src2        -> usage (regOp src1 ++ regOp src2, regOp dst)
   UDIV dst src1 src2       -> usage (regOp src1 ++ regOp src2, regOp dst)
@@ -209,11 +212,14 @@ patchRegsOfInstr instr env = case instr of
     -- 1. Arithmetic Instructions ----------------------------------------------
     ADD o1 o2 o3   -> ADD (patchOp o1) (patchOp o2) (patchOp o3)
     CMP o1 o2      -> CMP (patchOp o1) (patchOp o2)
+    CMN o1 o2      -> CMN (patchOp o1) (patchOp o2)
     MSUB o1 o2 o3 o4 -> MSUB (patchOp o1) (patchOp o2) (patchOp o3) (patchOp o4)
     MUL o1 o2 o3   -> MUL (patchOp o1) (patchOp o2) (patchOp o3)
     NEG o1 o2      -> NEG (patchOp o1) (patchOp o2)
     SMULH o1 o2 o3 -> SMULH (patchOp o1) (patchOp o2)  (patchOp o3)
     SMULL o1 o2 o3 -> SMULL (patchOp o1) (patchOp o2)  (patchOp o3)
+    UMULH o1 o2 o3 -> UMULH (patchOp o1) (patchOp o2)  (patchOp o3)
+    UMULL o1 o2 o3 -> UMULL (patchOp o1) (patchOp o2)  (patchOp o3)
     SDIV o1 o2 o3  -> SDIV (patchOp o1) (patchOp o2) (patchOp o3)
     SUB o1 o2 o3   -> SUB  (patchOp o1) (patchOp o2) (patchOp o3)
     UDIV o1 o2 o3  -> UDIV (patchOp o1) (patchOp o2) (patchOp o3)
@@ -540,6 +546,7 @@ data Instr
     -- | ADR ...
     -- | ADRP ...
     | CMP Operand Operand -- rd - op2
+    | CMN Operand Operand -- rd + op2
     -- | MADD ...
     -- | MNEG ...
     | MSUB Operand Operand Operand Operand -- rd = ra - rn × rm
@@ -562,8 +569,8 @@ data Instr
     -- | UMADDL ...  -- Xd = Xa + Wn × Wm
     -- | UMNEGL ... -- Xd = - Wn × Wm
     -- | UMSUBL ... -- Xd = Xa - Wn × Wm
-    -- | UMULH ... -- Xd = (Xn × Xm)_127:64
-    -- | UMULL ... -- Xd = Wn × Wm
+    | UMULH Operand Operand Operand -- Xd = (Xn × Xm)_127:64
+    | UMULL Operand Operand Operand -- Xd = Wn × Wm
 
     -- 2. Bit Manipulation Instructions ----------------------------------------
     | SBFM Operand Operand Operand Operand -- rd = rn[i,j]
@@ -644,12 +651,15 @@ instrCon i =
       POP_STACK_FRAME{} -> "POP_STACK_FRAME"
       ADD{} -> "ADD"
       CMP{} -> "CMP"
+      CMN{} -> "CMN"
       MSUB{} -> "MSUB"
       MUL{} -> "MUL"
       NEG{} -> "NEG"
       SDIV{} -> "SDIV"
       SMULH{} -> "SMULH"
       SMULL{} -> "SMULL"
+      UMULH{} -> "UMULH"
+      UMULL{} -> "UMULL"
       SUB{} -> "SUB"
       UDIV{} -> "UDIV"
       SBFM{} -> "SBFM"


=====================================
compiler/GHC/CmmToAsm/AArch64/Ppr.hs
=====================================
@@ -372,12 +372,15 @@ pprInstr platform instr = case instr of
   CMP  o1 o2
     | isFloatOp o1 && isFloatOp o2 -> op2 (text "\tfcmp") o1 o2
     | otherwise -> op2 (text "\tcmp") o1 o2
+  CMN  o1 o2       -> op2 (text "\tcmn") o1 o2
   MSUB o1 o2 o3 o4 -> op4 (text "\tmsub") o1 o2 o3 o4
   MUL  o1 o2 o3
     | isFloatOp o1 && isFloatOp o2 && isFloatOp o3 -> op3 (text "\tfmul") o1 o2 o3
     | otherwise -> op3 (text "\tmul") o1 o2 o3
   SMULH o1 o2 o3 -> op3 (text "\tsmulh") o1 o2 o3
   SMULL o1 o2 o3 -> op3 (text "\tsmull") o1 o2 o3
+  UMULH o1 o2 o3 -> op3 (text "\tumulh") o1 o2 o3
+  UMULL o1 o2 o3 -> op3 (text "\tumull") o1 o2 o3
   NEG  o1 o2
     | isFloatOp o1 && isFloatOp o2 -> op2 (text "\tfneg") o1 o2
     | otherwise -> op2 (text "\tneg") o1 o2


=====================================
compiler/GHC/Driver/Config/StgToCmm.hs
=====================================
@@ -76,7 +76,8 @@ initStgToCmmConfig dflags mod = StgToCmmConfig
         | otherwise
         -> const True
 
-  , stgToCmmAllowIntMul2Instr         = (ncg && x86ish) || llvm
+  , stgToCmmAllowIntMul2Instr         = (ncg && (x86ish || aarch64)) || llvm
+  , stgToCmmAllowWordMul2Instr        = (ncg && (x86ish || ppc || aarch64)) || llvm
   -- SIMD flags
   , stgToCmmVecInstrsErr  = vec_err
   , stgToCmmAvx           = isAvxEnabled                   dflags
@@ -92,6 +93,9 @@ initStgToCmmConfig dflags mod = StgToCmmConfig
                           JSPrimitives      -> (False, False)
                           NcgPrimitives     -> (True, False)
                           LlvmPrimitives    -> (False, True)
+          aarch64 = case platformArch platform of
+                      ArchAArch64  -> True
+                      _            -> False
           x86ish  = case platformArch platform of
                       ArchX86    -> True
                       ArchX86_64 -> True


=====================================
compiler/GHC/Driver/Make.hs
=====================================
@@ -1565,8 +1565,8 @@ downsweep hsc_env old_summaries excl_mods allow_dup_roots
        (root_errs, rootSummariesOk) <- partitionWithM getRootSummary roots -- #17549
        let root_map = mkRootMap rootSummariesOk
        checkDuplicates root_map
-       (deps, pkg_deps, map0) <- loopSummaries rootSummariesOk (M.empty, Set.empty, root_map)
-       let closure_errs = checkHomeUnitsClosed (hsc_unit_env hsc_env) (hsc_all_home_unit_ids hsc_env) (Set.toList pkg_deps)
+       (deps, map0) <- loopSummaries rootSummariesOk (M.empty, root_map)
+       let closure_errs = checkHomeUnitsClosed (hsc_unit_env hsc_env)
        let unit_env = hsc_unit_env hsc_env
        let tmpfs    = hsc_tmpfs    hsc_env
 
@@ -1660,19 +1660,19 @@ downsweep hsc_env old_summaries excl_mods allow_dup_roots
 
         -- This loops over all the mod summaries in the dependency graph, accumulates the actual dependencies for each module/unit
         loopSummaries :: [ModSummary]
-              -> (M.Map NodeKey ModuleGraphNode, Set.Set (UnitId, UnitId),
+              -> (M.Map NodeKey ModuleGraphNode,
                     DownsweepCache)
-              -> IO ((M.Map NodeKey ModuleGraphNode), Set.Set (UnitId, UnitId), DownsweepCache)
+              -> IO ((M.Map NodeKey ModuleGraphNode), DownsweepCache)
         loopSummaries [] done = return done
-        loopSummaries (ms:next) (done, pkgs, summarised)
+        loopSummaries (ms:next) (done, summarised)
           | Just {} <- M.lookup k done
-          = loopSummaries next (done, pkgs, summarised)
+          = loopSummaries next (done, summarised)
           -- Didn't work out what the imports mean yet, now do that.
           | otherwise = do
-             (final_deps, pkgs1, done', summarised') <- loopImports (calcDeps ms) done summarised
+             (final_deps, done', summarised') <- loopImports (calcDeps ms) done summarised
              -- This has the effect of finding a .hs file if we are looking at the .hs-boot file.
-             (_, _, done'', summarised'') <- loopImports (maybeToList hs_file_for_boot) done' summarised'
-             loopSummaries next (M.insert k (ModuleNode final_deps ms) done'', pkgs1 `Set.union` pkgs, summarised'')
+             (_, done'', summarised'') <- loopImports (maybeToList hs_file_for_boot) done' summarised'
+             loopSummaries next (M.insert k (ModuleNode final_deps ms) done'', summarised'')
           where
             k = NodeKey_Module (msKey ms)
 
@@ -1692,18 +1692,17 @@ downsweep hsc_env old_summaries excl_mods allow_dup_roots
                         -- Visited set; the range is a list because
                         -- the roots can have the same module names
                         -- if allow_dup_roots is True
-             -> IO ([NodeKey], Set.Set (UnitId, UnitId),
-
+             -> IO ([NodeKey],
                   M.Map NodeKey ModuleGraphNode, DownsweepCache)
                         -- The result is the completed NodeMap
-        loopImports [] done summarised = return ([], Set.empty, done, summarised)
+        loopImports [] done summarised = return ([], done, summarised)
         loopImports ((home_uid,mb_pkg, gwib) : ss) done summarised
           | Just summs <- M.lookup cache_key summarised
           = case summs of
               [Right ms] -> do
                 let nk = NodeKey_Module (msKey ms)
-                (rest, pkgs, summarised', done') <- loopImports ss done summarised
-                return (nk: rest, pkgs, summarised', done')
+                (rest, summarised', done') <- loopImports ss done summarised
+                return (nk: rest, summarised', done')
               [Left _err] ->
                 loopImports ss done summarised
               _errs ->  do
@@ -1715,69 +1714,77 @@ downsweep hsc_env old_summaries excl_mods allow_dup_roots
                                        Nothing excl_mods
                case mb_s of
                    NotThere -> loopImports ss done summarised
-                   External uid -> do
-                    (other_deps, pkgs, done', summarised') <- loopImports ss done summarised
-                    return (other_deps, Set.insert (homeUnitId home_unit, uid) pkgs, done', summarised')
+                   External _ -> do
+                    (other_deps, done', summarised') <- loopImports ss done summarised
+                    return (other_deps, done', summarised')
                    FoundInstantiation iud -> do
-                    (other_deps, pkgs, done', summarised') <- loopImports ss done summarised
-                    return (NodeKey_Unit iud : other_deps, pkgs, done', summarised')
+                    (other_deps, done', summarised') <- loopImports ss done summarised
+                    return (NodeKey_Unit iud : other_deps, done', summarised')
                    FoundHomeWithError (_uid, e) ->  loopImports ss done (Map.insert cache_key [(Left e)] summarised)
                    FoundHome s -> do
-                     (done', pkgs1, summarised') <-
-                       loopSummaries [s] (done, Set.empty, Map.insert cache_key [Right s] summarised)
-                     (other_deps, pkgs2, final_done, final_summarised) <- loopImports ss done' summarised'
+                     (done', summarised') <-
+                       loopSummaries [s] (done, Map.insert cache_key [Right s] summarised)
+                     (other_deps, final_done, final_summarised) <- loopImports ss done' summarised'
 
                      -- MP: This assumes that we can only instantiate non home units, which is probably fair enough for now.
-                     return (NodeKey_Module (msKey s) : other_deps, pkgs1 `Set.union` pkgs2, final_done, final_summarised)
+                     return (NodeKey_Module (msKey s) : other_deps, final_done, final_summarised)
           where
             cache_key = (home_uid, mb_pkg, unLoc <$> gwib)
             home_unit = ue_unitHomeUnit home_uid (hsc_unit_env hsc_env)
             GWIB { gwib_mod = L loc mod, gwib_isBoot = is_boot } = gwib
             wanted_mod = L loc mod

--- This function checks then important property that if both p and q are home units
+-- | This function checks then important property that if both p and q are home units
 -- then any dependency of p, which transitively depends on q is also a home unit.
-checkHomeUnitsClosed ::  UnitEnv -> Set.Set UnitId -> [(UnitId, UnitId)] -> [DriverMessages]
--- Fast path, trivially closed.
-checkHomeUnitsClosed ue home_id_set home_imp_ids
-  | Set.size home_id_set == 1 = []
-  | otherwise =
-  let res = foldMap loop home_imp_ids
-  -- Now check whether everything which transitively depends on a home_unit is actually a home_unit
-  -- These units are the ones which we need to load as home packages but failed to do for some reason,
-  -- it's a bug in the tool invoking GHC.
-      bad_unit_ids = Set.difference res home_id_set
-  in if Set.null bad_unit_ids
-        then []
-        else [singleMessage $ mkPlainErrorMsgEnvelope rootLoc $ DriverHomePackagesNotClosed (Set.toList bad_unit_ids)]
-
+--
+-- See Note [Multiple Home Units], section 'Closure Property'.
+checkHomeUnitsClosed ::  UnitEnv -> [DriverMessages]
+checkHomeUnitsClosed ue
+    | Set.null bad_unit_ids = []
+    | otherwise = [singleMessage $ mkPlainErrorMsgEnvelope rootLoc $ DriverHomePackagesNotClosed (Set.toList bad_unit_ids)]
   where
+    home_id_set = unitEnv_keys $ ue_home_unit_graph ue
+    bad_unit_ids = upwards_closure Set.\\ home_id_set
     rootLoc = mkGeneralSrcSpan (fsLit "<command line>")
-    -- TODO: This could repeat quite a bit of work but I struggled to write this function.
-    -- Which units transitively depend on a home unit
-    loop :: (UnitId, UnitId) -> Set.Set UnitId -- The units which transitively depend on a home unit
-    loop (from_uid, uid) =
-      let us = ue_findHomeUnitEnv from_uid ue in
-      let um = unitInfoMap (homeUnitEnv_units us) in
-      case lookupUniqMap um uid of
-        Nothing -> pprPanic "uid not found" (ppr uid)
-        Just ui ->
-          let depends = unitDepends ui
-              home_depends = Set.fromList depends `Set.intersection` home_id_set
-              other_depends = Set.fromList depends `Set.difference` home_id_set
-          in
-            -- Case 1: The unit directly depends on a home_id
-            if not (null home_depends)
-              then
-                let res = foldMap (loop . (from_uid,)) other_depends
-                in Set.insert uid res
-             -- Case 2: Check the rest of the dependencies, and then see if any of them depended on
-              else
-                let res = foldMap (loop . (from_uid,)) other_depends
-                in
-                  if not (Set.null res)
-                    then Set.insert uid res
-                    else res
+
+    graph :: Graph (Node UnitId UnitId)
+    graph = graphFromEdgedVerticesUniq graphNodes
+
+    -- downwards closure of graph
+    downwards_closure
+      = graphFromEdgedVerticesUniq [ DigraphNode uid uid (Set.toList deps)
+                                   | (uid, deps) <- M.toList (allReachable graph node_key)]
+
+    inverse_closure = transposeG downwards_closure
+
+    upwards_closure = Set.fromList $ map node_key $ reachablesG inverse_closure [DigraphNode uid uid [] | uid <- Set.toList home_id_set]
+
+    all_unit_direct_deps :: UniqMap UnitId (Set.Set UnitId)
+    all_unit_direct_deps
+      = unitEnv_foldWithKey go emptyUniqMap $ ue_home_unit_graph ue
+      where
+        go rest this this_uis =
+           plusUniqMap_C Set.union
+             (addToUniqMap_C Set.union external_depends this (Set.fromList $ this_deps))
+             rest
+           where
+             external_depends = mapUniqMap (Set.fromList . unitDepends) (unitInfoMap this_units)
+             this_units = homeUnitEnv_units this_uis
+             this_deps = [ toUnitId unit | (unit,Just _) <- explicitUnits this_units]
+
+    graphNodes :: [Node UnitId UnitId]
+    graphNodes = go Set.empty home_id_set
+      where
+        go done todo
+          = case Set.minView todo of
+              Nothing -> []
+              Just (uid, todo')
+                | Set.member uid done -> go done todo'
+                | otherwise -> case lookupUniqMap all_unit_direct_deps uid of
+                    Nothing -> pprPanic "uid not found" (ppr (uid, all_unit_direct_deps))
+                    Just depends ->
+                      let todo'' = (depends Set.\\ done) `Set.union` todo'
+                      in DigraphNode uid uid (Set.toList depends) : go (Set.insert uid done) todo''
 
 -- | Update the every ModSummary that is depended on
 -- by a module that needs template haskell. We enable codegen to


=====================================
compiler/GHC/StgToCmm/Config.hs
=====================================
@@ -70,6 +70,7 @@ data StgToCmmConfig = StgToCmmConfig
   , stgToCmmAllowQuotRem2             :: !Bool   -- ^ Allowed to generate QuotRem
   , stgToCmmAllowExtendedAddSubInstrs :: !Bool   -- ^ Allowed to generate AddWordC, SubWordC, Add2, etc.
   , stgToCmmAllowIntMul2Instr         :: !Bool   -- ^ Allowed to generate IntMul2 instruction
+  , stgToCmmAllowWordMul2Instr        :: !Bool   -- ^ Allowed to generate WordMul2 instruction
   , stgToCmmAllowFMAInstr             :: FMASign -> Bool -- ^ Allowed to generate FMA instruction
   , stgToCmmTickyAP                   :: !Bool   -- ^ Disable use of precomputed standard thunks.
   ------------------------------ SIMD flags ------------------------------------


=====================================
compiler/GHC/StgToCmm/Prim.hs
=====================================
@@ -1623,7 +1623,7 @@ emitPrimOp cfg primop =
     else Right genericIntSubCOp
 
   WordMul2Op -> \args -> opCallishHandledLater args $
-    if allowExtAdd
+    if allowWord2Mul
     then Left (MO_U_Mul2     (wordWidth platform))
     else Right genericWordMul2Op
 
@@ -1850,6 +1850,7 @@ emitPrimOp cfg primop =
   allowQuotRem2 = stgToCmmAllowQuotRem2             cfg
   allowExtAdd   = stgToCmmAllowExtendedAddSubInstrs cfg
   allowInt2Mul  = stgToCmmAllowIntMul2Instr         cfg
+  allowWord2Mul = stgToCmmAllowWordMul2Instr        cfg
 
   allowFMA = stgToCmmAllowFMAInstr cfg
 


=====================================
libraries/base/changelog.md
=====================================
@@ -4,6 +4,7 @@
   * Improve `toInteger :: Word32 -> Integer` on 64-bit platforms ([CLC proposal #259](https://github.com/haskell/core-libraries-committee/issues/259))
 
 ## 4.20.0.0 *TBA*
+  * Deprecate `GHC.Pack` ([#21461](https://gitlab.haskell.org/ghc/ghc/-/issues/21461))
   * Export `foldl'` from `Prelude` ([CLC proposal #167](https://github.com/haskell/core-libraries-committee/issues/167))
   * The top-level handler for uncaught exceptions now displays the output of `displayException` rather than `show`  ([CLC proposal #198](https://github.com/haskell/core-libraries-committee/issues/198))
   * Add `permutations` and `permutations1` to `Data.List.NonEmpty` ([CLC proposal #68](https://github.com/haskell/core-libraries-committee/issues/68))


=====================================
libraries/base/src/GHC/Pack.hs
=====================================
@@ -25,6 +25,7 @@
 --
 
 module GHC.Pack
+    {-# DEPRECATED "The exports of this module should be instead imported from GHC.Exts" #-}
     (packCString#,
      unpackCString,
      unpackCString#,


=====================================
libraries/unix
=====================================
@@ -1 +1 @@
-Subproject commit 930a8289f96c5353d120af4fd155446c574709f2
+Subproject commit 69552a5267c7dc5c46a8bceec5ec4b40d26b9463


=====================================
rts/Exception.cmm
=====================================
@@ -495,7 +495,7 @@ retry_pop_stack:
       W_ trec, outer;
       W_ r;
       trec = StgTSO_trec(CurrentTSO);
-      (r) = ccall stmValidateNestOfTransactions(MyCapability() "ptr", trec "ptr");
+      (r) = ccall stmValidateNestOfTransactions(MyCapability() "ptr", trec "ptr", 0);
       outer  = StgTRecHeader_enclosing_trec(trec);
       ccall stmAbortTransaction(MyCapability() "ptr", trec "ptr");
       ccall stmFreeAbortedTRec(MyCapability() "ptr", trec "ptr");


=====================================
rts/Libdw.c
=====================================
@@ -285,6 +285,9 @@ static bool memory_read(Dwfl *dwfl STG_UNUSED, Dwarf_Addr addr,
     return true;
 }
 
+// This function should persist the current machine state and call
+// dwfl_thread_state_registers. The register numbering should match
+// that defined by the platform's DWARF specification.
 static bool set_initial_registers(Dwfl_Thread *thread, void *arg);
 
 #if defined(x86_64_HOST_ARCH)
@@ -315,6 +318,53 @@ static bool set_initial_registers(Dwfl_Thread *thread,
         );
     return dwfl_thread_state_registers(thread, 0, 17, regs);
 }
+#elif defined(aarch64_HOST_ARCH)
+// See https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst
+static bool set_initial_registers(Dwfl_Thread *thread,
+                                  void *arg STG_UNUSED) {
+    Dwarf_Word regs[33] = {};
+    __asm__ ("str x0,  [%0, 0x000]\n\t"
+             "str x1,  [%0, 0x008]\n\t"
+             "str x2,  [%0, 0x010]\n\t"
+             "str x3,  [%0, 0x018]\n\t"
+             "str x4,  [%0, 0x020]\n\t"
+             "str x5,  [%0, 0x028]\n\t"
+             "str x6,  [%0, 0x030]\n\t"
+             "str x7,  [%0, 0x038]\n\t"
+             "str x8,  [%0, 0x040]\n\t"
+             "str x9,  [%0, 0x048]\n\t"
+             "str x10, [%0, 0x050]\n\t"
+             "str x11, [%0, 0x058]\n\t"
+             "str x12, [%0, 0x060]\n\t"
+             "str x13, [%0, 0x068]\n\t"
+             "str x14, [%0, 0x070]\n\t"
+             "str x15, [%0, 0x078]\n\t"
+             "str x16, [%0, 0x080]\n\t"
+             "str x17, [%0, 0x088]\n\t"
+             "str x18, [%0, 0x090]\n\t"
+             "str x19, [%0, 0x098]\n\t"
+             "str x20, [%0, 0x0a0]\n\t"
+             "str x21, [%0, 0x0a8]\n\t"
+             "str x22, [%0, 0x0b0]\n\t"
+             "str x23, [%0, 0x0b8]\n\t"
+             "str x24, [%0, 0x0c0]\n\t"
+             "str x25, [%0, 0x0c8]\n\t"
+             "str x26, [%0, 0x0d0]\n\t"
+             "str x27, [%0, 0x0d8]\n\t"
+             "str x28, [%0, 0x0e0]\n\t"
+             "str x29, [%0, 0x0e8]\n\t"
+             "str x30, [%0, 0x0f0]\n\t"
+             "mov x1,  sp\n\t"
+             "str x1,  [%0, 0x0f8]\n\t"
+             ".here:\n\t"
+             "adr x1,  .here\n\t"
+             "str x1,  [%0, 0x100]\n\t"
+             :                            /* no output */
+             :"r" (&regs[0])              /* input */
+             :"x1"                        /* clobbered */
+        );
+    return dwfl_thread_state_registers(thread, 0, 33, regs);
+}
 #elif defined(i386_HOST_ARCH)
 static bool set_initial_registers(Dwfl_Thread *thread,
                                   void *arg STG_UNUSED) {


=====================================
rts/STM.c
=====================================
@@ -31,10 +31,8 @@
  * interface.  In the Haskell RTS this means it is suitable only for
  * non-THREADED_RTS builds.
  *
- * STM_CG_LOCK uses coarse-grained locking -- a single 'stm lock' is acquired
- * during an invocation on the STM interface.  Note that this does not mean that
- * transactions are simply serialized -- the lock is only held *within* the
- * implementation of stmCommitTransaction, stmWait etc.
+ * STM_CG_LOCK was a historic locking mode using coarse-grained locking
+ * It has been removed, look at the git history if you are interest in it.
  *
  * STM_FG_LOCKS uses fine-grained locking -- locking is done on a per-TVar basis
  * and, when committing a transaction, no locks are acquired for TVars that have
@@ -42,19 +40,14 @@
  *
  * Concurrency control is implemented in the functions:
  *
- *    lock_stm
- *    unlock_stm
  *    lock_tvar / cond_lock_tvar
  *    unlock_tvar
  *
- * The choice between STM_UNIPROC / STM_CG_LOCK / STM_FG_LOCKS affects the
+ * The choice between STM_UNIPROC / STM_FG_LOCKS affects the
  * implementation of these functions.
  *
- * lock_stm & unlock_stm are straightforward : they acquire a simple spin-lock
- * using STM_CG_LOCK, and otherwise they are no-ops.
- *
  * lock_tvar / cond_lock_tvar and unlock_tvar are more complex because they have
- * other effects (present in STM_UNIPROC and STM_CG_LOCK builds) as well as the
+ * other effects (present in STM_UNIPROC builds) as well as the
  * actual business of manipulating a lock (present only in STM_FG_LOCKS builds).
  * This is because locking a TVar is implemented by writing the lock holder's
  * TRec into the TVar's current_value field:
@@ -167,7 +160,6 @@ static int shake(void) {
 /*......................................................................*/
 
 #define IF_STM_UNIPROC(__X)  do { } while (0)
-#define IF_STM_CG_LOCK(__X)  do { } while (0)
 #define IF_STM_FG_LOCKS(__X) do { } while (0)
 
 #if defined(STM_UNIPROC)
@@ -175,14 +167,6 @@ static int shake(void) {
 #define IF_STM_UNIPROC(__X)  do { __X } while (0)
 static const StgBool config_use_read_phase = false;
 
-static void lock_stm(StgTRecHeader *trec STG_UNUSED) {
-  TRACE("%p : lock_stm()", trec);
-}
-
-static void unlock_stm(StgTRecHeader *trec STG_UNUSED) {
-  TRACE("%p : unlock_stm()", trec);
-}
-
 static StgClosure *lock_tvar(Capability *cap STG_UNUSED,
                              StgTRecHeader *trec STG_UNUSED,
                              StgTVar *s STG_UNUSED) {
@@ -210,64 +194,9 @@ static StgBool cond_lock_tvar(Capability *cap STG_UNUSED,
                               StgTVar *s STG_UNUSED,
                               StgClosure *expected) {
   StgClosure *result;
-  TRACE("%p : cond_lock_tvar(%p, %p)", trec, s, expected);
-  result = ACQUIRE_LOAD(&s->current_value);
-  TRACE("%p : %s", trec, (result == expected) ? "success" : "failure");
-  return (result == expected);
-}
-#endif
-
-#if defined(STM_CG_LOCK) /*........................................*/
-
-#undef IF_STM_CG_LOCK
-#define IF_STM_CG_LOCK(__X)  do { __X } while (0)
-static const StgBool config_use_read_phase = false;
-static volatile StgTRecHeader *smp_locked = NULL;
-
-static void lock_stm(StgTRecHeader *trec) {
-  while (cas(&smp_locked, NULL, trec) != NULL) { }
-  TRACE("%p : lock_stm()", trec);
-}
-
-static void unlock_stm(StgTRecHeader *trec STG_UNUSED) {
-  TRACE("%p : unlock_stm()", trec);
-  ASSERT(smp_locked == trec);
-  RELEASE_STORE(&smp_locked, 0);
-}
-
-static StgClosure *lock_tvar(Capability *cap STG_UNUSED,
-                             StgTRecHeader *trec STG_UNUSED,
-                             StgTVar *s STG_UNUSED) {
-  StgClosure *result;
-  TRACE("%p : lock_tvar(%p)", trec, s);
-  ASSERT(smp_locked == trec);
+  // TRACE("%p : cond_lock_tvar(%p, %p)", trec, s, expected);
   result = ACQUIRE_LOAD(&s->current_value);
-  return result;
-}
-
-static void *unlock_tvar(Capability *cap,
-                         StgTRecHeader *trec STG_UNUSED,
-                         StgTVar *s,
-                         StgClosure *c,
-                         StgBool force_update) {
-  TRACE("%p : unlock_tvar(%p, %p)", trec, s, c);
-  ASSERT(smp_locked == trec);
-  if (force_update) {
-    StgClosure *old_value = ACQUIRE_LOAD(&s->current_value);
-    RELEASE_STORE(&s->current_value, c);
-    dirty_TVAR(cap, s, old_value);
-  }
-}
-
-static StgBool cond_lock_tvar(Capability *cap STG_UNUSED,
-                              StgTRecHeader *trec STG_UNUSED,
-                               StgTVar *s STG_UNUSED,
-                               StgClosure *expected) {
-  StgClosure *result;
-  TRACE("%p : cond_lock_tvar(%p, %p)", trec, s, expected);
-  ASSERT(smp_locked == trec);
-  result = ACQUIRE_LOAD(&s->current_value);
-  TRACE("%p : %d", result ? "success" : "failure");
+  // TRACE("%p : %s", trec, (result == expected) ? "success" : "failure");
   return (result == expected);
 }
 #endif
@@ -278,19 +207,11 @@ static StgBool cond_lock_tvar(Capability *cap STG_UNUSED,
 #define IF_STM_FG_LOCKS(__X) do { __X } while (0)
 static const StgBool config_use_read_phase = true;
 
-static void lock_stm(StgTRecHeader *trec STG_UNUSED) {
-  TRACE("%p : lock_stm()", trec);
-}
-
-static void unlock_stm(StgTRecHeader *trec STG_UNUSED) {
-  TRACE("%p : unlock_stm()", trec);
-}
-
 static StgClosure *lock_tvar(Capability *cap,
                              StgTRecHeader *trec,
                              StgTVar *s STG_UNUSED) {
   StgClosure *result;
-  TRACE("%p : lock_tvar(%p)", trec, s);
+  // TRACE("%p : lock_tvar(%p)", trec, s);
   do {
     const StgInfoTable *info;
     do {
@@ -313,7 +234,7 @@ static void unlock_tvar(Capability *cap,
                         StgTVar *s,
                         StgClosure *c,
                         StgBool force_update STG_UNUSED) {
-  TRACE("%p : unlock_tvar(%p, %p)", trec, s, c);
+  // TRACE("%p : unlock_tvar(%p, %p)", trec, s, c);
   ASSERT(ACQUIRE_LOAD(&s->current_value) == (StgClosure *)trec);
   RELEASE_STORE(&s->current_value, c);
   dirty_TVAR(cap, s, (StgClosure *) trec);
@@ -325,14 +246,14 @@ static StgBool cond_lock_tvar(Capability *cap,
                               StgClosure *expected) {
   StgClosure *result;
   StgWord w;
-  TRACE("%p : cond_lock_tvar(%p, %p)", trec, s, expected);
+  // TRACE("%p : cond_lock_tvar(%p, %p)", trec, s, expected);
   w = cas((void *)&(s -> current_value), (StgWord)expected, (StgWord)trec);
   result = (StgClosure *)w;
   IF_NONMOVING_WRITE_BARRIER_ENABLED {
       if (result)
           updateRemembSetPushClosure(cap, expected);
   }
-  TRACE("%p : %s", trec, result ? "success" : "failure");
+  // TRACE("%p : %s", trec, result ? "success" : "failure");
   return (result == expected);
 }
 #endif
@@ -438,6 +359,8 @@ static StgTRecHeader *new_stg_trec_header(Capability *cap,
 // Allocation / deallocation functions that retain per-capability lists
 // of closures that can be re-used
 
+//TODO: I think some of these lack write barriers required by the non-moving gc.
+
 static StgTVarWatchQueue *alloc_stg_tvar_watch_queue(Capability *cap,
                                                      StgClosure *closure) {
   StgTVarWatchQueue *result = NULL;
@@ -760,6 +683,44 @@ static void revert_ownership(Capability *cap STG_UNUSED,
 
 /*......................................................................*/
 
+// validate_optimistic()
+StgBool validate_trec_optimistic (Capability *cap, StgTRecHeader *trec);
+
+StgBool validate_trec_optimistic (Capability *cap, StgTRecHeader *trec) {
+  StgBool result;
+  TRACE("cap %d, trec %p : validate_trec_optimistic",
+         cap->no, trec);
+
+  if (shake()) {
+    TRACE("%p : shake, pretending trec is invalid when it may not be", trec);
+    return false;
+  }
+
+  ASSERT((trec -> state == TREC_ACTIVE) ||
+         (trec -> state == TREC_WAITING) ||
+         (trec -> state == TREC_CONDEMNED));
+  result = !((trec -> state) == TREC_CONDEMNED);
+  if (result) {
+    FOR_EACH_ENTRY(trec, e, {
+      StgTVar *s;
+      s = e -> tvar;
+      StgClosure *current = RELAXED_LOAD(&s->current_value);
+      if(current != e->expected_value &&
+          //If the trec is locked we optimistically assume our trec will still be valid after it's unlocked.
+         (GET_INFO(UNTAG_CLOSURE(current)) != &stg_TREC_HEADER_info))
+      {   TRACE("%p : failed optimistic validate %p", trec, s);
+          result = false;
+          BREAK_FOR_EACH;
+      }
+    });
+  }
+
+
+  TRACE("%p : validate_trec_optimistic, result: %d", trec, result);
+  return result;
+}
+
+
 // validate_and_acquire_ownership : this performs the twin functions
 // of checking that the TVars referred to by entries in trec hold the
 // expected values and:
@@ -778,6 +739,8 @@ static StgBool validate_and_acquire_ownership (Capability *cap,
                                                int acquire_all,
                                                int retain_ownership) {
   StgBool result;
+  TRACE("cap %d, trec %p : validate_and_acquire_ownership, all: %d, retrain: %d",
+         cap->no, trec, acquire_all, retain_ownership);
 
   if (shake()) {
     TRACE("%p : shake, pretending trec is invalid when it may not be", trec);
@@ -828,6 +791,7 @@ static StgBool validate_and_acquire_ownership (Capability *cap,
       revert_ownership(cap, trec, acquire_all);
   }
 
+  TRACE("%p : validate_and_acquire_ownership, result: %d", trec, result);
   return result;
 }
 
@@ -878,12 +842,10 @@ static StgBool check_read_only(StgTRecHeader *trec STG_UNUSED) {
 /************************************************************************/
 
 void stmPreGCHook (Capability *cap) {
-  lock_stm(NO_TREC);
   TRACE("stmPreGCHook");
   cap->free_tvar_watch_queues = END_STM_WATCH_QUEUE;
   cap->free_trec_chunks = END_STM_CHUNK_LIST;
   cap->free_trec_headers = NO_TREC;
-  unlock_stm(NO_TREC);
 }
 
 /************************************************************************/
@@ -959,8 +921,6 @@ void stmAbortTransaction(Capability *cap,
          (trec -> state == TREC_WAITING) ||
          (trec -> state == TREC_CONDEMNED));
 
-  lock_stm(trec);
-
   et = trec -> enclosing_trec;
   if (et == NO_TREC) {
     // We're a top-level transaction: remove any watch queue entries that
@@ -984,8 +944,6 @@ void stmAbortTransaction(Capability *cap,
   }
 
   trec -> state = TREC_ABORTED;
-  unlock_stm(trec);
-
   TRACE("%p : stmAbortTransaction done", trec);
 }
 
@@ -1013,35 +971,149 @@ void stmCondemnTransaction(Capability *cap,
          (trec -> state == TREC_WAITING) ||
          (trec -> state == TREC_CONDEMNED));
 
-  lock_stm(trec);
   if (trec -> state == TREC_WAITING) {
     ASSERT(trec -> enclosing_trec == NO_TREC);
     TRACE("%p : stmCondemnTransaction condemning waiting transaction", trec);
     remove_watch_queue_entries_for_trec(cap, trec);
   }
   trec -> state = TREC_CONDEMNED;
-  unlock_stm(trec);
 
   TRACE("%p : stmCondemnTransaction done", trec);
 }
 
-/*......................................................................*/
-
-StgBool stmValidateNestOfTransactions(Capability *cap, StgTRecHeader *trec) {
+/*......................................................................
+
+Note [STM Validation]
+~~~~~~~~~~~~~~~~~~~~~
+We validate STM transactions for two purposes:
+* Ensure the trec (transaction log) is valid *after* execution. Either during
+  commit or after an exception has occured. Potentially locking the tvars in
+  the process. This is done by validate_and_acquire_ownership.
+* Terminate transactions early after their trec became invalid.
+  This is done by validate_trec_optimistic,
+
+Note that the second point is not merely a optimization. Consider transactions
+that are in an infinite loop as a result of seeing an inconsistent view of
+memory, e.g.
+
+      atomically $ do
+          [a,b] <- mapM readTVar [ta,tb]
+          -- a is never equal to b given a consistent view of memory.
+          when (a == b) loop
+
+We want to to always get a precise result for both checks. And indeed for the
+non-threaded runtime this is reasonably (see STM paper "Composable Memory Transactions").
+However for SMP things are more difficult.
+
+The easiest way to avoid false positives is to lock all relevant tvars during
+validation. And indeed that is what we still use for post-run validation.
+While this can lead to validation spuriously failing in edge cases when multiple
+threads perform validation in parallel the relevant transactions will simply be
+restarted and as long as the false-negative rate is reasonably low this is not
+problematic.
+
+However compared to post-run validation in-flight validation can happen multiple
+times per transaction. This means even a fairly low rate of spurious validation
+failures will result in a large performance hit. In the worst case preventing
+progress alltogether (See #24446).
+
+We don't want to reduce validation frequency too much to detect invalid
+transactions early. So we simply stick with the frequency "on return to scheduler"
+that's described in the stm paper.
+
+However we can improve the behaviour of in-flight validations by taking advantage
+of the fact that we can allow false positives for these.
+
+The biggest overhead we can reduce for in-flight validation is locking. We simply
+won't take any locks for in-flight validation. If the tvar is already locked we
+simply assume the value in our trec is still valid.
+
+This has the following effects:
+
+Benefits:
+* No lock contention between post-run and in-flight validations operating on the
+  same tvars. This reduces the false negative rate significantly for both.
+* Concurrent in-flight validations won't cause each other to fail spuriously
+  through lock contention.
+* No cas operations for in-flight validation reduces it's overhead significantly.
+
+Drawbacks:
+* We will sometimes fail to recognize invalid trecs as such by assuming locked
+  tvars contain valid values.
+
+Why can we simply not lock tvars for in-flight validations? Unlike with post-run
+validation if we miss part of an update which would invalidate the trec it will
+be either seen by a later validation (at the latest in the post-run validation
+which still locks). However there is one exception: Looping transactions.
+
+If a transaction loops it will *only* be validated optimistically. I think this
+is not an issue. The only way for in-flight validation to constantly
+result in false-positives is for the conflicting tvar(s) to get constantly locked
+for updates by post-run validations. Which seems impossibly unlikely over a long
+period of time. At at least not any more likely than some of the other similarly
+unlikely live-lock scenarious for the STM implementation.
+
+Alternatives:
+
+All of these primarily revolve around ways to ensure that we can recognize invalid
+looping transactions. However without proof this is a real problem implementing
+those seems not worthwhile.
+
+
+A1:
+Take locks for in-flight validation opportunistically to improve things.
+While this would solve lock contention/false positives caused
+by concurrent in-flight validations. It would still result in in-flight validation
+potentially triggering false-negatives during post-run validation by holding a
+lock a post-run validation is trying to take. Neither is it guaranteed to
+recognize a looping transaction as invalid, so this does not seem like an
+improvement to the lock-free inflight validation.
+
+A2:
+Perform occasional locking in-flight validation for long running transactions.
+This would solve the theoretical looping transaction recognition issue at the
+cost of some performance and complexity. This could done by adding a counter to
+the trec, counting the number of validations it has endured.
+
+A2.1:
+Like A2, but instead of counting the number of validations count the number of
+potentially false-positives by keeping track of how often we couldn't validate
+locked tvars. Could be done fine grained on a trec-entry bases or for the trec
+overall.
+
+A3:
+When encountering a locked tvar, validate the trec based on the value of the
+tvar before it was locked. This could be done by either adding another field
+to the tvar, or by looking for the expected value in the trec that holds the
+lock of the tvar. But neither option sounds great.
+
+
+*/
+
+// Check if a transaction is possibly invalid by this point.
+// Pessimistically - Currently we use this if an exception occured inside a transaction.
+//    To decide weither or not to abort by checking if the transaction was valid.
+// Optimistically - Currently we use this to eagerly abort invalid transactions from the scheduler.
+// See Note [STM Validation]
+StgBool stmValidateNestOfTransactions(Capability *cap, StgTRecHeader *trec, StgBool optimistically) {
   StgTRecHeader *t;
 
-  TRACE("%p : stmValidateNestOfTransactions", trec);
+  TRACE("%p : stmValidateNestOfTransactions, %b", trec, optimistically);
   ASSERT(trec != NO_TREC);
   ASSERT((trec -> state == TREC_ACTIVE) ||
          (trec -> state == TREC_WAITING) ||
          (trec -> state == TREC_CONDEMNED));
 
-  lock_stm(trec);
-
   t = trec;
   StgBool result = true;
   while (t != NO_TREC) {
-    result &= validate_and_acquire_ownership(cap, t, true, false);
+    if(optimistically) {
+      result &= validate_trec_optimistic(cap, t);
+
+    } else {
+      // TODO: I don't think there is a need to lock all tvars here.
+      result &= validate_and_acquire_ownership(cap, t, true, false);
+    }
     t = t -> enclosing_trec;
   }
 
@@ -1049,12 +1121,9 @@ StgBool stmValidateNestOfTransactions(Capability *cap, StgTRecHeader *trec) {
     trec -> state = TREC_CONDEMNED;
   }
 
-  unlock_stm(trec);
-
   TRACE("%p : stmValidateNestOfTransactions()=%d", trec, result);
   return result;
 }
-
 /*......................................................................*/
 
 static TRecEntry *get_entry_for(StgTRecHeader *trec, StgTVar *tvar, StgTRecHeader **in) {
@@ -1087,8 +1156,6 @@ StgBool stmCommitTransaction(Capability *cap, StgTRecHeader *trec) {
   TRACE("%p : stmCommitTransaction()", trec);
   ASSERT(trec != NO_TREC);
 
-  lock_stm(trec);
-
   ASSERT(trec -> enclosing_trec == NO_TREC);
   ASSERT((trec -> state == TREC_ACTIVE) ||
          (trec -> state == TREC_CONDEMNED));
@@ -1112,6 +1179,7 @@ StgBool stmCommitTransaction(Capability *cap, StgTRecHeader *trec) {
       max_concurrent_commits = ((max_commits_at_end - max_commits_at_start) +
                                 (getNumCapabilities() * TOKEN_BATCH_SIZE));
       if (((max_concurrent_commits >> 32) > 0) || shake()) {
+        TRACE("STM - Max commit number exceeded");
         result = false;
       }
     }
@@ -1145,8 +1213,6 @@ StgBool stmCommitTransaction(Capability *cap, StgTRecHeader *trec) {
     }
   }
 
-  unlock_stm(trec);
-
   free_stg_trec_header(cap, trec);
 
   TRACE("%p : stmCommitTransaction()=%d", trec, result);
@@ -1162,8 +1228,6 @@ StgBool stmCommitNestedTransaction(Capability *cap, StgTRecHeader *trec) {
   TRACE("%p : stmCommitNestedTransaction() into %p", trec, trec -> enclosing_trec);
   ASSERT((trec -> state == TREC_ACTIVE) || (trec -> state == TREC_CONDEMNED));
 
-  lock_stm(trec);
-
   et = trec -> enclosing_trec;
   bool result = validate_and_acquire_ownership(cap, trec, (!config_use_read_phase), true);
   if (result) {
@@ -1196,8 +1260,6 @@ StgBool stmCommitNestedTransaction(Capability *cap, StgTRecHeader *trec) {
     }
   }
 
-  unlock_stm(trec);
-
   free_stg_trec_header(cap, trec);
 
   TRACE("%p : stmCommitNestedTransaction()=%d", trec, result);
@@ -1214,7 +1276,6 @@ StgBool stmWait(Capability *cap, StgTSO *tso, StgTRecHeader *trec) {
   ASSERT((trec -> state == TREC_ACTIVE) ||
          (trec -> state == TREC_CONDEMNED));
 
-  lock_stm(trec);
   bool result = validate_and_acquire_ownership(cap, trec, true, true);
   if (result) {
     // The transaction is valid so far so we can actually start waiting.
@@ -1237,7 +1298,6 @@ StgBool stmWait(Capability *cap, StgTSO *tso, StgTRecHeader *trec) {
     // TRec.
 
   } else {
-    unlock_stm(trec);
     free_stg_trec_header(cap, trec);
   }
 
@@ -1249,7 +1309,6 @@ StgBool stmWait(Capability *cap, StgTSO *tso, StgTRecHeader *trec) {
 void
 stmWaitUnlock(Capability *cap, StgTRecHeader *trec) {
     revert_ownership(cap, trec, true);
-    unlock_stm(trec);
 }
 
 /*......................................................................*/
@@ -1263,7 +1322,6 @@ StgBool stmReWait(Capability *cap, StgTSO *tso) {
   ASSERT((trec -> state == TREC_WAITING) ||
          (trec -> state == TREC_CONDEMNED));
 
-  lock_stm(trec);
   bool result = validate_and_acquire_ownership(cap, trec, true, true);
   TRACE("%p : validation %s", trec, result ? "succeeded" : "failed");
   if (result) {
@@ -1280,7 +1338,6 @@ StgBool stmReWait(Capability *cap, StgTSO *tso) {
     }
     free_stg_trec_header(cap, trec);
   }
-  unlock_stm(trec);
 
   TRACE("%p : stmReWait()=%d", trec, result);
   return result;


=====================================
rts/STM.h
=====================================
@@ -6,24 +6,21 @@
  *
  *----------------------------------------------------------------------
 
-  STM.h defines the C-level interface to the STM.  
+  STM.h defines the C-level interface to the STM.
 
   The design follows that of the PPoPP 2005 paper "Composable memory
   transactions" extended to include fine-grained locking of TVars.
 
   Three different implementations can be built.  In overview:
-  
+
   STM_UNIPROC  -- no locking at all: not safe for concurrent invocations
- 
-  STM_CG_LOCK  -- coarse-grained locking : a single mutex protects all
-                  TVars
- 
+
   STM_FG_LOCKS -- per-TVar exclusion : each TVar can be owned by at
                   most one TRec at any time.  This allows dynamically
                   non-conflicting transactions to commit in parallel.
                   The implementation treats reads optimistically --
-                  extra versioning information is retained in the 
-                  saw_update_by field of the TVars so that they do not 
+                  extra versioning information is retained in the
+                  num_updates field of the TVars so that they do not
                   need to be locked for reading.
 
   STM.C contains more details about the locking schemes used.
@@ -72,7 +69,7 @@ void stmAbortTransaction(Capability *cap, StgTRecHeader *trec);
 void stmFreeAbortedTRec(Capability *cap, StgTRecHeader *trec);
 
 /*
- * Ensure that a subsequent commit / validation will fail.  We use this 
+ * Ensure that a subsequent commit / validation will fail.  We use this
  * in our current handling of transactions that may have become invalid
  * and started looping.  We strip their stack back to the ATOMICALLY_FRAME,
  * and, when the thread is next scheduled, discover it to be invalid and
@@ -87,16 +84,23 @@ void stmCondemnTransaction(Capability *cap, StgTRecHeader *trec);
    Validation
    ----------
 
-  Test whether the specified transaction record, and all those within which
-  it is nested, are still valid.
+   Test whether the specified transaction record, and all those within which
+   it is nested, are still valid.
+
+   stmValidateNestOfTransactions - optimistically
+      - Can return false positives when tvars are locked.
+      - Faster
+      - Does not take any locks
+
+   stmValidateNestOfTransactions - pessimistic
+      - Can return false negatives.
+      - Slower
+      - Takes locks, negatively affecting performance of other threads.
+      - Most importantly - no false positives!
 
-  Note: the caller can assume that once stmValidateTransaction has
-  returned false for a given trec then that transaction will never
-  again be valid -- we rely on this in Schedule.c when kicking invalid
-  threads at GC (in case they are stuck looping)
 */
 
-StgBool stmValidateNestOfTransactions(Capability *cap, StgTRecHeader *trec);
+StgBool stmValidateNestOfTransactions(Capability *cap, StgTRecHeader *trec, StgBool optimistically);
 
 /*----------------------------------------------------------------------
 
@@ -106,14 +110,14 @@ StgBool stmValidateNestOfTransactions(Capability *cap, StgTRecHeader *trec);
    These four operations return boolean results which should be interpreted
    as follows:
 
-   true  => The transaction record was definitely valid 
+   true  => The transaction record was definitely valid
 
    false => The transaction record may not have been valid
 
    Note that, for nested operations, validity here is solely in terms
    of the specified trec: it does not say whether those that it may be
-   nested are themselves valid.  Callers can check this with 
-   stmValidateNestOfTransactions.
+   nested are themselves valid.  Callers can check this with
+   stmValidateNestOfTransactionsPessimistic.
 
    The user of the STM should ensure that it is always safe to assume that a
    transaction context is not valid when in fact it is (i.e. to return false in
@@ -151,7 +155,7 @@ StgBool stmCommitNestedTransaction(Capability *cap, StgTRecHeader *trec);
  * Test whether the current transaction context is valid and, if so,
  * start the thread waiting for updates to any of the tvars it has
  * ready from and mark it as blocked.  It is an error to call stmWait
- * if the thread is already waiting.  
+ * if the thread is already waiting.
  */
 
 StgBool stmWait(Capability *cap, StgTSO *tso, StgTRecHeader *trec);
@@ -180,7 +184,7 @@ StgBool stmReWait(Capability *cap, StgTSO *tso);
  */
 
 StgClosure *stmReadTVar(Capability *cap,
-                        StgTRecHeader *trec, 
+                        StgTRecHeader *trec,
                         StgTVar *tvar);
 
 /* Update the logical contents of 'tvar' within the context of the
@@ -189,7 +193,7 @@ StgClosure *stmReadTVar(Capability *cap,
 
 void stmWriteTVar(Capability *cap,
                   StgTRecHeader *trec,
-                  StgTVar *tvar, 
+                  StgTVar *tvar,
                   StgClosure *new_value);
 
 /*----------------------------------------------------------------------*/


=====================================
rts/Schedule.c
=====================================
@@ -1106,7 +1106,7 @@ schedulePostRunThread (Capability *cap, StgTSO *t)
     // and a is never equal to b given a consistent view of memory.
     //
     if (t -> trec != NO_TREC && t -> why_blocked == NotBlocked) {
-        if (!stmValidateNestOfTransactions(cap, t -> trec)) {
+        if (!stmValidateNestOfTransactions(cap, t -> trec, true)) {
             debugTrace(DEBUG_sched | DEBUG_stm,
                        "trec %p found wasting its time", t);
 


=====================================
rts/include/stg/SMP.h
=====================================
@@ -201,14 +201,15 @@ EXTERN_INLINE void busy_wait_nop(void);
  *   - StgWeak: finalizer
  *   - StgMVar: head, tail, value
  *   - StgMVarTSOQueue: link
- *   - StgTVar: current_value, first_watch_queue_entry
- *   - StgTVarWatchQueue: {next,prev}_queue_entry
- *   - StgTRecChunk: TODO
  *   - StgMutArrPtrs: payload
  *   - StgSmallMutArrPtrs: payload
  *   - StgThunk although this is a somewhat special case; see below
  *   - StgInd: indirectee
  *   - StgTSO: block_info
+
+ *   - StgTVar: current_value, first_watch_queue_entry
+ *   - StgTVarWatchQueue: {next,prev}_queue_entry
+ *   - StgTRecChunk: TODO
  *
  * Finally, non-pointer fields can be safely mutated without barriers as
  * they do not refer to other memory locations. Technically, concurrent


=====================================
testsuite/tests/driver/multipleHomeUnits/mhu-perf/Makefile
=====================================
@@ -0,0 +1,23 @@
+TOP=../../../..
+include $(TOP)/mk/boilerplate.mk
+include $(TOP)/mk/test.mk
+
+SETUP=../Setup -v0
+
+mhu-perf: clean
+	$(MAKE) -s --no-print-directory clean
+	./genLargeHMU
+	'$(GHC_PKG)' init tmp.d
+	'$(TEST_HC)' $(TEST_HC_OPTS) -v0 --make Setup
+	for dir in unit-p*; do \
+	  cd $$dir && $(SETUP) clean && $(SETUP) configure $(CABAL_MINIMAL_BUILD) --ipid=$$dir-0.1.0.0 --with-ghc='$(TEST_HC)' --with-hc-pkg='$(GHC_PKG)' --ghc-options='$(TEST_HC_OPTS)' --package-db=../tmp.d && $(SETUP) build && $(SETUP) register --inplace && cd ..; \
+  done;
+
+
+ifeq "$(CLEANUP)" "1"
+	$(MAKE) -s --no-print-directory clean
+endif
+
+clean :
+	 $(RM) -r unitTop* unit-p* top*/ tmp*.d inst-* *.o *.hi */*.o */*.hi */Setup$(exeext) */dist Setup$(exeext)
+


=====================================
testsuite/tests/driver/multipleHomeUnits/mhu-perf/Setup.hs
=====================================
@@ -0,0 +1,2 @@
+import Distribution.Simple
+main = defaultMain


=====================================
testsuite/tests/driver/multipleHomeUnits/mhu-perf/all.T
=====================================
@@ -0,0 +1,10 @@
+test('mhu-perf',
+     [ collect_compiler_stats('bytes allocated',2),
+       extra_files(['genLargeHMU','Setup.hs']),
+       pre_cmd('$MAKE -s --no-print-directory mhu-perf'),
+       js_broken(22349),
+       when(arch('wasm32'), skip), # wasm32 doesn't like running Setup/Makefile tests
+       compile_timeout_multiplier(5)
+     ],
+     multiunit_compile,
+     [['unitTop1', 'unitTop2'], '-fhide-source-paths'])


=====================================
testsuite/tests/driver/multipleHomeUnits/mhu-perf/genLargeHMU
=====================================
@@ -0,0 +1,54 @@
+#!/usr/bin/env bash
+# Generate $DEPTH layers of packages with $WIDTH modules on each layer
+# Every package on layer N depends on all the packages on layer N-1
+# unitTop imports all the units from the last layer
+DEPTH=8
+WIDTH=8
+for i in $(seq -w 1 $WIDTH); do
+  mkdir unit-p0M$i
+  echo "module DummyLevel0M$i where" > unit-p0M$i/DummyLevel0M$i.hs;
+  cat > unit-p0M$i/unit-p0M$i.cabal <<EOF
+name: unit-p0M$i
+version: 0.1.0.0
+build-type: Simple
+cabal-version: >=1.10
+library
+  default-language: Haskell2010
+  exposed-modules: DummyLevel0M$i
+  build-depends: base
+EOF
+done
+for l in $(seq 1 $DEPTH); do
+  for i in $(seq -w 1 $WIDTH); do
+    mkdir unit-p${l}M$i
+    cat > unit-p${l}M$i/unit-p${l}M$i.cabal <<EOF
+name: unit-p${l}M$i
+version: 0.1.0.0
+build-type: Simple
+cabal-version: >=1.10
+library
+  default-language: Haskell2010
+  exposed-modules: DummyLevel${l}M$i
+  build-depends: base
+EOF
+    echo "module DummyLevel${l}M$i where" > unit-p${l}M$i/DummyLevel${l}M$i.hs;
+    for j in $(seq -w 1 $WIDTH); do
+      echo "    , unit-p$((l-1))M$j" >> unit-p${l}M$i/unit-p${l}M$i.cabal
+      echo "import DummyLevel$((l-1))M$j" >> unit-p${l}M$i/DummyLevel${l}M$i.hs;
+    done
+  done
+done
+mkdir top1
+echo "module Top1 where" > top1/Top1.hs
+echo "-package-db ./tmp.d -i -itop1 Top1 -this-unit-id unit-top1 -package base" > unitTop1;
+for j in $(seq -w 1 $WIDTH); do
+  echo "-package unit-p${DEPTH}M$j" >> unitTop1;
+  echo "import DummyLevel${DEPTH}M$j" >> top1/Top1.hs;
+done
+mkdir top2
+echo "module Top2 where" > top2/Top2.hs
+echo "-package-db ./tmp.d -i  -itop2 Top2 -this-unit-id unit-top2 -package base" > unitTop2;
+for j in $(seq -w 2 $WIDTH); do
+  echo "-package unit-p${DEPTH}M$j" >> unitTop2;
+  echo "import DummyLevel${DEPTH}M$j" >> top2/Top2.hs;
+done


=====================================
testsuite/tests/driver/multipleHomeUnits/mhu-perf/mhu-perf.stderr
=====================================
@@ -0,0 +1,2 @@
+[1 of 2] Compiling Top1[unit-top1]
+[2 of 2] Compiling Top2[unit-top2]


=====================================
testsuite/tests/numeric/should_run/all.T
=====================================
@@ -50,6 +50,7 @@ test('T4383', normal, compile_and_run, [''])
 
 test('add2', normal, compile_and_run, ['-fobject-code'])
 test('mul2', normal, compile_and_run, ['-fobject-code'])
+test('mul2int', normal, compile_and_run, ['-fobject-code'])
 test('quotRem2', normal, compile_and_run, ['-fobject-code'])
 test('T5863', normal, compile_and_run, [''])
 


=====================================
testsuite/tests/numeric/should_run/mul2int.hs
=====================================
@@ -0,0 +1,35 @@
+{-# LANGUAGE MagicHash, UnboxedTuples #-}
+
+import GHC.Exts
+import Data.Bits
+
+main :: IO ()
+main = do g 5 6
+          g (-5) 6
+          g 0x7ECA71DBFF1B7D8C 49
+          g (-0x7ECA71DBFF1B7D8C) 49
+          g 0x7ECA71DBFF1B7D8C 0x7E0EC51DFD94FE35
+          g 0x7ECA71DBFF1B7D8C (-0x7E0EC51DFD94FE35)
+
+
+g :: Int -> Int -> IO ()
+g wx@(I# x) wy@(I# y)
+    = do putStrLn "-----"
+         putStrLn ("Doing " ++ show wx ++ " * " ++ show wy)
+         case x `timesInt2#` y of
+             (# n, h, l #) ->
+                 do let wh = I# h
+                        wl = I# l
+                        wlw = W# (int2Word# l)
+                        wn = I# n
+                        r | wn == 1   = shiftL (fromIntegral wh) (finiteBitSize wh)
+                                      + fromIntegral wlw
+                          | otherwise = fromIntegral wl
+
+                    putStrLn ("High:      " ++ show wh)
+                    putStrLn ("Low:       " ++ show wl)
+                    putStrLn ("Needed:    " ++ show wn)
+                    putStrLn ("Result:    " ++ show (r :: Integer))
+                    putStrLn ("Should be: " ++ show (fromIntegral wx * fromIntegral wy :: Integer))
+
+


=====================================
testsuite/tests/numeric/should_run/mul2int.stdout
=====================================
@@ -0,0 +1,42 @@
+-----
+Doing 5 * 6
+High:      0
+Low:       30
+Needed:    0
+Result:    30
+Should be: 30
+-----
+Doing -5 * 6
+High:      -1
+Low:       -30
+Needed:    0
+Result:    -30
+Should be: -30
+-----
+Doing 9136239983766240652 * 49
+High:      24
+Low:       4953901435516553164
+Needed:    1
+Result:    447675759204545791948
+Should be: 447675759204545791948
+-----
+Doing -9136239983766240652 * 49
+High:      -25
+Low:       -4953901435516553164
+Needed:    1
+Result:    -447675759204545791948
+Should be: -447675759204545791948
+-----
+Doing 9136239983766240652 * 9083414231051992629
+High:      4498802171008813567
+Low:       3355592377236579836
+Needed:    1
+Result:    82988252286848496451678442784944154108
+Should be: 82988252286848496451678442784944154108
+-----
+Doing 9136239983766240652 * -9083414231051992629
+High:      -4498802171008813568
+Low:       -3355592377236579836
+Needed:    1
+Result:    -82988252286848496451678442784944154108
+Should be: -82988252286848496451678442784944154108


=====================================
testsuite/tests/rts/T24142.hs
=====================================
@@ -0,0 +1,63 @@
+{- This test constructs a program that used to trigger an excessive amount of STM retries. -}
+{-# LANGUAGE NumericUnderscores #-}
+{-# LANGUAGE BangPatterns #-}
+module Main where
+
+import GHC.Conc
+import Control.Concurrent (newMVar, newEmptyMVar, takeMVar, putMVar)
+import Control.Monad
+import Control.Concurrent.STM.TArray
+import Data.Array.MArray
+import Data.IORef
+
+
+main :: IO ()
+main =
+    forM_ [2..40] $ \i -> do
+        -- Run the test with an increasing number of tvars
+        let tvar_count = i * 10
+        -- print $ "Tvars: " ++ show tvar_count
+        provokeLivelock tvar_count
+
+
+-- Forks two threads running a STM transactions, both accessing the same tvars but in opposite order.
+provokeLivelock :: Int -> IO ()
+provokeLivelock n = do
+    -- Use tvar array as a convenient way to bundle up n Tvars.
+    tvarArray <- atomically $ do
+        newListArray (0,n) [0.. fromIntegral n :: Integer] :: STM (TArray Int Integer)
+    m1 <- newEmptyMVar
+    m2 <- newEmptyMVar
+    updateCount <- newIORef (0 :: Int)
+
+    let useTvars :: [Int] -> Bool -> IO ()
+        useTvars tvar_order use_writes = atomically $ do
+            -- Walk the array once in the given order to add all tvars to the transaction log.
+            unsafeIOToSTM $ atomicModifyIORef' updateCount (\i -> (i+1,()))
+            mapM_ (\i -> readArray tvarArray i >>= \(!_n) -> return ()) tvar_order
+
+
+            -- Then we just enter the scheduler a lot
+            forM_ tvar_order $ \i -> do
+                -- when use_writes $
+                --     readArray tvarArray i >>= \(!n) -> writeArray tvarArray i (n+1 :: Integer)
+                unsafeIOToSTM yield
+
+    _ <- forkIO $ do
+                    useTvars [0..n] False
+                    -- print "Thread1 done."
+                    putMVar m1 True
+    _ <- forkIO $ do
+                    useTvars (reverse [0..n]) False
+                    -- print "Thread1 done."
+                    putMVar m2 True
+    -- Wait for forked threads.
+    _ <- takeMVar m1
+    _ <- takeMVar m2
+    updates <- readIORef updateCount
+    if updates > n
+        then putStrLn $ "TVars: " ++ show n ++ ", ERROR: more than " ++ show n ++ " transaction attempts. (" ++ show updates ++")\n"
+        else putStrLn $ "TVars: " ++ show n ++ ", OK: no more than " ++ show n ++ " transaction attempts."
+
+    return ()
+


=====================================
testsuite/tests/rts/T24142.stdout
=====================================
@@ -0,0 +1,39 @@
+TVars: 20, OK: no more than 20 transaction attempts.
+TVars: 30, OK: no more than 30 transaction attempts.
+TVars: 40, OK: no more than 40 transaction attempts.
+TVars: 50, OK: no more than 50 transaction attempts.
+TVars: 60, OK: no more than 60 transaction attempts.
+TVars: 70, OK: no more than 70 transaction attempts.
+TVars: 80, OK: no more than 80 transaction attempts.
+TVars: 90, OK: no more than 90 transaction attempts.
+TVars: 100, OK: no more than 100 transaction attempts.
+TVars: 110, OK: no more than 110 transaction attempts.
+TVars: 120, OK: no more than 120 transaction attempts.
+TVars: 130, OK: no more than 130 transaction attempts.
+TVars: 140, OK: no more than 140 transaction attempts.
+TVars: 150, OK: no more than 150 transaction attempts.
+TVars: 160, OK: no more than 160 transaction attempts.
+TVars: 170, OK: no more than 170 transaction attempts.
+TVars: 180, OK: no more than 180 transaction attempts.
+TVars: 190, OK: no more than 190 transaction attempts.
+TVars: 200, OK: no more than 200 transaction attempts.
+TVars: 210, OK: no more than 210 transaction attempts.
+TVars: 220, OK: no more than 220 transaction attempts.
+TVars: 230, OK: no more than 230 transaction attempts.
+TVars: 240, OK: no more than 240 transaction attempts.
+TVars: 250, OK: no more than 250 transaction attempts.
+TVars: 260, OK: no more than 260 transaction attempts.
+TVars: 270, OK: no more than 270 transaction attempts.
+TVars: 280, OK: no more than 280 transaction attempts.
+TVars: 290, OK: no more than 290 transaction attempts.
+TVars: 300, OK: no more than 300 transaction attempts.
+TVars: 310, OK: no more than 310 transaction attempts.
+TVars: 320, OK: no more than 320 transaction attempts.
+TVars: 330, OK: no more than 330 transaction attempts.
+TVars: 340, OK: no more than 340 transaction attempts.
+TVars: 350, OK: no more than 350 transaction attempts.
+TVars: 360, OK: no more than 360 transaction attempts.
+TVars: 370, OK: no more than 370 transaction attempts.
+TVars: 380, OK: no more than 380 transaction attempts.
+TVars: 390, OK: no more than 390 transaction attempts.
+TVars: 400, OK: no more than 400 transaction attempts.


=====================================
testsuite/tests/rts/all.T
=====================================
@@ -609,3 +609,6 @@ test('T23400', [], compile_and_run, ['-with-rtsopts -A8k'])
 test('IOManager', [js_skip, when(arch('wasm32'), skip), when(opsys('mingw32'), skip),
                    pre_cmd('$MAKE -s --no-print-directory IOManager.hs')],
                   compile_and_run, [''])
+
+test('T24142', [], compile_and_run, ['-threaded -with-rtsopts "-N2"'])
+



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/e7f2abae8e15758629909ad6e8cded4aec4361bc...d229a2600ceaa880f691f776d871fe255a34e3b0

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/e7f2abae8e15758629909ad6e8cded4aec4361bc...d229a2600ceaa880f691f776d871fe255a34e3b0
You're receiving this email because of your account on gitlab.haskell.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20240411/6197fa93/attachment-0001.html>