[Git][ghc/ghc][master] 4 commits: Zero out pinned block alignment slop when profiling

Wed Apr 15 03:31:13 UTC 2020


 Marge Bot pushed to branch master at Glasgow Haskell Compiler / GHC


Commits:
41230e26 by Daniel Gröber at 2020-04-14T23:31:01-04:00
Zero out pinned block alignment slop when profiling

The heap profiler currently cannot traverse pinned blocks because of
alignment slop. This used to just be a minor annoyance as the whole block
is accounted into a special cost center rather than the respective object's
CCS, cf. #7275. However for the new root profiler we would like to be able
to visit _every_ closure on the heap. We need to do this so we can get rid
of the current 'flip' bit hack in the heap traversal code.

Since info pointers are always non-zero we can in principle skip all the
slop in the profiler if we can rely on it being zeroed. This assumption
caused problems in the past though, commit a586b33f8e ("rts: Correct
handling of LARGE ARR_WORDS in LDV profiler"), part of !1118, tried to use
the same trick for BF_LARGE objects but neglected to take into account that
shrink*Array# functions don't ensure that slop is zeroed when not
compiling with profiling.

Later, commit 0c114c6599 ("Handle large ARR_WORDS in heap census (fix
as we will only be assuming slop is zeroed when profiling is on.

This commit also reduces the ammount of slop we introduce in the first
place by calculating the needed alignment before doing the allocation for
small objects where we know the next available address. For large objects
we don't know how much alignment we'll have to do yet since those details
are hidden behind the allocateMightFail function so there we continue to
allocate the maximum additional words we'll need to do the alignment.

So we don't have to duplicate all this logic in the cmm code we pull it
into the RTS allocatePinned function instead.

Metric Decrease:
    T7257
    haddock.Cabal
    haddock.base

- - - - -
15fa9bd6 by Daniel Gröber at 2020-04-14T23:31:01-04:00
rts: Expand and add more notes regarding slop

- - - - -
caf3f444 by Daniel Gröber at 2020-04-14T23:31:01-04:00
rts: allocatePinned: Fix confusion about word/byte units

- - - - -
c3c0f662 by Daniel Gröber at 2020-04-14T23:31:01-04:00
rts: Underline some Notes as is conventional

- - - - -


7 changed files:

- includes/rts/storage/ClosureMacros.h
- includes/rts/storage/GC.h
- rts/Apply.cmm
- rts/PrimOps.cmm
- rts/ProfHeap.c
- rts/sm/Sanity.c
- rts/sm/Storage.c


Changes:

=====================================
includes/rts/storage/ClosureMacros.h
=====================================
@@ -474,31 +474,39 @@ INLINE_HEADER StgWord8 *mutArrPtrsCard (StgMutArrPtrs *a, W_ n)
    OVERWRITING_CLOSURE(p) on the old closure that is about to be
    overwritten.
 
-   Note [zeroing slop]
+   Note [zeroing slop when overwriting closures]
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-   In some scenarios we write zero words into "slop"; memory that is
-   left unoccupied after we overwrite a closure in the heap with a
-   smaller closure.
+   When we overwrite a closure in the heap with a smaller one, in some scenarios
+   we need to write zero words into "slop"; the memory that is left
+   unoccupied. See Note [slop on the heap]
 
    Zeroing slop is required for:
 
-    - full-heap sanity checks (DEBUG, and +RTS -DS)
-    - LDV profiling (PROFILING, and +RTS -hb)
+    - full-heap sanity checks (DEBUG, and +RTS -DS),
 
-   Zeroing slop must be disabled for:
+    - LDV profiling (PROFILING, and +RTS -hb) and
 
-    - THREADED_RTS with +RTS -N2 and greater, because we cannot
-      overwrite slop when another thread might be reading it.
+   However we can get into trouble if we're zeroing slop for ordinarily
+   immutable closures when using multiple threads, since there is nothing
+   preventing another thread from still being in the process of reading the
+   memory we're about to zero.
 
-   Hence, slop is zeroed when either:
+   Thus, with the THREADED RTS and +RTS -N2 or greater we must not zero
+   immutable closure's slop.
 
-    - PROFILING && era <= 0 (LDV is on)
-    - !THREADED_RTS && DEBUG
+   Hence, an immutable closure's slop is zeroed when either:
 
-   And additionally:
+    - PROFILING && era > 0 (LDV is on) or
+    - !THREADED && DEBUG
 
-    - LDV profiling and +RTS -N2 are incompatible
-    - full-heap sanity checks are disabled for THREADED_RTS
+   Additionally:
+
+    - LDV profiling and +RTS -N2 are incompatible,
+
+    - full-heap sanity checks are disabled for the THREADED RTS, at least when
+      they don't run right after GC when there is no slop.
+      See Note [heap sanity checking with SMP].
 
    -------------------------------------------------------------------------- */
 
@@ -534,7 +542,7 @@ EXTERN_INLINE void overwritingClosure_ (StgClosure *p,
 EXTERN_INLINE void overwritingClosure_ (StgClosure *p, uint32_t offset, uint32_t size, bool prim USED_IF_PROFILING)
 {
 #if ZERO_SLOP_FOR_LDV_PROF && !ZERO_SLOP_FOR_SANITY_CHECK
-    // see Note [zeroing slop], also #8402
+    // see Note [zeroing slop when overwriting closures], also #8402
     if (era <= 0) return;
 #endif
 


=====================================
includes/rts/storage/GC.h
=====================================
@@ -170,10 +170,13 @@ extern generation * oldest_gen;
                                 Allocates memory from the nursery in
                                 the current Capability.
 
-   StgPtr allocatePinned(Capability *cap, W_ n)
+   StgPtr allocatePinned(Capability *cap, W_ n, W_ alignment, W_ align_off)
                                 Allocates a chunk of contiguous store
                                 n words long, which is at a fixed
-                                address (won't be moved by GC).
+                                address (won't be moved by GC). The
+                                word at the byte offset 'align_off'
+                                will be aligned to 'alignment', which
+                                must be a power of two.
                                 Returns a pointer to the first word.
                                 Always succeeds.
 
@@ -191,7 +194,7 @@ extern generation * oldest_gen;
 
 StgPtr  allocate          ( Capability *cap, W_ n );
 StgPtr  allocateMightFail ( Capability *cap, W_ n );
-StgPtr  allocatePinned    ( Capability *cap, W_ n );
+StgPtr  allocatePinned    ( Capability *cap, W_ n, W_ alignment, W_ align_off);
 
 /* memory allocator for executable memory */
 typedef void* AdjustorWritable;


=====================================
rts/Apply.cmm
=====================================
@@ -689,7 +689,7 @@ for:
 
   // Because of eager blackholing the closure no longer has correct size so
   // threadPaused() can't correctly zero the slop, so we do it here. See #15571
-  // and Note [zeroing slop].
+  // and Note [zeroing slop when overwriting closures].
   OVERWRITING_CLOSURE_SIZE(ap, BYTES_TO_WDS(SIZEOF_StgThunkHeader) + 2 + Words);
 
   ENTER_R1();


=====================================
rts/PrimOps.cmm
=====================================
@@ -89,22 +89,15 @@ stg_newPinnedByteArrayzh ( W_ n )
     /* When we actually allocate memory, we need to allow space for the
        header: */
     bytes = bytes + SIZEOF_StgArrBytes;
-    /* And we want to align to BA_ALIGN bytes, so we need to allow space
-       to shift up to BA_ALIGN - 1 bytes: */
-    bytes = bytes + BA_ALIGN - 1;
     /* Now we convert to a number of words: */
     words = ROUNDUP_BYTES_TO_WDS(bytes);
 
-    ("ptr" p) = ccall allocatePinned(MyCapability() "ptr", words);
+    ("ptr" p) = ccall allocatePinned(MyCapability() "ptr", words, BA_ALIGN, SIZEOF_StgArrBytes);
     if (p == NULL) {
         jump stg_raisezh(base_GHCziIOziException_heapOverflow_closure);
     }
     TICK_ALLOC_PRIM(SIZEOF_StgArrBytes,WDS(payload_words),0);
 
-    /* Now we need to move p forward so that the payload is aligned
-       to BA_ALIGN bytes: */
-    p = p + ((-p - SIZEOF_StgArrBytes) & BA_MASK);
-
     /* No write barrier needed since this is a new allocation. */
     SET_HDR(p, stg_ARR_WORDS_info, CCCS);
     StgArrBytes_bytes(p) = n;
@@ -121,7 +114,7 @@ stg_newAlignedPinnedByteArrayzh ( W_ n, W_ alignment )
     /* we always supply at least word-aligned memory, so there's no
        need to allow extra space for alignment if the requirement is less
        than a word.  This also prevents mischief with alignment == 0. */
-    if (alignment <= SIZEOF_W) { alignment = 1; }
+    if (alignment <= SIZEOF_W) { alignment = SIZEOF_W; }
 
     bytes = n;
 
@@ -131,23 +124,15 @@ stg_newAlignedPinnedByteArrayzh ( W_ n, W_ alignment )
     /* When we actually allocate memory, we need to allow space for the
        header: */
     bytes = bytes + SIZEOF_StgArrBytes;
-    /* And we want to align to <alignment> bytes, so we need to allow space
-       to shift up to <alignment - 1> bytes: */
-    bytes = bytes + alignment - 1;
     /* Now we convert to a number of words: */
     words = ROUNDUP_BYTES_TO_WDS(bytes);
 
-    ("ptr" p) = ccall allocatePinned(MyCapability() "ptr", words);
+    ("ptr" p) = ccall allocatePinned(MyCapability() "ptr", words, alignment, SIZEOF_StgArrBytes);
     if (p == NULL) {
         jump stg_raisezh(base_GHCziIOziException_heapOverflow_closure);
     }
     TICK_ALLOC_PRIM(SIZEOF_StgArrBytes,WDS(payload_words),0);
 
-    /* Now we need to move p forward so that the payload is aligned
-       to <alignment> bytes. Note that we are assuming that
-       <alignment> is a power of 2, which is technically not guaranteed */
-    p = p + ((-p - SIZEOF_StgArrBytes) & (alignment - 1));
-
     /* No write barrier needed since this is a new allocation. */
     SET_HDR(p, stg_ARR_WORDS_info, CCCS);
     StgArrBytes_bytes(p) = n;


=====================================
rts/ProfHeap.c
=====================================
@@ -1275,8 +1275,22 @@ heapCensusChain( Census *census, bdescr *bd )
             heapProfObject(census,(StgClosure*)p,size,prim);
 
             p += size;
-            /* skip over slop */
-            while (p < bd->free && !*p) p++; // skip slop
+
+            /* skip over slop, see Note [slop on the heap] */
+            while (p < bd->free && !*p) p++;
+            /* Note [skipping slop in the heap profiler]
+             * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+             *
+             * We make sure to zero slop that can remain after a major GC so
+             * here we can assume any slop words we see until the block's free
+             * pointer are zero. Since info pointers are always nonzero we can
+             * use this to scan for the next valid heap closure.
+             *
+             * Note that not all types of slop are relevant here, only the ones
+             * that can reman after major GC. So essentially just large objects
+             * and pinned objects. All other closures will have been packed nice
+             * and thight into fresh blocks.
+             */
         }
     }
 }


=====================================
rts/sm/Sanity.c
=====================================
@@ -475,7 +475,7 @@ void checkHeapChain (bdescr *bd)
                 ASSERT( size >= MIN_PAYLOAD_SIZE + sizeofW(StgHeader) );
                 p += size;
 
-                /* skip over slop */
+                /* skip over slop, see Note [slop on the heap] */
                 while (p < bd->free &&
                        (*p < 0x1000 || !LOOKS_LIKE_INFO_PTR(*p))) { p++; }
             }
@@ -796,12 +796,17 @@ static void checkGeneration (generation *gen,
     ASSERT(countBlocks(gen->large_objects) == gen->n_large_blocks);
 
 #if defined(THREADED_RTS)
+    // Note [heap sanity checking with SMP]
+    // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    //
     // heap sanity checking doesn't work with SMP for two reasons:
-    //   * we can't zero the slop (see Updates.h).  However, we can sanity-check
-    //     the heap after a major gc, because there is no slop.
     //
-    //   * the nonmoving collector may be mutating its large object lists, unless we
-    //     were in fact called by the nonmoving collector.
+    //   * We can't zero the slop. However, we can sanity-check the heap after a
+    //     major gc, because there is no slop. See also Updates.h and Note
+    //     [zeroing slop when overwriting closures].
+    //
+    //   * The nonmoving collector may be mutating its large object lists,
+    //     unless we were in fact called by the nonmoving collector.
     if (!after_major_gc) return;
 #endif
 


=====================================
rts/sm/Storage.c
=====================================
@@ -907,6 +907,54 @@ accountAllocation(Capability *cap, W_ n)
 
 }
 
+/* Note [slop on the heap]
+ * ~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ * We use the term "slop" to refer to allocated memory on the heap which isn't
+ * occupied by any closure. Usually closures are packet tightly into the heap
+ * blocks, storage for one immediately following another. However there are
+ * situations where slop is left behind:
+ *
+ * - Allocating large objects (BF_LARGE)
+ *
+ *   These are given an entire block, but if they don't fill the entire block
+ *   the rest is slop. See allocateMightFail in Storage.c.
+ *
+ * - Allocating pinned objects with alignment (BF_PINNED)
+ *
+ *   These are packet into blocks like normal closures, however they
+ *   can have alignment constraints and any memory that needed to be skipped for
+ *   alignment becomes slop. See allocatePinned in Storage.c.
+ *
+ * - Shrinking (Small)Mutable(Byte)Array#
+ *
+ *    The size of these closures can be decreased after allocation, leaving any,
+ *    now unused memory, behind as slop. See stg_resizzeMutableByteArrayzh,
+ *    stg_shrinkSmallMutableArrayzh, and stg_shrinkMutableByteArrayzh in
+ *    PrimOps.cmm.
+ *
+ *    This type of slop is extra tricky because it can also be pinned and
+ *    large.
+ *
+ * - Overwriting closures
+ *
+ *   During GC the RTS overwrites closures with forwarding pointers, this can
+ *   leave slop behind depending on the size of the closure being
+ *   overwritten. See Note [zeroing slop when overwriting closures].
+ *
+ * Under various ways we actually zero slop so we can linearly scan over blocks
+ * of closures. This trick is used by the sanity checking code and the heap
+ * profiler, see Note [skipping slop in the heap profiler].
+ *
+ * When profiling we zero:
+ *  - Pinned object alignment slop, see MEMSET_IF_PROFILING_W in allocatePinned.
+ *  - Shrunk array slop, see OVERWRITING_MUTABLE_CLOSURE.
+ *
+ * When performing LDV profiling or using a (single threaded) debug RTS we zero
+ * slop even when overwriting immutable closures, see Note [zeroing slop when
+ * overwriting closures].
+ */
+
 /* -----------------------------------------------------------------------------
    StgPtr allocate (Capability *cap, W_ n)
 
@@ -1059,6 +1107,26 @@ allocateMightFail (Capability *cap, W_ n)
     return p;
 }
 
+/**
+ * Calculate the number of words we need to add to 'p' so it satisfies the
+ * alignment constraint '(p + off) & (align-1) == 0'.
+ */
+#define ALIGN_WITH_OFF_W(p, align, off) \
+    (((-((uintptr_t)p) - off) & (align-1)) / sizeof(W_))
+
+/**
+ * When profiling we zero the space used for alignment. This allows us to
+ * traverse pinned blocks in the heap profiler.
+ *
+ * See Note [skipping slop in the heap profiler]
+ */
+#if defined(PROFILING)
+#define MEMSET_IF_PROFILING_W(p, val, len_w) memset(p, val, (len_w) * sizeof(W_))
+#else
+#define MEMSET_IF_PROFILING_W(p, val, len_w) \
+    do { (void)(p); (void)(val); (void)(len_w); } while(0)
+#endif
+
 /* ---------------------------------------------------------------------------
    Allocate a fixed/pinned object.
 
@@ -1084,29 +1152,49 @@ allocateMightFail (Capability *cap, W_ n)
    ------------------------------------------------------------------------- */
 
 StgPtr
-allocatePinned (Capability *cap, W_ n)
+allocatePinned (Capability *cap, W_ n /*words*/, W_ alignment /*bytes*/, W_ align_off /*bytes*/)
 {
     StgPtr p;
     bdescr *bd;
 
+    // Alignment and offset have to be a power of two
+    ASSERT(alignment && !(alignment & (alignment - 1)));
+    ASSERT(alignment >= sizeof(W_));
+
+    ASSERT(!(align_off & (align_off - 1)));
+
+    const StgWord alignment_w = alignment / sizeof(W_);
+
     // If the request is for a large object, then allocate()
     // will give us a pinned object anyway.
     if (n >= LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
-        p = allocateMightFail(cap, n);
+        // For large objects we don't bother optimizing the number of words
+        // allocated for alignment reasons. Here we just allocate the maximum
+        // number of extra words we could possibly need to satisfy the alignment
+        // constraint.
+        p = allocateMightFail(cap, n + alignment_w - 1);
         if (p == NULL) {
             return NULL;
         } else {
             Bdescr(p)->flags |= BF_PINNED;
+            W_ off_w = ALIGN_WITH_OFF_W(p, alignment, align_off);
+            MEMSET_IF_PROFILING_W(p, 0, off_w);
+            p += off_w;
+            MEMSET_IF_PROFILING_W(p + n, 0, alignment_w - off_w - 1);
             return p;
         }
     }
 
-    accountAllocation(cap, n);
     bd = cap->pinned_object_block;
 
+    W_ off_w = 0;
+
+    if(bd)
+        off_w = ALIGN_WITH_OFF_W(bd->free, alignment, align_off);
+
     // If we don't have a block of pinned objects yet, or the current
     // one isn't large enough to hold the new object, get a new one.
-    if (bd == NULL || (bd->free + n) > (bd->start + BLOCK_SIZE_W)) {
+    if (bd == NULL || (bd->free + off_w + n) > (bd->start + BLOCK_SIZE_W)) {
 
         // stash the old block on cap->pinned_object_blocks.  On the
         // next GC cycle these objects will be moved to
@@ -1158,10 +1246,20 @@ allocatePinned (Capability *cap, W_ n)
         // the next GC the BF_EVACUATED flag will be cleared, and the
         // block will be promoted as usual (if anything in it is
         // live).
+
+        off_w = ALIGN_WITH_OFF_W(bd->free, alignment, align_off);
     }
 
     p = bd->free;
+
+    MEMSET_IF_PROFILING_W(p, 0, off_w);
+
+    n += off_w;
+    p += off_w;
     bd->free += n;
+
+    accountAllocation(cap, n);
+
     return p;
 }
 



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/7b41f21bbfa9e266ba6654b08c3f9fec549c8bca...c3c0f662df06500a11970fd391d0a88e081a5296

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/7b41f21bbfa9e266ba6654b08c3f9fec549c8bca...c3c0f662df06500a11970fd391d0a88e081a5296
You're receiving this email because of your account on gitlab.haskell.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20200414/0c5cebae/attachment-0001.html>