[Git][ghc/ghc][wip/T24150] rts: Allocate non-moving segments with megablocks

Teo Camarasu (@teo) gitlab at gitlab.haskell.org
Thu May 16 11:49:32 UTC 2024



Teo Camarasu pushed to branch wip/T24150 at Glasgow Haskell Compiler / GHC


Commits:
5e36e7ce by Teo Camarasu at 2024-05-16T12:49:15+01:00
rts: Allocate non-moving segments with megablocks

Non-moving segments are 8 blocks long and need to be aligned.
Previously we serviced allocations by grabbing 15 blocks, finding
an aligned 8 block group in it and returning the rest.
This proved to lead to high levels of fragmentation as a de-allocating a segment
caused an 8 block gap to form, and this could not be reused for allocation.

This patch introduces a segment allocator based around using entire
megablocks to service segment allocations in bulk.

When there are no free segments, we grab an entire megablock and fill it
with aligned segments. As the megablock is free, we can easily guarantee
alignment. Any unused segments are placed on a free list.

It only makes sense to free segments in bulk when all of the segments in
a megablock are freeable. After sweeping, we grab the free list, sort it,
and find all groups of segments where they cover the megablock and free
them.
This introduces a period of time when free segments are not available to
the mutator, but the risk that this would lead to excessive allocation
is low. Right after sweep, we should have an abundance of partially full
segments, and this pruning step is relatively quick.

In implementing this we drop the logic that kept NONMOVING_MAX_FREE
segments on the free list.

We also introduce an eventlog event to log the amount of pruned/retained
free segments.

See Note [Segment allocation strategy]

Resolves #24150

-------------------------
Metric Decrease:
    T13253
    T19695
-------------------------

- - - - -


17 changed files:

- docs/users_guide/9.12.1-notes.rst
- docs/users_guide/eventlog-formats.rst
- rts/RtsStartup.c
- rts/Trace.c
- rts/Trace.h
- rts/eventlog/EventLog.c
- rts/eventlog/EventLog.h
- rts/gen_event_types.py
- rts/include/rts/storage/Block.h
- rts/include/stg/SMP.h
- rts/sm/BlockAlloc.c
- rts/sm/GC.c
- rts/sm/NonMoving.c
- rts/sm/NonMoving.h
- rts/sm/NonMovingAllocate.c
- rts/sm/Sanity.c
- testsuite/tests/rts/atomicinc.c


Changes:

=====================================
docs/users_guide/9.12.1-notes.rst
=====================================
@@ -53,6 +53,8 @@ GHCi
 Runtime system
 ~~~~~~~~~~~~~~
 
+- Reduce fragmentation incurred by the nonmoving GC's segment allocator. In one application this reduced resident set size by 26%. See :ghc-ticket:`24150`.
+
 ``base`` library
 ~~~~~~~~~~~~~~~~
 


=====================================
docs/users_guide/eventlog-formats.rst
=====================================
@@ -855,6 +855,9 @@ A typical non-moving collection cycle will look something like the following:
 12. A :event-type:`NONMOVING_HEAP_CENSUS` event will be emitted describing the
     fragmentation state of the non-moving heap.
 
+13. A :event-type:`NONMOVING_PRUNED_SEGMENTS` event will be emitted showing
+    information about freeing of segments.
+
 
 .. event-type:: CONC_MARK_BEGIN
 
@@ -929,6 +932,17 @@ heap.
 
    Describes the occupancy of the *blk_sz* sub-heap.
 
+.. event-type:: NONMOVING_PRUNED_SEGMENTS
+
+   :tag: 208
+   :length: fixed
+   :field Word32: number of pruned segments.
+   :field Word32: number of segments remaining on the free list.
+
+   Report the amount of segments pruned and those remaining on the nonmoving
+   heap's segment free list. Segments will be retained on the free list until
+   the entire megablock containing them can be freed.
+
 .. _ticky-event-format:
 
 Ticky counters


=====================================
rts/RtsStartup.c
=====================================
@@ -456,7 +456,7 @@ hs_exit_(bool wait_foreign)
     uint32_t g, i;
 
     // N.B. atomic_dec returns the new value.
-    StgInt init_count = (StgInt)atomic_dec(&hs_init_count);
+    StgInt init_count = (StgInt)atomic_dec(&hs_init_count, 1);
     if (init_count > 0) {
         // ignore until it's the last one
         return;


=====================================
rts/Trace.c
=====================================
@@ -929,6 +929,12 @@ void traceNonmovingHeapCensus(uint16_t blk_size,
         postNonmovingHeapCensus(blk_size, census);
 }
 
+void traceNonmovingPrunedSegments(uint32_t pruned_segments, uint32_t free_segments)
+{
+    if (eventlog_enabled && TRACE_nonmoving_gc)
+        postNonmovingPrunedSegments(pruned_segments, free_segments);
+}
+
 void traceThreadStatus_ (StgTSO *tso USED_IF_DEBUG)
 {
 #if defined(DEBUG)


=====================================
rts/Trace.h
=====================================
@@ -332,6 +332,7 @@ void traceConcSweepEnd(void);
 void traceConcUpdRemSetFlush(Capability *cap);
 void traceNonmovingHeapCensus(uint16_t blk_size,
                               const struct NonmovingAllocCensus *census);
+void traceNonmovingPrunedSegments(uint32_t pruned_segments, uint32_t free_segments);
 
 void traceIPE(const InfoProvEnt *ipe);
 void flushTrace(void);


=====================================
rts/eventlog/EventLog.c
=====================================
@@ -1154,6 +1154,15 @@ void postNonmovingHeapCensus(uint16_t blk_size,
     RELEASE_LOCK(&eventBufMutex);
 }
 
+void postNonmovingPrunedSegments(uint32_t pruned_segments, uint32_t free_segments)
+{
+    ACQUIRE_LOCK(&eventBufMutex);
+    postEventHeader(&eventBuf, EVENT_NONMOVING_PRUNED_SEGMENTS);
+    postWord32(&eventBuf, pruned_segments);
+    postWord32(&eventBuf, free_segments);
+    RELEASE_LOCK(&eventBufMutex);
+}
+
 void closeBlockMarker (EventsBuf *ebuf)
 {
     if (ebuf->marker)


=====================================
rts/eventlog/EventLog.h
=====================================
@@ -196,6 +196,7 @@ void postConcUpdRemSetFlush(Capability *cap);
 void postConcMarkEnd(StgWord32 marked_obj_count);
 void postNonmovingHeapCensus(uint16_t blk_size,
                              const struct NonmovingAllocCensus *census);
+void postNonmovingPrunedSegments(uint32_t pruned_segments, uint32_t free_segments);
 
 #if defined(TICKY_TICKY)
 void postTickyCounterDefs(StgEntCounter *p);


=====================================
rts/gen_event_types.py
=====================================
@@ -132,6 +132,7 @@ event_types = [
     EventType(205, 'CONC_SWEEP_END',               [],                    'End concurrent sweep phase'),
     EventType(206, 'CONC_UPD_REM_SET_FLUSH',       [CapNo],               'Update remembered set flushed'),
     EventType(207, 'NONMOVING_HEAP_CENSUS',        [Word16, Word32, Word32, Word32], 'Nonmoving heap census'),
+    EventType(208, 'NONMOVING_PRUNED_SEGMENTS',    [Word32, Word32],      'Report the amount of segments pruned and remaining on the free list.'),
 
     # Ticky-ticky profiling
     EventType(210, 'TICKY_COUNTER_DEF',            VariableLength,        'Ticky-ticky entry counter definition'),


=====================================
rts/include/rts/storage/Block.h
=====================================
@@ -323,6 +323,10 @@ bdescr *allocGroupOnNode(uint32_t node, W_ n);
 //
 bdescr *allocAlignedGroupOnNode(uint32_t node, W_ n);
 
+// Allocate a MBlock worth of `n` block sized chunks aligned at `n`-block boundry.
+// This returns a linked list of `bdescr` of length `BLOCKS_PER_MBLOCK / n`.
+bdescr *allocMBlockAlignedGroupOnNode(uint32_t node, W_ n);
+
 EXTERN_INLINE bdescr* allocBlockOnNode(uint32_t node);
 EXTERN_INLINE bdescr* allocBlockOnNode(uint32_t node)
 {


=====================================
rts/include/stg/SMP.h
=====================================
@@ -82,13 +82,13 @@ EXTERN_INLINE StgWord cas_seq_cst_relaxed(StgVolatilePtr p, StgWord o, StgWord n
 EXTERN_INLINE StgWord atomic_inc(StgVolatilePtr p, StgWord n);
 
 /*
- * Atomic decrement
+ * Atomic subtraction by the provided quantity.
  *
- * atomic_dec(p) {
- *   return --(*p);
+ * atomic_dec(p, n) {
+ *   return ((*p) -= n);
  * }
  */
-EXTERN_INLINE StgWord atomic_dec(StgVolatilePtr p);
+EXTERN_INLINE StgWord atomic_dec(StgVolatilePtr p, StgWord n);
 
 /*
  * Busy-wait nop: this is a hint to the CPU that we are currently in a
@@ -480,9 +480,9 @@ atomic_inc(StgVolatilePtr p, StgWord incr)
 }
 
 EXTERN_INLINE StgWord
-atomic_dec(StgVolatilePtr p)
+atomic_dec(StgVolatilePtr p, StgWord decr)
 {
-    return __atomic_sub_fetch(p, 1, __ATOMIC_SEQ_CST);
+    return __atomic_sub_fetch(p, decr, __ATOMIC_SEQ_CST);
 }
 
 /*
@@ -634,9 +634,9 @@ atomic_inc(StgVolatilePtr p, StgWord incr)
 }
 
 INLINE_HEADER StgWord
-atomic_dec(StgVolatilePtr p)
+atomic_dec(StgVolatilePtr p, StgWord decr)
 {
-    return --(*p);
+    return ((*p) -= decr);
 }
 #endif
 


=====================================
rts/sm/BlockAlloc.c
=====================================
@@ -394,6 +394,53 @@ split_block_low (bdescr *bd, W_ n)
 }
 
 
+// A variant of `split_block_high` where we keep both blocks.
+// The returned block has size `n`, which is split off `bd`.
+static bdescr *
+split_block_high_no_free (bdescr *bd, W_ n)
+{
+    ASSERT(bd->blocks > n);
+
+    bdescr* ret = bd + bd->blocks - n; // take n blocks off the end
+    ret->blocks = n;
+    ret->start = ret->free = bd->start + (bd->blocks - n)*BLOCK_SIZE_W;
+    ret->link = NULL;
+
+    bd->blocks -= n;
+
+    setup_tail(ret);
+    setup_tail(bd);
+
+    return ret;
+}
+
+// Allocate a MBlock worth of `n` block sized chunks aligned at `n`-block boundry.
+// This returns a linked list of `bdescr` of length `BLOCKS_PER_MBLOCK / n`.
+// We assume relevant locks are held.
+bdescr *
+allocMBlockAlignedGroupOnNode(uint32_t node, W_ n)
+{
+    bdescr *bd = allocGroupOnNode(node, BLOCKS_PER_MBLOCK);
+
+    // Free unaligned blocks, as we can't use these.
+    ASSERT(bd->blocks == BLOCKS_PER_MBLOCK);
+    bd = split_block_high(bd, bd->blocks - bd->blocks % n);
+    ASSERT(bd->blocks % n == 0);
+
+    bdescr *last = NULL;
+    bdescr *chunk = NULL;
+    // Chain the aligned groups together into a linked-list
+    while (bd->blocks > n) {
+      chunk = split_block_high_no_free(bd, n);
+      chunk->link = last;
+      last = chunk;
+    }
+    bd->link = chunk;
+
+    return bd;
+}
+
+
 /* Find a fitting block for the allocation request in the given free list.
    Returns:
      - not NULL: when an exact match was found in the free list.


=====================================
rts/sm/GC.c
=====================================
@@ -1313,7 +1313,7 @@ dec_running (void)
     ACQUIRE_LOCK(&gc_running_mutex);
 #endif
 
-    StgWord r = atomic_dec(&gc_running_threads);
+    StgWord r = atomic_dec(&gc_running_threads, 1);
 
 #if defined(THREADED_RTS)
     if (r == 0) {


=====================================
rts/sm/NonMoving.c
=====================================
@@ -248,6 +248,8 @@ static void nonmovingBumpEpoch(void) {
  *
  *  - Note [Allocator sizes] goes into detail about our choice of allocator sizes.
  *
+ *  - Note [Segment allocation strategy] explains our segment allocation strategy.
+ *
  * [ueno 2016]:
  *   Katsuhiro Ueno and Atsushi Ohori. 2016. A fully concurrent garbage
  *   collector for functional programs on multicore processors. SIGPLAN Not. 51,
@@ -560,6 +562,22 @@ static void nonmovingBumpEpoch(void) {
  *
  * See #23340
  *
+ * Note [Segment allocation strategy]
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * Non-moving segments must be aligned. In order, to efficiently service these
+ * allocations, we allocate segments in bulk
+ * We allocate an entire megablocks worth of segments at once.
+ * All unused segments are placed on the `nonmovingHeap.free` list.
+ *
+ * Symmetrically we only de-allocate segments if all the segments in a megablock are free-able, ie,
+ * are on `nonmovingHeap.free`. We prune the free list in `nonmovingPruneFreeSegmentList`,
+ * called during concurrent sweep phase.
+ * Note that during pruning of the free list, free segments are not available for use by the
+ * mutator. This might lead to extra allocation of segments. But the risk is low as just after sweep
+ * there is usually a large amount of partially full segments, and pruning the free list is quite
+ * quick.
+ *
+ * See #24150
  */
 
 memcount nonmoving_segment_live_words = 0;
@@ -578,19 +596,6 @@ static void nonmovingExitConcurrentWorker(void);
 // Add a segment to the free list.
 void nonmovingPushFreeSegment(struct NonmovingSegment *seg)
 {
-    // See Note [Live data accounting in nonmoving collector].
-    if (RELAXED_LOAD(&nonmovingHeap.n_free) > NONMOVING_MAX_FREE) {
-        bdescr *bd = Bdescr((StgPtr) seg);
-        ACQUIRE_SM_LOCK;
-        ASSERT(oldest_gen->n_blocks >= bd->blocks);
-        ASSERT(oldest_gen->n_words >= BLOCK_SIZE_W * bd->blocks);
-        oldest_gen->n_blocks -= bd->blocks;
-        oldest_gen->n_words  -= BLOCK_SIZE_W * bd->blocks;
-        freeGroup(bd);
-        RELEASE_SM_LOCK;
-        return;
-    }
-
     SET_SEGMENT_STATE(seg, FREE);
     while (true) {
         struct NonmovingSegment *old = nonmovingHeap.free;
@@ -601,6 +606,110 @@ void nonmovingPushFreeSegment(struct NonmovingSegment *seg)
     __sync_add_and_fetch(&nonmovingHeap.n_free, 1);
 }
 
+static int
+cmp_segment_ptr (const void *x, const void *y)
+{
+    const struct NonMovingSegment *p1 = *(const struct NonMovingSegment**)x;
+    const struct NonMovingSegment *p2 = *(const struct NonMovingSegment**)y;
+    if (p1 > p2) return +1;
+    else if (p1 < p2) return -1;
+    else return 0;
+}
+
+// Prune the free list of segments that can be freed.
+// Segments can be freed if all segments from a mblock are on the free list.
+void nonmovingPruneFreeSegmentList(void)
+{
+  trace(TRACE_nonmoving_gc, "Pruning free segment list.");
+  // Atomically grab the entire free list.
+  struct NonmovingSegment *free;
+  size_t length;
+  while (true) {
+    free = ACQUIRE_LOAD(&nonmovingHeap.free);
+    length = ACQUIRE_LOAD(&nonmovingHeap.n_free);
+    if (cas((StgVolatilePtr) &nonmovingHeap.free,
+            (StgWord) free,
+            (StgWord) NULL) == (StgWord) free) {
+        atomic_dec((StgVolatilePtr) &nonmovingHeap.n_free, length);
+        break;
+    }
+    // Save the current free list so the sanity checker can see these segments.
+    nonmovingHeap.saved_free = free;
+  }
+
+  // Sort the free list by address.
+  struct NonmovingSegment **sorted = stgMallocBytes(sizeof(struct NonmovingSegment*) * length, "sorted free segment list");
+  for(size_t i = 0; i<length; i++) {
+    sorted[i] = free;
+    free = free->link;
+  }
+  // we should have reached the end of the free list
+  ASSERT(free == NULL);
+
+  qsort(sorted, length, sizeof(struct NonmovingSegment*), cmp_segment_ptr);
+
+  // Walk the sorted list and either:
+  // - free segments if the entire megablock is free
+  // - put it back on the free list
+  size_t new_length = 0;
+  size_t free_in_megablock = 0;
+  // iterate through segments by megablock
+  for(size_t i = 0; i<length; i+=free_in_megablock) {
+    // count of free segments in the current megablock
+    free_in_megablock = 1;
+    for(;i + free_in_megablock < length; free_in_megablock++) {
+      if (((W_)sorted[i] & ~MBLOCK_MASK) != ((W_)sorted[i + free_in_megablock] & ~MBLOCK_MASK))
+        break;
+    }
+    if (free_in_megablock < BLOCKS_PER_MBLOCK / NONMOVING_SEGMENT_BLOCKS) {
+      // the entire block isn't free so put it back on the list
+      for(size_t j = 0; j < free_in_megablock;j++){
+        struct NonmovingSegment *last = free;
+        free = sorted[i+j];
+        free->link = last;
+        new_length++;
+      }
+    } else {
+      // the megablock is free, so let's free all the segments.
+      ACQUIRE_SM_LOCK;
+      for(size_t j = 0; j < free_in_megablock;j++){
+        bdescr *bd = Bdescr((StgPtr)sorted[i+j]);
+        freeGroup(bd);
+      }
+      RELEASE_SM_LOCK;
+    }
+  }
+  stgFree(sorted);
+  // If we have any segments left over, then put them back on the free list.
+  if(free) {
+    struct NonmovingSegment* tail = free;
+    while(tail->link) {
+      tail = tail->link;
+    }
+    while (true) {
+      struct NonmovingSegment* rest = ACQUIRE_LOAD(&nonmovingHeap.free);
+      tail->link = rest;
+      if (cas((StgVolatilePtr) &nonmovingHeap.free,
+              (StgWord) rest,
+              (StgWord) free) == (StgWord) rest) {
+          __sync_add_and_fetch(&nonmovingHeap.n_free, new_length);
+          break;
+      }
+    }
+  }
+  size_t pruned_segments = length - new_length;
+  // See Note [Live data accounting in nonmoving collector].
+  oldest_gen->n_blocks -= pruned_segments * NONMOVING_SEGMENT_BLOCKS;
+  oldest_gen->n_words  -= pruned_segments * NONMOVING_SEGMENT_SIZE;
+  nonmovingHeap.saved_free = NULL;
+  debugTrace(DEBUG_nonmoving_gc,
+            "Pruned %d free segments, leaving %d on the free segment list.",
+            pruned_segments, new_length);
+  traceNonmovingPrunedSegments(pruned_segments, new_length);
+  trace(TRACE_nonmoving_gc, "Finished pruning free segment list.");
+
+}
+
 void nonmovingInitAllocator(struct NonmovingAllocator* alloc, uint16_t block_size)
 {
   *alloc = (struct NonmovingAllocator)
@@ -1216,6 +1325,7 @@ concurrent_marking:
     nonmovingSweepStableNameTable();
 
     nonmovingSweep();
+    nonmovingPruneFreeSegmentList();
     ASSERT(nonmovingHeap.sweep_list == NULL);
     debugTrace(DEBUG_nonmoving_gc, "Finished sweeping.");
     traceConcSweepEnd();


=====================================
rts/sm/NonMoving.h
=====================================
@@ -119,16 +119,16 @@ extern uint8_t nonmoving_alloca_dense_cnt;
 // NONMOVING_SEGMENT_SIZE (in bytes)
 extern uint8_t nonmoving_alloca_cnt;
 
-// maximum number of free segments to hold on to
-#define NONMOVING_MAX_FREE 16
-
 struct NonmovingHeap {
     struct NonmovingAllocator *allocators;
-    // free segment list. This is a cache where we keep up to
-    // NONMOVING_MAX_FREE segments to avoid thrashing the block allocator.
+    // free segment list. This is a cache where we keep segments
+    // belonging to megablocks that are only partially free.
     // Note that segments in this list are still counted towards
     // oldest_gen->n_blocks.
     struct NonmovingSegment *free;
+    // saved free segment list, so the sanity checker can
+    // see the segments while the free list is being pruned.
+    struct NonmovingSegment *saved_free;
     // how many segments in free segment list? accessed atomically.
     unsigned int n_free;
 
@@ -172,6 +172,7 @@ void nonmovingCollect(StgWeak **dead_weaks,
                       bool concurrent);
 
 void nonmovingPushFreeSegment(struct NonmovingSegment *seg);
+void nonmovingPruneFreeSegmentList(void);
 
 INLINE_HEADER unsigned long log2_ceil(unsigned long x)
 {


=====================================
rts/sm/NonMovingAllocate.c
=====================================
@@ -65,19 +65,42 @@ static struct NonmovingSegment *nonmovingAllocSegment(enum AllocLockMode mode, u
     struct NonmovingSegment *ret;
     ret = nonmovingPopFreeSegment();
 
-    // Nothing in the free list, allocate a new segment...
+    // Nothing in the free list, allocate a new segment.
+    // We allocate a full megablock, and add spare segments to our free list.
     if (ret == NULL) {
         acquire_alloc_lock(mode);
-        bdescr *bd = allocAlignedGroupOnNode(node, NONMOVING_SEGMENT_BLOCKS);
-        // See Note [Live data accounting in nonmoving collector].
-        oldest_gen->n_blocks += bd->blocks;
-        oldest_gen->n_words  += BLOCK_SIZE_W * bd->blocks;
+        // Another thread might have allocated while we were waiting for the lock.
+        ret = nonmovingPopFreeSegment();
+        if (ret != NULL) {
+          release_alloc_lock(mode);
+          // Check alignment
+          ASSERT(((uintptr_t)ret % NONMOVING_SEGMENT_SIZE) == 0);
+          return ret;
+        }
+
+        bdescr *bd = allocMBlockAlignedGroupOnNode(node, NONMOVING_SEGMENT_BLOCKS);
         release_alloc_lock(mode);
 
-        for (StgWord32 i = 0; i < bd->blocks; ++i) {
+        W_ alloc_blocks = BLOCKS_PER_MBLOCK - (BLOCKS_PER_MBLOCK % NONMOVING_SEGMENT_BLOCKS);
+
+        // See Note [Live data accounting in nonmoving collector].
+        oldest_gen->n_blocks += alloc_blocks;
+        oldest_gen->n_words  += BLOCK_SIZE_W * alloc_blocks;
+
+        for (StgWord32 i = 0; i < alloc_blocks; ++i) {
             initBdescr(&bd[i], oldest_gen, oldest_gen);
             bd[i].flags = BF_NONMOVING;
         }
+
+        // Push all but the last segment to the free segment list.
+        while(bd->link) {
+          bdescr *next_bd = bd->link;
+          bd->link = NULL;
+          nonmovingPushFreeSegment((struct NonmovingSegment *)bd->start);
+          bd = next_bd;
+        }
+
+        // Use the last segment to service the allocation.
         ret = (struct NonmovingSegment *)bd->start;
     }
 


=====================================
rts/sm/Sanity.c
=====================================
@@ -1238,6 +1238,7 @@ countNonMovingHeap(struct NonmovingHeap *heap)
     }
     ret += countNonMovingSegments(heap->sweep_list);
     ret += countNonMovingSegments(heap->free);
+    ret += countNonMovingSegments(heap->saved_free);
     return ret;
 }
 


=====================================
testsuite/tests/rts/atomicinc.c
=====================================
@@ -11,7 +11,7 @@ int main(int argc, char *argv[])
     CHECK(j == 1);
     CHECK(i == 1);
 
-    j = atomic_dec(&i);
+    j = atomic_dec(&i,1);
     CHECK(j == 0);
     CHECK(i == 0);
 



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/commit/5e36e7ced6a8aa183762e1df97da20b460e19bcf

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/commit/5e36e7ced6a8aa183762e1df97da20b460e19bcf
You're receiving this email because of your account on gitlab.haskell.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20240516/33679c09/attachment-0001.html>


More information about the ghc-commits mailing list