Potential improvement in compacting GC

Wed Jul 14 06:27:11 UTC 2021

Two other ideas that should improve GHC's compacting GC much more
significantly. I've implemented both of these in another project and the
results were great. In a GC benchmark (mutator does almost no work other than
allocating data using a bump allocator), first one reduced Wasm instructions
executed by 14%, second one 19.8%.

Both of these ideas require pushing object headers to the mark stack with the
objects, which means larger mark stacks. This is the only downside of these
algorithms.

- Instead of marking and then threading in the next pass, mark phase threads
  all fields when pushing the fields to the mark stack. We still need two other
  passes: one to unthread headers, another to move the objects. (we can't do
  both in one pass, let me know if you're curious and I can elaborate)

  This has the same number of passes as the current implementation, but it only
  visits object fields once. Currently, we visit fields once when marking, to
  mark fields, then again in `update_fwd`. This implementation does both in one
  pass over the fields. `update_fwd` does not visit fields.

  This reduced Wasm instructions executed by 14% in my benchmark.

- Marking phase threads backwards pointers (ignores forwards pointers). Then we
  do one pass instead of two: for a marked object, unthread it (update
  forwards pointers to the object's new location), move it to its new location,
  then thread its forwards pointers. This completely eliminates one of the 3
  passes, but fields need to be visited twice as before (unlike the algorithm
  above).

  I think this one is originally described in "An Efficient Garbage Compaction
  Algorithm", but I found that paper to be difficult to follow. A short
  description of the same algorithm exists in "High-Performance Garbage
  Collection for Memory-Constrained Environments" in section 5.1.2.

  This reduced Wasm instructions executed by 19.8% in my benchmark.

  In this algorithm, fields that won't be moved can be threaded any time before
  the second pass (pointed objects need to be marked and pushed to the mark
  stack with headers before threading a field). For example, in GHC, mut list
  entries can be threaded before or after marking (but before the second pass)
  as IIRC mut lists are not moved. Same for fields of large objects.

As far as I can see, mark-compact GC is still the default when max heap size is
specified and the oldest generation size is (by default) more than 30% of the
max heap size. I'm not sure if max heap size is specified often (it's off by
default), so not sure what would be the impact of these improvements be, but if
anyone would be interested in funding me to implement these ideas (second
algorithm above, and the bitmap iteration in the previous email) I could try to
allocate one or two days a week to finish in a few months.

Normally these are simple changes, but it's difficult to test and debug GHC's
RTS as we don't have a test suite readily available and the code is not easily
testable. In my previous implementations of these algorithms I had unit tests
for the GC where I could easily generate arbitrary graphs (with cycles,
backwards ptrs, forwards ptrs, ptrs from/to roots etc.) and test GC in
isolation. Implementation of (2) took less than a day, and I didn't have to
debug it more once the tests passed. It's really unfortunate that GHC's RTS
makes this kind of thing difficult..

Ömer

Ömer Sinan Ağacan <omeragacan at gmail.com>, 7 Oca 2021 Per, 20:42
tarihinde şunu yazdı:
>
> Hello,
>
> I recently implemented the algorithm used by GHC's compacting GC in another
> project. The algorithm (after marking) makes two passes over the heap
> /generation. In GHC, these passes are implemented in [1] and in the next
> function.
>
> In my implementation I tried 3 ways of implementing these passes, one of which
> is the same as GHC's code, and benchmarked each version. I found that the
> fastest implementation is not what's used in GHC, but it could easily be used.
>
> I should mention that my code targets Wasm, and I benchmarked Wasm instructions
> executed. Not CPU cycles, CPU instructions, or anything like that. It's very
> likely that the results will be different when benchmarking code running on
> actual hardware.
>
> Secondly, in my case the heap is mostly dead (residency is low). In GHC,
> compaction for the oldest generation is enabled when residency passes a
> threshold, so the assumption is the heap is mostly live. I'm guessing this
> should also make some difference.
>
> Anyway, the first implementation I tried was similar to GHC's scan, but I did
>
>     scan += object_size(scan);
>
> instead of bumping scan by one, as GHC does in [2]. This was the slowest
> version.
>
> Second implementation did the same as GHC (bumped scan by one). This was
> faster, but still slower than the next version.
>
> What I found to be the best is scanning the bitmap, not the heap. The bitmap
> iterator reads one word at a time. In each iteration it checks if the bitmap
> word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit
> systems, and 32 words on 64-bit. Reminder: we need two bits per object in the
> bitmap, see [3]. (this is not the case in my implementation so the payoff is
> better)
>
> When the bitmap word is not 0 I use "count trailing zeros" (or "count leading
> zeros" depending on the bitmap implementation) to get the number of words to
> skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available
> via __builtin_ctzl and __builtin_clzl in gcc).
>
> So instead of skipping one word at a time, this can potentially skip 16 words
> (or 32 on 64-bit architectures). When that's not possible, it can still skip
> multiple words by using ctz/clz.
>
> Ömer
>
> [1]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L824-L879
> [2]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L838
> [3]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.h#L18-L55