Potential improvement in compacting GC

Thu Jan 7 17:42:08 UTC 2021

Hello,

I recently implemented the algorithm used by GHC's compacting GC in another
project. The algorithm (after marking) makes two passes over the heap
/generation. In GHC, these passes are implemented in [1] and in the next
function.

In my implementation I tried 3 ways of implementing these passes, one of which
is the same as GHC's code, and benchmarked each version. I found that the
fastest implementation is not what's used in GHC, but it could easily be used.

I should mention that my code targets Wasm, and I benchmarked Wasm instructions
executed. Not CPU cycles, CPU instructions, or anything like that. It's very
likely that the results will be different when benchmarking code running on
actual hardware.

Secondly, in my case the heap is mostly dead (residency is low). In GHC,
compaction for the oldest generation is enabled when residency passes a
threshold, so the assumption is the heap is mostly live. I'm guessing this
should also make some difference.

Anyway, the first implementation I tried was similar to GHC's scan, but I did

    scan += object_size(scan);

instead of bumping scan by one, as GHC does in [2]. This was the slowest
version.

Second implementation did the same as GHC (bumped scan by one). This was
faster, but still slower than the next version.

What I found to be the best is scanning the bitmap, not the heap. The bitmap
iterator reads one word at a time. In each iteration it checks if the bitmap
word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit
systems, and 32 words on 64-bit. Reminder: we need two bits per object in the
bitmap, see [3]. (this is not the case in my implementation so the payoff is
better)

When the bitmap word is not 0 I use "count trailing zeros" (or "count leading
zeros" depending on the bitmap implementation) to get the number of words to
skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available
via __builtin_ctzl and __builtin_clzl in gcc).

So instead of skipping one word at a time, this can potentially skip 16 words
(or 32 on 64-bit architectures). When that's not possible, it can still skip
multiple words by using ctz/clz.

Ömer

[1]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L824-L879
[2]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L838
[3]: https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.h#L18-L55