<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-3">

</head>

<body>

<div>

<div>

<div dir="ltr" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">

Hey Ömer,</div>

<div dir="ltr" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">

<br>

</div>

<div dir="ltr" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">

Just in case you are wondering whether you are talking into the void: you aren't! Keep the good suggestions coming, someone will eventually be able to get around to implementing them!</div>

<div dir="ltr" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">

Thanks for your write-ups.</div>

<div dir="ltr" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">

<br>

</div>

<div dir="ltr" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">

Cheers,</div>

<div dir="ltr" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">

Sebastian</div>

</div>

<div id="ms-outlook-mobile-signature">

<div></div>

</div>

</div>

<hr style="display:inline-block;width:98%" tabindex="-1">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>Von:</b> ghc-devs <ghc-devs-bounces@haskell.org> im Auftrag von Ömer Sinan Aťacan <omeragacan@gmail.com><br>

<b>Gesendet:</b> Thursday, September 2, 2021 5:47:08 PM<br>

<b>An:</b> ghc-devs <ghc-devs@haskell.org><br>

<b>Betreff:</b> Re: Potential improvement in compacting GC</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText">Here's another improvement that fixes a very old issue in GHC's compacting GC<br>

[1].<br>

<br>

To summarize the problem: when untreading an object we update references to the<br>

object that we've seen so far to the object's new location. But to get the<br>

object's new location we need to know the object's size, because depending on<br>

the size we may need to move the object to a new block (when the current block<br>

does not have enough space for the object).<br>

<br>

For this we currently walk the thread twice, once to get the info table (used<br>

to get the size), then again to update references to the object. Ideally we<br>

want to do just one pass when unthreading.<br>

<br>

The solution is explained in The GC Handbook, section 3.4. Instead of using one<br>

bit to mark an object, we use two bits: one for the first word of the object,<br>

one for the last word. Using two bits is not a problem in GHC because heap<br>

objects are at least 2 words. For example, an object with two words is marked<br>

with `11`, 3 words is marked with `101` and so on.<br>

<br>

Now we can scan the bitmap to find object size, and unthread it without having<br>

to find the info table first.<br>

<br>

Ömer<br>

<br>

[1]: <a href="https://github.com/ghc/ghc/blob/922c6bc8dd8d089cfe4b90ec2120cb48959ba2b5/rts/sm/Compact.c#L844-L848">

https://github.com/ghc/ghc/blob/922c6bc8dd8d089cfe4b90ec2120cb48959ba2b5/rts/sm/Compact.c#L844-L848</a><br>

<br>

Ömer Sinan Aťacan <omeragacan@gmail.com>, 14 Tem 2021 Çar, 09:27<br>

tarihinde şunu yazdš:<br>

><br>

> Two other ideas that should improve GHC's compacting GC much more<br>

> significantly. I've implemented both of these in another project and the<br>

> results were great. In a GC benchmark (mutator does almost no work other than<br>

> allocating data using a bump allocator), first one reduced Wasm instructions<br>

> executed by 14%, second one 19.8%.<br>

><br>

> Both of these ideas require pushing object headers to the mark stack with the<br>

> objects, which means larger mark stacks. This is the only downside of these<br>

> algorithms.<br>

><br>

> - Instead of marking and then threading in the next pass, mark phase threads<br>

>   all fields when pushing the fields to the mark stack. We still need two other<br>

>   passes: one to unthread headers, another to move the objects. (we can't do<br>

>   both in one pass, let me know if you're curious and I can elaborate)<br>

><br>

>   This has the same number of passes as the current implementation, but it only<br>

>   visits object fields once. Currently, we visit fields once when marking, to<br>

>   mark fields, then again in `update_fwd`. This implementation does both in one<br>

>   pass over the fields. `update_fwd` does not visit fields.<br>

><br>

>   This reduced Wasm instructions executed by 14% in my benchmark.<br>

><br>

> - Marking phase threads backwards pointers (ignores forwards pointers). Then we<br>

>   do one pass instead of two: for a marked object, unthread it (update<br>

>   forwards pointers to the object's new location), move it to its new location,<br>

>   then thread its forwards pointers. This completely eliminates one of the 3<br>

>   passes, but fields need to be visited twice as before (unlike the algorithm<br>

>   above).<br>

><br>

>   I think this one is originally described in "An Efficient Garbage Compaction<br>

>   Algorithm", but I found that paper to be difficult to follow. A short<br>

>   description of the same algorithm exists in "High-Performance Garbage<br>

>   Collection for Memory-Constrained Environments" in section 5.1.2.<br>

><br>

>   This reduced Wasm instructions executed by 19.8% in my benchmark.<br>

><br>

>   In this algorithm, fields that won't be moved can be threaded any time before<br>

>   the second pass (pointed objects need to be marked and pushed to the mark<br>

>   stack with headers before threading a field). For example, in GHC, mut list<br>

>   entries can be threaded before or after marking (but before the second pass)<br>

>   as IIRC mut lists are not moved. Same for fields of large objects.<br>

><br>

> As far as I can see, mark-compact GC is still the default when max heap size is<br>

> specified and the oldest generation size is (by default) more than 30% of the<br>

> max heap size. I'm not sure if max heap size is specified often (it's off by<br>

> default), so not sure what would be the impact of these improvements be, but if<br>

> anyone would be interested in funding me to implement these ideas (second<br>

> algorithm above, and the bitmap iteration in the previous email) I could try to<br>

> allocate one or two days a week to finish in a few months.<br>

><br>

> Normally these are simple changes, but it's difficult to test and debug GHC's<br>

> RTS as we don't have a test suite readily available and the code is not easily<br>

> testable. In my previous implementations of these algorithms I had unit tests<br>

> for the GC where I could easily generate arbitrary graphs (with cycles,<br>

> backwards ptrs, forwards ptrs, ptrs from/to roots etc.) and test GC in<br>

> isolation. Implementation of (2) took less than a day, and I didn't have to<br>

> debug it more once the tests passed. It's really unfortunate that GHC's RTS<br>

> makes this kind of thing difficult..<br>

><br>

> Ömer<br>

><br>

> Ömer Sinan Aťacan <omeragacan@gmail.com>, 7 Oca 2021 Per, 20:42<br>

> tarihinde şunu yazdš:<br>

> ><br>

> > Hello,<br>

> ><br>

> > I recently implemented the algorithm used by GHC's compacting GC in another<br>

> > project. The algorithm (after marking) makes two passes over the heap<br>

> > /generation. In GHC, these passes are implemented in [1] and in the next<br>

> > function.<br>

> ><br>

> > In my implementation I tried 3 ways of implementing these passes, one of which<br>

> > is the same as GHC's code, and benchmarked each version. I found that the<br>

> > fastest implementation is not what's used in GHC, but it could easily be used.<br>

> ><br>

> > I should mention that my code targets Wasm, and I benchmarked Wasm instructions<br>

> > executed. Not CPU cycles, CPU instructions, or anything like that. It's very<br>

> > likely that the results will be different when benchmarking code running on<br>

> > actual hardware.<br>

> ><br>

> > Secondly, in my case the heap is mostly dead (residency is low). In GHC,<br>

> > compaction for the oldest generation is enabled when residency passes a<br>

> > threshold, so the assumption is the heap is mostly live. I'm guessing this<br>

> > should also make some difference.<br>

> ><br>

> > Anyway, the first implementation I tried was similar to GHC's scan, but I did<br>

> ><br>

> >     scan += object_size(scan);<br>

> ><br>

> > instead of bumping scan by one, as GHC does in [2]. This was the slowest<br>

> > version.<br>

> ><br>

> > Second implementation did the same as GHC (bumped scan by one). This was<br>

> > faster, but still slower than the next version.<br>

> ><br>

> > What I found to be the best is scanning the bitmap, not the heap. The bitmap<br>

> > iterator reads one word at a time. In each iteration it checks if the bitmap<br>

> > word is 0. In GHC, in the best case this can skip 16 words on heap on 32-bit<br>

> > systems, and 32 words on 64-bit. Reminder: we need two bits per object in the<br>

> > bitmap, see [3]. (this is not the case in my implementation so the payoff is<br>

> > better)<br>

> ><br>

> > When the bitmap word is not 0 I use "count trailing zeros" (or "count leading<br>

> > zeros" depending on the bitmap implementation) to get the number of words to<br>

> > skip. This is a single instruction on Wasm and x86 (TZCNT or LZCNT, available<br>

> > via __builtin_ctzl and __builtin_clzl in gcc).<br>

> ><br>

> > So instead of skipping one word at a time, this can potentially skip 16 words<br>

> > (or 32 on 64-bit architectures). When that's not possible, it can still skip<br>

> > multiple words by using ctz/clz.<br>

> ><br>

> > Ömer<br>

> ><br>

> > [1]: <a href="https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L824-L879">

https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L824-L879</a><br>

> > [2]: <a href="https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L838">

https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.c#L838</a><br>

> > [3]: <a href="https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.h#L18-L55">

https://github.com/ghc/ghc/blob/bd877edd9499a351db947cd51ed583872b2facdf/rts/sm/Compact.h#L18-L55</a><br>

_______________________________________________<br>

ghc-devs mailing list<br>

ghc-devs@haskell.org<br>

<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a><br>

</div>

</span></font></div>

</body>

</html>