Transparently hooking into the STG stack to validate an escape analysis

Wed Dec 8 10:19:33 UTC 2021

Hi Devs,

my master's student Sebastian and I (also Sebastian :)) are working on 
an escape analysis in STG, see 
https://gitlab.haskell.org/ghc/ghc/-/issues/16891#note_347903.

We have a prototype for the escape analysis that we want to 
validate/exploit now.
The original plan was to write the transformation that allocates 
non-escaping objects on the STG stack. But that is quite tricky for many 
reasons, one of them being treatment of the GC.

This mail is rather lengthy, so I suggest you skip to "What we hope 
could work" and see if you can answer it without the context I provide 
below. If you can't, I would be very grateful if you were willing to 
suffer through the exposition.

# Instrumentation

So instead we thought about doing a (easily changed and thus versatile) 
instrumentation-based approach:
Assign a sequence number to every instantiation (a term I will use to 
mean "allocation of g's closure") of things that we our analysis 
determines as escaping or non-escaping, such as STG's let bindings 
(focusing on non-let-no-escape functions for now).
(One sequence number *per allocation* of the let binding's closure, not 
based on its syntactic identity.)
Then, we set a bit in a (dynamically growing) global bit vector whenever 
the let "RHS is entered" and then unset it when we "leave the let-body". 
Example:

f = \x y ->
   let {
     g = [y] \z -> y + z;
   } in g x

Here, our analysis could see that no instantiation (which I say instead 
of "allocation of g's closure") of g will ever escape its scope within 
f.
Our validation would give a fresh sequence number to the instantiation 
of g whenever f is called and store it in g's closure (which we arrange 
by piggy-backing on -prof and adding an additional field to the 
profiling header).
Then, when g's RHS is entered, we set the bit in the global bit vector, 
indicating "this instantiation of g might escape".
After leaving the RHS of g, we also leave the body of the defining let, 
which means we unset the bit in the bit vector, meaning "every use so 
far wasn't in an escaping scenario".

So far so good. Modifying the code upon entering g takes a bit of 
tinkering but can be done by building on TickyTicky in StgToCmm.
But what is not done so easily is inserting the continuation upon 
entering the let that will unset the bit!

# What doesn't work: Modifying the Sequel

At first, we tried to modify the sequel 
<https://gitlab.haskell.org/ghc/ghc/-/blob/4c6985cc91b9cf89b393f69c9a603e8df0aea9e0/compiler/GHC/StgToCmm/Monad.hs#L210-222> 
of the let-body to an `AssignTo`.
That requires us to know the registers in which the let-body will return 
its results, which in turn means we have to know the representation of 
those results, so we have to write a function `stgExprPrimRep :: 
GenStgExpr p -> [PrimRep]`.
Urgh! We were very surprised that there was no such function. And while 
we tested our function, we very soon knew why. Consider the following 
pattern synonym matcher:

GHC.Natural.$mNatJ#
   :: forall {rep :: GHC.Types.RuntimeRep} {r :: TYPE rep}.
      GHC.Num.Natural.Natural
      -> (GHC.Num.BigNat.BigNat -> r) -> ((# #) -> r) -> r
  = {} \r [scrut_sBE cont_sBF fail_sBG]
         case scrut_sBE of {
           GHC.Num.Natural.NS _ -> fail_sBG GHC.Prim.(##);
           GHC.Num.Natural.NB ds_sBJ ->
               let {
                 sat_sBK :: GHC.Num.BigNat.BigNat
                  = CCCS GHC.Num.BigNat.BN#! [ds_sBJ];
               } in  cont_sBF sat_sBK;
         };

Note how its result is representation-polymorphic! It only works because 
our machine implementation allows tail-calls.
It's obvious in hindsight that we could never write `stgExprPrimRep` in 
such a way that it will work on the expression `cont_sBF sat_sBK`.
So the sequel approach seems infeasible.

# What we hope could work: A special stack frame

The other alternative would be to insert a special continuation frame on 
the stack when we enter the let-body (inspired by stg_restore_cccs).
This continuation frame would simply push all registers (FP regs, GP 
regs, Vec regs, ...) to the C stack, do its work (unsetting the bit), 
then pop all registers again and jump to the topmost continuation on the 
STG stack.
Example:

f :: forall rep (r :: TYPE rep). Int# -> (Int# -> r) -> r
f = \x g ->
   let {
     h = [x] \a -> x + a;
   } in
   case h x of b {
     __DEFAULT -> g b
   }

We are only interested in unsetting the bit for h here. Consider the 
stack when entering the body of h.

caller_of_f_cont_info <- Sp

Now push our special continuation frame:

caller_of_f_cont_info
seq_h
unset_bit_stk_info <- Sp

E.g., the stack frame contains the info pointer and the sequence number. 
(Btw., I hope I got the stack layout about right and this is even 
possible)
Then, after we entered the continuation of the __DEFAULT alt, we do a 
jump to g.
Plot twist: g returns an unboxed 8-tuple of `Int#`s (as 
caller_of_f_cont_info knows, but f certainly doesn't!), so before it 
returns it will push two args on the stack (correct?):

caller_of_f_cont_info
seq_h
unset_bit_stk_info
unboxed tuple component 7
unboxed tuple component 8 <- Sp

And then `g` jumps directly to the entry code for `unset_bit_stk_info` 
(which does the register saving I mentioned), which absolutely can't 
figure out from Sp alone where seq_h is.
Darn! I think Luite's recent work on the StgToByteCode had to work 
around similar issues, I found this wiki page 
<https://gitlab.haskell.org/ghc/ghc-wiki-mirror/-/blob/master/Unboxed-Tuples-and-Sums-in-GHCi-bytecode.md#returning-a-tuple>. 
But we aren't in a position where we know the representation of `r` *at 
all*!

So our idea was to scan the stack, beginning from `Sp`, until I find 
`unset_bit_stk_info`, giving us the following situation:

caller_of_f_cont_info
seq_h
unset_bit_stk_info <- Bp
unboxed tuple component 7
unboxed tuple component 8 <- Sp

I suggestively named the register in which we store the result Bp in 
analogy to the traditional base pointer. This information would be 
enough to unset the bit at index seq_h
and then copy the unboxed tuple components 7 and 8 up by two words:

caller_of_f_cont_info
unboxed tuple component 7
unboxed tuple component 8 <- Sp

Then jump to caller_of_f_cont_info, which knows what to make of the 
stack and the register state.

The stack scanning is horrible and too brittle and slow for a production 
setting, as any of the unboxed tuple components could have the same bit 
pattern as `unset_bit_stk_info`.
We simply hope that is never the case and it's fine for the purposes of 
a quick-and-dirty instrumentation.

QUESTION 1: What do you think? Could this work? Can you anticipate pit 
falls of this approach?

# What about Thunks and StgRhsCon?

QUESTION 2: The instrumentation as described won't work for Thunks 
(which are only entered once) and constructor applications (like sat_sBK 
in the BigNat matcher above). Not sure how to change that without 
incurring huge performance hits (by deactivating memoisation). Ideas are 
welcome here.

Thanks for reading this far. I hope you could follow and are now fizzing 
with excitement because you have a much better idea of how to do this 
and will let me know :)

Sebastian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20211208/b72cfde2/attachment.html>