Transparently hooking into the STG stack to validate an escape analysis

Fri Dec 17 09:24:38 UTC 2021

(Moving the conversation off the list.)

------ Originalnachricht ------
Von: "Csaba Hruska" <csaba.hruska at gmail.com>
An: "Sebastian Graf" <sgraf1337 at gmail.com>
Cc: "ghc-devs" <ghc-devs at haskell.org>; "Sebastian Scheper" 
<sebastian.scheper at student.kit.edu>
Gesendet: 16.12.2021 18:22:30
Betreff: Re: Re[2]: Transparently hooking into the STG stack to validate 
an escape analysis

>Thanks for the feedback!
>Let's have a video meeting. My schedule is flexible. What time is best 
>for you?
>
>Cheers,
>Csaba
>
>On Thu, Dec 16, 2021 at 5:29 PM Sebastian Graf <sgraf1337 at gmail.com> 
>wrote:
>>Hey Csaba,
>>
>>After catching up on your talk and reflecting about it a bit, it seems 
>>quite obvious that your tool is the right way to collect the data and 
>>validate our analysis!
>>Even if meanwhile we decided that a "transparent stack frame" (which I 
>>believe is something similar to what you are doing here 
>><https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3ed07d9f3167ad1afa01d1aa95aec2472b2708f/external-stg-interpreter/lib/Stg/Interpreter.hs#L130>, 
>>with an explicit `argCount` which we do not know) is not the ideal 
>>solution we've been looking for (for different reasons).
>>
>>Essentially, we have two things
>>An escape analysis, implemented as an STG-to-STG pass that attaches a 
>>boolean flag to each Id whether it "escapes its scope" (for a suitable 
>>definition of that).
>>We'd like to implement it in a way that it would be reusable within 
>>GHC with moderate effort (e.g., renaming Binder to Id or accounting 
>>for different fields), operating on a module at a time rather than 
>>whole program.The instrumentation that tries to measure how many heap 
>>objects could be allocated on the stack. E.g., set a closure-specific 
>>flag whenever the closure is entered, unset that bit (once) when we 
>>"leave the scope" that defines the closure.
>>If my understanding is right, we could just implement this 
>>"instrumentation" as a simple extra field to Closure 
>><https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3ed07d9f3167ad1afa01d1aa95aec2472b2708f/external-stg-interpreter/lib/Stg/Interpreter/Base.hs#L135>, 
>>right? Neat!
>>A bit tangential: I see that your interpreter currently allocates a 
>>fresh closure for let-no-escapes 
>><https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3ed07d9f3167ad1afa01d1aa95aec2472b2708f/external-stg-interpreter/lib/Stg/Interpreter.hs#L606> 
>>when it could just re-use the closure of its defining scope. That 
>>would skew our numbers somewhat compared to instrumenting GHC-compiled 
>>programs, but I think we'd be able to work around that. I also wonder 
>>if the semantics of your let-no-escapes are actually as typically 
>>specified (operationally), in that a jump to a let-no-escape should 
>>also reset the stack pointer. It should hardly matter for the programs 
>>that GHC generates, though.
>>
>>I would also be interested in knowing whether the +RTS -s "bytes 
>>allocated in the heap" metric (very nearly) coincides with a similar 
>>metric you could produce. It would be fantastic if that was the case! 
>>Theoretically, that should be possible, right?
>>
>>I think your interpreter work is very valuable to collect data we 
>>otherwise would only be able to measure with a TickyTicky-based 
>>approach. Nice!
>>Another, similar use case would be to identify the fraction of 
>>closures that are only entered once. I remember that there was a 
>>ticky-based patch with which Joachim used to measure this fraction 
>><https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/demand#instrumentation> 
>>(and similarly validate the analysis results), but unfortunately it 
>>couldn't end up in master. Ah, yes, we have a whole open ticket about 
>>it: #10613 <https://gitlab.haskell.org/ghc/ghc/-/issues/10613>. In 
>>fact, that instrumentation is also somewhat similar (adding a field to 
>>every closure) as what we want to do.
>>
>>Anyway, it seems like your work will be very valuable in replacing 
>>some of the annoying ticky-based instrumentation ideas!
>>
>>Maybe we can have a call some time this or next week to discuss 
>>details, once Sebastian and I are more familiar with the code base?
>>
>>Thanks for sticking with the project and doing all the hard work that 
>>can build upon!
>>Sebastian
>>
>>------ Originalnachricht ------
>>Von: "Csaba Hruska" <csaba.hruska at gmail.com>
>>An: "Sebastian Graf" <sgraf1337 at gmail.com>
>>Cc: "ghc-devs" <ghc-devs at haskell.org>; "Sebastian Scheper" 
>><sebastian.scheper at student.kit.edu>
>>Gesendet: 15.12.2021 16:16:27
>>Betreff: Re: Transparently hooking into the STG stack to validate an 
>>escape analysis
>>
>>>Hi,
>>>
>>>IMO the Cmm STG machine implementation is just too complex for 
>>>student projects. It's not fun to work with at all.
>>>Why did you choose this approach?
>>>IMO the escape analysis development and validation would be much 
>>>smoother and fun when you'd use the external STG interpreter.
>>>When you have a solid and working design of your analysis and 
>>>transformations then you could implement it in GHC's native backend 
>>>if it needs any changes at all.
>>>
>>>What do you think?
>>>Do you disagree?
>>>
>>>Have you seen my presentation about the stg interpreter?
>>>https://www.youtube.com/watch?v=Ey5OFPkxF_w
>>>
>>>Cheers,
>>>Csaba
>>>
>>>On Wed, Dec 8, 2021 at 11:20 AM Sebastian Graf <sgraf1337 at gmail.com> 
>>>wrote:
>>>>Hi Devs,
>>>>
>>>>my master's student Sebastian and I (also Sebastian :)) are working 
>>>>on an escape analysis in STG, see 
>>>>https://gitlab.haskell.org/ghc/ghc/-/issues/16891#note_347903.
>>>>
>>>>We have a prototype for the escape analysis that we want to 
>>>>validate/exploit now.
>>>>The original plan was to write the transformation that allocates 
>>>>non-escaping objects on the STG stack. But that is quite tricky for 
>>>>many reasons, one of them being treatment of the GC.
>>>>
>>>>This mail is rather lengthy, so I suggest you skip to "What we hope 
>>>>could work" and see if you can answer it without the context I 
>>>>provide below. If you can't, I would be very grateful if you were 
>>>>willing to suffer through the exposition.
>>>>
>>>># Instrumentation
>>>>
>>>>So instead we thought about doing a (easily changed and thus 
>>>>versatile) instrumentation-based approach:
>>>>Assign a sequence number to every instantiation (a term I will use 
>>>>to mean "allocation of g's closure") of things that we our analysis 
>>>>determines as escaping or non-escaping, such as STG's let bindings 
>>>>(focusing on non-let-no-escape functions for now).
>>>>(One sequence number *per allocation* of the let binding's closure, 
>>>>not based on its syntactic identity.)
>>>>Then, we set a bit in a (dynamically growing) global bit vector 
>>>>whenever the let "RHS is entered" and then unset it when we "leave 
>>>>the let-body". Example:
>>>>
>>>>f = \x y ->
>>>>   let {
>>>>     g = [y] \z -> y + z;
>>>>   } in g x
>>>>
>>>>Here, our analysis could see that no instantiation (which I say 
>>>>instead of "allocation of g's closure") of g will ever escape its 
>>>>scope within f.
>>>>Our validation would give a fresh sequence number to the 
>>>>instantiation of g whenever f is called and store it in g's closure 
>>>>(which we arrange by piggy-backing on -prof and adding an additional 
>>>>field to the profiling header).
>>>>Then, when g's RHS is entered, we set the bit in the global bit 
>>>>vector, indicating "this instantiation of g might escape".
>>>>After leaving the RHS of g, we also leave the body of the defining 
>>>>let, which means we unset the bit in the bit vector, meaning "every 
>>>>use so far wasn't in an escaping scenario".
>>>>
>>>>So far so good. Modifying the code upon entering g takes a bit of 
>>>>tinkering but can be done by building on TickyTicky in StgToCmm.
>>>>But what is not done so easily is inserting the continuation upon 
>>>>entering the let that will unset the bit!
>>>>
>>>># What doesn't work: Modifying the Sequel
>>>>
>>>>At first, we tried to modify the sequel 
>>>><https://gitlab.haskell.org/ghc/ghc/-/blob/4c6985cc91b9cf89b393f69c9a603e8df0aea9e0/compiler/GHC/StgToCmm/Monad.hs#L210-222> 
>>>>of the let-body to an `AssignTo`.
>>>>That requires us to know the registers in which the let-body will 
>>>>return its results, which in turn means we have to know the 
>>>>representation of those results, so we have to write a function 
>>>>`stgExprPrimRep :: GenStgExpr p -> [PrimRep]`.
>>>>Urgh! We were very surprised that there was no such function. And 
>>>>while we tested our function, we very soon knew why. Consider the 
>>>>following pattern synonym matcher:
>>>>
>>>>GHC.Natural.$mNatJ#
>>>>   :: forall {rep :: GHC.Types.RuntimeRep} {r :: TYPE rep}.
>>>>      GHC.Num.Natural.Natural
>>>>      -> (GHC.Num.BigNat.BigNat -> r) -> ((# #) -> r) -> r
>>>>  = {} \r [scrut_sBE cont_sBF fail_sBG]
>>>>         case scrut_sBE of {
>>>>           GHC.Num.Natural.NS _ -> fail_sBG GHC.Prim.(##);
>>>>           GHC.Num.Natural.NB ds_sBJ ->
>>>>               let {
>>>>                 sat_sBK :: GHC.Num.BigNat.BigNat
>>>>                  = CCCS GHC.Num.BigNat.BN#! [ds_sBJ];
>>>>               } in  cont_sBF sat_sBK;
>>>>         };
>>>>
>>>>Note how its result is representation-polymorphic! It only works 
>>>>because our machine implementation allows tail-calls.
>>>>It's obvious in hindsight that we could never write `stgExprPrimRep` 
>>>>in such a way that it will work on the expression `cont_sBF 
>>>>sat_sBK`.
>>>>So the sequel approach seems infeasible.
>>>>
>>>># What we hope could work: A special stack frame
>>>>
>>>>The other alternative would be to insert a special continuation 
>>>>frame on the stack when we enter the let-body (inspired by 
>>>>stg_restore_cccs).
>>>>This continuation frame would simply push all registers (FP regs, GP 
>>>>regs, Vec regs, ...) to the C stack, do its work (unsetting the 
>>>>bit), then pop all registers again and jump to the topmost 
>>>>continuation on the STG stack.
>>>>Example:
>>>>
>>>>f :: forall rep (r :: TYPE rep). Int# -> (Int# -> r) -> r
>>>>f = \x g ->
>>>>   let {
>>>>     h = [x] \a -> x + a;
>>>>   } in
>>>>   case h x of b {
>>>>     __DEFAULT -> g b
>>>>   }
>>>>
>>>>We are only interested in unsetting the bit for h here. Consider the 
>>>>stack when entering the body of h.
>>>>
>>>>caller_of_f_cont_info <- Sp
>>>>
>>>>Now push our special continuation frame:
>>>>
>>>>caller_of_f_cont_info
>>>>seq_h
>>>>unset_bit_stk_info <- Sp
>>>>
>>>>E.g., the stack frame contains the info pointer and the sequence 
>>>>number. (Btw., I hope I got the stack layout about right and this is 
>>>>even possible)
>>>>Then, after we entered the continuation of the __DEFAULT alt, we do 
>>>>a jump to g.
>>>>Plot twist: g returns an unboxed 8-tuple of `Int#`s (as 
>>>>caller_of_f_cont_info knows, but f certainly doesn't!), so before it 
>>>>returns it will push two args on the stack (correct?):
>>>>
>>>>caller_of_f_cont_info
>>>>seq_h
>>>>unset_bit_stk_info
>>>>unboxed tuple component 7
>>>>unboxed tuple component 8 <- Sp
>>>>
>>>>And then `g` jumps directly to the entry code for 
>>>>`unset_bit_stk_info` (which does the register saving I mentioned), 
>>>>which absolutely can't figure out from Sp alone where seq_h is.
>>>>Darn! I think Luite's recent work on the StgToByteCode had to work 
>>>>around similar issues, I found this wiki page 
>>>><https://gitlab.haskell.org/ghc/ghc-wiki-mirror/-/blob/master/Unboxed-Tuples-and-Sums-in-GHCi-bytecode.md#returning-a-tuple>. 
>>>>But we aren't in a position where we know the representation of `r` 
>>>>*at all*!
>>>>
>>>>So our idea was to scan the stack, beginning from `Sp`, until I find 
>>>>`unset_bit_stk_info`, giving us the following situation:
>>>>
>>>>caller_of_f_cont_info
>>>>seq_h
>>>>unset_bit_stk_info <- Bp
>>>>unboxed tuple component 7
>>>>unboxed tuple component 8 <- Sp
>>>>
>>>>I suggestively named the register in which we store the result Bp in 
>>>>analogy to the traditional base pointer. This information would be 
>>>>enough to unset the bit at index seq_h
>>>>and then copy the unboxed tuple components 7 and 8 up by two words:
>>>>
>>>>caller_of_f_cont_info
>>>>unboxed tuple component 7
>>>>unboxed tuple component 8 <- Sp
>>>>
>>>>Then jump to caller_of_f_cont_info, which knows what to make of the 
>>>>stack and the register state.
>>>>
>>>>The stack scanning is horrible and too brittle and slow for a 
>>>>production setting, as any of the unboxed tuple components could 
>>>>have the same bit pattern as `unset_bit_stk_info`.
>>>>We simply hope that is never the case and it's fine for the purposes 
>>>>of a quick-and-dirty instrumentation.
>>>>
>>>>QUESTION 1: What do you think? Could this work? Can you anticipate 
>>>>pit falls of this approach?
>>>>
>>>># What about Thunks and StgRhsCon?
>>>>
>>>>QUESTION 2: The instrumentation as described won't work for Thunks 
>>>>(which are only entered once) and constructor applications (like 
>>>>sat_sBK in the BigNat matcher above). Not sure how to change that 
>>>>without incurring huge performance hits (by deactivating 
>>>>memoisation). Ideas are welcome here.
>>>>
>>>>Thanks for reading this far. I hope you could follow and are now 
>>>>fizzing with excitement because you have a much better idea of how 
>>>>to do this and will let me know :)
>>>>
>>>>Sebastian
>>>>_______________________________________________
>>>>ghc-devs mailing list
>>>>ghc-devs at haskell.org
>>>>http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20211217/e68218f0/attachment-0001.html>