[GHC] #9476: Implement late lambda-lifting

Thu Sep 6 16:09:55 UTC 2018

#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  sgraf
            Type:  feature request   |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:  LateLamLift
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #8763 #13286      |  Differential Rev(s):
       Wiki Page:  LateLamLift       |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 I investigated various variations of the configuration that intuitively
 should give the best results. Specifically, I played with 3 parameters:

 1. `-fstg-lift-lams-known`: Allow turning known into unknown calls.
 Default: Don't allow.

    Imagine we lift `f` in the following program:
 {{{
 let g x = ...
     f y = ... g y ...
 in f 2
 ==>
 f g y = ... g y ...;
 let g x = ...
 in f g 2
 }}}

    This turns the known call to `g` within `f`s body into an unknown one
 (let's call it ''anonymisation'' for the sake of this post), which needs
 to go through one of the generic apply functions. My gut tells me this
 probably isn't worth taking the risk: There's the (very small) chance,
 that when there's no matching slow call pattern, this will turn into a
 very slow call, which, from my understanding, would allocate.

 2. `-fstg-lift-lams-rec-args <n>`: Allow lifting as long as the resulting
 recursive function has at most n arguments. Default: 5, the number of
 available registers.

    Excess arguments are passed on the stack, which would mean the same
 slow memory accesses we try to avoid. Still, allocating on the stack vs.
 on the heap might be favorable, but I'm not sure how passing arguments on
 the stack plays with (non-tail) recursion, e.g. would passing arguments on
 the stack mean we had to pass the same, static args all over again for a
 nested recursive activation?

    Anyway, I measured.

 3. `-fstg-lift-lams-nonrec-args <n>`: Allow lifting as long as the
 resulting non-recursive function has at most n arguments. Default: 5.

    Lifting non-recursive functions could have different effects than
 lifting recursive ones, because a) there's no recursive calls, we pay call
 overhead only once b) they are probably huge enough that call overhead is
 neglible.

 I'll abbreviate each configuration I tested by a triple
 `{t,f}-<nrec>-<nnonrec>`, so the (current) default parameter would be
 `f-5-5`.

 - `t-5-5` -- or: allow turning known call into unknown calls. Total mean
 changes: -0.0% allocs, -0.0% counted instructions
   Numbers in attachment:unknown.txt. No regressions in allocations (that's
 what we have the cost model for, after all), with two benchmarks standing
 out:
   * `rewrite`: -1.5% allocs, -0.1% counted instructions. Changes were
 somewhere in `base`, so didn't bother to investigate further
   * `mate`: -0.9% allocs, -0.1% counted instructions. This one lifted
 recursive functions of the form
 {{{
 let {
   $warg_scDq [InlPrag=NOUSERINLINE[2],
               Occ=OnceL*!,
               Dmd=<L,C(C1(C1(U)))>]
     :: Board.Kind
         -> Board.Square
         -> [(Move.MoveInFull, Board.Board)]
         -> [(Move.MoveInFull, Board.Board)]
   [LclId, Arity=3, Str=<S,U><L,U(U(U),U(U))><L,U>, Unf=OtherCon []] = ...
 } in ... let {
   go_scKg [Occ=LoopBreaker]
     :: [(Board.Kind, Board.Square)] -> [(Move.MoveInFull, Board.Board)]
   [LclId, Arity=1, Str=<S,1*U>, Unf=OtherCon []] =
       sat-only [go_scKg $warg_scDq] \r [ds_scKh]
           case ds_scKh of {
             [] -> [] [];
             : y_scKj [Occ=Once!] ys_scKk [Occ=Once] ->
                 case y_scKj of {
                   (,) ww3_scKm [Occ=Once] ww4_scKn [Occ=Once] ->
                       let {
                         sat_scKo [Occ=Once] :: [(Move.MoveInFull,
 Board.Board)]
                         [LclId] =
                             [ys_scKk go_scKg] \u [] go_scKg ys_scKk;
                       } in  $warg_scDq ww3_scKm ww4_scKn sat_scKo;
                 };
           };
 } in  go_scKg ww_scCR;
 }}}
     Which is exactly the kind of lift that I tought we don't want to make:
 lifting `go` to top-level will result in abstracting over `$warg`, which
 will turn the known call into an unknown one. Perhaps this is only
 beneficial because the unknown call isn't on the hot path.

 - `f-<n>-5`: This varied max number of recursive args between 5 and 10
 (attachment:rec-5-6-8-10.txt).
   Allowing 6 parameters lifted some more functions, 8 parameters didn't do
 anything more than 6 and 10 parameters did another influential lift (-7.7%
 allocs in `mandel2`, but +0.3% in ci). The only real beneficial lift here
 was in `fibheaps`, happening with n >= 6 (-0.5% on both allocs and ci).
 The rest seems to be just noise.

   So, what happened in `fibheaps`? It seems two recursive functions were
 lifted, both taking 6 arguments. Ah, but one of them (the last, in
 particular) is a 'void' parameter (so, slow call pattern pppppv), which is
 completely erased from the resulting Cmm! ... the tip of my branch should
 allow the lift here now.

 - `f-5-<n>`: This varied max number of non-recursive args between 5 and 10
 (attachment:nonrec-5-6-8-10.
   Allowing up to 8 parameters had great effect on allocations in
 `cichelli` (-10.4%), while also improving counted instructions negligibly
 (-0.0%). Allowing 10 parameters also had a tiny effect on `simple` (-0.9%
 allocs, -0.1%). Codegen for both benchmarks reveals that the changes hide
 somewhere in `base`, so I'm not investigating further at the moment, seems
 like it's not worth the time.

 - `f-8-8`: To test the interaction of both tuning parameters. No
 surprising results: attachment:8-8.txt (the baseline doesn't use the
 `fibheaps` opportunity which is now optimised in `f-5-5`)

 I didn't bother evaluating the combination of allowing anonymisation of
 calls with the max bound on parameters, because I think they are largely
 independent of another.

 Should I evaluate more variants, like allowing to lift undersaturated
 applications (are there even any in STG? Thought not, but then there's
 `satCallsOnly`)? I don't think these would be worthwhile, except when the
 resulting PAP is on a cold path...

 -----

 So, here's a **TLDR** some questions:

 1. Should we allow anonymisation (see 1) above)? I don't see big wins
 (configuration `t-5-5`), and don't really understand why there are any at
 all, which probably is the bigger problem here.
 2. I think the `f-5-5` configuration is a good default. Can you think of
 any other parameters I could vary? Looking at the Wiki page, I think these
 are the other ones Nicolas evaluated (which was on Core, which is a lot
 less explicit about these things):
    i. Abstracting over/lifting LNEs: That's a nono, as the last few years
 showed
    ii. Abstract Undersaturated applciations: As I wrote above, I don't
 think these are worthwhile, because they only shift allocation to call
 sites via PAPs (and also they aren't trivially to implement because of
 STGs ANF guarantees)
    iii. Abstract (over-)saturated applications: That's a no brainer in my
 eyes. Oversats are just regular calls returning something unknown we then
 call. If the known call is worthwhile to lift, just do it; the unknown
 call won't change a bit.
    iv. Create PAPs: That's a special case of ii. I think, or at least this
 is the case that would entail a special case in the transformation because
 of STG only allows atoms as arguments.
    v. Use strictness: Seems to be related to anticipating CorePrep, which
 fortunately we don't have to any longer.
    vi. Use one-shot info: I'm not entirely sure what this means either.
 Lifting would spoil one-shot info, but I don't think this has any
 operational consequences, because one-shot info was already used to
 identify single-entry thunks.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9476#comment:27>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler