[GHC] #9476: Implement late lambda-lifting

Wed Jul 11 16:25:01 UTC 2018

#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  nfrisby
            Type:  feature request   |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #8763             |  Differential Rev(s):
       Wiki Page:  LateLamLift       |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 Replying to [comment:16 simonpj]:
 > Thoughts
 >
 > * There are a handful of spectacular reductions in allocation (queens,
 n-body).  It'd be good to understand and explain them.  Perhaps we can
 more closely target LLF on those cases.

 It's hard to tell for `n-body`: The `-fno-llf` variant shows the same
 reduction in allocations, meaning that the reduction must happen somewhere
 in the base libraries rather than in `n-body` itself.

 `queens` has its `go1` function within `gen` lifted (arity 1 with 3 free
 vars). Non-lifted Core after CorePrep:

 {{{
 let {
   n_s5S3 [Occ=OnceL*] :: [[GHC.Types.Int]]
   [LclId]
   n_s5S3 = go_s5RY ys_s5S2 } in
 case GHC.Prim.># 1# ww1_s5RX of {
   __DEFAULT ->
     letrec {
       go1_s5S5 [Occ=LoopBreaker]
         :: GHC.Prim.Int# -> [[GHC.Types.Int]]
       [LclId, Arity=1, Str=<S,U>, Unf=OtherCon []]
       go1_s5S5
         = \ (x1_s5S6 :: GHC.Prim.Int#) ->
             case Main.main_$ssafe y_s5S1 1# x1_s5S6 of {
               GHC.Types.False ->
                 case GHC.Prim.==# x1_s5S6 ww1_s5RX of {
                   __DEFAULT ->
                     case GHC.Prim.+# x1_s5S6 1#
                     of sat_s5S9 [Occ=Once]
                     { __DEFAULT ->
                     go1_s5S5 sat_s5S9
                     };
                   1# -> n_s5S3
                 };
               GHC.Types.True ->
                 let {
                   sat_s5Se [Occ=Once] :: [[GHC.Types.Int]]
                   [LclId]
                   sat_s5Se
                     = case GHC.Prim.==# x1_s5S6 ww1_s5RX
                       of {
                         __DEFAULT ->
                           case GHC.Prim.+# x1_s5S6 1#
                           of sat_s5Sd [Occ=Once]
                           { __DEFAULT ->
                           go1_s5S5 sat_s5Sd
                           };
                         1# -> n_s5S3
                       } } in
                 let {
                   sat_s5Sa [Occ=Once] :: GHC.Types.Int
                   [LclId]
                   sat_s5Sa = GHC.Types.I# x1_s5S6 } in
                 let {
                   sat_s5Sb [Occ=Once] :: [GHC.Types.Int]
                   [LclId]
                   sat_s5Sb
                     = GHC.Types.:
                         @ GHC.Types.Int sat_s5Sa y_s5S1 } in
                 GHC.Types.:
                   @ [GHC.Types.Int] sat_s5Sb sat_s5Se
             }; } in
     go1_s5S5 1#;
   1# -> n_s5S3
 }
 }}}

 And when `go1` is lifted to top-level, the CorePrep'd call site changes to

 {{{
 case GHC.Prim.># 1# ww1_s5Ts of {
   __DEFAULT ->
     let {
       sat_s5Tz [Occ=Once, Dmd=<L,1*U>] :: [[GHC.Types.Int]]
       [LclId]
       sat_s5Tz = go_s5Tt ys_s5Tx } in
     llf_go_r5SK y_s5Tw ww1_s5Ts sat_s5Tz 1#;
   1# -> go_s5Tt ys_s5Tx
 }
 }}}

 The important detail is that `n_s5S3` corresponds to `sat_s5Tz` in the
 lifted version. Both have a 'single' occurrence, but `n_s5S3`'s occurrence
 is under a non-oneshot lambda (which appearently and understandably is
 enough for OccAnal to give up, as opposed to analyzing parameters of
 recursive functions), so `n_s5S3` is updateable and causes an increase in
 heap allocation, while `sat_s5Tz` is single-entry.

 So, I'm afraid this gain can't entirely claimed by LLF. A late demand
 analysis run would probably detect the same. There's `cichelli`, which
 shows similar improvements (-10% allocs) due to something happening in
 `Prog.hs`, but there's too many lifts to bisect, so I'll postpone that to
 a follow-up post.

 > * I don't think we should float join points at all, recursive or non-
 recursive.  Think of them like labels in a control-flow graph.

 This was also what I thought, but there can be synergetic effects with the
 inliner if we manage to "outline" (that's what the transformation would be
 called in an imperative language) huge non-recursive join points, where
 the call overhead is neglibigle. At least that's what the wiki page
 mentions. But that brings me to the next point...

 > * I think of LLF as a code-generation strategy, that we do once all
 other transformations are done.  (Lambda-lifting ''can'' affect earlier
 optimisations.  It can make a big function into a small one (by floating
 out its guts), and thereby let it be inlined.  But that is subtle and
 difficult to get consistent gains for.  Let's not complicate LLF by
 thinking about this.)

 Yes, I wasn't sure if having it play nice with the simplifier was intended
 here. I take comfort in your confirmation seeing it as a code-gen
 strategy.

 > * Given that it's a code-gen strategy, doing it on STG makes perfect
 sense to me.  You've outlined the pros and cons well. Definitely worth a
 try.

 Will do.

 > I'm not sure what you meant by "It's not enough to look at Core alone to
 gauge allocation" as a disadvantage.

 Maybe it's just me only slowly understanding every layer of compilation in
 GHC, but I felt like I could have this mental model of roughly where and
 how much things allocate after CorePrep, but that's probably an illusion
 considering things like let-no-escapes (that story got better with join
 points, I suppose). Having another pass transforming heap allocation into
 stack allocation *after* CorePrep isn't exactly helping that intuition.

 Or were you thrown off by me using words I maybe mistranslated ("gauge"
 for "estimate", roughly)?

 > When you say "Much less involved analysis that doesn't need to stay in
 sync with CorePrep", I think it would be v helpful to lay out "the
 analysis".  I have vague memories, but I don't know what this "stay in
 sync" stuff is about.

 There is
 [https://github.com/sgraf812/ghc/blob/f5c160c98830fdba83faa9a0634eeab38dbe04d4/compiler/simplCore/SetLevels.hs#L2169-L2214
 this note] that explains why it's necessary to approximate CorePrep. What
 follows is
 [https://github.com/sgraf812/ghc/blob/f5c160c98830fdba83faa9a0634eeab38dbe04d4/compiler/simplCore/SetLevels.hs#L2478
 this new analysis] that interleaves a free variable analysis with
 computing use information for id's which are free in the floating
 bindings.

 > If you do it in STG you don't need to explain "stay in sync", but
 explaining the analysis would be excellent.

 Yes! I'm not really sure I understand all of it yet. (Re-)implementing it
 on STG should help me figure things out.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9476#comment:17>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler