[GHC] #9476: Implement late lambda-lifting

Thu Nov 8 17:34:48 UTC 2018

#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  sgraf
            Type:  feature request   |               Status:  patch
        Priority:  normal            |            Milestone:  8.8.1
       Component:  Compiler          |              Version:  7.8.2
      Resolution:                    |             Keywords:  LateLamLift
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #8763 #13286      |  Differential Rev(s):  Phab:D5224
       Wiki Page:  LateLamLift       |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 I'm currently trying to find the right configuration for Runtime
 benchmarking.

 When using the NCG on the architecture I benchmark on, there are seemingly
 random outliers performance-wise, even when ignoring benchmarks with less
 than 200ms running time. Take `CSD` from `real/eff` for example. On the
 target architecture (i7-6700), things consistently are 4.5% slower, yet
 ''there isn't a single lifted function in that benchmark''. It's basically
 just a counting loop. To make matters worse, I can't reproduce this on my
 local PC, quite the contrary there. Altogether this makes for a very
 meager improvement of -0.2% in runtime.

 This leads me to believe that the (relatively minor) benefits are obscured
 by code size and layout concerns. If I only include benchmarks that ran at
 least 500ms, things look much better (-0.4%), but that's probably because
 I excluded the `eff` 'microbenchmarks'.

 I tried another configuration that probably does better justice to the
 optimisation: I re-ran the benchmarks with `-fllvm -optlo -Os` to have the
 LLVM optimise for size concerns which IME yields less code layout
 dependent results.

 Anyway, ignoring benchmarks with <200ms runtime yields an improvement of
 -1.0% (result:
 https://ghc.haskell.org/trac/ghc/attachment/ticket/9476/nofib.txt), while
 ignoring all benchmarks with <500ms runtime yields an -1.2% improvement.
 Ironically, runtime of `CSD` ''improved'' by -7.1%.

 Notable is also that while `n-body` allocates 20% less (heap space!), it
 got slower by a non-meaningful margin of 0.1%. Maybe watching out for
 allocations isn't the be all end all here.

 I really think we should flag benchmarks for being eligible for runtime
 measurements. I get hung up on what are architectural wibbles ''all the
 time''.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9476#comment:52>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler