[GHC] #9476: Implement late lambda-lifting
GHC
ghc-devs at haskell.org
Wed Jul 4 11:43:10 UTC 2018
#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: nfrisby
Type: feature request | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.8.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #8763 | Differential Rev(s):
Wiki Page: LateLamLift |
-------------------------------------+-------------------------------------
Comment (by sgraf):
It took me quite some time, but
[https://github.com/sgraf812/ghc/tree/c1f16ac245ca8f8c8452a5b3c1f116237adcb577
this commit] passes `./validate` (modulo 4 compiler perf tests). Fixing
the testsuite was rather simple, but investigating various performance
regressions to see which knobs we could turn is really time consuming, so
I figured I better post now than never.
I updated the wiki page with a summary of changes I made. For
completeness:
- A hopefully faithful rebase, removing previous LNE (= join point)
detection logic
- Activate all LLF flags (see the above llf-nr10-r6 configuration) by
default
- Actually use the `-fllf-nonrec-lam-limit` setting
- Don't stabilise Unfoldings mentioning `makeStatic`
- Respect RULEs and Unfoldings in the identifier we abstract over
(previously, when SpecConstr added a RULE mentioning an otherwise absent
specialised join point, we would ignore it, which is not in line with how
CoreFVs works)
- Stabilise Unfoldings only when we lifted something out of a function
(Not doing so led to a huge regression in veritas' Edlib.lhs)
I'll attach nofib results in a following post. Here's the summary:
{{{
Program Allocs Allocs Instrs Instrs
no-llf llf no-llf llf
--------------------------------------------------------------------------------
Min -20.3% -20.3% -7.8% -16.5%
Max +2.0% +1.6% +18.4% +18.4%
Geometric Mean -0.4% -1.0% +0.3% -0.0%
}}}
`llf` is a plain benchmark run, whereas `no-llf` means libraries compiled
with `-fllf`, but benchmarks compiled with `-fno-llf`. This is a useful
baseline, as it allows to detect test cases where the regression actually
happens in the test case rather than somewhere in the boot libraries.
Hardly surprising, allocations go down. More surprisingly, not in a
consistent fashion. The most illustrating test case is `real/pic`:
{{{
no-llf llf
pic -0.0% +1.0% +0.0% -3.4%
}}}
The lifting of some functions results in functions of rather big result
arity (6 and 7), which no longer can be fast called. Appearently, there's
no `stg_ap_pppnn` variant matching the call pattern.
Also, counted instructions went up in some cases, so that there's no real
win to be had. If I completely refrain from lifting non-recursive join
points, things look better wrt. to counted instructions:
{{{
Program Allocs Allocs Instrs Instrs
no-llf llf no-llf llf
--------------------------------------------------------------------------------
Min -20.3% -20.3% -3.4% -17.1%
Max +2.0% +1.6% +6.4% +6.4%
Geometric Mean -0.3% -1.0% +0.1% -0.4%
}}}
But I recently questioned using cachegrind results (such as the very
relevant counted memory reads/writes) as a reliable metric (#15333).
There are some open things that should be measured:
- Is it worthwhile at all to lift join points? (Related: don't we rather
want 'custom calling conventions' that inherits register/closure
configurations to top-level bindings?)
- Isn't a reduction in allocations a lie when all we did is spill more on
to the stack? Imagine we lift a (non-tail-recursive) function to top-level
that would have arity > 5. Arguments would have to be passed on the stack,
for each recursive call. I'd expect that to be worse than the status quo.
So maybe don't just count the number of free ids we abstract over, but
rather bound the resulting arity?
Finally, the whole transformation feels more like it belongs in the STG
layer: We very brittly anticipate CorePrep and have to pull in really low-
level stuff into the analysis, all while having to preserve unfoldings
when we change anything. Seems like a very local optimization (except for
enabling intra-module inlining opportunities) that doesn't enable many
other core2core optimizations (nor should it, that's why we lift late).
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9476#comment:13>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list