[GHC] #9476: Implement late lambda-lifting
GHC
ghc-devs at haskell.org
Thu Sep 6 16:09:55 UTC 2018
#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: sgraf
Type: feature request | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.8.2
Resolution: | Keywords: LateLamLift
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #8763 #13286 | Differential Rev(s):
Wiki Page: LateLamLift |
-------------------------------------+-------------------------------------
Comment (by sgraf):
I investigated various variations of the configuration that intuitively
should give the best results. Specifically, I played with 3 parameters:
1. `-fstg-lift-lams-known`: Allow turning known into unknown calls.
Default: Don't allow.
Imagine we lift `f` in the following program:
{{{
let g x = ...
f y = ... g y ...
in f 2
==>
f g y = ... g y ...;
let g x = ...
in f g 2
}}}
This turns the known call to `g` within `f`s body into an unknown one
(let's call it ''anonymisation'' for the sake of this post), which needs
to go through one of the generic apply functions. My gut tells me this
probably isn't worth taking the risk: There's the (very small) chance,
that when there's no matching slow call pattern, this will turn into a
very slow call, which, from my understanding, would allocate.
2. `-fstg-lift-lams-rec-args <n>`: Allow lifting as long as the resulting
recursive function has at most n arguments. Default: 5, the number of
available registers.
Excess arguments are passed on the stack, which would mean the same
slow memory accesses we try to avoid. Still, allocating on the stack vs.
on the heap might be favorable, but I'm not sure how passing arguments on
the stack plays with (non-tail) recursion, e.g. would passing arguments on
the stack mean we had to pass the same, static args all over again for a
nested recursive activation?
Anyway, I measured.
3. `-fstg-lift-lams-nonrec-args <n>`: Allow lifting as long as the
resulting non-recursive function has at most n arguments. Default: 5.
Lifting non-recursive functions could have different effects than
lifting recursive ones, because a) there's no recursive calls, we pay call
overhead only once b) they are probably huge enough that call overhead is
neglible.
I'll abbreviate each configuration I tested by a triple
`{t,f}-<nrec>-<nnonrec>`, so the (current) default parameter would be
`f-5-5`.
- `t-5-5` -- or: allow turning known call into unknown calls. Total mean
changes: -0.0% allocs, -0.0% counted instructions
Numbers in attachment:unknown.txt. No regressions in allocations (that's
what we have the cost model for, after all), with two benchmarks standing
out:
* `rewrite`: -1.5% allocs, -0.1% counted instructions. Changes were
somewhere in `base`, so didn't bother to investigate further
* `mate`: -0.9% allocs, -0.1% counted instructions. This one lifted
recursive functions of the form
{{{
let {
$warg_scDq [InlPrag=NOUSERINLINE[2],
Occ=OnceL*!,
Dmd=<L,C(C1(C1(U)))>]
:: Board.Kind
-> Board.Square
-> [(Move.MoveInFull, Board.Board)]
-> [(Move.MoveInFull, Board.Board)]
[LclId, Arity=3, Str=<S,U><L,U(U(U),U(U))><L,U>, Unf=OtherCon []] = ...
} in ... let {
go_scKg [Occ=LoopBreaker]
:: [(Board.Kind, Board.Square)] -> [(Move.MoveInFull, Board.Board)]
[LclId, Arity=1, Str=<S,1*U>, Unf=OtherCon []] =
sat-only [go_scKg $warg_scDq] \r [ds_scKh]
case ds_scKh of {
[] -> [] [];
: y_scKj [Occ=Once!] ys_scKk [Occ=Once] ->
case y_scKj of {
(,) ww3_scKm [Occ=Once] ww4_scKn [Occ=Once] ->
let {
sat_scKo [Occ=Once] :: [(Move.MoveInFull,
Board.Board)]
[LclId] =
[ys_scKk go_scKg] \u [] go_scKg ys_scKk;
} in $warg_scDq ww3_scKm ww4_scKn sat_scKo;
};
};
} in go_scKg ww_scCR;
}}}
Which is exactly the kind of lift that I tought we don't want to make:
lifting `go` to top-level will result in abstracting over `$warg`, which
will turn the known call into an unknown one. Perhaps this is only
beneficial because the unknown call isn't on the hot path.
- `f-<n>-5`: This varied max number of recursive args between 5 and 10
(attachment:rec-5-6-8-10.txt).
Allowing 6 parameters lifted some more functions, 8 parameters didn't do
anything more than 6 and 10 parameters did another influential lift (-7.7%
allocs in `mandel2`, but +0.3% in ci). The only real beneficial lift here
was in `fibheaps`, happening with n >= 6 (-0.5% on both allocs and ci).
The rest seems to be just noise.
So, what happened in `fibheaps`? It seems two recursive functions were
lifted, both taking 6 arguments. Ah, but one of them (the last, in
particular) is a 'void' parameter (so, slow call pattern pppppv), which is
completely erased from the resulting Cmm! ... the tip of my branch should
allow the lift here now.
- `f-5-<n>`: This varied max number of non-recursive args between 5 and 10
(attachment:nonrec-5-6-8-10.
Allowing up to 8 parameters had great effect on allocations in
`cichelli` (-10.4%), while also improving counted instructions negligibly
(-0.0%). Allowing 10 parameters also had a tiny effect on `simple` (-0.9%
allocs, -0.1%). Codegen for both benchmarks reveals that the changes hide
somewhere in `base`, so I'm not investigating further at the moment, seems
like it's not worth the time.
- `f-8-8`: To test the interaction of both tuning parameters. No
surprising results: attachment:8-8.txt (the baseline doesn't use the
`fibheaps` opportunity which is now optimised in `f-5-5`)
I didn't bother evaluating the combination of allowing anonymisation of
calls with the max bound on parameters, because I think they are largely
independent of another.
Should I evaluate more variants, like allowing to lift undersaturated
applications (are there even any in STG? Thought not, but then there's
`satCallsOnly`)? I don't think these would be worthwhile, except when the
resulting PAP is on a cold path...
-----
So, here's a **TLDR** some questions:
1. Should we allow anonymisation (see 1) above)? I don't see big wins
(configuration `t-5-5`), and don't really understand why there are any at
all, which probably is the bigger problem here.
2. I think the `f-5-5` configuration is a good default. Can you think of
any other parameters I could vary? Looking at the Wiki page, I think these
are the other ones Nicolas evaluated (which was on Core, which is a lot
less explicit about these things):
i. Abstracting over/lifting LNEs: That's a nono, as the last few years
showed
ii. Abstract Undersaturated applciations: As I wrote above, I don't
think these are worthwhile, because they only shift allocation to call
sites via PAPs (and also they aren't trivially to implement because of
STGs ANF guarantees)
iii. Abstract (over-)saturated applications: That's a no brainer in my
eyes. Oversats are just regular calls returning something unknown we then
call. If the known call is worthwhile to lift, just do it; the unknown
call won't change a bit.
iv. Create PAPs: That's a special case of ii. I think, or at least this
is the case that would entail a special case in the transformation because
of STG only allows atoms as arguments.
v. Use strictness: Seems to be related to anticipating CorePrep, which
fortunately we don't have to any longer.
vi. Use one-shot info: I'm not entirely sure what this means either.
Lifting would spoil one-shot info, but I don't think this has any
operational consequences, because one-shot info was already used to
identify single-entry thunks.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9476#comment:27>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list