[GHC] #10606: avoid redundant stores to the stack when examining already-tagged data
GHC
ghc-devs at haskell.org
Sun Jul 5 20:08:53 UTC 2015
#10606: avoid redundant stores to the stack when examining already-tagged data
-------------------------------------+-------------------------------------
Reporter: rwbarton | Owner:
Type: feature | Status: new
request | Milestone:
Priority: normal | Version: 7.11
Component: Compiler | Operating System: Unknown/Multiple
(CodeGen) | Type of failure: None/Unknown
Keywords: | Blocked By:
Architecture: | Related Tickets:
Unknown/Multiple |
Test Case: |
Blocking: |
Differential Revisions: |
-------------------------------------+-------------------------------------
GHC compiles a function that performs case analysis on a value of an ADT
like
{{{
bool :: a -> a -> Bool -> a
bool f t b = case b of
False -> f
True -> t
}}}
to Cmm of the form
{{{
{offset
cwV:
if ((Sp + -24) < SpLim) goto cwW; else goto cwX;
cwW:
R4 = R4;
R3 = R3;
R2 = R2;
R1 = Bool.bool_closure;
call (stg_gc_fun)(R4, R3, R2, R1) args: 8, res: 0, upd: 8;
cwX:
I64[Sp - 24] = cwL; -- (*)
R1 = R4;
P64[Sp - 16] = R2; -- (†1)
P64[Sp - 8] = R3; -- (†2)
Sp = Sp - 24; -- (‡)
if (R1 & 7 != 0) goto cwL; else goto cwM;
cwM:
call (I64[R1])(R1) returns to cwL, args: 8, res: 8, upd: 8;
cwL:
if (R1 & 7 >= 2) goto cwT; else goto cwU;
cwT:
R1 = P64[Sp + 16];
Sp = Sp + 24;
call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
cwU:
R1 = P64[Sp + 8];
Sp = Sp + 24;
call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
}
}}}
Statement (*) stores a return address for the evaluation of `b` to return
to, and statements (†1), (†2) save local variables that are live in case
alternatives, since they cannot be held in registers across the evaluation
of `b`. But in the event that `b` is already evaluated and represented by
a tagged pointer, all these stores are unnecessary: the return address
written by (*) is simply dead, and the values saved in (†1), (†2) are
still available in whatever locations they were copied to the stack from.
In many cases the data we examine is mostly tagged, and while the active
part of the stack is likely to be in L1 cache, the cost of these stores
and reads is probably still positive (barring secondary effects from
changes to pipelining, branch prediction, and so on).
In this case we could certainly move the return address store (*) into
block `cwM`, and possibly move the local variable stores (†1), (†2) into
`cwM` as well, though it's then not clear to me how to recover the values
in the alternatives (does Cmm have something like phi nodes?) I don't
propose to move the statement (‡), as arithmetic on registers is
essentially free anyways.
I tried implementing the part of this pertaining to the return address (*)
and ran into two complications.
* For some reason, when I moved the return address store (*) into the
"data is not tagged" branch in the Stg->Cmm translation, this also
resulted in both the local variable stores (†1), (†2) and the update to Sp
(‡) being sunk into both branches of the "is the data tagged" conditional
at some point in the Cmm optimization pipeline. This was useless since
they couldn't be pushed further past the branch on the returned tag value,
so the result was enlarged code size that outweighed the savings of
avoiding a single store. I didn't investigate exactly why this sinking was
dependent on the location of the store (*), but this should be fixable.
* There may be heap checks in the alternatives. In that case, the code
generator currently cleverly reuses the stack frame and info table set up
for the evaluation of `b` in the heap failure branches. If we move some of
the stores (*), (†1), (†2) into the evaluation branch `cwM`, then we
either have to duplicate them in heap failure branches, or set up a new
stack frame and info table, or do some other clever thing. Or in the worst
case, only do this optimization when performing the heap check before the
case (which may then become slightly more attractive).
I'm attaching the current version of my patch mainly for my own future
reference; it seems to produce correct, but larger and marginally slower
code, I believe for the reasons described above.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/10606>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list