[GHC] #10606: avoid redundant stores to the stack when examining already-tagged data

Sun Jul 5 20:08:53 UTC 2015

#10606: avoid redundant stores to the stack when examining already-tagged data
-------------------------------------+-------------------------------------
              Reporter:  rwbarton    |             Owner:
                  Type:  feature     |            Status:  new
  request                            |         Milestone:
              Priority:  normal      |           Version:  7.11
             Component:  Compiler    |  Operating System:  Unknown/Multiple
  (CodeGen)                          |   Type of failure:  None/Unknown
              Keywords:              |        Blocked By:
          Architecture:              |   Related Tickets:
  Unknown/Multiple                   |
             Test Case:              |
              Blocking:              |
Differential Revisions:              |
-------------------------------------+-------------------------------------
 GHC compiles a function that performs case analysis on a value of an ADT
 like
 {{{
 bool :: a -> a -> Bool -> a
 bool f t b = case b of
   False -> f
   True  -> t
 }}}
 to Cmm of the form
 {{{
      {offset
        cwV:
            if ((Sp + -24) < SpLim) goto cwW; else goto cwX;
        cwW:
            R4 = R4;
            R3 = R3;
            R2 = R2;
            R1 = Bool.bool_closure;
            call (stg_gc_fun)(R4, R3, R2, R1) args: 8, res: 0, upd: 8;
        cwX:
            I64[Sp - 24] = cwL;     -- (*)
            R1 = R4;
            P64[Sp - 16] = R2;      -- (†1)
            P64[Sp - 8] = R3;       -- (†2)
            Sp = Sp - 24;           -- (‡)
            if (R1 & 7 != 0) goto cwL; else goto cwM;
        cwM:
            call (I64[R1])(R1) returns to cwL, args: 8, res: 8, upd: 8;
        cwL:
            if (R1 & 7 >= 2) goto cwT; else goto cwU;
        cwT:
            R1 = P64[Sp + 16];
            Sp = Sp + 24;
            call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
        cwU:
            R1 = P64[Sp + 8];
            Sp = Sp + 24;
            call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
      }
 }}}
 Statement (*) stores a return address for the evaluation of `b` to return
 to, and statements (†1), (†2) save local variables that are live in case
 alternatives, since they cannot be held in registers across the evaluation
 of `b`. But in the event that `b` is already evaluated and represented by
 a tagged pointer, all these stores are unnecessary: the return address
 written by (*) is simply dead, and the values saved in (†1), (†2) are
 still available in whatever locations they were copied to the stack from.

 In many cases the data we examine is mostly tagged, and while the active
 part of the stack is likely to be in L1 cache, the cost of these stores
 and reads is probably still positive (barring secondary effects from
 changes to pipelining, branch prediction, and so on).

 In this case we could certainly move the return address store (*) into
 block `cwM`, and possibly move the local variable stores (†1), (†2) into
 `cwM` as well, though it's then not clear to me how to recover the values
 in the alternatives (does Cmm have something like phi nodes?) I don't
 propose to move the statement (‡), as arithmetic on registers is
 essentially free anyways.

 I tried implementing the part of this pertaining to the return address (*)
 and ran into two complications.

 * For some reason, when I moved the return address store (*) into the
 "data is not tagged" branch in the Stg->Cmm translation, this also
 resulted in both the local variable stores (†1), (†2) and the update to Sp
 (‡) being sunk into both branches of the "is the data tagged" conditional
 at some point in the Cmm optimization pipeline. This was useless since
 they couldn't be pushed further past the branch on the returned tag value,
 so the result was enlarged code size that outweighed the savings of
 avoiding a single store. I didn't investigate exactly why this sinking was
 dependent on the location of the store (*), but this should be fixable.

 * There may be heap checks in the alternatives. In that case, the code
 generator currently cleverly reuses the stack frame and info table set up
 for the evaluation of `b` in the heap failure branches. If we move some of
 the stores (*), (†1), (†2) into the evaluation branch `cwM`, then we
 either have to duplicate them in heap failure branches, or set up a new
 stack frame and info table, or do some other clever thing. Or in the worst
 case, only do this optimization when performing the heap check before the
 case (which may then become slightly more attractive).

 I'm attaching the current version of my patch mainly for my own future
 reference; it seems to produce correct, but larger and marginally slower
 code, I believe for the reasons described above.

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/10606>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler