Better calling conventions for strict functions (bang patterns)?

Sun Oct 25 08:09:29 UTC 2015

Hi,

I'm interested in execution performance.

Maybe modern hardware (which implement IA64, ARMv8) is able to predict a
long chain of jumps [1].
But prediction accuracy for indirect jump is low,
especially dynamic addressed indirect jumps.

By the way, Ryan's example code will be fast by following optimization:
(If c3aX is most fast path, c3aX is reached without taken-branch.)

  // Skip to the chase if it's already evaluated:
  start:
//    if (R2 & 7 != 0) goto fastpath; else goto slowpath;
      if (R2 & 7 == 0) goto slowpath;         // *** (1) remove branch for
fastpath

  fastpath: // Formerly c3aO                  // *** (1) move fastpath here
//    if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
      if (R1 & 7 >= 2) goto c3aW;             // *** (2) remove branch for
prior path(c3aX)
  c3aX:                                       // *** (2) move else path to
here(without branch)
      R1 = PicBaseReg + lvl_r39S_closure;
      call (I64[R1])(R1) args: 8, res: 0, upd: 8;  // *** indirect jump,
but fixed address (100% hit)
  c3aW:
      R1 = P64[R1 + 6] & (-8);
      call (I64[R1])(R1) args: 8, res: 0, upd: 8;  // *** indirect jump,
dynamic address (hit or miss)
//c3aX:
//    R1 = PicBaseReg + lvl_r39S_closure;
//    call (I64[R1])(R1) args: 8, res: 0, upd: 8;

  slowpath:   // Formerly c3aY
      if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;
  c3aZ:
      // nop
      R1 = PicBaseReg + foo_closure;
      call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;
  c3b0:
      I64[Sp - 8] = PicBaseReg + block_c3aO_info;
      R1 = R2;
      Sp = Sp - 8;

      call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;
      // Sp bump moved to here so it's separate from "fastpath"
      Sp = Sp + 8;
      goto fastpath;  // ***

//fastpath: // Formerly c3aO
//    if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
//c3aW:
//    R1 = P64[R1 + 6] & (-8);
//    call (I64[R1])(R1) args: 8, res: 0, upd: 8;
//c3aX:
//    R1 = PicBaseReg + lvl_r39S_closure;
//    call (I64[R1])(R1) args: 8, res: 0, upd: 8;

[1]: Intel64 and IA-32 Architectures Optimization Reference Manual

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
       3.4 OPTIMIZING THE FRONT END
       2.3.2.3 Branch Prediction

I'm just studying and drawing about lazy evaluation.
This thread is helpful to me :)

Regards,
Takenobu

2015-10-25 5:53 GMT+09:00 Carter Schonwald <carter.schonwald at gmail.com>:

> Doesn't modern hardware have pretty good branch prediction? In which case
> the order of the branches may not matter unless it's a long chain of calls?
> Vs say an inner loop that hasn't been inlined?
>
> Either way, I'd love be stay in the loop on this topic, for work I'm
> building a strongly normalizing language that supports both strict and call
> by need evaluation strategies.
>
>
> On Friday, October 23, 2015, Ryan Newton <rrnewton at gmail.com> wrote:
>
>>
>>>    1. Small tweaks: The CMM code above seems to be *betting* than the
>>>    thunk is unevaluated, because it does the stack check and stack write
>>>    *before* the predicate test that checks if the thunk is evaluated (if
>>>    (R1 & 7 != 0) goto c3aO; else goto c3aP;).  With a bang-pattern
>>>    function, couldn't it make the opposite bet?  That is, branch on whether
>>>    the thunk is evaluated first, and then the wasted computation is only a
>>>    single correctly predicted branch (and a read of a tag that we need to read
>>>    anyway).
>>>
>>> Oh, a small further addition would be needed for this tweak.  In the
>> generated code above "Sp = Sp + 8;" happens *late*, but I think it could
>> happen right after the call to the thunk.  In general, does it seem
>> feasible to separate the slowpath from fastpath as in the following tweak
>> of the example CMM?
>>
>>
>> *  // Skip to the chase if it's already evaluated:*
>> *  start:*
>> *      if (R2 & 7 != 0) goto fastpath; else goto slowpath;*
>>
>> *  slowpath:   // Formerly c3aY*
>> *      if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;*
>> *  c3aZ:*
>> *      // nop*
>> *      R1 = PicBaseReg + foo_closure;*
>> *      call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;*
>> *  c3b0:*
>> *      I64[Sp - 8] = PicBaseReg + block_c3aO_info;*
>> *      R1 = R2;*
>> *      Sp = Sp - 8;*
>>
>> *      call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;*
>> *      // Sp bump moved to here so it's separate from "fastpath"*
>> *      Sp = Sp + 8;*
>>
>> *  fastpath: // Formerly c3aO*
>> *      if (R1 & 7 >= 2) goto c3aW; else goto c3aX;*
>> *  c3aW:*
>> *      R1 = P64[R1 + 6] & (-8);*
>> *      call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
>> *  c3aX:*
>> *      R1 = PicBaseReg + lvl_r39S_closure;*
>> *      call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
>>
>>
>>
>>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20151025/bc138f64/attachment-0001.html>