Better calling conventions for strict functions (bang patterns)?
Takenobu Tani
takenobu.hs at gmail.com
Sun Oct 25 08:09:29 UTC 2015
Hi,
I'm interested in execution performance.
Maybe modern hardware (which implement IA64, ARMv8) is able to predict a
long chain of jumps [1].
But prediction accuracy for indirect jump is low,
especially dynamic addressed indirect jumps.
By the way, Ryan's example code will be fast by following optimization:
(If c3aX is most fast path, c3aX is reached without taken-branch.)
// Skip to the chase if it's already evaluated:
start:
// if (R2 & 7 != 0) goto fastpath; else goto slowpath;
if (R2 & 7 == 0) goto slowpath; // *** (1) remove branch for
fastpath
fastpath: // Formerly c3aO // *** (1) move fastpath here
// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
if (R1 & 7 >= 2) goto c3aW; // *** (2) remove branch for
prior path(c3aX)
c3aX: // *** (2) move else path to
here(without branch)
R1 = PicBaseReg + lvl_r39S_closure;
call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump,
but fixed address (100% hit)
c3aW:
R1 = P64[R1 + 6] & (-8);
call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump,
dynamic address (hit or miss)
//c3aX:
// R1 = PicBaseReg + lvl_r39S_closure;
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
slowpath: // Formerly c3aY
if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;
c3aZ:
// nop
R1 = PicBaseReg + foo_closure;
call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;
c3b0:
I64[Sp - 8] = PicBaseReg + block_c3aO_info;
R1 = R2;
Sp = Sp - 8;
call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;
// Sp bump moved to here so it's separate from "fastpath"
Sp = Sp + 8;
goto fastpath; // ***
//fastpath: // Formerly c3aO
// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
//c3aW:
// R1 = P64[R1 + 6] & (-8);
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
//c3aX:
// R1 = PicBaseReg + lvl_r39S_closure;
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
[1]: Intel64 and IA-32 Architectures Optimization Reference Manual
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
3.4 OPTIMIZING THE FRONT END
2.3.2.3 Branch Prediction
I'm just studying and drawing about lazy evaluation.
This thread is helpful to me :)
Regards,
Takenobu
2015-10-25 5:53 GMT+09:00 Carter Schonwald <carter.schonwald at gmail.com>:
> Doesn't modern hardware have pretty good branch prediction? In which case
> the order of the branches may not matter unless it's a long chain of calls?
> Vs say an inner loop that hasn't been inlined?
>
> Either way, I'd love be stay in the loop on this topic, for work I'm
> building a strongly normalizing language that supports both strict and call
> by need evaluation strategies.
>
>
> On Friday, October 23, 2015, Ryan Newton <rrnewton at gmail.com> wrote:
>
>>
>>> 1. Small tweaks: The CMM code above seems to be *betting* than the
>>> thunk is unevaluated, because it does the stack check and stack write
>>> *before* the predicate test that checks if the thunk is evaluated (if
>>> (R1 & 7 != 0) goto c3aO; else goto c3aP;). With a bang-pattern
>>> function, couldn't it make the opposite bet? That is, branch on whether
>>> the thunk is evaluated first, and then the wasted computation is only a
>>> single correctly predicted branch (and a read of a tag that we need to read
>>> anyway).
>>>
>>> Oh, a small further addition would be needed for this tweak. In the
>> generated code above "Sp = Sp + 8;" happens *late*, but I think it could
>> happen right after the call to the thunk. In general, does it seem
>> feasible to separate the slowpath from fastpath as in the following tweak
>> of the example CMM?
>>
>>
>> * // Skip to the chase if it's already evaluated:*
>> * start:*
>> * if (R2 & 7 != 0) goto fastpath; else goto slowpath;*
>>
>> * slowpath: // Formerly c3aY*
>> * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;*
>> * c3aZ:*
>> * // nop*
>> * R1 = PicBaseReg + foo_closure;*
>> * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;*
>> * c3b0:*
>> * I64[Sp - 8] = PicBaseReg + block_c3aO_info;*
>> * R1 = R2;*
>> * Sp = Sp - 8;*
>>
>> * call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;*
>> * // Sp bump moved to here so it's separate from "fastpath"*
>> * Sp = Sp + 8;*
>>
>> * fastpath: // Formerly c3aO*
>> * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;*
>> * c3aW:*
>> * R1 = P64[R1 + 6] & (-8);*
>> * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
>> * c3aX:*
>> * R1 = PicBaseReg + lvl_r39S_closure;*
>> * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
>>
>>
>>
>>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20151025/bc138f64/attachment-0001.html>
More information about the ghc-devs
mailing list