<div dir="ltr"><div>Hi,<br></div><div><br></div><div>I'm interested in execution performance.</div><div><br></div><div>Maybe modern hardware (which implement IA64, ARMv8) is able to predict a long chain of jumps [1].</div><div>But prediction accuracy for indirect jump is low,</div><div>especially dynamic addressed indirect jumps.</div><div><br></div><div><br></div><div>By the way, Ryan's example code will be fast by following optimization:</div><div>(If c3aX is most fast path, c3aX is reached without taken-branch.)</div><div><br></div><div> // Skip to the chase if it's already evaluated:</div><div> start:</div><div>// if (R2 & 7 != 0) goto fastpath; else goto slowpath;</div><div> if (R2 & 7 == 0) goto slowpath; // *** (1) remove branch for fastpath</div><div><br></div><div> fastpath: // Formerly c3aO // *** (1) move fastpath here</div><div>// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;</div><div> if (R1 & 7 >= 2) goto c3aW; // *** (2) remove branch for prior path(c3aX)</div><div> c3aX: // *** (2) move else path to here(without branch)</div><div> R1 = PicBaseReg + lvl_r39S_closure;</div><div> call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump, but fixed address (100% hit)</div><div> c3aW:</div><div> R1 = P64[R1 + 6] & (-8);</div><div> call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump, dynamic address (hit or miss)</div><div>//c3aX:</div><div>// R1 = PicBaseReg + lvl_r39S_closure;</div><div>// call (I64[R1])(R1) args: 8, res: 0, upd: 8;</div><div><br></div><div> slowpath: // Formerly c3aY</div><div> if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;</div><div> c3aZ:</div><div> // nop</div><div> R1 = PicBaseReg + foo_closure;</div><div> call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;</div><div> c3b0:</div><div> I64[Sp - 8] = PicBaseReg + block_c3aO_info;</div><div> R1 = R2;</div><div> Sp = Sp - 8;</div><div><br></div><div> call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;</div><div> // Sp bump moved to here so it's separate from "fastpath"</div><div> Sp = Sp + 8;</div><div> goto fastpath; // ***</div><div><br></div><div>//fastpath: // Formerly c3aO</div><div>// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;</div><div>//c3aW:</div><div>// R1 = P64[R1 + 6] & (-8);</div><div>// call (I64[R1])(R1) args: 8, res: 0, upd: 8;</div><div>//c3aX:</div><div>// R1 = PicBaseReg + lvl_r39S_closure;</div><div>// call (I64[R1])(R1) args: 8, res: 0, upd: 8;</div><div><br></div><div><br></div><div><br></div><div>[1]: Intel64 and IA-32 Architectures Optimization Reference Manual</div><div> <a href="http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf">http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf</a></div><div> 3.4 OPTIMIZING THE FRONT END</div><div> 2.3.2.3 Branch Prediction</div><div><br></div><div><br></div><div>I'm just studying and drawing about lazy evaluation.</div><div>This thread is helpful to me :)</div><div><br></div><div>Regards,</div><div>Takenobu</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2015-10-25 5:53 GMT+09:00 Carter Schonwald <span dir="ltr"><<a href="mailto:carter.schonwald@gmail.com" target="_blank">carter.schonwald@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Doesn't modern hardware have pretty good branch prediction? In which case the order of the branches may not matter unless it's a long chain of calls? Vs say an inner loop that hasn't been inlined?<div><br></div><div>Either way, I'd love be stay in the loop on this topic, for work I'm building a strongly normalizing language that supports both strict and call by need evaluation strategies. <div><div class="h5"><span></span><br><br>On Friday, October 23, 2015, Ryan Newton <<a href="mailto:rrnewton@gmail.com" target="_blank">rrnewton@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><ol><li>Small tweaks: The CMM code above seems to be <i>betting</i> than the thunk is unevaluated, because it does the stack check and stack write <i>before</i> the predicate test that checks if the thunk is evaluated (<span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap;color:rgb(167,29,93)">if</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap"> (R1 & </span><span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap;color:rgb(0,134,179)">7</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap"> != </span><span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap;color:rgb(0,134,179)">0</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap">) </span><span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap;color:rgb(167,29,93)">goto</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap"> c3aO; </span><span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap;color:rgb(167,29,93)">else</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap"> </span><span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap;color:rgb(167,29,93)">goto</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap"> c3aP;</span>). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway). <br></li></ol></div></div></blockquote><div>Oh, a small further addition would be needed for this tweak. In the generated code above "<span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap">Sp = Sp + </span><span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap;color:rgb(0,134,179)">8</span><span style="color:rgb(51,51,51);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18.2px;white-space:pre-wrap">;"</span> happens <i>late</i>, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?</div><div><br></div><div><div><br></div><div><font face="monospace, monospace"><b> // Skip to the chase if it's already evaluated:</b></font></div><div><font face="monospace, monospace"><b> start:</b></font></div><div><font face="monospace, monospace"><b> if (R2 & 7 != 0) goto fastpath; else goto slowpath;</b></font></div><div><font face="monospace, monospace"><b><br></b></font></div><div><font face="monospace, monospace"><b> slowpath: // Formerly c3aY</b></font></div><div><font face="monospace, monospace"><b> if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;</b></font></div><div><font face="monospace, monospace"><b> c3aZ:</b></font></div><div><font face="monospace, monospace"><b> // nop</b></font></div><div><font face="monospace, monospace"><b> R1 = PicBaseReg + foo_closure;</b></font></div><div><font face="monospace, monospace"><b> call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;</b></font></div><div><font face="monospace, monospace"><b> c3b0:</b></font></div><div><font face="monospace, monospace"><b> I64[Sp - 8] = PicBaseReg + block_c3aO_info;</b></font></div><div><font face="monospace, monospace"><b> R1 = R2;</b></font></div><div><font face="monospace, monospace"><b> Sp = Sp - 8;</b></font></div><div><br></div><div><font face="monospace, monospace"><b> call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;</b></font></div><div><font face="monospace, monospace"><b> // Sp bump moved to here so it's separate from "fastpath"</b></font></div><div><font face="monospace, monospace"><b> Sp = Sp + 8;</b></font></div><div><font face="monospace, monospace"><b><br></b></font></div><div><font face="monospace, monospace"><b> fastpath: // Formerly c3aO</b></font></div><div><font face="monospace, monospace"><b> if (R1 & 7 >= 2) goto c3aW; else goto c3aX;</b></font></div><div><font face="monospace, monospace"><b> c3aW:</b></font></div><div><font face="monospace, monospace"><b> R1 = P64[R1 + 6] & (-8);</b></font></div><div><font face="monospace, monospace"><b> call (I64[R1])(R1) args: 8, res: 0, upd: 8;</b></font></div><div><font face="monospace, monospace"><b> c3aX:</b></font></div><div><font face="monospace, monospace"><b> R1 = PicBaseReg + lvl_r39S_closure;</b></font></div><div><font face="monospace, monospace"><b> call (I64[R1])(R1) args: 8, res: 0, upd: 8;</b></font></div></div><div><br></div><div><br></div><div><br></div></div></div></div>
</blockquote></div></div></div>
<br>_______________________________________________<br>
ghc-devs mailing list<br>
<a href="mailto:ghc-devs@haskell.org">ghc-devs@haskell.org</a><br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a><br>
<br></blockquote></div><br></div>