<div dir="ltr">My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and the assembly traced was exactly the same except for a marginal improvement. The 8.0.1 code generator removed the r14/r11 swap but the rest of the register ring shift remains the same. I have updated the github gist with the 8.0.1 trace:<div><br></div><div><a href="https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447">https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447</a><br></div><div><br></div><div>thanks,</div><div>harendra</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 13 June 2016 at 00:08, Harendra Kumar <span dir="ltr"><<a href="mailto:harendra.kumar@gmail.com" target="_blank">harendra.kumar@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>I am implementing unicode normalization in Haskell. I challenged myself to match the performance with the best C/C++ implementation, the best being the ICU library. I am almost there, beating it in one of the benchmarks and within 30% for others. I am out of all application level tricks that I could think of and now need help from the compiler.</div><div><br></div><div>I started with a bare minimum loop and adding functionality incrementally watching where the performance trips. At one point I saw that adding just one 'if' condition reduced the performance by half. I looked at what's going on at the assembly level. Here is a github gist of the assembly instructions executed in the fast path of the loop, corresponding cmm snippets and also the full cmm corresponding to the loop:</div><div><br></div><div><a href="https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447" target="_blank">https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447</a><br></div><div><br></div><div>I have annotated the assembly code with labels matching the corresponding CMM.</div><div><br></div><div>With the addition of another "if" condition the loop which was pretty simple till now suddenly got bloated with a lot of register reassignments. Here is a snippet of the register movements added:</div><div><br></div><div><div># _n4se:</div><div># swap r14 <-> r11</div><div>=> 0x408d6b:<span style="white-space:pre-wrap">  </span>mov    %r11,0x98(%rsp)</div><div>=> 0x408d73:<span style="white-space:pre-wrap">  </span>mov    %r14,%r11</div><div>=> 0x408d76:<span style="white-space:pre-wrap">        </span>mov    0x98(%rsp),%r14</div><div><br></div><div># reassignments</div><div># rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx</div><div>=> 0x408d7e:<span style="white-space:pre-wrap"> </span>mov    %rbx,0x90(%rsp)</div><div>=> 0x408d86:<span style="white-space:pre-wrap">  </span>mov    %rcx,%rbx</div><div>=> 0x408d89:<span style="white-space:pre-wrap">        </span>mov    %rdx,%rcx</div><div>=> 0x408d8c:<span style="white-space:pre-wrap">        </span>mov    %rsi,%rdx</div><div>=> 0x408d8f:<span style="white-space:pre-wrap">        </span>mov    %rdi,%rsi</div><div>=> 0x408d92:<span style="white-space:pre-wrap">        </span>mov    %r8,%rdi</div><div>=> 0x408d95:<span style="white-space:pre-wrap"> </span>mov    %r9,%r8</div><div>=> 0x408d98:<span style="white-space:pre-wrap">  </span>mov    %r10,%r9</div><div>=> 0x408d9b:<span style="white-space:pre-wrap"> </span>mov    0x90(%rsp),%r10</div></div><div>.</div><div>.</div><div>.</div><div>loop logic here which uses only %rax, %r10 and %r9 . </div><div>.</div><div>.</div><div>.</div><div><div># _n4s8:</div><div># shuffle back to original assignments</div><div>=> 0x4090dc:<span style="white-space:pre-wrap">  </span>mov    %r14,%r11</div><div>=> 0x4090df:<span style="white-space:pre-wrap">        </span>mov    %r9,%r10</div><div>=> 0x4090e2:<span style="white-space:pre-wrap"> </span>mov    %r8,%r9</div><div>=> 0x4090e5:<span style="white-space:pre-wrap">  </span>mov    %rdi,%r8</div><div>=> 0x4090e8:<span style="white-space:pre-wrap"> </span>mov    %rsi,%rdi</div><div>=> 0x4090eb:<span style="white-space:pre-wrap">        </span>mov    %rdx,%rsi</div><div>=> 0x4090ee:<span style="white-space:pre-wrap">        </span>mov    %rcx,%rdx</div><div>=> 0x4090f1:<span style="white-space:pre-wrap">        </span>mov    %rbx,%rcx</div><div>=> 0x4090f4:<span style="white-space:pre-wrap">        </span>mov    %rax,%rbx</div><div>=> 0x4090f7:<span style="white-space:pre-wrap">        </span>mov    0x88(%rsp),%rax</div><div><br></div><div>=> 0x4090ff:<span style="white-space:pre-wrap"> </span>jmpq   0x408d2a</div></div><div><br></div><div><br></div><div>The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used. </div><div><br></div><div>Maybe this is because these are reusable blocks and the movement is necessary when used in some other path?</div><div><br></div><div>Can this be avoided? Or at least avoided in a certain fast path somehow by hinting the compiler? Any pointers to the GHC code will be appreciated. I am not yet much familiar with the GHC code but can dig deeper pretty quickly. But before that I hope someone knowledgeable in this area can shed some light on this at a conceptual level or if at all it can be improved. I can provide more details and experiment more if needed.</div><div><br></div><div>Thanks,</div><div>Harendra</div></div>

</blockquote></div><br></div>