<div dir="ltr">Hi,<div><br></div><div>I am implementing unicode normalization in Haskell. I challenged myself to match the performance with the best C/C++ implementation, the best being the ICU library. I am almost there, beating it in one of the benchmarks and within 30% for others. I am out of all application level tricks that I could think of and now need help from the compiler.</div><div><br></div><div>I started with a bare minimum loop and adding functionality incrementally watching where the performance trips. At one point I saw that adding just one 'if' condition reduced the performance by half. I looked at what's going on at the assembly level. Here is a github gist of the assembly instructions executed in the fast path of the loop, corresponding cmm snippets and also the full cmm corresponding to the loop:</div><div><br></div><div><a href="https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447">https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447</a><br></div><div><br></div><div>I have annotated the assembly code with labels matching the corresponding CMM.</div><div><br></div><div>With the addition of another "if" condition the loop which was pretty simple till now suddenly got bloated with a lot of register reassignments. Here is a snippet of the register movements added:</div><div><br></div><div><div># _n4se:</div><div># swap r14 <-> r11</div><div>=> 0x408d6b:<span class="" style="white-space:pre">   </span>mov    %r11,0x98(%rsp)</div><div>=> 0x408d73:<span class="" style="white-space:pre">    </span>mov    %r14,%r11</div><div>=> 0x408d76:<span class="" style="white-space:pre">  </span>mov    0x98(%rsp),%r14</div><div><br></div><div># reassignments</div><div># rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx</div><div>=> 0x408d7e:<span class="" style="white-space:pre">   </span>mov    %rbx,0x90(%rsp)</div><div>=> 0x408d86:<span class="" style="white-space:pre">    </span>mov    %rcx,%rbx</div><div>=> 0x408d89:<span class="" style="white-space:pre">  </span>mov    %rdx,%rcx</div><div>=> 0x408d8c:<span class="" style="white-space:pre">  </span>mov    %rsi,%rdx</div><div>=> 0x408d8f:<span class="" style="white-space:pre">  </span>mov    %rdi,%rsi</div><div>=> 0x408d92:<span class="" style="white-space:pre">  </span>mov    %r8,%rdi</div><div>=> 0x408d95:<span class="" style="white-space:pre">   </span>mov    %r9,%r8</div><div>=> 0x408d98:<span class="" style="white-space:pre">    </span>mov    %r10,%r9</div><div>=> 0x408d9b:<span class="" style="white-space:pre">   </span>mov    0x90(%rsp),%r10</div></div><div>.</div><div>.</div><div>.</div><div>loop logic here which uses only %rax, %r10 and %r9 . </div><div>.</div><div>.</div><div>.</div><div><div># _n4s8:</div><div># shuffle back to original assignments</div><div>=> 0x4090dc:<span class="" style="white-space:pre">    </span>mov    %r14,%r11</div><div>=> 0x4090df:<span class="" style="white-space:pre">  </span>mov    %r9,%r10</div><div>=> 0x4090e2:<span class="" style="white-space:pre">   </span>mov    %r8,%r9</div><div>=> 0x4090e5:<span class="" style="white-space:pre">    </span>mov    %rdi,%r8</div><div>=> 0x4090e8:<span class="" style="white-space:pre">   </span>mov    %rsi,%rdi</div><div>=> 0x4090eb:<span class="" style="white-space:pre">  </span>mov    %rdx,%rsi</div><div>=> 0x4090ee:<span class="" style="white-space:pre">  </span>mov    %rcx,%rdx</div><div>=> 0x4090f1:<span class="" style="white-space:pre">  </span>mov    %rbx,%rcx</div><div>=> 0x4090f4:<span class="" style="white-space:pre">  </span>mov    %rax,%rbx</div><div>=> 0x4090f7:<span class="" style="white-space:pre">  </span>mov    0x88(%rsp),%rax</div><div><br></div><div>=> 0x4090ff:<span class="" style="white-space:pre">   </span>jmpq   0x408d2a</div></div><div><br></div><div><br></div><div>The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used. </div><div><br></div><div>Maybe this is because these are reusable blocks and the movement is necessary when used in some other path?</div><div><br></div><div>Can this be avoided? Or at least avoided in a certain fast path somehow by hinting the compiler? Any pointers to the GHC code will be appreciated. I am not yet much familiar with the GHC code but can dig deeper pretty quickly. But before that I hope someone knowledgeable in this area can shed some light on this at a conceptual level or if at all it can be improved. I can provide more details and experiment more if needed.</div><div><br></div><div>Thanks,</div><div>Harendra</div></div>