<div dir="auto">I thought Andreas Klebinger in particular might find this interesting.<div dir="auto"><br></div><div dir="auto">The talk is here: <a href="https://youtu.be/IAdLwUXRUvg?t=1867">https://youtu.be/IAdLwUXRUvg?t=1867</a></div><div dir="auto"><br></div><div dir="auto"><div dir="auto">(timestamp is 31:07)</div><div dir="auto"><br></div><div dir="auto">basically, if you have an interpreter consisting of a switch statement inside an infinite loop, there's a mechanical transformation using GOTOs that causes a 15% speedup</div><div dir="auto">in the switch statement blob, there's n + 1 basic blocks, each of which has a jmp back to the first basic block with probability 1, and the first basic block has a jmp that will go to one of the other n basic blocks with probability 1 / n</div><div dir="auto">whereas in the goto statement setup, there's n basic blocks, each of which has a jmp that will go to one of the other n - 1 basic blocks with probability 1 / (n - 1)</div><div dir="auto">so if your sequence of instructions is produced by a markov decision process, then the branch predictor will efficiently predict in the latter situation but not necessarily the former</div><div dir="auto">in the goto setup the callgraph looks like a complete graph on n vertices</div><div dir="auto">whereas in the switch setup it looks like a bipartite graph on (1, n) vertices (where the edges represent both "is called by" and "calls")</div><div dir="auto">and because of all those extra edges (n^2 rather than 2n) there's more spots for the branch predictor to build up data</div><div dir="auto"><br></div><div dir="auto">At 39:13 he explains that you get per-address jmp predictions. If you only share a single relative jmp, the cpu only knows how common every opcode is. But, if you have a jmp from each opcode, to the next, the cpu can determine how common an opcode A is followed an opcode B.</div><div dir="auto"><br></div><div dir="auto">He references godbolt (<a href="https://godbolt.org/">https://godbolt.org/</a>), which seems like a useful tool. and mentions that ruby, jvm (on android only?), and some arm compilers implement this.</div><div dir="auto"><br></div><div dir="auto">There are some tradeoffs - code size does increase. According to a GCC developer in the audience, GCC is not currently able to do this. I have no idea about Clang.</div><div dir="auto"><br></div><div dir="auto">Thanks</div></div></div>