Re: Branchless implementation for literal case – is it worth it?

Mon Apr 20 14:29:12 UTC 2015

2015-04-19 9:44 GMT+02:00 Joachim Breitner <mail at joachim-breitner.de>:
> [...] So my question to the general audience: Is such branchless code really
> better than the current, branching code? Can someone provide us with an
> example that shows that it is better? Do I need to produce different
> branchless assembly? [...]

Just a few war stories regarding this from the trenches (= Chrome's
JavaScript JIT):

   * Branchless code in itself is a non-goal. What you care about is
performance and/or code size, but both don't have a direct
relationship to using branches or not.

   * "Hacker's Delight" is a cool book, but don't use the bit fiddling
tricks in there blindly. We actually reverted to straightforward code
with branches from some seemingly "better" branchless code, because
even with branches the performance was better.

   * Even within a processor family like x64, performance
characteristics vary vastly. What can be a performance improvement for
the modern beefy Xeon machine you're benchmarking on, can make things
worse for a low-end Atom. The same holds in the other direction.

   * The same holds for different architectures, i.e. an
"optimization" which makes things fast on most Intel cores could make
things worse on e.g. ARM cores (and vice versa).

   * On more powerful cores with heavy out-of-order execution, it's
hard to beat a well-predicted branch.

   * On Linux, the perf tool is your best friend. Without it you don't
have a clue what's making your code slower than expected, it could be
bad branch prediction, stalled units with the CPU, bad luck with
caches, etc.

   * Micro-benchmarks can be highly misleading, e.g. due to totally
different branching patterns, cache usage, etc.

In a nutshell: If you don't know the details of the architecture
you're compiling for, you simply don't know if the "optimization" you
have in mind actually makes things better or worse. Therefore these
kind of decision have to be pushed very far towards the end of the
compiler pipeline. Having some kind of feedback about previous runs of
the same code is very helpful, too, but this is a bit complicated in a
batch compiler (but doable).