wislist item: more control over branch prediction

Tue Aug 8 08:47:23 EDT 2006

So now that GHC is so good at producing really fast low level code (see
ByteString benchmarks) we start to encounter low level gubbins where to
get the fastest possible code we need slightly more influence over
branch prediction.

In the code for ByteString's 'writeStrUp/Dn' functions we're doing a
tight loop writing bytes to memory.

The first version looks much like this:

loop fp n off s = case next s of
    Done       -> return $! PS fp 0 off
    Skip    s' -> loop fp n off s'
    Yield x s' -> do
             withForeignPtr fp $ \p -> pokeByteOff p off x
             loop fp n (off+1) s'

When we get to the end of a memory block we need to double the size of
the memory block and carry on. So this adds an additional conditional
test into this loop.

loop fp n off s = case next s of
    Done       -> trimUp fp n off
    Skip    s' -> loop fp n off s'
    Yield x s'
        | n == off -> realloc fp n off s' x
        | otherwise -> do
             withForeignPtr fp $ \p -> pokeByteOff p off x
             loop fp n (off+1) s'

There are dramatic differences between equivalent forms of code. Just by
reversing the order of the (n==off) test one form I can process 50Mb of
data in 0.20 seconds and in the other it takes 0.34 seconds.

That is:

        | n == off -> realloc fp n off s' x
        | otherwise -> do ...

vs

        | n /= off -> do ...
        | otherwise -> realloc fp n off s' x

I think that this is all down to branch prediction and the arrangement
of basic blocks. On a modern CPU correctly predicted branches are nearly
free while mis-predicted branches are very costly due to stalled
pipelines etc.

In gcc they have a branch prediction framework. It annotates
conditionals with prediction information. It then uses that during code
generation to do things like arranging basic blocks so that unlikely
blocks are moved out of line. It gets the prediction information from a
few sources. One is a simple static branch predictor, for example
branches back to to the beginning of a loop are likely to be taken and
branches to error functions are not. gcc even has a profile feedback
system to find the real probabilities of branches from a sample run of
the program. It also has a user annotation __builtin_expect() which many
C projects use in the form of a macro:

if (unlikely(x==0)) { ...

The Linux kernel uses this fairly heavily to move error handling code
out of the main 'fast path' code.

Anyway, so what I wish is that I could write something like:

loop fp n off s = case next s of
    Done       -> {-# UNLIKELY #-}
                  trimUp fp n off
    Skip    s' -> loop fp n off s'
    Yield x s'
        | n == off -> {-# UNLIKELY #-}
                      realloc fp n off s' x
        | otherwise -> do
             withForeignPtr fp $ \p -> pokeByteOff p off x
             loop fp n (off+1) s'

The best approximating effect that I can use at the moment is to make
the unlikely realloc branch a local function in a where clause but mark
it with {-# NOINLINE #-} so that the code for the realloc case doesn't
pollute the tight loop for the normal fast case. Then the other
approximation is fiddling with the logic of the binary test and looking
at the benchmarks. The combination of these two techniques makes a
dramatic difference to the speed.

However I do need more control to do it reliably, while I was able to
make it work for writeStrUp, I could not find a combination to work for
writeStrDn - we loose about 50% performance when we add the realloc
branch. So a slightly more reliable method to hint at the likelihood of
conditionals would make this kind of low level code faster and easier to
write.

Some time ago as a quick hack I did add a branch prediction annotation
to the CMM conditional constructor. I only used it in the C backend to
turn it into uses of gcc's __builtin_expect(). I also didn't connect it
to the front end so I didn't have any high level static branch
prediction (which could be done by looking for branches with recursive
calls or calls to error etc). The only place I did actually use it was
in the heap check and stack check. I assumed that it would be
advantageous to predict heap and stack overflow branches as not taken.
It was only a quick hack - I didn't do any benchmarks to see if it had
any effect.

Anyway, just a thought, and of course not something we should be
thinking about 'til post 6.6.

Duncan