Re: Branchless implementation for literal case – is it worth it?

Mon Apr 20 01:31:33 UTC 2015

the optimization manual ben mentions is
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

it does a very good job laying out the scope of both *generality* and
*impact* for a huge range of systemsy tricks.

there are certain conditions when that transformation can be a win, but
they generally correspond with *inner loop* scenarios where bad branch
prediction heavily impacts performance. One class of algorithms that might
provide some grist for the optimization mill would be those in the hacker's
delight book http://www.hackersdelight.org/

I have written some code that has that bit fiddling vs bad branch
prediction encoding tradeoff before, i'll see if i can dig any of it up if
you'd like!

On Sun, Apr 19, 2015 at 10:41 AM, Ben Gamari <ben at smart-cactus.org> wrote:

> Joachim Breitner <mail at joachim-breitner.de> writes:
>
> > Dear devs,
> >
> > in  https://ghc.haskell.org/trac/ghc/ticket/10124 bgamari suggested that
> > code like
> >
> >         f :: Int -> Bool
> >         f a = case a of
> >                 1  -> True
> >                 5  -> True
> >                 8  -> True
> >                 9  -> True
> >                 11 -> True
> >                 19 -> True
> >                 _  -> False
> >         {-# NOINLINE f #-}
> >
> > should not compile to a series of conditional jumps, but rather a
> > branchless code akin to
> >
> >         let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9
> >                          `orI#` a ==# 11 `orI#` a ==# 19
> >         case p of
> >           1 -> do_something
> >           0 -> do_something_else
> >
> Hi Joachim,
>
> Thanks for trying this and I'm sorry to hear that the optimization
> hasn't yielded the improvement I would have expected. I've been a bit
> silent on the issue as I've been working on finishing up my
> PhD. Hopefully in a couple of months after I'm settled in Germany I'll
> have time to delve into this more deeply.
>
> > Subsequently, I refactored the way we produce Cmm code from STG, opening
> > the possibility to implement this optimization at that stage¹.
> >
> > But when I then added this implementation, I could not measure any
> > runtime improvement, see
> > https://ghc.haskell.org/trac/ghc/ticket/10124#comment:15
> >
> > So my question to the general audience: Is such branchless code really
> > better than the current, branching code? Can someone provide us with an
> > example that shows that it is better? Do I need to produce different
> > branchless assembly?
> >
> Have you had a look at the Intel Optimization manual? I'll admit that I
> haven't looked myself but I'd imagine they'd probably have a few things
> to say about this.
>
> Cheers,
>
> - Ben
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20150419/817ab08f/attachment-0001.html>