[Haskell-cafe] Fastest saturating add?

Sun Jan 14 21:38:01 UTC 2024

On Sun, 14 Jan 2024, amindfv--- via Haskell-Cafe wrote:

> I have a function that's called many (many) times, which involves a 
> saturating addition (addition without overflow).
>
> In other words, I'd like ((254 :: Word8) + 10) to equal 255 (which is 
> maxBound::Word8), not 8 (which is what you get with overflow).
>
> My current code, without particular regard to performance, is simply:
>
>    satAdd :: Word8 -> Word8 -> Word8
>    satAdd x y =
>       let m = x + y
>       in if m < x then 255 else m
>
> This seems like the type of code somebody's already worked on the 
> performance for, though I can't find much Haskell-specific in searches 
> online.

LLVM is pretty good at optimizing such code snippets. You may first watch 
what code it generates. I guess you do not have to apply manual 
optimizations. But you might be curious how LLVM translates your code.

The CPU sets a flag when it encounters a carry (aka unsigned overflow). 
Then it has an instruction for doing a conditional move. That is, even 
without special saturated arithmetic instructions, unsigned saturated 
addition can be done in two or three basic instructions.

LLVM also has a saturating add intrinsic:
    https://llvm.org/docs/LangRef.html#saturation-arithmetic-intrinsics

x86 provides saturating addition in its SSE and AVX subsystems. They even 
allow you to perform up to 64 byte additions in one instruction. I have 
looked up that x86 has PADDUSB instruction.

Ok, lets watch LLVM:

$ cat saturated-add.ll
target triple = "x86_64-unknown-linux-gnu"
define i8 @satAdd(i8 %x, i8 %y) {
   %m = add i8 %x, %y
   %carry = icmp ult i8 %m, %x
   %satSum = select i1 %carry, i8 255, i8 %m
   ret i8 %satSum
}

$ llc-17 <saturated-add.ll
...
satAdd:                                 # @satAdd
         .cfi_startproc
# %bb.0:
         addb    %sil, %dil
         movzbl  %dil, %ecx
         movl    $255, %eax
         cmovael %ecx, %eax
         retq
...

Now with optimizations:

$ opt-17 -O2 <saturated-add.ll | llvm-dis-17
...
define i8 @satAdd(i8 %x, i8 %y) local_unnamed_addr #0 {
   %satSum = tail call i8 @llvm.uadd.sat.i8(i8 %x, i8 %y)
   ret i8 %satSum
}
...
declare i8 @llvm.uadd.sat.i8(i8, i8) #1

Thus, it has detected your manual saturated add implementation.

To what instruction will it be translated?

$ opt-17 -O2 <saturated-add.ll | llc-17
...
satAdd:                                 # @satAdd
# %bb.0:
         addb    %sil, %dil
         movzbl  %dil, %ecx
         movl    $255, %eax
         cmovael %ecx, %eax
         retq

Sadly, it remains the same. No special instructions.