SIMD/SSE support & alignment

Nicolas Trangez nicolas at incubaid.com
Wed Mar 13 21:46:34 CET 2013


I had one more remark about the prefetch instruction included in the
compilation result (and if I understood the paper correctly, they're
there on purpose).

On Sun, 2013-03-10 at 22:52 +0100, Nicolas Trangez wrote:
> As an example, here's 'test.hs':
> 
> {-# OPTIONS_GHC -fllvm -O3  -optlo-O3 -optlc-O=3 -funbox-strict-fields
> #-}
> module Test (sum) where
> 
> import Prelude hiding (sum)
> import Data.Int (Int32)
> import Data.Vector.Unboxed (Vector)
> import qualified Data.Vector.Unboxed as U
> 
> sum :: Vector Int32 -> Int32
> sum v = U.mfold' (+) (+) 0 v
> 
> When compiling this into assembly (compiler/library version details at
> the end of this message), the 'sum' function yields (among other
> things)
> this code:
> 
> .LBB2_3:                                # %c1C0
>                                         # =>This Inner Loop Header:
> Depth=1
>         prefetcht0      (%rsi)
>         movdqu  -1536(%rsi), %xmm1
>         paddd   %xmm1, %xmm0
>         addq    $16, %rsi
>         addq    $4, %rcx
>         cmpq    %rdx, %rcx
>         jl      .LBB2_3

If I'm not mistaken, this results in 'prefetcht0 (%rsi)' to be executed
for blocks of 16 bytes, in every loop iteration.

This seems to be overkill: prefetch* loads a full cache-line, which
(according to some cursory reading online) is guaranteed to be at least
32 bytes. It seems to be 64 bytes on my CPU.

As a result, 4 (potentially unaligned!) prefetch instructions are
executed whilst there's no real use for 3 of them.

Next to this, as written in the paper having automatically-generated
suitable 'prefetch' instruction can be cool, but alas: in some
benchmarks I performed some time ago on linear well-aligned vectors
using SSE instructions (using C and inline assembly), removing the
prefetch instructions increased runtime performance (I guess due to
reduced opcode dispatch, and the processor's heuristic prefetcher doing
a good job when scanning over a linear memory range).

There might be some more interesting research in here ;-)

Nicolas




More information about the ghc-devs mailing list