SIMD/SSE support & alignment
Nicolas Trangez
nicolas at incubaid.com
Wed Mar 13 21:46:34 CET 2013
I had one more remark about the prefetch instruction included in the
compilation result (and if I understood the paper correctly, they're
there on purpose).
On Sun, 2013-03-10 at 22:52 +0100, Nicolas Trangez wrote:
> As an example, here's 'test.hs':
>
> {-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields
> #-}
> module Test (sum) where
>
> import Prelude hiding (sum)
> import Data.Int (Int32)
> import Data.Vector.Unboxed (Vector)
> import qualified Data.Vector.Unboxed as U
>
> sum :: Vector Int32 -> Int32
> sum v = U.mfold' (+) (+) 0 v
>
> When compiling this into assembly (compiler/library version details at
> the end of this message), the 'sum' function yields (among other
> things)
> this code:
>
> .LBB2_3: # %c1C0
> # =>This Inner Loop Header:
> Depth=1
> prefetcht0 (%rsi)
> movdqu -1536(%rsi), %xmm1
> paddd %xmm1, %xmm0
> addq $16, %rsi
> addq $4, %rcx
> cmpq %rdx, %rcx
> jl .LBB2_3
If I'm not mistaken, this results in 'prefetcht0 (%rsi)' to be executed
for blocks of 16 bytes, in every loop iteration.
This seems to be overkill: prefetch* loads a full cache-line, which
(according to some cursory reading online) is guaranteed to be at least
32 bytes. It seems to be 64 bytes on my CPU.
As a result, 4 (potentially unaligned!) prefetch instructions are
executed whilst there's no real use for 3 of them.
Next to this, as written in the paper having automatically-generated
suitable 'prefetch' instruction can be cool, but alas: in some
benchmarks I performed some time ago on linear well-aligned vectors
using SSE instructions (using C and inline assembly), removing the
prefetch instructions increased runtime performance (I guess due to
reduced opcode dispatch, and the processor's heuristic prefetcher doing
a good job when scanning over a linear memory range).
There might be some more interesting research in here ;-)
Nicolas
More information about the ghc-devs
mailing list