SIMD/SSE support & alignment

Tue Mar 12 15:09:26 CET 2013

On 03/10/2013 09:52 PM, Nicolas Trangez wrote:
> All,
>
> I've been toying with the SSE code generation in GHC 7.7 and Geoffrey
> Mainland's work to integrate this into the 'vector' library in order to
> generate SIMD code from high-level Haskell code.
>
> While working with this, I wrote some simple code for testing purposes,
> then compiled this into LLVM IR and x86_64 assembly form in order to
> figure out how 'good' the resulting code would be.
>
> First and foremost: I'm really impressed. Whilst there's most certainly
> room for improvement (one of them touched in this mail, though I also
> noticed unnecessary constant memory reads inside a tight loop), the
> initial results look very promising, especially taking into account how
> high-level the source code is. This is pretty amazing!
>
> As an example, here's 'test.hs':
>
> {-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields
> #-}
> module Test (sum) where
>
> import Prelude hiding (sum)
> import Data.Int (Int32)
> import Data.Vector.Unboxed (Vector)
> import qualified Data.Vector.Unboxed as U
>
> sum :: Vector Int32 -> Int32
> sum v = U.mfold' (+) (+) 0 v
>
> When compiling this into assembly (compiler/library version details at
> the end of this message), the 'sum' function yields (among other things)
> this code:
>
> .LBB2_3: # %c1C0
> # =>This Inner Loop Header:
> Depth=1
> prefetcht0 (%rsi)
> movdqu -1536(%rsi), %xmm1
> paddd %xmm1, %xmm0
> addq $16, %rsi
> addq $4, %rcx
> cmpq %rdx, %rcx
> jl .LBB2_3
>
> The full LLVM IR and assembler output are attached to this message.
>
> Whilst this is a nice and tight loop, I noticed the use of 'movdqu',
> which is used for non-128bit aligned memory access in SSE code. For
> aligned memory, 'movdqa' can be used, and this can have a major
> performance impact.
>
> Whilst I understand why this code is currently generated as-is (also in
> other sample inputs), I wondered whether there are plans/approaches to
> tackle this. In some cases (e.g. in 'sum') this could be by using the
> scalar calculation at the beginning of the vector up until an aligned
> boundary, then use aligned access and handle the tail using scalars
> again, but I assume OTOH that's not trivial when multiple 'source'
> vectors are used in the calculation.
>
> This might even become more complex when using AVX code, which needs
> 256bit alignments.
>
> Whilst I can't propose an out-of-the-box solution, I'd like to point at
> the 'vector-simd' code [1] I wrote some months ago, which might propose
> some ideas. In this package, I created an unboxed vector-like type whose
> alignment is tracked at type level, and functions which consume a vector
> define the minimal required alignment. As such, vectors can be allocated
> at the minimal alignment they're required to be, throughout all code
> using them.
>
> As an example, if I'd use this code (OTOH):
>
> sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2)
> => Vector o1 a -> Vector o2 a
> sseFoo = undefined
>
> avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2,
> AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a
> avxFoo = undefined
>
> the type of
>
> combinedFoo v = avxFoo sv sv
> where
> sv = sseFoo v
>
> would automagically be
>
> combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast
> A32 o2) => Vector o1 a -> Vector o2 a
>
> and when using this
>
> v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8])
>
> the allocated argument vector (result of Vector.fromList) will be
> 16byte-aligned as expected/required for the SSE function to work with
> unaligned loads internally (assuming no unaligned slices are supported,
> etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte
> aligned as required by 'avxFoo'.
>
> Attached: test.ll and test.s, compilation results of test.hs using
>
> $ ghc-7.7.20130302 -keep-llvm-files
> -package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S
> test.hs
>
> GHC from HEAD/master compiled on my Fedora 18 system using system LLVM
> (3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from
> git://github.com/mainland/primitive.git and 'vector'
> e1a6c403bcca07b4c8121753daf120d30dedb1b0 from
> git://github.com/mainland/vector.git
>
> Nicolas
>
> [1] https://github.com/NicolasT/vector-simd

Hi Nicolas,

Have you read our paper about the SIMD work? It's available here:

https://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-beats-C.pdf

The paper describes the issues involved with integrated SIMD
instructions with the vector fusion framework.

There are two primary issues with alignment: stack alignment and heap
alignment.

We cannot rely on the stack being properly aligned for AVX spills on any
platform, and LLVM's stack fixup code does not play well with GHC, so we
*rewrite* all AVX spill instructions to their unaligned counterparts. On
Win32 we must do the same for SSE.

Unboxed vectors are allocated by GHC, and it does not align memory on
16-byte boundaries, so our first cut at SSE intrinsics simply used
unaligned accesses. Obviously with ForeignPtr's we can control alignment
and potentially use the aligned variants of SSE instructions, but this
will almost double the number of primops. One could imagine extending
our fusion framework to transition to aligned move instructions.

Finally, LLVM 3.2 does not work with GHC. This means we cannot yet take
advantage of its new vectorization optimizations, which is a shame.

So, four projects for you or anyone else who is interested, in rough
dependency order:

1) Get LLVM 3.2 working with GHC's LLVM back end.

2) Fix the stack alignment issue with LLVM. This will likely require a
patch to LLVM.

3) Add support for aligned move primops.

4) Extend the current SIMD fusion framework to handle transitioning to
aligned move instructions. As an alternative, only use aligned move
instructions on memory that we know is aligned.

These are all on my todo list, but my plate is quite full at the moment.

Geoff