SIMD/SSE support & alignment

Tue Mar 12 21:33:00 CET 2013

On 03/12/2013 08:08 PM, Nicolas Trangez wrote:
> Hey,
>
> On Tue, 2013-03-12 at 14:09 +0000, Geoffrey Mainland wrote:
>> On 03/10/2013 09:52 PM, Nicolas Trangez wrote:
>>> ...
>>
>> Hi Nicolas,
>>
>> Have you read our paper about the SIMD work? It's available here:
>>
>>
https://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-beats-C.pdf
>
> I didn't read that one before (read other stream-fusion related papers
> before), but did now. I got most of it already while reading the vector
> simd branch commits. Benchmarks results look very nice!
>
> I'm afraid I didn't 'get' how the framework would allow for both AVX and
> SSE instructions to work on streams, since it seems to assume Multi's
> are always a fixed number of bytes wide (in this case 16 for SSE).

The width of a Multi depends on the platform. If the platform supports
AVX, the Multi will be 256 bits wide. Otherwise, it will be 128 bits
wide.

>> The paper describes the issues involved with integrated SIMD
>> instructions with the vector fusion framework.
>>
>> There are two primary issues with alignment: stack alignment and heap
>> alignment.
>>
>> We cannot rely on the stack being properly aligned for AVX spills on any
>> platform, and LLVM's stack fixup code does not play well with GHC, so we
>> *rewrite* all AVX spill instructions to their unaligned counterparts. On
>> Win32 we must do the same for SSE.
>
> Does this imply stack values are always 16-byte aligned?
> I haven't worked with AVX yet (my CPU doesn't support it).

On Linux, and, I believe, MacOS X, the stack is 16-byte aligned. On
Windows it is not.

>> Unboxed vectors are allocated by GHC, and it does not align memory on
>> 16-byte boundaries, so our first cut at SSE intrinsics simply used
>> unaligned accesses. Obviously with ForeignPtr's we can control alignment
>> and potentially use the aligned variants of SSE instructions, but this
>> will almost double the number of primops. One could imagine extending
>> our fusion framework to transition to aligned move instructions.
>
> Right. I created the patch of #7067
> (http://hackage.haskell.org/trac/ghc/ticket/7067) for vector-simd
> purposed back then (adding mallocForeignPtrAlignedBytes and
> mallocPlainForeignPtrAlignedBytes).
>
>> Finally, LLVM 3.2 does not work with GHC. This means we cannot yet take
>> advantage of its new vectorization optimizations, which is a shame.
>>
>> So, four projects for you or anyone else who is interested, in rough
>> dependency order:
>>
>> 1) Get LLVM 3.2 working with GHC's LLVM back end.
>
> According to other mails in this thread this should be fixed. I'll give
> it a go.

GHC doesn't bootstrap with LLVM 3.2 for me, so something seems
wrong. Let me know if you get it working.

>> 2) Fix the stack alignment issue with LLVM. This will likely require a
>> patch to LLVM.
>
> I'm afraid that's a bit out of my league for now :-)
>
>> 3) Add support for aligned move primops.
>
> I looked into this before, might give it a stab.
>
>> 4) Extend the current SIMD fusion framework to handle transitioning to
>> aligned move instructions. As an alternative, only use aligned move
>> instructions on memory that we know is aligned.
>
> This is why I sent my previous mail initially: is there any plan how to
> approach the 'memory that we know is aligned' bit? Would it make sense
> to have a more general 'alignment restriction' framework for arbitrary
> values, not only unboxed vectors (if there are any other use-cases)?

Not until the LLVM 3.2 and stack alignment issues are resolved.

I'm not sure what a general alignment restriction framework would look
like. I think a framework for aligned Data.Vector.Unboxed/Storable
vectors makes  a lot of sense though.

>> These are all on my todo list, but my plate is quite full at the moment.
>
> Heh, sounds familiar ;-)
>
> Thanks,
>
> Nicolas
>