[Haskell-cafe] vector, alignment and SIMD through FFI

Fri Jul 6 22:06:47 CEST 2012

Hello Cafe,

Recently I've been playing with the implementation of an algorithm, for
which we already have highly-optimized implementations available (in
plain C/C++ as well as OCaml with calls to C through FFI).

The algorithm works on buffers/arrays/vectors/whatever you want to call
it, which needs to be combined in certain ways. This can be highly
optimized by using SIMD instructions (like the ones provides by several
SSE versions).

I'd like to get a to Haskell version which is comparable in efficiency
as the existing versions, whilst remaining as 'functional' as possible.
I don't mind jumping into some low-level C glue and FFI (using ccall or
custom primops), but this should be limited.

Currently I have something working (still highly unoptimized) using
(unboxed) vectors from the vector package, using mutable versions within
a well-contained ST environment in some places.

One hot zone of the current version is combining several vectors, and
the performance of this operation could be greatly improved by using
SIMD instructions. There's one catch though: when using these, memory
should be aligned on certain boundaries (16 byte in this case).

First and foremost, to be able to pass my vectors to some C functions, I
should change my code into using Storable vectors (which should be fine,
I guess I can expect similar performance characteristics?). I couldn't
find any information on alignment guarantees of these vectors though...

Which is how I get to my question: are there any such guarantees? If
not, are there any pointers to how to proceed with this? I guess
tracking alignment at the type level should be the goal, you can find
some code trying to explain my reasoning up to now at the end of this
email.

I have some issues with this:

- I'd need to re-implement almost all vector operations, which seems
stupid.
- It doesn't actually work right now ;-)
- It'd be nice to be able to encode 'compatible' alignment: as an
example, a 16 byte aligned buffer is also 8 byte aligned...

I hope the above explains somewhat my goal. Any thoughts & help on this
would be very welcome!

Thanks,

Nicolas

module Data.Vector.SIMD (
    -- ...
) where

import qualified Data.Vector.Storable as SV

import Foreign.Storable (Storable, sizeOf)
import Foreign.Ptr (Ptr, FunPtr)
import Foreign.ForeignPtr (ForeignPtr, newForeignPtr)
import System.IO.Unsafe (unsafePerformIO)

class Alignment a where
    alignment :: a -> Int

data A8Byte
instance Alignment A8Byte where
    alignment _ = 8

data A16Byte
instance Alignment A16Byte where
    alignment _ = 16

newtype Alignment a => SIMDVector a b = V (SV.Vector b)

replicate :: (Alignment a, Storable b) => a -> Int -> b -> SIMDVector a
b
replicate a n b = V v
  where
    ptr = unsafePerformIO $ do
            v <- _mm_malloc n (alignment a)
            -- memset etc
            return v

    v = SV.unsafeFromForeignPtr0 ptr n

-- These are 2 _stub versions of the procedures since xmminstr.h (or
mm_malloc.h
-- when using GCC) contains them as inline procedures which are not
available
-- as-is in a library. There should be some C module which exports
-- _mm_malloc_stub and _mm_free_stub, which simply includes xmminstr.h
and calls
-- the underlying procedures.
foreign import ccall "_mm_malloc_stub" _mm_malloc_stub :: Int -> Int ->
IO (Ptr a)
foreign import ccall "_mm_free_stub" _mm_free_stub :: FunPtr (Ptr a ->
IO ())

_mm_malloc :: Storable a => Int -> Int -> IO (ForeignPtr a)
_mm_malloc s l = do
    -- This fails:
    -- Ambiguous type variable `a0' in the constraint:
    --   (Storable a0) arising from a use of `sizeOf'
    -- v <- c_mm_malloc (s * sizeOf (undefined :: a)) l
    newForeignPtr _mm_free_stub undefined

-- This allocates a 16 byte aligned output buffer, takes 2 existing ones
and
-- calls some FFI function to perform some magic.
-- The implementation could run inside ST, if the FFI import (which e.g.
works
-- on a mutable buffer and returns IO ()) is lifted into ST using
unsafeIOToST
mySIMDFun :: SIMDVector A16Byte a -> SIMDVector A16Byte a -> SIMDVector
A16Byte a
mySIMDFun a b = undefined