[Haskell-cafe] vector-simd: some code available, and some questions

Sat Jul 7 21:13:58 CEST 2012

All, 

After my message of yesterday [1] I got down to it and implemented
something along those lines. I created a playground repository
containing the code at [2]. Initial benchmark results at [3]. More about
the benchmark at the end of this email.

First some questions and requests for help:

- I'm stuck with a typing issue related to 'sizeOf' calculation at [4].
I tried a couple of things, but wasn't able to figure out how to fix it.
- I'm using unsafePerformIO at [5], yet I'm not certain it's OK to do
so. Are there better (safer/performant/...) ways to get this working?
- Currently Alignment phantom types (e.g. A8 and A16) are not related to
each other: a function (like Data.Vector.SIMD.Algorithms.unsafeXorSSE42)
can have this signature:

unsafeXorSSE42 :: Storable a => SV.Vector SV.A16 a -> SV.Vector SV.A16 a
-> SV.Vector SV.A16 a

Yet, imaging I'd have an "SV.Vector SV.A32 Word8" vector at hand, the
function should accept it as well (a 32-byte aligned vector is also
16-byte aligned). Is there any way to encode this at the type level?

That's about it :-)

As of now, I only implemented a couple of the vector API functions (the
ones required to execute my benchmark). Adding the others should be
trivial.

The benchmark works with Data.Vector.{Unboxed|Storable}.Vector (UV and
SV) vectors of Word8 values, as well as my custom
Data.Vector.SIMD.Vector type (MV) using 16-byte alignment (MV.Vector
MV.A16 Word8).

benchUV, benchSV and benchMV all take 2 pre-calculated Word8 vectors of
given size (1024 and 4096) and xor them pairwise into the result using
"zipWith xor". benchMVA takes 2 suitable MV vectors and xor's them into
a third using a rather simple and unoptimized C implementation using
SSE4.2 intrinsics [6]. This could be enhanced quite a bit (I guess using
the prim calling convention, FFI overhead can be reduced as well).
Currently, only vectors of a multiple of 32 bytes are supported (mostly
because of laziness on my part).

As you can see, the zipWith Data.Vector.SIMD implementation is slightly
slower than the Data.Vector.Storable based one. I didn't perform much
profiling yet, but I suspect allocation and ForeignPtr creation is to
blame, this seems to be highly optimized in
GHC.ForeignPtr.mallocPlainForeignPtrBytes as used by
Data.Vector.Storable.

Thanks for any input,

Nicolas

[1] http://www.haskell.org/pipermail/haskell-cafe/2012-July/102167.html
[2] https://github.com/NicolasT/vector-simd/
[3] http://linode2.nicolast.be/files/vector-simd-xor1.html
[4]
https://github.com/NicolasT/vector-simd/blob/master/src/Data/Vector/SIMD/Algorithms.hs#L46
[5]
https://github.com/NicolasT/vector-simd/blob/master/src/Data/Vector/SIMD/Algorithms.hs#L43
[6]
https://github.com/NicolasT/vector-simd/blob/master/cbits/vector-simd.c#L47