simd branch ready for review

Tue Feb 5 01:36:47 CET 2013

On 02/04/2013 11:56 PM, Johan Tibell wrote:
> On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland <mainland at apeiron.net> wrote:
>
> What would a sensible fallback be for AVX instructions? What should we
> fall back on when the LLVM backend is not being used?
>
> Depends on the instruction. A 256-bit multiply could be replaced by N
> multiplies etc. For popcount we have a little bit of C code in
> ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to
> emit some different assembly in e.g. the x86-64 backend if AVX isn't
> enabled.
>
> Maybe we could desugar AVX instructions to SSE instructions on platforms
> that support SSE but not AVX, but in practice people would then #ifdef
> anyway and just use SSE if AVX weren't available.
>
> I don't follow here. If you conditionally emitted different
> instructions in the backends depending on which -m flags are passed to
> GHC, why would people #ifdef?

I think you are suggesting that the user should always use 256-bit
short-vector instructions, and that on platforms where AVX is not
available, this would fall back to an implementation that performed
multiple SSE instructions for each 256-bit vector instruction---and used
multiple XMM registers to hold each 256-bit vector value (or spilled).

Anyone using low-level primops should only do so if they really want
low-level control. The most efficient SSE implementation of a function
is not going to be whatever implementation falls out of a desugaring of
generic 256-bit short-vector primitives. Therefore, I suspect that
anyone using low-level vector primops like this will #ifdef and provide
two implementations---one for SSE, one for AVX. Anyone who doesn't care
about this level of detail should use a higher-level interface---which
we have already implemented---and which does not require any
ifdefs. People will #ifdef because they can provide better SSE
implementations than GHC when AVX instructions are not available.

I am suggesting that we push the "ifdefs" into a library. The vast
majority of programmers will never see the ifdefs, because they will use
the library.

I think you are suggesting that we push the "ifdefs" into GHC. That way
nobody will have a choice---they get whatever desugaring GHC gives them.

I understand your point of view---having primops that don't work
everywhere is a real pain and aesthetically unpleasing---but I prefer
exposing more low-level details in our primops even if it means a bit of
unpleasantness once in a while. This does mean a tiny segment of
programmers will have to deal with ifdefs, but I suspect that this tiny
segment of programmers would prefer ifdefs to a lack of control.

If a population count operation translates to a few extra instructions,
I don't think anyone will care. If a body of code performing
short-vector operations desugars to twice as many instructions that
require twice as many registers, thereby resulting in a bunch of extra
spills, it will matter. Put differently, there is a more-or-less
canonical desugaring of population count. For a given function using
short-vector instructions of one width, there is not a canonical
desugaring into a function using short-vector instructions of a lesser
width.

> The current idea is to hide the #ifdefs in a library. Clients of the
> library would then get the "best" short-vector implementation available
> for their platform by using this library. Right now this library is a
> modified version of primitive, and I have modified versions of vector
> and DPH that use this version of the primitive library to generate SSE
> code.
>
> You would still end up with an GHC.Exts that exports a different API
> depending on which flags (e.g. -m<something>) are passed to
> GHC. Couldn't you use ghc-prim for your fallbacks and have
> GHC.Exts.yourPrimOp use either those fallbacks or the AVX
> instructions.

This is basically what I've implemented, expect there is a Multi type
family that "picks" the appropriate short-vector representation for a
type, e.g., DoubleX2# for Double on machines with SSE, DoubleX4# for
Double on machines with AVX, and accompanying set of short-vector
operations.

We have a concrete design and implementation---take a look at the
primitive, vector, and dph packages on my github page
(http://github.com/mainland). I would be very happy to discuss any
concrete alternative design. We also have a paper with some performance
measurements
(http://www.eecs.harvard.edu/~mainland/publications/mainland12simd.pdf). I
would not be thrilled with a design that resulting in significantly
worse benchmarks.

Geoff