simd branch ready for review

Tue Feb 5 03:36:36 CET 2013

I'm noticing that linked paper (very nice results!) mentions a prefetch
primops that were added to ghc.
Is there any documentation current or pending ?

https://github.com/mainland/vector/commit/cfce37d3a9c228fe4bdf627ffb777399f54af5e5#Data/Vectorseems
to have the relevant prim ops mentioned in the paper

thanks
-Carter

On Mon, Feb 4, 2013 at 7:36 PM, Geoffrey Mainland <mainland at apeiron.net>wrote:

> On 02/04/2013 11:56 PM, Johan Tibell wrote:
> > On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland <mainland at apeiron.net>
> wrote:
> >
> > What would a sensible fallback be for AVX instructions? What should we
> > fall back on when the LLVM backend is not being used?
> >
> > Depends on the instruction. A 256-bit multiply could be replaced by N
> > multiplies etc. For popcount we have a little bit of C code in
> > ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to
> > emit some different assembly in e.g. the x86-64 backend if AVX isn't
> > enabled.
> >
> > Maybe we could desugar AVX instructions to SSE instructions on platforms
> > that support SSE but not AVX, but in practice people would then #ifdef
> > anyway and just use SSE if AVX weren't available.
> >
> > I don't follow here. If you conditionally emitted different
> > instructions in the backends depending on which -m flags are passed to
> > GHC, why would people #ifdef?
>
> I think you are suggesting that the user should always use 256-bit
> short-vector instructions, and that on platforms where AVX is not
> available, this would fall back to an implementation that performed
> multiple SSE instructions for each 256-bit vector instruction---and used
> multiple XMM registers to hold each 256-bit vector value (or spilled).
>
> Anyone using low-level primops should only do so if they really want
> low-level control. The most efficient SSE implementation of a function
> is not going to be whatever implementation falls out of a desugaring of
> generic 256-bit short-vector primitives. Therefore, I suspect that
> anyone using low-level vector primops like this will #ifdef and provide
> two implementations---one for SSE, one for AVX. Anyone who doesn't care
> about this level of detail should use a higher-level interface---which
> we have already implemented---and which does not require any
> ifdefs. People will #ifdef because they can provide better SSE
> implementations than GHC when AVX instructions are not available.
>
> I am suggesting that we push the "ifdefs" into a library. The vast
> majority of programmers will never see the ifdefs, because they will use
> the library.
>
> I think you are suggesting that we push the "ifdefs" into GHC. That way
> nobody will have a choice---they get whatever desugaring GHC gives them.
>
> I understand your point of view---having primops that don't work
> everywhere is a real pain and aesthetically unpleasing---but I prefer
> exposing more low-level details in our primops even if it means a bit of
> unpleasantness once in a while. This does mean a tiny segment of
> programmers will have to deal with ifdefs, but I suspect that this tiny
> segment of programmers would prefer ifdefs to a lack of control.
>
> If a population count operation translates to a few extra instructions,
> I don't think anyone will care. If a body of code performing
> short-vector operations desugars to twice as many instructions that
> require twice as many registers, thereby resulting in a bunch of extra
> spills, it will matter. Put differently, there is a more-or-less
> canonical desugaring of population count. For a given function using
> short-vector instructions of one width, there is not a canonical
> desugaring into a function using short-vector instructions of a lesser
> width.
>
> > The current idea is to hide the #ifdefs in a library. Clients of the
> > library would then get the "best" short-vector implementation available
> > for their platform by using this library. Right now this library is a
> > modified version of primitive, and I have modified versions of vector
> > and DPH that use this version of the primitive library to generate SSE
> > code.
> >
> > You would still end up with an GHC.Exts that exports a different API
> > depending on which flags (e.g. -m<something>) are passed to
> > GHC. Couldn't you use ghc-prim for your fallbacks and have
> > GHC.Exts.yourPrimOp use either those fallbacks or the AVX
> > instructions.
>
> This is basically what I've implemented, expect there is a Multi type
> family that "picks" the appropriate short-vector representation for a
> type, e.g., DoubleX2# for Double on machines with SSE, DoubleX4# for
> Double on machines with AVX, and accompanying set of short-vector
> operations.
>
> We have a concrete design and implementation---take a look at the
> primitive, vector, and dph packages on my github page
> (http://github.com/mainland). I would be very happy to discuss any
> concrete alternative design. We also have a paper with some performance
> measurements
> (http://www.eecs.harvard.edu/~mainland/publications/mainland12simd.pdf). I
> would not be thrilled with a design that resulting in significantly
> worse benchmarks.
>
> Geoff
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130204/b7710abc/attachment-0001.htm>