simd branch ready for review

Wed Feb 6 15:21:05 CET 2013

On 02/06/2013 09:24 AM, Simon Marlow wrote:
> On 05/02/13 10:34, Geoffrey Mainland wrote:
>> On 02/05/2013 09:06 AM, Simon Marlow wrote:
>>> On 05/02/13 00:36, Geoffrey Mainland wrote:
>>>> On 02/04/2013 11:56 PM, Johan Tibell wrote:
>>>>> On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
>> <mainland at apeiron.net> wrote:
>>>>>
>>>>> What would a sensible fallback be for AVX instructions? What should we
>>>>> fall back on when the LLVM backend is not being used?
>>>>>
>>>>> Depends on the instruction. A 256-bit multiply could be replaced by N
>>>>> multiplies etc. For popcount we have a little bit of C code in
>>>>> ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to
>>>>> emit some different assembly in e.g. the x86-64 backend if AVX isn't
>>>>> enabled.
>>>>>
>>>>> Maybe we could desugar AVX instructions to SSE instructions on
platforms
>>>>> that support SSE but not AVX, but in practice people would then #ifdef
>>>>> anyway and just use SSE if AVX weren't available.
>>>>>
>>>>> I don't follow here. If you conditionally emitted different
>>>>> instructions in the backends depending on which -m flags are passed to
>>>>> GHC, why would people #ifdef?
>>>>
>>>> I think you are suggesting that the user should always use 256-bit
>>>> short-vector instructions, and that on platforms where AVX is not
>>>> available, this would fall back to an implementation that performed
>>>> multiple SSE instructions for each 256-bit vector instruction---and
used
>>>> multiple XMM registers to hold each 256-bit vector value (or spilled).
>>>>
>>>> Anyone using low-level primops should only do so if they really want
>>>> low-level control. The most efficient SSE implementation of a function
>>>> is not going to be whatever implementation falls out of a desugaring of
>>>> generic 256-bit short-vector primitives. Therefore, I suspect that
>>>> anyone using low-level vector primops like this will #ifdef and provide
>>>> two implementations---one for SSE, one for AVX. Anyone who doesn't care
>>>> about this level of detail should use a higher-level interface---which
>>>> we have already implemented---and which does not require any
>>>> ifdefs. People will #ifdef because they can provide better SSE
>>>> implementations than GHC when AVX instructions are not available.
>>>>
>>>> I am suggesting that we push the "ifdefs" into a library. The vast
>>>> majority of programmers will never see the ifdefs, because they
will use
>>>> the library.
>>>>
>>>> I think you are suggesting that we push the "ifdefs" into GHC. That way
>>>> nobody will have a choice---they get whatever desugaring GHC gives
them.
>>>>
>>>> I understand your point of view---having primops that don't work
>>>> everywhere is a real pain and aesthetically unpleasing---but I prefer
>>>> exposing more low-level details in our primops even if it means a
bit of
>>>> unpleasantness once in a while. This does mean a tiny segment of
>>>> programmers will have to deal with ifdefs, but I suspect that this tiny
>>>> segment of programmers would prefer ifdefs to a lack of control.
>>>>
>>>> If a population count operation translates to a few extra instructions,
>>>> I don't think anyone will care. If a body of code performing
>>>> short-vector operations desugars to twice as many instructions that
>>>> require twice as many registers, thereby resulting in a bunch of extra
>>>> spills, it will matter. Put differently, there is a more-or-less
>>>> canonical desugaring of population count. For a given function using
>>>> short-vector instructions of one width, there is not a canonical
>>>> desugaring into a function using short-vector instructions of a lesser
>>>> width.
>>>
>>> While I agree with Geoff, there's one thing we have to be careful
>>> about: inlining. If the primop is exposed via an inline definition,
>>> then either we have to check and disable the inlining if the primop is
>>> not available in the current compilation, or else prevent the inlining
>>> from being visible in the first place.
>>>
>>> I believe this is what Johan had in mind when he gave popcount a
>>> fallback. Geoff, maybe you've thought about this already - what's the
>>> plan for the vector library?
>>>
>>> Cheers,
>>> Simon
>>
>> Right now, the short-vector primops are only visible if you use the
>> -fllvm switch when compiling. If you compile the vector package with
>> -fllvm and then try to use this package with the native back end and an
>> SSE primop gets inlined, the native back end will error out and tell you
>> to use -fllvm. This is not a good solution.
>>
>> On the one hand, if you use an -msse4.2-compiled C library on a machine
>> without SSE 4.2 support, you should not expect it to work. I would be
>> fine with a world in which compiling the vector library with -mavx would
>> result in a package that the compiler would not allow the programmer to
>> use from a program that wasn't also compiled with -mavx, i.e., a world
>> in which the compiler checked flag compatibility. Having two back-ends
>> makes things more difficult, because we certainly don't want a package
>> compiled with -fllvm to be unusable from the native back end.
>>
>> I don't have a good solution. I am assuming that we decide that having
>> the set of available primops be a function of DynFlags is OK. Then there
>> are two problems.
>
> I think it will be difficult to make the set of primops vary depending
> on flags. The reason is that the contents of GHC.Prim is re-exported
> by various modules: GHC.Base and GHC.Exts for example, and each of
> those .hi files lists the names of the exported primops. So we can't
> change the set after these modules have been compiled. (well we can,
> but odd things will happen).
>
> So I think GHC.Prim should always export the full set of primops.
>
> It is OK for compilation to fail if the source code mentions an
> unsupported primop.
>
> What about unfoldings?
>
> We cannot have compilation failing if an unsupported primop gets
> inlined into the current module, that is a non-deterministic
> compilation failure.
>
> So then we have two options:
>
> 1) disable an unfolding if it contains an unsupported primop
> 2) implement unsupported primops via fallback C functions
>
> Both options lead to performance problems, so we want the compiler to
> warn if this happens. But we cannot fail the compilation.
>
> If we do (2), then we don't have to make it an error to use
> unsupported primops directly, but it should at least be a warning.
>
> Fallbacks are reasonably easy to implement I think: gcc provides
> generic vector operations that compile on any target (if I'm
> understanding the docs correctly).
>
> I suppose I don't mind whether we do (1) or (2).
>
> Cheers,
> Simon

To answer Johan's question from a separate email, LLVM is supposed to
lower vector instructions, but I have had it error out in odd ways on
certain platforms. The example I recall was projecting an element from a
vector using a non-constant index. So perhaps we can just rely on LLVM
to do the lowering, solving the inlining problem. I think users will
still want to test the various CPP defines to see whether or not, for
example, AVX instructions are really available, and provide alternate
implementations of low-level functions depending in case only SSE is
available.

I think the proposal is then that there will be a set of extra primops
available when compiling with LLVM, but that's it. We will provide
short-vector primops of multiple widths on all platforms, but some may
not produce efficient code. The user who cares can test the CPP defines
__SSE__, __AVX__, etc.

Currently, for these extra primops to be available, the base libraries
must also be compiled with LLVM---in particular ghc-prim, due to the
fact that GHC.PrimopWrappers lives there. Is that acceptable?

There are still interoperability problems if we allow LLVM to perform
lowering. Turning on AVX code generation will change the calling
convention. With -mavx, 256-bit wide vectors will be passed in the ymm*
registers. Without -mavx, this obviously won't happen. How should we
deal with that? Also, inlining code from an LLVM-compiled module will
cause an error in a native-back-end-compiled module if any LLVM-only
primops show up in the unfolding. The error will tell the user to use
-fllvm. Is this acceptable?

Geoff

>> 1) What mechanism do we add to GHC to make the set of available primops
>> be a function of DynFlags? Right now we have a llvm_only attribute in
>> compiler/prelude/primops.txt.pp so that the SSE primops are only
>> available when using the LLVM back end. This is a stopgap measure and
>> not correct. What's the right way to do it? How do we then communicate
>> to the user which primops are available? -msse, and thus __SSE__,
>> doesn't mean that the SSE primops are available, because we might be
>> using the native back end. Does the user have to test __SSE__ and
>> __GLASGOW_HASKELL_LLVM__ to know that the SSE primops are available?
>> That's not a good solution, and it certainly doesn't scale. Note that
>> when I say "user" I mean the person who writes the Multi type family
>> instances---even in the current ifdef purgatory situation, most users
>> can use the Multi type family without worrying about ifdefs.
>>
>> 2) What do we do about unfoldings? As a straw-man proposal, we could
>> find a subset of DynFlags that uniquely determines the set of available
>> primops, and then disable unfoldings that come from a module that was
>> compiled with incompatible DynFlags. Once we solve (1), we could (I
>> think) straightforwardly implement this.