Vector primops sizes

Wed Feb 13 07:19:58 CET 2013

> By which I mean having this family of proposed primops. Its not obvious to
> me at least how GHC could  intelligently infer / use these implicitly for
> the end user / library writer.

I have couple of ideas how to implement this, but having explicit set
of primops will make using of the vector instructions less magical.

As for having only valid set of primops for given arch/CPU target will
make things much more complicated - llvm takes care of implementing
vector operation from smaller instructions - operations DoubleX16
primitive types gets compiled into something like

plusDoubleX16# :: DoubleX16# -> DoubleX16# -> DoubleX16#

        movq    %r13, 616(%rsp)
        movq    %rbp, 608(%rsp)
        movq    %r12, 600(%rsp)
        movq    %rbx, 592(%rsp)
        movq    %r15, 544(%rsp)
        movq    592(%rsp), %rax
        movq    %rax, 344(%rsp)
        movq    608(%rsp), %rax
        vmovups (%rax), %ymm0
        vmovups 32(%rax), %ymm1
        vmovups 64(%rax), %ymm2
        vmovups 96(%rax), %ymm3
        vmovaps %ymm3, 224(%rsp)
        vmovaps %ymm2, 192(%rsp)
        vmovaps %ymm1, 160(%rsp)
        vmovaps %ymm0, 128(%rsp)
        movq    608(%rsp), %rax
        vmovups 128(%rax), %ymm0
        vmovups 160(%rax), %ymm1
        vmovups 192(%rax), %ymm2
        vmovups 224(%rax), %ymm3
        vmovaps %ymm3, 96(%rsp)
        vmovaps %ymm2, 64(%rsp)
        vmovaps %ymm1, 32(%rsp)
        vmovaps %ymm0, (%rsp)
        movq    344(%rsp), %rbx
        movq    %rbx, 592(%rsp)
        movq    544(%rsp), %r15
        movq    600(%rsp), %r12
        movq    608(%rsp), %rax
        movq    616(%rsp), %r13
        movq    %rax, %rbp
        vzeroupper

(Still it should be possible to compile this with less amount of movements)