Removing/deprecating -fvia-c

Tue Feb 16 09:57:52 EST 2010

On 15/02/2010 18:29, Don Stewart wrote:
> marlowsd:
>>>>
>>>> Simon Marlow has recently fixed FP performance for modern x86 chips in
>>>> the native code generator in the HEAD. That was the last reason we know
>>>> of to prefer via-C to the native code generators. But before we start
>>>> the removal process, does anyone know of any other problems with the
>>>> native code generators that need to be fixed first?
>>>>
>>>
>>> Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
>>>
>>> As recently as last year -fvia-C -optc-O3 was still useful for some
>>> microbenchmarks -- what's changed in that time, or is expected to change?
>>
>> If you have benchmarks that show a significant difference, I'd be
>> interested to see them.
>
> I've attached an example where there's a 40% variation (and it's a
> floating point benchmark). Roman would be seeing similar examples in the
> vector code.
>
> I'm all in favor of dropping the C backend, but I'm also wary that we
> don't have benchmarks to know what difference it is making.
>
> Here's a simple program testing a tight, floating point loop:
>
>      import Data.Array.Vector
>      import Data.Complex
>
>      main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex Double)
>
> Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.
>
> The -fvia-C -optc-O3 is about 40% faster than -fasm.
> How does it fair with the new sse patches?
>
> I've attached the assembly below for each case..
>
> -- Don
>
>
> ------------------------------------------------------------------------
>
> Fastest. 2.17s. About 40% faster than -fasm
>
>      $ time ./sum-complex
>      1.0e9 :+ 1.0e9
>      ./sum-complex  2.16s user 0.00s system 99% cpu 2.175 total
>
> Main_mainzuzdszdwfold_info:
>          leaq    32(%r12), %rax
>          movq    %r12, %rdx
>          cmpq    144(%r13), %rax
>          movq    %rax, %r12
>          ja      .L4
>          cmpq    $1000000000, %r14
>          je      .L9
> .L5:
>          movsd   .LC0(%rip), %xmm0
>          leaq    1(%r14), %r14
>          addsd   %xmm0, %xmm5
>          addsd   %xmm0, %xmm6
>          movq    %rdx, %r12
>          jmp     Main_mainzuzdszdwfold_info
>
> .L4:
>          leaq    -24(%rbp), %rax
>          movq    $32, 184(%r13)
>          movq    %rax, %rbp
>          movq    %r14, (%rax)
>          movsd   %xmm5, 8(%rax)
>          movsd   %xmm6, 16(%rax)
>          movl    $Main_mainzuzdszdwfold_closure, %ebx
>          jmp     *-8(%r13)
> .L9:
>          movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
>          movsd   %xmm5, -16(%rax)
>          movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
>          leaq    25(%rdx), %rbx
>          movsd   %xmm6, 32(%rdx)
>          leaq    9(%rdx), %r14
>          jmp     *(%rbp)
>
> ------------------------------------------------------------------------
>
> Second, 2.34s
>
>      $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3
>      $ time ./sum-complex
>      1.0e9 :+ 1.0e9
>      ./sum-complex  2.33s user 0.01s system 99% cpu 2.347 total
>
> Main_mainzuzdszdwfold_info:
>          leaq    32(%r12), %rax
>          cmpq    144(%r13), %rax
>          movq    %r12, %rdx
>          movq    %rax, %r12
>          ja      .L4
>          cmpq    $100000000, %r14
>          je      .L9
> .L5:
>          movsd   .LC0(%rip), %xmm0
>          leaq    1(%r14), %r14
>          movq    %rdx, %r12
>          addsd   %xmm0, %xmm5
>          addsd   %xmm0, %xmm6
>          jmp     Main_mainzuzdszdwfold_info
>
> .L4:
>          leaq    -24(%rbp), %rax
>          movq    $32, 184(%r13)
>          movl    $Main_mainzuzdszdwfold_closure, %ebx
>          movsd   %xmm5, 8(%rax)
>          movq    %rax, %rbp
>          movq    %r14, (%rax)
>          movsd   %xmm6, 16(%rax)
>          jmp     *-8(%r13)
>
> .L9:
>          movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
>          movsd   %xmm5, -16(%rax)
>          movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
>          leaq    25(%rdx), %rbx
>          movsd   %xmm6, 32(%rdx)
>          leaq    9(%rdx), %r14
>          jmp     *(%rbp)
>
> ------------------------------------------------------------------------
>
> Native codegen, 3.57s
>
>   ghc 6.12 -fasm -O2
>   $ time ./sum-complex
>   1.0e9 :+ 1.0e9
>   ./sum-complex  3.57s user 0.01s system 99% cpu 3.574 total
>
>
> Main_mainzuzdszdwfold_info:
> .Lc1i7:
>          addq $32,%r12
>          cmpq 144(%r13),%r12
>          ja .Lc1ia
>          movq %r14,%rax
>          cmpq $100000000,%rax
>          jne .Lc1id
>          movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
>          movsd %xmm5,-16(%r12)
>          movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
>          movsd %xmm6,(%r12)
>          leaq -7(%r12),%rbx
>          leaq -23(%r12),%r14
>          jmp *(%rbp)
> .Lc1ia:
>          movq $32,184(%r13)
>          movl $Main_mainzuzdszdwfold_closure,%ebx
>          addq $-24,%rbp
>          movq %r14,(%rbp)
>          movsd %xmm5,8(%rbp)
>          movsd %xmm6,16(%rbp)
>          jmp *-8(%r13)
> .Lc1id:
>          movsd %xmm6,%xmm0
>          addsd .Ln1if(%rip),%xmm0
>          movsd %xmm5,%xmm7
>          addsd .Ln1ig(%rip),%xmm7
>          leaq 1(%rax),%r14
>          movsd %xmm7,%xmm5
>          movsd %xmm0,%xmm6
>          addq $-32,%r12
>          jmp Main_mainzuzdszdwfold_info
>
>

I manged to improve this:

Main_mainzuzdszdwfold_info:
.Lc1lP:
         addq $32,%r12
         cmpq 144(%r13),%r12
         ja .Lc1lS
         movq %r14,%rax
         cmpq $1000000000,%rax
         jne .Lc1lV
         movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
         movsd %xmm6,-16(%r12)
         movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
         movsd %xmm5,(%r12)
         leaq -7(%r12),%rbx
         leaq -23(%r12),%r14
         jmp *(%rbp)
.Lc1lS:
         movq $32,184(%r13)
         movl $Main_mainzuzdszdwfold_closure,%ebx
         addq $-24,%rbp
         movsd %xmm5,(%rbp)
         movsd %xmm6,8(%rbp)
         movq %r14,16(%rbp)
         jmp *-8(%r13)
.Lc1lV:
         addsd .Ln1m2(%rip),%xmm5
         addsd .Ln1m3(%rip),%xmm6
         leaq 1(%rax),%r14
         addq $-32,%r12
         jmp Main_mainzuzdszdwfold_info

from 9 instructions in the last block down to 5 (one instruction fewer 
than gcc).  I haven't commoned up the two constant 1's though, that'd 
mean doing some CSE.

On my machine with GHC HEAD and gcc 4.3.0, the gcc version runs in 2.0s, 
with the NCG at 2.3s.  I put the difference down to a bit of instruction 
scheduling done by gcc, and that extra constant load.

But let's face it, all of this code is crappy.  It should be a tiny 
little loop rather than a tail-call with argument passing, and that's 
what we'll get with the new backend (eventually).  LLVM probably won't 
turn it into a loop on its own, that needs to be done before the code 
gets passed to LLVM.

Have you looked at this example on x86?  It's *far* worse and runs about 
5 times slower.

Cheers,
	Simon