Removing/deprecating -fvia-c
Simon Marlow
marlowsd at gmail.com
Tue Feb 16 09:57:52 EST 2010
On 15/02/2010 18:29, Don Stewart wrote:
> marlowsd:
>>>>
>>>> Simon Marlow has recently fixed FP performance for modern x86 chips in
>>>> the native code generator in the HEAD. That was the last reason we know
>>>> of to prefer via-C to the native code generators. But before we start
>>>> the removal process, does anyone know of any other problems with the
>>>> native code generators that need to be fixed first?
>>>>
>>>
>>> Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
>>>
>>> As recently as last year -fvia-C -optc-O3 was still useful for some
>>> microbenchmarks -- what's changed in that time, or is expected to change?
>>
>> If you have benchmarks that show a significant difference, I'd be
>> interested to see them.
>
> I've attached an example where there's a 40% variation (and it's a
> floating point benchmark). Roman would be seeing similar examples in the
> vector code.
>
> I'm all in favor of dropping the C backend, but I'm also wary that we
> don't have benchmarks to know what difference it is making.
>
> Here's a simple program testing a tight, floating point loop:
>
> import Data.Array.Vector
> import Data.Complex
>
> main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex Double)
>
> Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.
>
> The -fvia-C -optc-O3 is about 40% faster than -fasm.
> How does it fair with the new sse patches?
>
> I've attached the assembly below for each case..
>
> -- Don
>
>
> ------------------------------------------------------------------------
>
> Fastest. 2.17s. About 40% faster than -fasm
>
> $ time ./sum-complex
> 1.0e9 :+ 1.0e9
> ./sum-complex 2.16s user 0.00s system 99% cpu 2.175 total
>
> Main_mainzuzdszdwfold_info:
> leaq 32(%r12), %rax
> movq %r12, %rdx
> cmpq 144(%r13), %rax
> movq %rax, %r12
> ja .L4
> cmpq $1000000000, %r14
> je .L9
> .L5:
> movsd .LC0(%rip), %xmm0
> leaq 1(%r14), %r14
> addsd %xmm0, %xmm5
> addsd %xmm0, %xmm6
> movq %rdx, %r12
> jmp Main_mainzuzdszdwfold_info
>
> .L4:
> leaq -24(%rbp), %rax
> movq $32, 184(%r13)
> movq %rax, %rbp
> movq %r14, (%rax)
> movsd %xmm5, 8(%rax)
> movsd %xmm6, 16(%rax)
> movl $Main_mainzuzdszdwfold_closure, %ebx
> jmp *-8(%r13)
> .L9:
> movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
> movsd %xmm5, -16(%rax)
> movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
> leaq 25(%rdx), %rbx
> movsd %xmm6, 32(%rdx)
> leaq 9(%rdx), %r14
> jmp *(%rbp)
>
> ------------------------------------------------------------------------
>
> Second, 2.34s
>
> $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3
> $ time ./sum-complex
> 1.0e9 :+ 1.0e9
> ./sum-complex 2.33s user 0.01s system 99% cpu 2.347 total
>
> Main_mainzuzdszdwfold_info:
> leaq 32(%r12), %rax
> cmpq 144(%r13), %rax
> movq %r12, %rdx
> movq %rax, %r12
> ja .L4
> cmpq $100000000, %r14
> je .L9
> .L5:
> movsd .LC0(%rip), %xmm0
> leaq 1(%r14), %r14
> movq %rdx, %r12
> addsd %xmm0, %xmm5
> addsd %xmm0, %xmm6
> jmp Main_mainzuzdszdwfold_info
>
> .L4:
> leaq -24(%rbp), %rax
> movq $32, 184(%r13)
> movl $Main_mainzuzdszdwfold_closure, %ebx
> movsd %xmm5, 8(%rax)
> movq %rax, %rbp
> movq %r14, (%rax)
> movsd %xmm6, 16(%rax)
> jmp *-8(%r13)
>
> .L9:
> movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
> movsd %xmm5, -16(%rax)
> movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
> leaq 25(%rdx), %rbx
> movsd %xmm6, 32(%rdx)
> leaq 9(%rdx), %r14
> jmp *(%rbp)
>
> ------------------------------------------------------------------------
>
> Native codegen, 3.57s
>
> ghc 6.12 -fasm -O2
> $ time ./sum-complex
> 1.0e9 :+ 1.0e9
> ./sum-complex 3.57s user 0.01s system 99% cpu 3.574 total
>
>
> Main_mainzuzdszdwfold_info:
> .Lc1i7:
> addq $32,%r12
> cmpq 144(%r13),%r12
> ja .Lc1ia
> movq %r14,%rax
> cmpq $100000000,%rax
> jne .Lc1id
> movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
> movsd %xmm5,-16(%r12)
> movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
> movsd %xmm6,(%r12)
> leaq -7(%r12),%rbx
> leaq -23(%r12),%r14
> jmp *(%rbp)
> .Lc1ia:
> movq $32,184(%r13)
> movl $Main_mainzuzdszdwfold_closure,%ebx
> addq $-24,%rbp
> movq %r14,(%rbp)
> movsd %xmm5,8(%rbp)
> movsd %xmm6,16(%rbp)
> jmp *-8(%r13)
> .Lc1id:
> movsd %xmm6,%xmm0
> addsd .Ln1if(%rip),%xmm0
> movsd %xmm5,%xmm7
> addsd .Ln1ig(%rip),%xmm7
> leaq 1(%rax),%r14
> movsd %xmm7,%xmm5
> movsd %xmm0,%xmm6
> addq $-32,%r12
> jmp Main_mainzuzdszdwfold_info
>
>
I manged to improve this:
Main_mainzuzdszdwfold_info:
.Lc1lP:
addq $32,%r12
cmpq 144(%r13),%r12
ja .Lc1lS
movq %r14,%rax
cmpq $1000000000,%rax
jne .Lc1lV
movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
movsd %xmm6,-16(%r12)
movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
movsd %xmm5,(%r12)
leaq -7(%r12),%rbx
leaq -23(%r12),%r14
jmp *(%rbp)
.Lc1lS:
movq $32,184(%r13)
movl $Main_mainzuzdszdwfold_closure,%ebx
addq $-24,%rbp
movsd %xmm5,(%rbp)
movsd %xmm6,8(%rbp)
movq %r14,16(%rbp)
jmp *-8(%r13)
.Lc1lV:
addsd .Ln1m2(%rip),%xmm5
addsd .Ln1m3(%rip),%xmm6
leaq 1(%rax),%r14
addq $-32,%r12
jmp Main_mainzuzdszdwfold_info
from 9 instructions in the last block down to 5 (one instruction fewer
than gcc). I haven't commoned up the two constant 1's though, that'd
mean doing some CSE.
On my machine with GHC HEAD and gcc 4.3.0, the gcc version runs in 2.0s,
with the NCG at 2.3s. I put the difference down to a bit of instruction
scheduling done by gcc, and that extra constant load.
But let's face it, all of this code is crappy. It should be a tiny
little loop rather than a tail-call with argument passing, and that's
what we'll get with the new backend (eventually). LLVM probably won't
turn it into a loop on its own, that needs to be done before the code
gets passed to LLVM.
Have you looked at this example on x86? It's *far* worse and runs about
5 times slower.
Cheers,
Simon
More information about the Glasgow-haskell-users
mailing list