Removing/deprecating -fvia-c

Mon Feb 15 13:29:20 EST 2010

marlowsd:
>>>
>>> Simon Marlow has recently fixed FP performance for modern x86 chips in
>>> the native code generator in the HEAD. That was the last reason we know
>>> of to prefer via-C to the native code generators. But before we start
>>> the removal process, does anyone know of any other problems with the
>>> native code generators that need to be fixed first?
>>>
>>
>> Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
>>
>> As recently as last year -fvia-C -optc-O3 was still useful for some
>> microbenchmarks -- what's changed in that time, or is expected to change?
>
> If you have benchmarks that show a significant difference, I'd be  
> interested to see them.

I've attached an example where there's a 40% variation (and it's a
floating point benchmark). Roman would be seeing similar examples in the
vector code.

I'm all in favor of dropping the C backend, but I'm also wary that we
don't have benchmarks to know what difference it is making.

Here's a simple program testing a tight, floating point loop:

    import Data.Array.Vector
    import Data.Complex

    main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex Double)

Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.

The -fvia-C -optc-O3 is about 40% faster than -fasm.
How does it fair with the new sse patches?

I've attached the assembly below for each case..

-- Don

------------------------------------------------------------------------

Fastest. 2.17s. About 40% faster than -fasm

    $ time ./sum-complex                                             
    1.0e9 :+ 1.0e9
    ./sum-complex  2.16s user 0.00s system 99% cpu 2.175 total

Main_mainzuzdszdwfold_info:
        leaq    32(%r12), %rax
        movq    %r12, %rdx
        cmpq    144(%r13), %rax
        movq    %rax, %r12
        ja      .L4
        cmpq    $1000000000, %r14
        je      .L9
.L5:
        movsd   .LC0(%rip), %xmm0
        leaq    1(%r14), %r14
        addsd   %xmm0, %xmm5
        addsd   %xmm0, %xmm6
        movq    %rdx, %r12
        jmp     Main_mainzuzdszdwfold_info

.L4:
        leaq    -24(%rbp), %rax
        movq    $32, 184(%r13)
        movq    %rax, %rbp
        movq    %r14, (%rax)
        movsd   %xmm5, 8(%rax)
        movsd   %xmm6, 16(%rax)
        movl    $Main_mainzuzdszdwfold_closure, %ebx
        jmp     *-8(%r13)
.L9:
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
        movsd   %xmm5, -16(%rax)
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
        leaq    25(%rdx), %rbx
        movsd   %xmm6, 32(%rdx)
        leaq    9(%rdx), %r14
        jmp     *(%rbp)

------------------------------------------------------------------------

Second, 2.34s

    $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3
    $ time ./sum-complex
    1.0e9 :+ 1.0e9
    ./sum-complex  2.33s user 0.01s system 99% cpu 2.347 total

Main_mainzuzdszdwfold_info:
        leaq    32(%r12), %rax
        cmpq    144(%r13), %rax
        movq    %r12, %rdx
        movq    %rax, %r12
        ja      .L4
        cmpq    $100000000, %r14
        je      .L9
.L5:
        movsd   .LC0(%rip), %xmm0
        leaq    1(%r14), %r14
        movq    %rdx, %r12
        addsd   %xmm0, %xmm5
        addsd   %xmm0, %xmm6
        jmp     Main_mainzuzdszdwfold_info

.L4:
        leaq    -24(%rbp), %rax
        movq    $32, 184(%r13)
        movl    $Main_mainzuzdszdwfold_closure, %ebx
        movsd   %xmm5, 8(%rax)
        movq    %rax, %rbp
        movq    %r14, (%rax)
        movsd   %xmm6, 16(%rax)
        jmp     *-8(%r13)

.L9:
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
        movsd   %xmm5, -16(%rax)
        movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
        leaq    25(%rdx), %rbx
        movsd   %xmm6, 32(%rdx)
        leaq    9(%rdx), %r14
        jmp     *(%rbp)

------------------------------------------------------------------------

Native codegen, 3.57s

 ghc 6.12 -fasm -O2
 $ time ./sum-complex
 1.0e9 :+ 1.0e9
 ./sum-complex  3.57s user 0.01s system 99% cpu 3.574 total

Main_mainzuzdszdwfold_info:
.Lc1i7:
        addq $32,%r12
        cmpq 144(%r13),%r12
        ja .Lc1ia
        movq %r14,%rax
        cmpq $100000000,%rax
        jne .Lc1id
        movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
        movsd %xmm5,-16(%r12)
        movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
        movsd %xmm6,(%r12)
        leaq -7(%r12),%rbx
        leaq -23(%r12),%r14
        jmp *(%rbp)
.Lc1ia:
        movq $32,184(%r13)
        movl $Main_mainzuzdszdwfold_closure,%ebx
        addq $-24,%rbp
        movq %r14,(%rbp)
        movsd %xmm5,8(%rbp)
        movsd %xmm6,16(%rbp)
        jmp *-8(%r13)
.Lc1id:
        movsd %xmm6,%xmm0
        addsd .Ln1if(%rip),%xmm0
        movsd %xmm5,%xmm7
        addsd .Ln1ig(%rip),%xmm7
        leaq 1(%rax),%r14
        movsd %xmm7,%xmm5
        movsd %xmm0,%xmm6
        addq $-32,%r12
        jmp Main_mainzuzdszdwfold_info