Removing/deprecating -fvia-c

Mon Feb 15 16:49:14 EST 2010

dons:
> marlowsd:
> >>>
> >>> Simon Marlow has recently fixed FP performance for modern x86 chips in
> >>> the native code generator in the HEAD. That was the last reason we know
> >>> of to prefer via-C to the native code generators. But before we start
> >>> the removal process, does anyone know of any other problems with the
> >>> native code generators that need to be fixed first?
> >>>
> >>
> >> Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
> >>
> >> As recently as last year -fvia-C -optc-O3 was still useful for some
> >> microbenchmarks -- what's changed in that time, or is expected to change?
> >
> > If you have benchmarks that show a significant difference, I'd be  
> > interested to see them.
> 
> I've attached an example where there's a 40% variation (and it's a
> floating point benchmark). Roman would be seeing similar examples in the
> vector code.

Here's an example that doesn't use floating point:

    import Data.Array.Vector
    import Data.Bits

    main = print . sumU $ zipWith3U (\x y z -> x * y * z)
                            (enumFromToU 1 (100000000 :: Int))
                            (enumFromToU 2 (100000001 :: Int))
                            (enumFromToU 7 (100000008 :: Int))

In core:

    main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int#
    main_$s$wfold =
      \ (sc_s1l1 :: Int#)
        (sc1_s1l2 :: Int#)
        (sc2_s1l3 :: Int#)
        (sc3_s1l4 :: Int#) ->
        case ># sc2_s1l3 100000000 of _ {
          False ->
            case ># sc1_s1l2 100000001 of _ {
              False ->
                case ># sc_s1l1 100000008 of _ {
                  False ->
                    main_$s$wfold
                      (+# sc_s1l1 1)
                      (+# sc1_s1l2 1)
                      (+# sc2_s1l3 1)
                      (+#
                         sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1));
                  True -> sc3_s1l4
                };
              True -> sc3_s1l4
            };
          True -> sc3_s1l4
        }

Rather nice!

-fvia-C -optc-O3

    Main_mainzuzdszdwfold_info:
            cmpq    $100000000, %rdi
            jg      .L6
            cmpq    $100000001, %rsi
            jg      .L6
            cmpq    $100000008, %r14
            jle     .L10
    .L6:
            movq    %r8, %rbx
            movq    (%rbp), %rax
            jmp     *%rax
    .L10:
            movq    %rsi, %r10
            leaq    1(%rsi), %rsi
            imulq   %rdi, %r10
            leaq    1(%rdi), %rdi
            imulq   %r14, %r10
            leaq    1(%r14), %r14
            leaq    (%r10,%r8), %r8
            jmp     Main_mainzuzdszdwfold_info

Which looks ok.

    $ time ./zipwith3                          
    3541230156834269568
    ./zipwith3  0.33s user 0.00s system 99% cpu 0.337 total

And -fasm we get very different code, and a bit of a slowdown:

    Main_mainzuzdszdwfold_info:
    .Lc1mo:
            cmpq $100000000,%rdi
            jg .Lc1mq
            cmpq $100000001,%rsi
            jg .Lc1ms
            cmpq $100000008,%r14
            jg .Lc1mv

            movq %rsi,%rax
            imulq %r14,%rax
            movq %rdi,%rcx
            imulq %rax,%rcx
            movq %r8,%rax
            addq %rcx,%rax
            leaq 1(%rdi),%rcx
            leaq 1(%rsi),%rdx
            incq %r14
            movq %rdx,%rsi
            movq %rcx,%rdi
            movq %rax,%r8
            jmp Main_mainzuzdszdwfold_info

    .Lc1mq:
            movq %r8,%rbx
            jmp *(%rbp)
    .Lc1ms:
            movq %r8,%rbx
            jmp *(%rbp)
    .Lc1mv:
            movq %r8,%rbx
            jmp *(%rbp)

Slower:

    $ time ./zipwith3
    3541230156834269568
    ./zipwith3  0.38s user 0.00s system 98% cpu 0.384 total

Now maybe we need to wait on the new backend optimizations to get there?

-- Don