Removing/deprecating -fvia-c
Don Stewart
dons at galois.com
Mon Feb 15 16:49:14 EST 2010
dons:
> marlowsd:
> >>>
> >>> Simon Marlow has recently fixed FP performance for modern x86 chips in
> >>> the native code generator in the HEAD. That was the last reason we know
> >>> of to prefer via-C to the native code generators. But before we start
> >>> the removal process, does anyone know of any other problems with the
> >>> native code generators that need to be fixed first?
> >>>
> >>
> >> Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
> >>
> >> As recently as last year -fvia-C -optc-O3 was still useful for some
> >> microbenchmarks -- what's changed in that time, or is expected to change?
> >
> > If you have benchmarks that show a significant difference, I'd be
> > interested to see them.
>
> I've attached an example where there's a 40% variation (and it's a
> floating point benchmark). Roman would be seeing similar examples in the
> vector code.
Here's an example that doesn't use floating point:
import Data.Array.Vector
import Data.Bits
main = print . sumU $ zipWith3U (\x y z -> x * y * z)
(enumFromToU 1 (100000000 :: Int))
(enumFromToU 2 (100000001 :: Int))
(enumFromToU 7 (100000008 :: Int))
In core:
main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int#
main_$s$wfold =
\ (sc_s1l1 :: Int#)
(sc1_s1l2 :: Int#)
(sc2_s1l3 :: Int#)
(sc3_s1l4 :: Int#) ->
case ># sc2_s1l3 100000000 of _ {
False ->
case ># sc1_s1l2 100000001 of _ {
False ->
case ># sc_s1l1 100000008 of _ {
False ->
main_$s$wfold
(+# sc_s1l1 1)
(+# sc1_s1l2 1)
(+# sc2_s1l3 1)
(+#
sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1));
True -> sc3_s1l4
};
True -> sc3_s1l4
};
True -> sc3_s1l4
}
Rather nice!
-fvia-C -optc-O3
Main_mainzuzdszdwfold_info:
cmpq $100000000, %rdi
jg .L6
cmpq $100000001, %rsi
jg .L6
cmpq $100000008, %r14
jle .L10
.L6:
movq %r8, %rbx
movq (%rbp), %rax
jmp *%rax
.L10:
movq %rsi, %r10
leaq 1(%rsi), %rsi
imulq %rdi, %r10
leaq 1(%rdi), %rdi
imulq %r14, %r10
leaq 1(%r14), %r14
leaq (%r10,%r8), %r8
jmp Main_mainzuzdszdwfold_info
Which looks ok.
$ time ./zipwith3
3541230156834269568
./zipwith3 0.33s user 0.00s system 99% cpu 0.337 total
And -fasm we get very different code, and a bit of a slowdown:
Main_mainzuzdszdwfold_info:
.Lc1mo:
cmpq $100000000,%rdi
jg .Lc1mq
cmpq $100000001,%rsi
jg .Lc1ms
cmpq $100000008,%r14
jg .Lc1mv
movq %rsi,%rax
imulq %r14,%rax
movq %rdi,%rcx
imulq %rax,%rcx
movq %r8,%rax
addq %rcx,%rax
leaq 1(%rdi),%rcx
leaq 1(%rsi),%rdx
incq %r14
movq %rdx,%rsi
movq %rcx,%rdi
movq %rax,%r8
jmp Main_mainzuzdszdwfold_info
.Lc1mq:
movq %r8,%rbx
jmp *(%rbp)
.Lc1ms:
movq %r8,%rbx
jmp *(%rbp)
.Lc1mv:
movq %r8,%rbx
jmp *(%rbp)
Slower:
$ time ./zipwith3
3541230156834269568
./zipwith3 0.38s user 0.00s system 98% cpu 0.384 total
Now maybe we need to wait on the new backend optimizations to get there?
-- Don
More information about the Glasgow-haskell-users
mailing list