[Haskell-cafe] speed: ghc vs gcc
Don Stewart
dons at galois.com
Fri Feb 20 11:41:33 EST 2009
bulat.ziganshin:
> Hello haskell-cafe,
>
> since there are no objective tests comparing ghc to gcc, i made my own
> one. these are 3 programs, calculating sum in c++ and haskell:
Wonderful. Thank you!
> main = print $ sum[1..10^9::Int]
This won't be comparable to your loop below, as 'sum' is a left fold
(which doesn't fuse under build/foldr).
You should use the list implementation from the stream-fusion package (or
uvector) if you're expecting it to fuse to the following loop:
> main = print $ sum0 (10^9) 0
>
> sum0 :: Int -> Int -> Int
> sum0 0 !acc = acc
> sum0 !x !acc = sum0 (x-1) (acc+x)
Note the bang patterns aren't required here. It compiles to the
following core:
$wsum0 :: Int# -> Int# -> Int#
$wsum0 =
\ (ww_sON :: Int#) (ww1_sOR :: Int#) ->
case ww_sON of ds_XD0 {
_ -> $wsum0 (-# ds_XD0 1) (+# ww1_sOR ds_XD0);
0 -> ww1_sOR
which is perfect.
Main_zdwsum0_info:
testq %rsi, %rsi
movq %rsi, %rax
jne .L2
movq %rdi, %rbx
jmp *(%rbp)
.L2:
leaq -1(%rsi), %rsi
addq %rax, %rdi
jmp Main_zdwsum0_info
Which seems ... OK.
$ ghc-core A.hs -fvia-C -optc-O3
$ time ./A
500000000500000000
./A 1.12s user 0.00s system 99% cpu 1.127 total
Works for me. That's on linux x86_64, gcc 4.4
Trying -fasm:
Main_zdwsum0_info:
.LcQs:
movq %rsi,%rax
testq %rax,%rax
jne .LcQw
movq %rdi,%rbx
jmp *(%rbp)
.LcQw:
movq %rdi,%rcx
addq %rax,%rcx
leaq -1(%rax),%rsi
movq %rcx,%rdi
jmp Main_zdwsum0_info
$ time ./A
500000000500000000
./A 1.65s user 0.00s system 98% cpu 1.677 total
Is a bit slower.
> main()
> {
> int sum=0;
> //for(int j=0; j<100;j++)
> for(int i=0; i<1000*1000*1000;i++)
> sum += i;
> return sum;
> }
Well, that's a bit different. It doesn't print the result, and it returns a different
results on 64 bit....
$ gcc -O0 t.c
$ time ./a.out
-1243309312
./a.out 3.99s user 0.00s system 88% cpu 4.500 total
$ gcc -O1 t.c
$ time ./a.out
-1243309312
./a.out 0.88s user 0.00s system 99% cpu 0.892 total
$ gcc -O3 -funroll-loops t.c
$ time ./a.out
-1243309312
./a.out 0.31s user 0.00s system 97% cpu 0.318 total
I don't get anything near the 0.062s which is interesting.
The print statement slows things down, I guess...
So we have:
ghc -fvia-C -O2 1.127
ghc -fasm 1.677
gcc -O0 4.500
gcc -O3 -funroll-loops 0.318
So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as
bad as it used to be).
That's actually a worse margin than any current shootout program, where we are no
worse than 2.9 slower on larger things:
http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=ghc&lang2=gcc&box=1
>
> execution times:
> sum:
> ghc 6.6.1 -O2 : 12.433 secs
> ghc 6.10.1 -O2 : 12.792 secs
> sum-fast:
> ghc 6.6.1 -O2 : 1.919 secs
> ghc 6.10.1 -O2 : 1.856 secs
> ghc 6.10.1 -O2 -fvia-C : 1.966 secs
> C++:
> gcc 3.4.5 -O3 -funroll-loops: 0.062 secs
>
I couldn't reproduce your final number.
Now, given GHC gets most of the way there -- I think this might make a good bug
report against GHC head, so we can see if the new register allocator helps any.
http://hackage.haskell.org/trac/ghc/newticket?type=bug
Thanks for the report, Bulat!
-- Don
More information about the Haskell-Cafe
mailing list