[Haskell-cafe] speed: ghc vs gcc

Fri Feb 20 15:56:44 EST 2009

Am Freitag, 20. Februar 2009 18:10 schrieb Bulat Ziganshin:
> Hello Don,
>
> Friday, February 20, 2009, 7:41:33 PM, you wrote:
> >> main = print $ sum[1..10^9::Int]
> >
> > This won't be comparable to your loop below, as 'sum' is a left fold
> > (which doesn't fuse under build/foldr).
> >
> > You should use the list implementation from the stream-fusion package (or
> > uvector) if you're expecting it to fuse to the following loop:
>
> it was comparison of native haskell, low-level haskell (which is
> harder to write than native C)

Um, not always, and certainly not in cases like this, at least if you call the 
simple worker loop "low-level Haskell".

> and native C. stream-fusion and any
> other packages provides libraries for some tasks but they can't make faster
> maps, for example. so i used plain list

Which is of course comparable with a tight loop in C, isn't it?
Really, you hurt your cause by including that.
You said you wanted to compare ghc to gcc, then compare what they do to 
comparable code.

>
> > Which seems ... OK.
>
> really? :D
>
> > Well, that's a bit different. It doesn't print the result, and it returns
> > a different results on 64 bit....
>
> doesn't matter for testing speed
>
> > I don't get anything near the 0.062s which is interesting.
>
> it was beautiful gcc optimization - it added 8 values at once. with
> xor results are:
>
> xor.hs      12.605
> xor-fast.hs  1.856
> xor.cpp      0.339
>
> > The print statement slows things down, I guess...
>
> are you really believe that printing one number needs so much time? :)
>
> > So we have:
> >
> >     ghc -fvia-C -O2             1.127
> >     ghc -fasm                   1.677
> >     gcc -O0                     4.500
> >     gcc -O3 -funroll-loops      0.318
>
> why not compare to ghc -O0? also you can disable loop unrolling in gcc
> and unroll loops manually in haskell. or you can generate asm code on
> the fly. there are plenty of tricks to "prove" that gcc generates bad
> code :D

That's not what he's doing at all. Sure, he's not comparing Haskell code 
compiled without optimisations, but he also includes gcc with highest 
optimisation level. Read the gcc -O0 figure as an indication of what 
optimisations can achieve.

>
> > So. some lessons. GHC is around 3-4x slower on this tight loop. (Which
> > isn't as bad as it used to be).
>
> really? what i see: low-level haskell code is usually 3 times harder
> to write and 3 times slower than gcc code.

I deny that low-level Haskell code is three times harder to write than 
ordinary C code, at least if we consider the worker/wrapper idiom low-level 
Haskell.
It is also my experience that gcc usually creates faster executables from good 
C code than ghc does from corresponding ordinary Haskell code (not using 
#-magic), but the margin does vary wildly.

> native haskell code is tens
> to thousands times slower than C code (just recall that real programs
> use type classes and monads in addition to laziness)

Okay, tens is realistic, but thousands?
Of course if you compare a tight loop that doesn't allocate to creating 
thousands of millions of cons-cells...
Just because lists are easier to use in Haskell than in any other language I 
know doesn't mean it's necessary to use lists when writing Haskell if other 
ways are more fitting for the goal.

Just for the record, timings on my machine, gcc-3.3 vs. ghc-6.8.3:
$ ./runtests
Sums in C, first counting up, then down
with -O0
-243309312

real    0m6.751s
user    0m6.660s
sys     0m0.020s
-243309312

real    0m6.318s
user    0m6.190s
sys     0m0.000s
with -O1
-243309312

real    0m2.533s
user    0m2.530s
sys     0m0.010s
-243309312

real    0m1.744s
user    0m1.700s
sys     0m0.000s
with -O2
-243309312

real    0m1.744s
user    0m1.710s
sys     0m0.000s
-243309312

real    0m1.687s
user    0m1.680s
sys     0m0.000s
with -O3
-243309312

real    0m1.753s
user    0m1.720s
sys     0m0.000s
-243309312

real    0m1.701s
user    0m1.700s
sys     0m0.000s
Sums in Haskell
First compiled with -O2, then with -O2 -fvia-C -optc-O3
Using uvector
-243309312

real    0m7.412s
user    0m7.290s
sys     0m0.000s
-243309312

real    0m5.726s
user    0m5.650s
sys     0m0.000s
Loop down with BangPatterns
-243309312

real    0m4.789s
user    0m4.750s
sys     0m0.010s
-243309312

real    0m4.561s
user    0m4.470s
sys     0m0.000s
Loop down without BangPatterns
-243309312

real    0m5.092s
user    0m4.890s
sys     0m0.000s
-243309312

real    0m4.747s
user    0m4.540s
sys     0m0.010s
Loop up (with BangPatterns)
-243309312

real    0m5.511s
user    0m5.320s
sys     0m0.000s
-243309312

real    0m4.449s
user    0m4.410s
sys     0m0.000s
Using strict left fold
-243309312

real    2m45.625s
user    2m41.930s
sys     0m0.260s
-243309312

real    2m43.890s
user    2m41.550s
sys     0m0.280s
Fully naive
-243309312

real    2m45.657s
user    2m42.980s
sys     0m0.250s
-243309312

real    2m42.403s
user    2m40.160s
sys     0m0.370s
Done