[Haskell-cafe] speed: ghc vs gcc
Daniel Fischer
daniel.is.fischer at web.de
Fri Feb 20 15:56:44 EST 2009
Am Freitag, 20. Februar 2009 18:10 schrieb Bulat Ziganshin:
> Hello Don,
>
> Friday, February 20, 2009, 7:41:33 PM, you wrote:
> >> main = print $ sum[1..10^9::Int]
> >
> > This won't be comparable to your loop below, as 'sum' is a left fold
> > (which doesn't fuse under build/foldr).
> >
> > You should use the list implementation from the stream-fusion package (or
> > uvector) if you're expecting it to fuse to the following loop:
>
> it was comparison of native haskell, low-level haskell (which is
> harder to write than native C)
Um, not always, and certainly not in cases like this, at least if you call the
simple worker loop "low-level Haskell".
> and native C. stream-fusion and any
> other packages provides libraries for some tasks but they can't make faster
> maps, for example. so i used plain list
Which is of course comparable with a tight loop in C, isn't it?
Really, you hurt your cause by including that.
You said you wanted to compare ghc to gcc, then compare what they do to
comparable code.
>
> > Which seems ... OK.
>
> really? :D
>
> > Well, that's a bit different. It doesn't print the result, and it returns
> > a different results on 64 bit....
>
> doesn't matter for testing speed
>
> > I don't get anything near the 0.062s which is interesting.
>
> it was beautiful gcc optimization - it added 8 values at once. with
> xor results are:
>
> xor.hs 12.605
> xor-fast.hs 1.856
> xor.cpp 0.339
>
> > The print statement slows things down, I guess...
>
> are you really believe that printing one number needs so much time? :)
>
> > So we have:
> >
> > ghc -fvia-C -O2 1.127
> > ghc -fasm 1.677
> > gcc -O0 4.500
> > gcc -O3 -funroll-loops 0.318
>
> why not compare to ghc -O0? also you can disable loop unrolling in gcc
> and unroll loops manually in haskell. or you can generate asm code on
> the fly. there are plenty of tricks to "prove" that gcc generates bad
> code :D
That's not what he's doing at all. Sure, he's not comparing Haskell code
compiled without optimisations, but he also includes gcc with highest
optimisation level. Read the gcc -O0 figure as an indication of what
optimisations can achieve.
>
> > So. some lessons. GHC is around 3-4x slower on this tight loop. (Which
> > isn't as bad as it used to be).
>
> really? what i see: low-level haskell code is usually 3 times harder
> to write and 3 times slower than gcc code.
I deny that low-level Haskell code is three times harder to write than
ordinary C code, at least if we consider the worker/wrapper idiom low-level
Haskell.
It is also my experience that gcc usually creates faster executables from good
C code than ghc does from corresponding ordinary Haskell code (not using
#-magic), but the margin does vary wildly.
> native haskell code is tens
> to thousands times slower than C code (just recall that real programs
> use type classes and monads in addition to laziness)
Okay, tens is realistic, but thousands?
Of course if you compare a tight loop that doesn't allocate to creating
thousands of millions of cons-cells...
Just because lists are easier to use in Haskell than in any other language I
know doesn't mean it's necessary to use lists when writing Haskell if other
ways are more fitting for the goal.
Just for the record, timings on my machine, gcc-3.3 vs. ghc-6.8.3:
$ ./runtests
Sums in C, first counting up, then down
with -O0
-243309312
real 0m6.751s
user 0m6.660s
sys 0m0.020s
-243309312
real 0m6.318s
user 0m6.190s
sys 0m0.000s
with -O1
-243309312
real 0m2.533s
user 0m2.530s
sys 0m0.010s
-243309312
real 0m1.744s
user 0m1.700s
sys 0m0.000s
with -O2
-243309312
real 0m1.744s
user 0m1.710s
sys 0m0.000s
-243309312
real 0m1.687s
user 0m1.680s
sys 0m0.000s
with -O3
-243309312
real 0m1.753s
user 0m1.720s
sys 0m0.000s
-243309312
real 0m1.701s
user 0m1.700s
sys 0m0.000s
Sums in Haskell
First compiled with -O2, then with -O2 -fvia-C -optc-O3
Using uvector
-243309312
real 0m7.412s
user 0m7.290s
sys 0m0.000s
-243309312
real 0m5.726s
user 0m5.650s
sys 0m0.000s
Loop down with BangPatterns
-243309312
real 0m4.789s
user 0m4.750s
sys 0m0.010s
-243309312
real 0m4.561s
user 0m4.470s
sys 0m0.000s
Loop down without BangPatterns
-243309312
real 0m5.092s
user 0m4.890s
sys 0m0.000s
-243309312
real 0m4.747s
user 0m4.540s
sys 0m0.010s
Loop up (with BangPatterns)
-243309312
real 0m5.511s
user 0m5.320s
sys 0m0.000s
-243309312
real 0m4.449s
user 0m4.410s
sys 0m0.000s
Using strict left fold
-243309312
real 2m45.625s
user 2m41.930s
sys 0m0.260s
-243309312
real 2m43.890s
user 2m41.550s
sys 0m0.280s
Fully naive
-243309312
real 2m45.657s
user 2m42.980s
sys 0m0.250s
-243309312
real 2m42.403s
user 2m40.160s
sys 0m0.370s
Done
More information about the Haskell-Cafe
mailing list