[Haskell-beginners] Performance of Idiomatic lazy Haskell

Daniel Fischer daniel.is.fischer at web.de
Sun Jan 31 13:10:59 EST 2010

Am Sonntag 31 Januar 2010 15:52:41 schrieb Stephen Tetley:
> Hi Daniel
> Thanks - the figures are very impressive for the stream fusion
> library. I knew the paper, but I hadn't looked at it the
> implementation.

Just for the record, the loops (my variant) coded in C take 2.7 seconds 
(icc) resp. 3.0 seconds (gcc) with -O3.
Markus' loop takes 5.32s (icc) resp. 6.04s (gcc) with -O3.
Well, that's largely due to the fact that that takes two divisions per 
iteration (and divisions are very slow on my box).
Rewriting it so that only one division per iteration is necessary:

module Main (main) where

main :: IO ()
main = do putStrLn "EPS: "
          eps <- readLn :: IO Double
          let pi14 = pisum 1 3 False (eps*0.25)
          putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

pisum s i b eps
    | d < eps   = s
    | b         = let h = s+d in h `seq` pisum h (i+2) False eps
    | otherwise = let h = s-d in h `seq` pisum h (i+2) True eps
        d = 1/i

it takes 5.42 seconds (eps = 1e-8) when compiled natively, that loop in C 
is practically indistinguishable from my loop variant.

So stream-fusion or hand-coded loops in Haskell are clearly slower than C 
loops, but don't stink when compiled with the native code generator.

However, I get identical to gcc performance with
ghc -O2 -fexcess-precision -fvia-C -optc-O3 
[-optc-ffast-math doesn't make any difference then] for the hand-coded 
loops (all, my variant, Markus' original, Markus' rewritten - of course, 
each identical to the corresponding C-loop).

Also for stream-fusion.

Some of the list creating variants also profit from compilation via C, 
others not. However, none come near the NCG loop performance, let alone C 
loop performance.

To sum up:
Although GHC's native code generator does a not too bad job with 
arithmetic-heavy loops, gcc is better (even without optimisations).
But if you compile via C, you can get identical performance - 
for these loops.
In general, things are more complicated, often the native code generator 
produces better code than the via-C route. It's usually worth a test which 
produces the better binary, native or via-C.

More information about the Beginners mailing list