Removing/deprecating -fvia-c

Mon Feb 15 15:20:57 EST 2010

Am Montag 15 Februar 2010 17:37:55 schrieb Simon Marlow:
> On 14/02/2010 17:58, Don Stewart wrote:
> > igloo:
> >> Hi all,
> >>
> >> We are planning to remove the -fvia-c way of compiling code
> >> (unregisterised compilers will continue to compile via C only, but
> >> registerised compilers will only use the native code generator).
> >> We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in
> >> 6.16.
> >>
> >> Simon Marlow has recently fixed FP performance for modern x86 chips
> >> in the native code generator in the HEAD. That was the last reason we
> >> know of to prefer via-C to the native code generators. But before we
> >> start the removal process, does anyone know of any other problems
> >> with the native code generators that need to be fixed first?
> >
> > Do we have the blessing of the DPH team, wrt. tight, numeric inner
> > loops?
> >
> > As recently as last year -fvia-C -optc-O3 was still useful for some
> > microbenchmarks -- what's changed in that time, or is expected to
> > change?
>
> If you have benchmarks that show a significant difference, I'd be
> interested to see them.

I have a benchmark (or a couple) from the Beginners mailing list two weeks 
ago (thread starting in January at 
http://www.haskell.org/pipermail/beginners/2010-January/003356.html and 
continued in February at 
http://www.haskell.org/pipermail/beginners/2010-February/003373.html ff) 
which show a significant difference.

Loop.hs:
========================================
{-# LANGUAGE BangPatterns #-}
module Main (main) where

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let !mx = (4/eps)
        !pi14 = pisum mx
    putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

pisum :: Double -> Double
pisum cut = go True 1 0
      where
        go b n s | cut < n = if b then s+1/(2*n) else s-1/(2*n)
        go True n !s = go False (n+2) (s+recip n)
        go False n !s = go True (n+2) (s-recip n)
========================================

$ echo '1e-8' | time ./Loop

ghc -O2 --make:
4.53s
ghc -O2 -fexcess-precision --make:
4.54s
ghc -O2 -fvia-C -optc-O3 --make:
7.52s
ghc -O2 -fvia-C -optc-O3 -optc-ffast-math --make:
7.53s
ghc -O2 -fvia-C -optc-O3 -optc-ffast-math -optc-fno-float-store --make:
3.02s
ghc -O2 -fvia-C -optc-O3 -optc-fno-float-store --make:
3.02s
ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make:
3.02s

The loop coded in C and compiled with gcc -O3 [-ffast-math, -fno-float-
store, -msse2 make no difference there] also takes 3.02s (gcc-4.3.2), 2.70s 
with icc -O3 (icc 11.0).

It is probably worth pointing out, however, that on Markus Böhm's box 
running Windows XP, the native code generator produced better code than the 
via-C route (NCG code was faster there than on my box [openSUSE 11.1], 
while -O2 -fexcess-precision -fvia-C -optc-O3 on his box was slower than 
NCG on mine).

Similar results for

Fusion.hs (uses stream-fusion package)
========================================
module Main (main) where

import qualified Data.List.Stream as S

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let !mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ S.sum $ S.take n step

step :: [Double]
step = S.unfoldr phi (True,1) where
   phi (sig,d) | sig         = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))
========================================

ghc -O2 [-fexcess-precision] --make:
4.22s
ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make:
3.02s

Using lists instead of loops,

List.hs
========================================
module Main (main) where

import Data.List (unfoldr)

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ sum $ take n step

step :: [Double]
step = unfoldr phi (True,1) where
   phi (sig,d) | sig         = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))
========================================

things are much slower, 23.60s vs. 18.15s, but the via-C route is again 
significantly faster.

>
> What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend,
> so where previously we had to use -fvia-C -fexcess-precision -optc-O3
> etc. to get reasonable floating point performance, now we can use -msse2
> with the native code gen and get about the same results.

Can I test whether I get about the same results as with -fvia-C ... for the 
above?
I.e., is it in the HEAD, and would I have to pass -msse2 on the command 
line or is it implied by -O2 already?

>
> In the future we have a couple of ways that things could get better:
>
>   1. The new back-end, which eventually will incorporate more
>      optimisations at the C-- level, and potentially could produce
>      good loop code.  It will also free up some registers.
>
>   2. Compiling via LLVM.
>
> Dropping the C backend will give us more flexibility with calling
> conventions, letting us use more of the x86 registers for passing
> arguments.  We can only make this change by removing -fvia-C, though.
> There's low hanging fruit here particularly for the x86 backend, as soon
> as we drop -fvia-C.
>
> There are other reasons to want to get rid of -fvia-C:
>
>   - it doubles the testing surface
>
>   - it's associated with a bucketload of grotesque Perl 4 code and
>     gcc-specific hacks in the RTS headers.
>
> Cheers,
> 	Simon