Removing/deprecating -fvia-c

David Terei davidterei at gmail.com
Thu Feb 18 04:09:31 EST 2010


Don Stewart wrote:
> Here's an example that doesn't use floating point:
>
> import Data.Array.Vector
> import Data.Bits
>
> main = print . sumU $ zipWith3U (\x y z -> x * y * z)
> (enumFromToU 1 (100000000 :: Int))
> (enumFromToU 2 (100000001 :: Int))
> (enumFromToU 7 (100000008 :: Int))
>
> In core:
>
> main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int#
> main_$s$wfold =
> \ (sc_s1l1 :: Int#)
> (sc1_s1l2 :: Int#)
> (sc2_s1l3 :: Int#)
> (sc3_s1l4 :: Int#) ->
> case># sc2_s1l3 100000000 of _ {
> False ->
> case># sc1_s1l2 100000001 of _ {
> False ->
> case># sc_s1l1 100000008 of _ {
> False ->
> main_$s$wfold
> (+# sc_s1l1 1)
> (+# sc1_s1l2 1)
> (+# sc2_s1l3 1)
> (+#
> sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1));
> True -> sc3_s1l4
> };
> True -> sc3_s1l4
> };
> True -> sc3_s1l4
> }
>
> Rather nice!
>
> -fvia-C -optc-O3
>
> Main_mainzuzdszdwfold_info:
> cmpq $100000000, %rdi
> jg .L6
> cmpq $100000001, %rsi
> jg .L6
> cmpq $100000008, %r14
> jle .L10
> .L6:
> movq %r8, %rbx
> movq (%rbp), %rax
> jmp *%rax
> .L10:
> movq %rsi, %r10
> leaq 1(%rsi), %rsi
> imulq %rdi, %r10
> leaq 1(%rdi), %rdi
> imulq %r14, %r10
> leaq 1(%r14), %r14
> leaq (%r10,%r8), %r8
> jmp Main_mainzuzdszdwfold_info
>
> Which looks ok.
>
> $ time ./zipwith3
> 3541230156834269568
> ./zipwith3 0.33s user 0.00s system 99% cpu 0.337 total
>
> And -fasm we get very different code, and a bit of a slowdown:
>
> Main_mainzuzdszdwfold_info:
> .Lc1mo:
> cmpq $100000000,%rdi
> jg .Lc1mq
> cmpq $100000001,%rsi
> jg .Lc1ms
> cmpq $100000008,%r14
> jg .Lc1mv
>
> movq %rsi,%rax
> imulq %r14,%rax
> movq %rdi,%rcx
> imulq %rax,%rcx
> movq %r8,%rax
> addq %rcx,%rax
> leaq 1(%rdi),%rcx
> leaq 1(%rsi),%rdx
> incq %r14
> movq %rdx,%rsi
> movq %rcx,%rdi
> movq %rax,%r8
> jmp Main_mainzuzdszdwfold_info
>
> .Lc1mq:
> movq %r8,%rbx
> jmp *(%rbp)
> .Lc1ms:
> movq %r8,%rbx
> jmp *(%rbp)
> .Lc1mv:
> movq %r8,%rbx
> jmp *(%rbp)
>
> Slower:
>
> $ time ./zipwith3
> 3541230156834269568
> ./zipwith3 0.38s user 0.00s system 98% cpu 0.384 total
>
> Now maybe we need to wait on the new backend optimizations to get there?

I've been looking at some of these cases and seeing how the LLVM 
back-end performs. My general impression from benchmarking the LLVM 
back-end in the past has been that it generally performs with similar 
characteristics as the C code generator (that is, where the C code 
generator stood out compared to the NCG, so did LLVM).

(On x86-32/Mac OS X 10.5, had some issues getting x64 working at moment):

./zipWith3_viac 0.72s
./zipWith3_fasm 0.65s
./zipWith3_llvm 0.38s

Code that LLVM produces:

_Main_mainzuzdszdwfold_entry:
## BB#0: ## %c1qP
subl $12, %esp
jmp LBB2_1
.align 4, 0x90
LBB2_4: ## %n1re
## Loop Depth 1
## Loop Header is BB2_1
## Inner Loop
movl %ecx, %esi
incl %ecx
imull %eax, %esi
incl %eax
imull %edx, %esi
incl %edx
addl (%ebp), %esi
movl %edx, 12(%ebp)
movl %ecx, 8(%ebp)
movl %eax, 4(%ebp)
movl %esi, (%ebp)
LBB2_1: ## %tailrecurse
## Loop Depth 1
## Loop Header
## Inner Loop
movl 4(%ebp), %eax
cmpl $100000000, %eax
jg LBB2_5
## BB#2: ## %n1qX
## Loop Depth 1
## Loop Header is BB2_1
## Inner Loop
movl 8(%ebp), %ecx
cmpl $100000001, %ecx
jg LBB2_5
## BB#3: ## %n1r5
## Loop Depth 1
## Loop Header is BB2_1
## Inner Loop
movl 12(%ebp), %edx
cmpl $100000008, %edx
jle LBB2_4
LBB2_5: ## %c1qW
movl 16(%ebp), %eax
movl (%ebp), %esi
addl $16, %ebp
movl (%eax), %eax
addl $12, %esp
jmpl *%eax # TAILCALL

Which is very nice. (The comments in the code are inserted by LLVM, not me).

I also ran through some of the programs outlined here:

http://permalink.gmane.org/gmane.comp.lang.haskell.glasgow.user/18151

All ran with 'echo '1e-8' | ./$PRG'.

Loop.hs:
========================================
{-# LANGUAGE BangPatterns #-}
module Main (main) where

main :: IO ()
main = do
putStrLn "EPS: "
eps <- readLn :: IO Double
let !mx = (4/eps)
!pi14 = pisum mx
putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

pisum :: Double -> Double
pisum cut = go True 1 0
where
go b n s | cut < n = if b then s+1/(2*n) else s-1/(2*n)
go True n !s = go False (n+2) (s+recip n)
go False n !s = go True (n+2) (s-recip n)
========================================

./Loops_fasm 4.53s
./Loops_viac 4.22s
./Loops_llvm 2.89s


Fusion.hs (uses stream-fusion package)
========================================
module Main (main) where

import qualified Data.List.Stream as S

main :: IO ()
main = do
putStrLn "EPS: "
eps <- readLn :: IO Double
let !mx = floor (4/eps)
!k = (mx+1) `quot` 2
putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ S.sum $ S.take n step

step :: [Double]
step = S.unfoldr phi (True,1) where
phi (sig,d) | sig = Just (1/d, (False,d+2))
| otherwise = Just (negate (1/d), (True,d+2))
========================================

./Fusion_fasm 4.61s
./Fusion_viac 4.22s
./Fusion_llvm 3.62s


List.hs
========================================
module Main (main) where

import Data.List (unfoldr)

main :: IO ()
main = do
putStrLn "EPS: "
eps <- readLn :: IO Double
let mx = floor (4/eps)
!k = (mx+1) `quot` 2
putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ sum $ take n step

step :: [Double]
step = unfoldr phi (True,1) where
phi (sig,d) | sig = Just (1/d, (False,d+2))
| otherwise = Just (negate (1/d), (True,d+2))
========================================

./List_fasm 18.21s
./List_viac 16.71s
./List_llvm 16.92s

So with these kinds of results (obviously I'm biased though since I 
wrote the llvm back-end) I think the sentiment that the -fvia-C approach 
should be eventually removed is the right way to go since with the LLVM 
back-end and the new code generator there is a promising and much more 
interesting future.

~ David


More information about the Glasgow-haskell-users mailing list