[Haskell-cafe] Re: speed: ghc vs gcc
Don Stewart
dons at galois.com
Fri Feb 20 16:19:49 EST 2009
claus.reinke:
> Concrete examples always help, thanks.
>
> In simple enough situations, one can roll one's own loop unrolling;),
> somewhat like shown below (worker/wrapper split to bring the function
> parameter representing the loop body into scope, then template haskell
> to unroll its applications syntactically, then optimization by
> transformation
> to get rid of the extra code). It is all rather more complicated than one
> would like it to be, what with TH scoping restrictions and all, but
> perhaps a library of self-unrolling loop combinators along these lines
> might help, as a workaround until ghc does its own unrolling.
>
> Claus
>
> {-# LANGUAGE TemplateHaskell #-}
> module Apply where
> import Language.Haskell.TH.Syntax
> apply i bound | i<bound = [| \f x -> $(apply (i+1) bound) f (f i x) |]
> | otherwise = [| \f x -> x |]
>
> {-# LANGUAGE CPP #-}
> {-# LANGUAGE TemplateHaskell #-}
> {-# LANGUAGE BangPatterns #-}
> {-# OPTIONS_GHC -DN=8 -ddump-splices #-}
> module Main(main) where
> import Apply
> main = print $ loopW 1 (10^9) body 0
>
> {-# INLINE loopW #-}
> loopW :: Int -> Int -> (Int -> Int -> Int) -> Int -> Int
> loopW i max body acc = loop i acc
> where
> loop :: Int -> Int -> Int
> loop !i !acc | i+N<=max = loop (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc)
> {-
> loop !i !acc | i+8<=max = loop (i+8) ( body (i+7)
> $ body (i+6)
> $ body (i+5)
> $ body (i+4)
> $ body (i+3)
> $ body (i+2)
> $ body (i+1)
> $ body i acc)
> -}
> loop !i !acc | i<=max = loop (i+1) (body i acc)
> | otherwise = acc
>
> body :: Int -> Int -> Int
> body !i !acc = i+acc
>
Great thinking! This is EXTREMELY COOL!
Main.hs:15:42-57: Splicing expression
let
apply = apply
$dOrd = GHC.Base.$f1
$dNum = GHC.Num.$f6
$dLift = Language.Haskell.TH.Syntax.$f18
in apply (0 :: Int) 8
======>
\ f[a1KU] x[a1KV]
-> \ f[a1KW] x[a1KX]
-> \ f[a1KY] x[a1KZ]
-> \ f[a1L0] x[a1L1]
-> \ f[a1L2] x[a1L3]
-> \ f[a1L4] x[a1L5]
-> \ f[a1L6] x[a1L7]
-> \ f[a1L8] x[a1L9]
-> \ f[a1La] x[a1Lb] -> x[a1Lb]
f[a1L8] (f[a1L8] 7 x[a1L9])
f[a1L6] (f[a1L6] 6 x[a1L7])
f[a1L4] (f[a1L4] 5 x[a1L5])
f[a1L2] (f[a1L2] 4 x[a1L3])
f[a1L0] (f[a1L0] 3 x[a1L1])
f[a1KY] (f[a1KY] 2 x[a1KZ])
f[a1KW] (f[a1KW] 1 x[a1KX])
f[a1KU] (f[a1KU] 0 x[a1KV])
In the second argument of `loop', namely
`($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc)'
In the expression:
loop
(i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc)
In the definition of `loop':
loop !i !acc
| i + 8 <= max
= loop
(i + 8) ($(apply (0 :: Int) 8) (\ j acc -> body (i + j) acc) acc)
So, that's the fastest yet:
$ time ./Main
500000000500000000
./Main 0.61s user 0.00s system 98% cpu 0.623 total
And within 2x the best GCC was doing,
gcc -O3 -funroll-loops 0.318
If we unroll even further...
$ ghc -O2 -fvia-C -optc-O3 -D64 Main.hs
$ time ./Main
500000000500000000
./Main 0.08s user 0.00s system 94% cpu 0.088 total
Very very nice, Claus!
Now I'm wondering if we can do this via rewrite rules....
-- Don
More information about the Haskell-Cafe
mailing list