[Haskell-cafe] Performance question
Don Stewart
dons at galois.com
Thu Feb 26 12:23:12 EST 2009
vandijk.roel:
> I replaced the standard random number generated with the one from
> mersenne-random. On my system this makes the resulting program about
> 14 times faster than the original. I also made a change to
> accumulateHit because it doesn't need to count to total. That is
> already known.
>
> {-# LANGUAGE BangPatterns #-}
>
> import System( getArgs )
> import Data.List( foldl' )
>
> import System.Random.Mersenne
>
> pairs :: [a] -> [(a,a)]
> pairs [] = []
> pairs (x:[]) = []
> pairs (x:y:rest) = (x, y) : pairs rest
>
> isInCircle :: (Double, Double) -> Bool
> isInCircle (x,y) = sqrt (x*x + y*y) <= 1.0
>
> accumulateHit :: Int -> (Double, Double) -> Int
> accumulateHit (!hits) pair | isInCircle pair = hits + 1
> | otherwise = hits
>
> countHits :: [(Double, Double)] -> Int
> countHits ps = foldl' accumulateHit 0 ps
>
> monteCarloPi :: Int -> [(Double, Double)] -> Double
> monteCarloPi n xs = 4.0 * fromIntegral hits / fromIntegral n
> where hits = countHits $ take n xs
>
> main = do
> args <- getArgs
> let samples = read $ head args
>
> randomNumberGenerator <- getStdGen
> randomNumbers <- randoms randomNumberGenerator
>
> let res = monteCarloPi samples $ pairs randomNumbers
> putStrLn $ show $ res
But note the lazy list of Double pairs, so the inner loop still looks like this though:
$wlgo :: Int# -> [(Double, Double)] -> Int
$wlgo =
\ (ww_s1pv :: Int#)
(w_s1px :: [(Double, Double)]) ->
case w_s1px of wild_aTl {
[] -> I# ww_s1pv;
: x_aTp xs_aTq ->
case x_aTp of wild1_B1 { (x1_ak3, y_ak5) ->
case x1_ak3 of wild2_aX8 { D# x2_aXa ->
case y_ak5 of wild3_XYs { D# x3_XYx ->
case <=##
(sqrtDouble#
(+##
(*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx)))
1.0
of wild4_X1D {
False -> $wlgo ww_s1pv xs_aTq;
True -> $wlgo (+# ww_s1pv 1) xs_aTq
}
while we want to keep everything in registers with something like:
Int# -> Double# -> Double# -> Int#
So we'll be paying a penalty to force the next elem of the list (instead of
just calling the Double generator). This definitely has an impact on performance.
$ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
$ time ./B 10000000
3.1407688
./B 10000000 2.41s user 0.01s system 99% cpu 2.415 total
Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give
us a decent lower bound.
{-# LANGUAGE BangPatterns #-}
import System.Environment
import System.Random.Mersenne
isInCircle :: Double -> Double -> Bool
isInCircle x y = sqrt (x*x + y*y) <= 1.0
countHits :: Int -> IO Int
countHits lim = do
g <- newMTGen Nothing
let go :: Int -> Int -> IO Int
go !throws !hits
| throws >= lim = return hits
| otherwise = do
x <- random g -- use mersenne-random-pure64 to stay pure!
y <- random g
if isInCircle x y
then go (throws+1) (hits+1)
else go (throws+1) hits
go 0 0
monteCarloPi :: Int -> IO Double
monteCarloPi n = do
hits <- countHits n
return $ 4.0 * fromIntegral hits / fromIntegral n
main = do
[n] <- getArgs
res <- monteCarloPi (read n)
print res
And now the inner loop looks like:
$wa_s1yW :: Int#
-> Int#
-> State# RealWorld
-> (# State# RealWorld, Int #)
Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad.
But at least does away with heap allocs in the inner loop!
How does it go:
$ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
$ time ./A 10000000
3.1412564
./A 10000000 0.81s user 0.00s system 99% cpu 0.818 total
Ok. So 3 times faster. Now the goal is to recover the high level version.
We have many tools to employ: switching to mersenne-random-pure64 might help
here. And seeing if you can fuse filling a uvector with randoms, and folding
over it... t
-- Don
More information about the Haskell-Cafe
mailing list