[Haskell-cafe] Performance question

Thu Feb 26 12:23:12 EST 2009

vandijk.roel:
> I replaced the standard random number generated with the one from
> mersenne-random. On my system this makes the resulting program about
> 14 times faster than the original. I also made a change to
> accumulateHit because it doesn't need to count to total. That is
> already known.
> 
> {-# LANGUAGE BangPatterns #-}
> 
> import System( getArgs )
> import Data.List( foldl' )
> 
> import System.Random.Mersenne
> 
> pairs :: [a] -> [(a,a)]
> pairs [] = []
> pairs (x:[]) = []
> pairs (x:y:rest) = (x, y) : pairs rest
> 
> isInCircle :: (Double, Double) -> Bool
> isInCircle (x,y) = sqrt (x*x + y*y) <= 1.0
> 
> accumulateHit :: Int -> (Double, Double) -> Int
> accumulateHit (!hits) pair | isInCircle pair = hits + 1
>                            | otherwise       = hits
> 
> countHits :: [(Double, Double)] -> Int
> countHits ps = foldl' accumulateHit 0 ps
> 
> monteCarloPi :: Int -> [(Double, Double)] -> Double
> monteCarloPi n xs = 4.0 * fromIntegral hits / fromIntegral n
> 	where hits = countHits $ take n xs
> 
> main = do
> 	args <- getArgs
> 	let samples = read $ head args
> 
> 	randomNumberGenerator <- getStdGen
> 	randomNumbers <- randoms randomNumberGenerator
> 
> 	let res = monteCarloPi samples $ pairs randomNumbers
> 	putStrLn $ show $ res

But note the lazy list of Double pairs, so the inner loop still looks like this though:

    $wlgo :: Int# -> [(Double, Double)] -> Int

    $wlgo =
      \ (ww_s1pv :: Int#)
        (w_s1px :: [(Double, Double)]) ->
        case w_s1px of wild_aTl {
          [] -> I# ww_s1pv;
          : x_aTp xs_aTq ->
            case x_aTp of wild1_B1 { (x1_ak3, y_ak5) ->
            case x1_ak3 of wild2_aX8 { D# x2_aXa ->
            case y_ak5 of wild3_XYs { D# x3_XYx ->
            case <=##
                   (sqrtDouble#
                      (+##
                         (*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx)))
                   1.0
            of wild4_X1D {
              False -> $wlgo ww_s1pv xs_aTq;
              True -> $wlgo (+# ww_s1pv 1) xs_aTq
            }

while we want to keep everything in registers with something like:

    Int# -> Double# -> Double# -> Int#

So we'll be paying a penalty to force the next elem of the list (instead of
just calling the Double generator).  This definitely has an impact on performance.

    $ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields

    $ time ./B 10000000                                                                            
    3.1407688
    ./B 10000000  2.41s user 0.01s system 99% cpu 2.415 total

Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give
us a decent lower bound.

    {-# LANGUAGE BangPatterns #-}

    import System.Environment
    import System.Random.Mersenne

    isInCircle :: Double -> Double -> Bool
    isInCircle x y = sqrt (x*x + y*y) <= 1.0

    countHits :: Int -> IO Int
    countHits lim = do
        g <- newMTGen Nothing
        let go :: Int -> Int -> IO Int
            go !throws !hits
                | throws >= lim  = return hits
                | otherwise = do
                    x <- random g   -- use mersenne-random-pure64 to stay pure!
                    y <- random g
                    if isInCircle x y
                        then go (throws+1) (hits+1)
                        else go (throws+1) hits
        go 0 0

    monteCarloPi :: Int -> IO Double
    monteCarloPi n = do
        hits <- countHits n
        return $ 4.0 * fromIntegral hits / fromIntegral n

    main = do
        [n] <- getArgs
        res <- monteCarloPi (read n)
        print res

And now the inner loop looks like:

      $wa_s1yW :: Int#
                  -> Int#
                  -> State# RealWorld
                  -> (# State# RealWorld, Int #)

Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad. 
But at least does away with heap allocs in the inner loop!

How does it go:

    $ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields

    $ time ./A 10000000
    3.1412564
    ./A 10000000  0.81s user 0.00s system 99% cpu 0.818 total

Ok. So 3 times faster. Now the goal is to recover the high level version.
We have many tools to employ: switching to mersenne-random-pure64 might help
here. And seeing if you can fuse filling a uvector with randoms, and folding
over it... t

-- Don