[Haskell-cafe] Performance question
Roel van Dijk
vandijk.roel at gmail.com
Thu Feb 26 14:44:38 EST 2009
On Thu, Feb 26, 2009 at 6:23 PM, Don Stewart <dons at galois.com> wrote:
> But note the lazy list of Double pairs, so the inner loop still looks like this though:
>
> $wlgo :: Int# -> [(Double, Double)] -> Int
>
> $wlgo =
> \ (ww_s1pv :: Int#)
> (w_s1px :: [(Double, Double)]) ->
> case w_s1px of wild_aTl {
> [] -> I# ww_s1pv;
> : x_aTp xs_aTq ->
> case x_aTp of wild1_B1 { (x1_ak3, y_ak5) ->
> case x1_ak3 of wild2_aX8 { D# x2_aXa ->
> case y_ak5 of wild3_XYs { D# x3_XYx ->
> case <=##
> (sqrtDouble#
> (+##
> (*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx)))
> 1.0
> of wild4_X1D {
> False -> $wlgo ww_s1pv xs_aTq;
> True -> $wlgo (+# ww_s1pv 1) xs_aTq
> }
>
> while we want to keep everything in registers with something like:
>
> Int# -> Double# -> Double# -> Int#
>
> So we'll be paying a penalty to force the next elem of the list (instead of
> just calling the Double generator). This definitely has an impact on performance.
>
> $ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
>
> $ time ./B 10000000
> 3.1407688
> ./B 10000000 2.41s user 0.01s system 99% cpu 2.415 total
>
>
> Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give
> us a decent lower bound.
>
> {-# LANGUAGE BangPatterns #-}
>
> import System.Environment
> import System.Random.Mersenne
>
> isInCircle :: Double -> Double -> Bool
> isInCircle x y = sqrt (x*x + y*y) <= 1.0
>
> countHits :: Int -> IO Int
> countHits lim = do
> g <- newMTGen Nothing
> let go :: Int -> Int -> IO Int
> go !throws !hits
> | throws >= lim = return hits
> | otherwise = do
> x <- random g -- use mersenne-random-pure64 to stay pure!
> y <- random g
> if isInCircle x y
> then go (throws+1) (hits+1)
> else go (throws+1) hits
> go 0 0
>
> monteCarloPi :: Int -> IO Double
> monteCarloPi n = do
> hits <- countHits n
> return $ 4.0 * fromIntegral hits / fromIntegral n
>
> main = do
> [n] <- getArgs
> res <- monteCarloPi (read n)
> print res
>
> And now the inner loop looks like:
>
> $wa_s1yW :: Int#
> -> Int#
> -> State# RealWorld
> -> (# State# RealWorld, Int #)
>
> Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad.
> But at least does away with heap allocs in the inner loop!
>
> How does it go:
>
> $ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
>
> $ time ./A 10000000
> 3.1412564
> ./A 10000000 0.81s user 0.00s system 99% cpu 0.818 total
>
> Ok. So 3 times faster. Now the goal is to recover the high level version.
> We have many tools to employ: switching to mersenne-random-pure64 might help
> here. And seeing if you can fuse filling a uvector with randoms, and folding
> over it... t
>
> -- Don
>
Very nice! I also wrote a naive version which used uvector but it was
about twice as slow as the original Haskell version. I wanted to write
"lengthU . filterU isInCircle" because that clearly expresses the
algorithm. Sadly I was at work and didn't have time for profile the
program to see what was wrong. Still, I couldn't resist having a go at
the problem :-)
More information about the Haskell-Cafe
mailing list