[Haskell-cafe] Performance question

Thu Feb 26 14:44:38 EST 2009

On Thu, Feb 26, 2009 at 6:23 PM, Don Stewart <dons at galois.com> wrote:
> But note the lazy list of Double pairs, so the inner loop still looks like this though:
>
>    $wlgo :: Int# -> [(Double, Double)] -> Int
>
>    $wlgo =
>      \ (ww_s1pv :: Int#)
>        (w_s1px :: [(Double, Double)]) ->
>        case w_s1px of wild_aTl {
>          [] -> I# ww_s1pv;
>          : x_aTp xs_aTq ->
>            case x_aTp of wild1_B1 { (x1_ak3, y_ak5) ->
>            case x1_ak3 of wild2_aX8 { D# x2_aXa ->
>            case y_ak5 of wild3_XYs { D# x3_XYx ->
>            case <=##
>                   (sqrtDouble#
>                      (+##
>                         (*## x2_aXa x2_aXa) (*## x3_XYx x3_XYx)))
>                   1.0
>            of wild4_X1D {
>              False -> $wlgo ww_s1pv xs_aTq;
>              True -> $wlgo (+# ww_s1pv 1) xs_aTq
>            }
>
> while we want to keep everything in registers with something like:
>
>    Int# -> Double# -> Double# -> Int#
>
> So we'll be paying a penalty to force the next elem of the list (instead of
> just calling the Double generator).  This definitely has an impact on performance.
>
>    $ ghc-core B.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
>
>    $ time ./B 10000000
>    3.1407688
>    ./B 10000000  2.41s user 0.01s system 99% cpu 2.415 total
>
>
> Now, what if we just rewrote that inner loop directly to avoid intermediate stuff? That'd give
> us a decent lower bound.
>
>    {-# LANGUAGE BangPatterns #-}
>
>    import System.Environment
>    import System.Random.Mersenne
>
>    isInCircle :: Double -> Double -> Bool
>    isInCircle x y = sqrt (x*x + y*y) <= 1.0
>
>    countHits :: Int -> IO Int
>    countHits lim = do
>        g <- newMTGen Nothing
>        let go :: Int -> Int -> IO Int
>            go !throws !hits
>                | throws >= lim  = return hits
>                | otherwise = do
>                    x <- random g   -- use mersenne-random-pure64 to stay pure!
>                    y <- random g
>                    if isInCircle x y
>                        then go (throws+1) (hits+1)
>                        else go (throws+1) hits
>        go 0 0
>
>    monteCarloPi :: Int -> IO Double
>    monteCarloPi n = do
>        hits <- countHits n
>        return $ 4.0 * fromIntegral hits / fromIntegral n
>
>    main = do
>        [n] <- getArgs
>        res <- monteCarloPi (read n)
>        print res
>
> And now the inner loop looks like:
>
>      $wa_s1yW :: Int#
>                  -> Int#
>                  -> State# RealWorld
>                  -> (# State# RealWorld, Int #)
>
> Pretty good. Can't avoid the Int boxed return (and resulting heap check) due to use of IO monad.
> But at least does away with heap allocs in the inner loop!
>
> How does it go:
>
>    $ ghc-core A.hs -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=core2 -funbox-strict-fields
>
>    $ time ./A 10000000
>    3.1412564
>    ./A 10000000  0.81s user 0.00s system 99% cpu 0.818 total
>
> Ok. So 3 times faster. Now the goal is to recover the high level version.
> We have many tools to employ: switching to mersenne-random-pure64 might help
> here. And seeing if you can fuse filling a uvector with randoms, and folding
> over it... t
>
> -- Don
>

Very nice! I also wrote a naive version which used uvector but it was
about twice as slow as the original Haskell version. I wanted to write
"lengthU . filterU isInCircle" because that clearly expresses the
algorithm. Sadly I was at work and didn't have time for profile the
program to see what was wrong. Still, I couldn't resist having a go at
the problem :-)