[Haskell-cafe] Performance Issue

Thu Feb 26 12:35:34 EST 2009

james.swaine:
> i'm implementing a benchmark which includes a detailed specification for a
> random number generator.  for any of the kernels outlined in the benchmark, i
> might have to generate a set of random numbers R, which has a length n, using
> the following formulas:
> 
> R[k] = ((2^-46)(X[k])) mod 2^46, where
> 
> X[k] = (a^k)s
> 
> where the values of a and s are constant and defined below. 
> many of the kernels in the benchmark require a large number of randoms to be
> generated (in the tens of millions).  when i invoke the following getRandAt
> function that many times to build up a list, evaluation of the list takes
> forever (somewhere between 5 and 10 minutes).  i've tried optimizing this
> several different ways, with no luck.  i though i might post my code here and
> see if anyone notices anything i'm doing wrong that might be causing such a
> large bottleneck:
> 
> --constants
> a :: Int64
> a = 5^13   
> 
> divisor :: Int64
> divisor = 2^46
> 
> multiplier :: Float
> multiplier = 2**(-46)
> 
> 
> --gets r[k], which is the value at the kth
> --position in the overall sequence of
> --pseudorandom numbers
> getRandAt :: Int64 -> Int64 -> Float
> getRandAt 0 seed = multiplier * (fromIntegral seed)
> getRandAt k seed = multiplier * (fromIntegral x_next)
>     where
>         x_prev = (a^k * seed) `mod` divisor
>         x_next = (a * x_prev) `mod` divisor
> 
> thanks all in advance for your help!

Using ghc -O2 --make

There's nothing wrong with this code, really:

    Z.$wgetRandAt :: Int# -> Int# -> Float#

and an inner loop of:

    Z.$w$j :: Int# -> Float#
    Z.$w$j =
      \ (w_sHx :: Int#) ->
        case Z.^1 Z.lit1 Z.lvl of w1_XFs { I64# ww_XFv ->
        case ww_XFv of wild_aFB {
          __DEFAULT ->
            case minBound3 of wild1_aFC { I64# b1_aFE ->
            case ==# w_sHx b1_aFE of wild2_aFG {
              False ->
                case modInt# w_sHx wild_aFB of wild3_aFJ { __DEFAULT ->
                timesFloat#
                  (powerFloat# __float 2.0 __float -46.0)
                  (int2Float# wild3_aFJ)
                };
              True ->
                case wild_aFB of wild3_aFM {
                  __DEFAULT ->
                    case modInt# w_sHx wild3_aFM of wild4_aFN { __DEFAULT ->
                    timesFloat#
                      (powerFloat# __float 2.0 __float -46.0)
                      (int2Float# wild4_aFN)
                    };
                  (-1) ->
                    overflowError
                    `cast` (CoUnsafe (forall a_aFS. a_aFS) Float#
                            :: forall a_aFS. a_aFS ~ Float#)
                }
            }
            };
          0 ->
            divZeroError
            `cast` (CoUnsafe (forall a_aFT. a_aFT) Float#
                    :: forall a_aFT. a_aFT ~ Float#)
        }

Which is just fine.

Inlining those constants explicitly might be a good idea, then we get an outer loop of:

    Z.$wgetRandAt :: Int# -> Int# -> Float#
    Z.$wgetRandAt =
      \ (ww_sHG :: Int#) (ww1_sHK :: Int#) ->
        case ww_sHG of wild_B1 {
          __DEFAULT ->
            case Z.lvl3 of wild1_aEd { I64# x#_aEf ->
            case Z.^ Z.a (I64# wild_B1)
            of wild2_XFe { I64# x#1_XFh ->
            case Z.lvl1 of w_aE7 { I64# ww2_aE9 ->
            case ww2_aE9 of wild3_aFB {
              __DEFAULT ->
                case minBound3 of wild11_aFC { I64# b1_aFE ->
                let {
                  ww3_aFz [ALWAYS Just L] :: Int#

                  ww3_aFz = *# x#1_XFh ww1_sHK } in
                case ==# ww3_aFz b1_aFE of wild21_aFG {
                  False ->
                    case modInt# ww3_aFz wild3_aFB
                    of wild31_aFJ { __DEFAULT ->
                    Z.$w$j (*# x#_aEf wild31_aFJ)
                    };
                  True ->
                    case wild3_aFB of wild31_aFM {
                      __DEFAULT ->
                        case modInt# ww3_aFz wild31_aFM
                        of wild4_aFN { __DEFAULT ->
                        Z.$w$j (*# x#_aEf wild4_aFN)
                        };
                      (-1) ->
                        overflowError
                        `cast` (CoUnsafe (forall a_aFS. a_aFS) Float#
                                :: forall a_aFS. a_aFS ~ Float#)
                    }
          0 ->
            divZeroError
            `cast` (CoUnsafe (forall a_aFT. a_aFT) Float#
                    :: forall a_aFT. a_aFT ~ Float#)
        }
        }
        }
        };
      0 ->
        timesFloat#
          (powerFloat# __float 2.0 __float -46.0)
          (int2Float# ww1_sHK)
    }

which is the fast path, then error / bounds checking.

this looks perfectly acceptable.

What does sound troublesome is using lazy lists .. that's more likely to be the bottleneck.

-- Don