[Haskell-cafe] Stacking State on State.....

Sun Mar 1 16:46:47 EST 2009

Am Sonntag, 1. März 2009 21:03 schrieb Phil:
> Hi,
>
> Thanks for the replies - I haven't had a chance to try out everything
> suggested yet - but your explanations of transformers nailed it for me.
>
> However, in terms of performance when stacking, I've come across something
> I'm struggling to explain - I was wondering if anyone could offer up and
> explanation.
> I've rewritten my code twice - one with 3 stacked monads, and one with 2
> stacked monads and a load of maps.  Heuristically I would have thought the
> 3 stacked monads would have performed as well, or even better than the 2
> stacked solution, but the 2 stacked solution is MUCH faster and MUCH less
> memory is used.  They are both using 90% of the same code and both chain
> together the same number of computations using replicateM.

Not quite, the triple stack uses replicateM_, whihc should be a little 
cheaper.

>  From profiling
> I can see that the pure function 'reflect' takes up most of the umph in
> both cases - which I'd expect.  But in the triple stacked version the
> garbage collector is using up >90% of the time.
>
> I've tried using BangPatterns to reduce memory usage in the Triple Stack
> version - doing this I can half the time it takes, but it is still running
> at over twice the time of the two stack version.  The BangPatterns were
> also put in Common Code in the reflect function - so I'd expect both
> solutions to need them?

One thing that helps much is to use

import Control.Monad.State.Strict

Using the default lazy State monad, you build enormous thunks in the states, 
which harms the triple stack even more than the double stack.
With the strict State monad (and a strict left fold instead of foldr in the 
double stack), I get

./v6tripleStrict +RTS -sstderr -K16M
10.450674088955589
444,069,720 bytes allocated in the heap
234,808,472 bytes copied during GC (scavenged)
 30,504,688 bytes copied during GC (not scavenged)
 41,074,688 bytes maximum residency (9 sample(s))

        786 collections in generation 0 ( 21.03s)
          9 collections in generation 1 (  2.54s)

        106 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.05s  (  4.21s elapsed)
  GC    time   23.57s  ( 24.18s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   27.62s  ( 28.40s elapsed)

  %GC time      85.3%  (85.2% elapsed)

  Alloc rate    109,646,844 bytes per MUT second

  Productivity  14.7% of total user, 14.3% of total elapsed

./v6doubleStrict +RTS -sstderr
10.450674088955592
388,795,972 bytes allocated in the heap
177,748,228 bytes copied during GC (scavenged)
 23,953,900 bytes copied during GC (not scavenged)
 44,560,384 bytes maximum residency (9 sample(s))

        710 collections in generation 0 ( 11.62s)
          9 collections in generation 1 (  3.03s)

         94 Mb total memory in use

  INIT  time    0.01s  (  0.00s elapsed)
  MUT   time   13.54s  ( 13.91s elapsed)
  GC    time   14.65s  ( 15.02s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   28.20s  ( 28.93s elapsed)

  %GC time      52.0%  (51.9% elapsed)

  Alloc rate    28,693,429 bytes per MUT second

  Productivity  48.0% of total user, 46.8% of total elapsed

So, yes, the triple stack uses more memory, but not terribly much more.
However, it spends much more time gc'ing, but as its MUT time is much less, 
the total times are not much different.

Now,if we give them enough heap space to begin with:

./v6tripleStrict +RTS -sstderr -K16M -H192M
10.450674088955589
444,077,972 bytes allocated in the heap
 95,828,976 bytes copied during GC (scavenged)
 15,441,936 bytes copied during GC (not scavenged)
 36,147,200 bytes maximum residency (2 sample(s))

          5 collections in generation 0 (  2.16s)
          2 collections in generation 1 (  0.43s)

        185 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.20s  (  4.55s elapsed)
  GC    time    2.59s  (  2.74s elapsed)
  EXIT  time    0.00s  (  0.95s elapsed)
  Total time    6.79s  (  7.29s elapsed)

  %GC time      38.1%  (37.6% elapsed)

  Alloc rate    105,732,850 bytes per MUT second

  Productivity  61.9% of total user, 57.6% of total elapsed

./v6doubleStrict +RTS -sstderr -K16M -H192M
10.450674088955592
388,806,408 bytes allocated in the heap
 46,446,680 bytes copied during GC (scavenged)
     77,852 bytes copied during GC (not scavenged)
    159,744 bytes maximum residency (2 sample(s))

          4 collections in generation 0 (  1.36s)
          2 collections in generation 1 (  0.03s)

        182 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.53s  (  5.11s elapsed)
  GC    time    1.39s  (  1.44s elapsed)
  EXIT  time    0.00s  (  0.02s elapsed)
  Total time    5.92s  (  6.55s elapsed)

  %GC time      23.5%  (21.9% elapsed)

  Alloc rate    85,829,229 bytes per MUT second

  Productivity  76.5% of total user, 69.2% of total elapsed

MUCH better. 
I have no idea why the MUT time for the double stack decreases so much, 
though.

>
> Even though both pieces of code are a bit untidy, the triple stacked monad
> 'feels' nicer to me - everything is encapsulated away and one evaluation in
> main yields the result.  From purely a design perspective I prefer it - but
> obviously not if it runs like a dog!
>
> Any ideas why the triple stack runs so slow?

It suffers horribly from laziness. One thing is the lazy State monad, another 
is the implementation of mc.

>
> Thanks again!
>
> Phil
>
> ***************** Triple Stack Specific Impl:
>
> type MonteCarloStateT = StateT Double
>
> mc :: MonteCarloStateT BoxMullerQuasiState ()
> mc = StateT $ \s -> do nextNormal <- generateNormal
>                        let stochastic = 0.2*1*nextNormal
>                        let drift = 0.05 - (0.5*(0.2*0.2))*1
>                        let newStockSum = payOff 100 ( 100 * exp ( drift +
> stochastic ) ) + s
>                        return ((),newStockSum)

Don't use a new let on each line, have it all in one let-block.
And, please, force the evaluation of newStockSum:

mc :: MonteCarloStateT BoxMullerQuasiState ()
mc = StateT $ \s -> do
    nextNormal <- generateNormal
    let stochastic = 0.2*1*nextNormal
        drift = 0.05 - (0.5*(0.2*0.2))*1
        !newStockSum = payOff 100 ( 100 * exp ( drift + stochastic ) ) + s
    return ((),newStockSum)

Now:

./v8tripleStrict +RTS -sstderr
10.450674088955589
396,391,172 bytes allocated in the heap
     65,252 bytes copied during GC (scavenged)
     22,272 bytes copied during GC (not scavenged)
     40,960 bytes maximum residency (1 sample(s))

        757 collections in generation 0 (  0.02s)
          1 collections in generation 1 (  0.00s)

          1 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.36s  (  3.49s elapsed)
  GC    time    0.02s  (  0.05s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.38s  (  3.54s elapsed)

  %GC time       0.6%  (1.3% elapsed)

  Alloc rate    117,973,563 bytes per MUT second

  Productivity  99.4% of total user, 94.9% of total elapsed

w00t!

>
> iterations = 1000000
> main :: IO()
> main = do let sumOfPayOffs = evalState ( evalStateT ( execStateT (do
> replicateM_ iterations mc) $ 0 ) $ (Nothing,nextHalton) ) $ (1,[3,5])
>           let averagePO = sumOfPayOffs / fromIntegral iterations
>           let discountPO = averagePO * exp (-0.05)
>           print discountPO
>

Again, don't needlessly multiply the lets.

>
> ***************** Double Stack and Map Specific Impl:
>
>
> iterations = 1000000
> main :: IO()
> main = do let normals = evalState ( evalStateT (do replicateM iterations
> generateNormal) $ (Nothing,nextHalton) ) $ (1,[3,5])
>           let stochastic = map (0.2*1*) normals
>           let sde = map ((( 0.05 - (0.5*(0.2*0.2)) )*1)+) stochastic
>           let expiryMult = map exp sde
>           let expiry = map (100*) expiryMult
>           let payoff = map (payOff 100) expiry
>           let averagePO = (foldr (+) 0 payoff) / fromIntegral iterations
>           let discountPO = averagePO * exp (-0.05)
>           print discountPO
>

Same here, but important for performance is to replace the foldr with foldl'.

>
> ***************** Common Code Used By Both Methods:
>
>
> import Control.Monad.State
> import Debug.Trace
>
> -- State Monad for QRNGs - stores current iteration and list of
> -- bases to compute
> type QuasiRandomState = State (Int,[Int])
>
> nextHalton :: QuasiRandomState [Double]
> nextHalton = do (n,bases) <- get
>                 let !nextN = n+1
>             put (nextN,bases)
>             return $ map (reflect (n,1,0)) bases
>
> type ReflectionThreadState = (Int,Double,Double)
>
> reflect :: ReflectionThreadState -> Int -> Double
> reflect (k,f,h) base
>
>   | k <= 0 = h
>   | otherwise = reflect (newK,newF,newH) base
>
>       where
>         newK = k `div` base
>         newF = f / fromIntegral base
>         newH = h + fromIntegral(k `mod` base) * newF
>
> -- So we are defining a state transform which has state of 'maybe double'
> and an
> -- operating function for the inner monad of type QuasiRandomMonad
> returning a [Double]
> -- We then say that it wraps an QuasiRandomMonad (State Monad) - it must of
> course
> -- if we pass it a function that operates on these Monads we must wrap the
> same
> -- type of Monad.  And finally it returns a double
>
> type BoxMullerStateT = StateT (Maybe Double, QuasiRandomState [Double])
> type BoxMullerQuasiState = BoxMullerStateT QuasiRandomState
>
> generateNormal :: BoxMullerQuasiState Double
> generateNormal = StateT $ \s -> case s of
>                 (Just d,qrnFunc) -> return (d,(Nothing,qrnFunc))
>                 (Nothing,qrnFunc) -> do qrnBaseList <- qrnFunc
>                                     let (norm1,norm2) = boxMuller (head
> qrnBaseList) (head $ tail qrnBaseList)
>                                     return (norm1,(Just norm2,qrnFunc))
>
> boxMuller :: Double -> Double -> (Double,Double)
> -- boxMuller rn1 rn2 | trace ( "rn1 " ++ show rn1 ++ " rn2 " ++ show rn2 )
> False=undefined
> boxMuller rn1 rn2 = (normal1,normal2)
>   where
>     r        = sqrt ( (-2)*log rn1)
>     twoPiRn2 = 2 * pi * rn2
>     normal1  = r * cos ( twoPiRn2 )
>     normal2  = r * sin ( twoPiRn2 )
>
>
>
> payOff :: Double -> Double -> Double
> payOff strike stock | (stock - strike) > 0 = stock - strike
>
>                     | otherwise = 0
>

Cheers,
Daniel