[Haskell-cafe] Performance with do notation, mwc-random and unboxed vector

Mon Jun 11 13:25:43 CEST 2012

11.06.2012, 14:17, "Malcolm Wallace" <malcolm.wallace at me.com>:
> that there are no side-effects

There are — PRNG state is updated for RealWorld, that's why monadic replicateM is used.

You can add something like

  print $ (VU.!) e 500000

after e is bound and still get 0.057 sec with do-less version.
This quite matches the performance claimed by mwc-random package
and seems reasonable since modern hardware shouldn't have any problem
with generating  twenty million random variates in a second with one execution thread.

Your note on laziness would be correct in case like
------ 8< ------
import qualified Data.Vector.Unboxed as VU
import Data.Functor

import System.Random.MWC
import System.Random.MWC.Distributions (standard)

count = 100000000

main = do
  g <- create
  e <- return $ VU.replicate count (212.8506 :: Double)
  return ()
------ >8 -------
Where unused `e` is truly left unevaluated (you could force it
by matching with `!e` for example).

Profiling indicates that random number sampling really occurs for
both of original versions with `replicateM`, expectedly taking most of time:

	Mon Jun 11 14:24 2012 Time and Allocation Profiling Report  (Final)

	   slow-mwc-vector +RTS -p -RTS

	total time  =        5.45 secs   (5453 ticks @ 1000 us, 1 processor)
	total alloc = 3,568,827,856 bytes  (excludes profiling overheads)

COST CENTRE   MODULE                          %time %alloc

uniform2      System.Random.MWC                45.0   53.7
uniformWord32 System.Random.MWC                31.3   31.5
standard.loop System.Random.MWC.Distributions   4.1    1.1
uniform1      System.Random.MWC                 3.9    4.5
nextIndex     System.Random.MWC                 3.6    1.4
uniform       System.Random.MWC                 2.8    3.3
uniform       System.Random.MWC                 2.5    1.4
wordsToDouble System.Random.MWC                 2.1    0.5

I could drop do notation and go with the simpler version if I wanted just 
a vector of variates. But in reality I want a vector of tuples with random
components:
------ 8< ------
import qualified Data.Vector.Unboxed as VU
import Control.Monad

import System.Random.MWC
import System.Random.MWC.Distributions (standard)

count = 1000000

main = do
  g <- create
  e <- VU.replicateM count $ do
         v1 <- standard g
         v2 <- standard g
         v3 <- standard g
         return (v1, v2, v3)
  return ()
------ >8 -------
which runs for the same 11.412 seconds.
Since three times more variates are generated and run time stays the same,
this implies that perhaps some optimizations of vector package interfere
with mwc-random — can this be the case?
This becomes quite a bottleneck in my application.

On the other hand, mwc-random has `normal` function implemented as follows:

------ 8< ------
normal m s gen = do
  x <- standard gen
  return $! m + s * x
------ >8 -------
which again uses explicit `do`. Both standard and normal are marked with INLINE.

Now if I try to write
------ 8< ------
  e <- VU.replicateM count $ normal 0 1 g
------ >8 -------
in my test case, quite expectedly I get horrible performance of 11 seconds,
even though I'm not using do myself.