[Haskell-cafe] Efficient functional idiom for histogram

Sat Aug 1 16:38:05 EDT 2009

Am Samstag 01 August 2009 20:50:10 schrieb Sterling Clover:
> On Aug 1, 2009, at 2:06 PM, Paul Moore wrote:
> > Is the issue with random numbers just the implementation, or is it
> > something inherent in the non-pure nature of a random number generator
> > that makes it hard for Haskell to implement efficiently? If the
> > latter, that probably makes Haskell a somewhat poor choice for
> > simulation-type programs.

If you view a PRNG as a function from the seed to the sequence of generated numbers or as 
a function state -> (bitpattern, newstate), PRNGs are pure (at least, I know no 
counterexample), so it's not inherently inefficient in Haskell, though it's probably still 
faster in C. 
One thing that makes StdGen slow is splittability, as Sterling points out below.
For a simulation programme where you don't need splittability, choose a different PRNG.

>
> Well, I'm not sure of the details, but in your original
> implementation, you're performing IO to pull the seed out of a ref at
> every iteration. Pekka Karjalainen's doesn't do that, which probably
> helps with the speedup. Along with that, Haskell has a fairly slow
> random implementation. As I recall however, this is partially because
> it hasn't received a great deal of optimization, but mainly because
> the algorithm itself fulfills some rather strong properties -- in
> particular it must be fairly statistically robust, and it must
> provide a "split" function which produces generators that are
> independently robust [1]. This limits algorithm choice quite a bit.
>
> For other random numbers, with different properties (faster, but with
> tradeoffs in robustness, or ability to split, or both), you can check
> hackage for at least mersenne-random and gsl-random.

I didn't get much speedup with System.Random.Mersenne.Pure64 (might be because I have a 
32-bit system and Word64 goes through foreign calls, if that is still the case), but 
GSL.Random.Gen reduced the time by a factor of over 5.
It forces you back into IO and is a little less convenient, but if speed is a concern, 
it's a price worth to pay.

---------------------------------------
module Main (main) where

import GSL.Random.Gen
import qualified Data.Map as Map
import Data.Map (Map)
import Data.List
import System.IO.Unsafe
import System.Time
import Data.Word

dice :: RNG -> Int -> Int -> IO Int
dice _ 0 n = return 0
dice rng m n = do
  total <- dice rng (m - 1) n
  roll <- fmap (+1) $ getUniformInt rng n
  return (total + roll)

simulate _ 0 _ _ = return []
simulate rng count m n = unsafeInterleaveIO $ do
    val <- dice rng m n
    tl <- simulate rng (count-1) m n
    return (val:tl)

histogram :: Ord a => [a] -> [(a,Int)]
histogram = Map.assocs . foldl' f Map.empty
  where
    f m k = Map.insertWith' (+) k 1 m

simulation rng = do
  lst <- simulate rng 1000000 3 6
  return (histogram lst)

main = do
  rng <- newRNG mt19937
  sd <- getTimeSeed
  setSeed rng sd          -- omit seeding for reproducible results
  s <- simulation rng
  putStrLn (show s)

getTimeSeed :: IO Word64
getTimeSeed = do
    TOD a b <- getClockTime
    return . fromInteger $ 10^6*a + b `quot` (10^6)
--------------------------------------------------

> There may be others that I don't recall.
>
> Cheers,
> Sterl.