[Haskell-cafe] Splittable random numbers

Sat Jan 29 14:52:04 CET 2011

>
> perhaps performance? Is this approach less robust with a faster,
> non-cryptographic RNG?
>

Yes, I don't understand that either.  Is there a reason that using a weaker
PRNG in this case is WORSE than using it in the non-splitting case?  Is that
why there is more of an impetus to use the cryptographic approach in this
case?

Anyway, taking for granted that the Burton approach is a useful thing to
have implemented, I started developing a package for this stuff -- AES based
RNG including both a haskell implementation and wrapping an AESNI-based C
one .  I haven't posted it to Hackage yet, but you can find the git
repository here:

    https://github.com/rrnewton/intel-aes

If you build with cabal and run the benchmark-intel-aes-rng executable, it
will give you a breakdown like this:

    How many random numbers can we generate in a second on one thread?
      Cost of rdtsc (ffi call):    83
      Approx getCPUTime calls per second: 205,640
      Approx clock frequency:  3,306,891,339
      First, timing with System.Random interface:
       193,178,901 random ints generated [constant zero gen]
        14,530,358 random ints generated [System.Random stdGen]
            16,346 random ints generated [BurtonGenSlow/reference]
            32,965 random ints generated [BurtonGenSlow]
      Comparison to C's rand():
       118,766,285 random ints generated [rand/store in C loop]
       114,668,028 random ints generated [rand / Haskell loop]
       114,675,116 random ints generated [rand/store Haskell]

At the moment this is Haskell-only, I haven't included the wrapped Intel
assembly code library yet.  As you can see, the pure-Haskell AES based RNG
(BurtonGenSlow) is pretty slow.

Would anyone else be interested in running those RNG testing tools (diehard,
big crush) on this to make sure it is working correctly?

Also I'd be happy if anyone with a performance-oriented-eye would like to
take a look at what's going on.  Both for the sake of the serial performance
(above) and because the parallel performance is currently *mysterious* (see
below).

I figure one of the main reasons for splittable RNG is deterministic
parallel computations.  Thus it's important that all threads be able to run
the RNG efficiently.  Right now, if you look at SimpleRNGBench.hs I'm just
running the same RNG on numCapabilities threads.  Yet with that simple test
I'm running into problems, summarized thus:

  * I get substantial (3X) variance in program performance on consecutive
runs.
  * I see a minor hit in performance adding -threaded, but going from -N1 to
-N4 (even with a single-thread of work) yields a big hit in performance and
increase in variance.
  * -N4 with four actual threads of work is actually pretty good for the
pure haskell version.  All four threads on my nehalem 3.33ghz can maintain
93% of their throughput in the serial case.  BUT the variance problem
persists.
  * I run a busy-wait loop that measures cpu frequency... and this seems to
get messed up in threaded mode (even with -qm -qa).  I don't know why.
  * I cannot killThread a haskell thread (forkIO or forkOS) that is
currently running a divergent FFI call (safe or unsafe).  (See "time_c".)

You can find the details in the DEVLOG here:

   https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG

Let me know if you have any ideas.  I'm going to leave the Haskell version
how it is and focus on wrapping the Intel asm (which has a permissive
license).

Cheers,
  -Ryan

P.S. Regarding this benchmarking -- would it be appropriate to use Criterion
for this?  Or is it sufficient to measure aggregate throughput as I've been
doing?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110129/0111fc6a/attachment.htm>