genBits: small addition to random API + largely new implementation that needs code review

Tue Jun 28 20:58:00 CEST 2011

[CC'ing some people involved in the relevant ticket discussions or showing
up in git blame]

Hello all,

So it seems random was due for an overhaul.  I would like to initiate a
discussion period to talk about what changes should happen before the next
major release.  I think this is timely because there is *already* a pending
backwards-incompatible change in the API (factoring out the SplittableGen
class) <http://hackage.haskell.org/trac/ghc/ticket/4314>.  We might as well
make any other fixes now and make all the changes at once.  For
reference, I've attached a list of the open tickets I know about pertaining
to System.Random at the end of this email.

There implementation has two major pieces:

    (1) Sources of random bits (class RandomGen)
    (2) Instances of class Random that map use random bits to create Haskell
types

For now I've left (1) alone and have made heavy changes to part (2) in a
branch to fix correctness and performance problems:
   https://github.com/haskell/random/tree/new_api

These new changes require a small API change to RandomGen.  The issue is
that the legacy RandomGen API isn't very good at generating random *bits*.
 Rather, it next creates numbers in an arbitrary range: genRange.  First of
all, I would argue that -- completely aside from performance considerations
-- this extra flexibility for RandomGens increases the likelihood of errors
in the clients of the API.  As supporting evidence I would cite tickets
#5278 <http://hackage.haskell.org/trac/ghc/ticket/5278> and
#5280<http://hackage.haskell.org/trac/ghc/ticket/5280> --
it looks like even use of the API in System/Random.hs itself was incorrect!

*My proposed API addition/restriction:*

I propose that we add a "genBits" function that reports how many bits of
randomness are created by a generator.

  class RandomGen g where
    ...
    genBits :: g -> Int

I added a default implementation of genBits that computes the answer based
on genRange.  This need not be a backwards-incompatible change.

What about generators that have a genRange which is not a power of two?  We
could consider making genBits return a Maybe value to distinguish these.
 But I would prefer that we* *instead* restrict RandomGen instances,
requiring that they generate a clean number of random bits*.  This requires
that genRange be (0,2^N-1) or (-2^N,2^N-1).  But we can tolerate slight
deviations; for example, in the new_api
branch<https://github.com/haskell/random/tree/new_api>,
stdGen has a genRange of (0,2^31 - 86).  Probably close enough.  (In any
case, in the future I'd like to see stdGen replaced with something better
anyway.)

*An alternative would be to go even further and require every RandomGen
instances to generate the full range of Ints.*  I think any future
replacement for stdGen will probably meet this additional requirement,
though I would be interested in important generators that do not, and in any
references on the original API design of RandomGen.  As long as we have
genBits I don't feel strongly about it but am interested in other opinions.

Anyway, I've used genBits to reimplement the random and randomR methods.
 Let me know what you think.  This new version should fix tickets
#5278<http://hackage.haskell.org/trac/ghc/ticket/5278>,
#5280 <http://hackage.haskell.org/trac/ghc/ticket/5280>,
#5133<http://hackage.haskell.org/trac/ghc/ticket/5133>.
 But give me a code review and help me head off future new bugs!

Regarding other tickets:
    #3575 <http://hackage.haskell.org/trac/ghc/ticket/3575> and
#3620<http://hackage.haskell.org/trac/ghc/ticket/3620>will have to
wait for stdGen to get replaced.  Whereas
#427 <http://hackage.haskell.org/trac/ghc/ticket/427> and
#2280<http://hackage.haskell.org/trac/ghc/ticket/2280> are
complaints about the performance; they will also require a new stdGen to be
adequately addressed.  However, the current batch of changes *does* change
the performance of some Random instances significantly.  (See appendix.)
 There's plenty of room for improvement, and in particular, the new random
approach would work better if
#4102<http://hackage.haskell.org/trac/ghc/ticket/4102> were
implemented in Data.Bits.
   The excessive laziness problems are still there.
(#4218<http://hackage.haskell.org/trac/ghc/ticket/4218>is still
unfixed.)  Perhaps that's something we can fix in this audit.

I'd also be happy to hear other proposals for API changes and additions.
 System.Random can't depend on bytestring, so I doubt we want block RNG in
the RandomGen class.  I could, however, imagine having nextWord16,
nextWord32, etc.
   Something worthwhile if we're doing an API audit  is to look at some of
the other Hackage packages.  Here are some that I am familiar with:
    http://hackage.haskell.org/package/DRBG-0.2.2
    http://hackage.haskell.org/package/mersenne-random-pure64
    http://hackage.haskell.org/package/mwc-random
    http://hackage.haskell.org/package/intel-aes

And some that I'm not:
    http://hackage.haskell.org/package/random-fu
    http://hackage.haskell.org/package/gsl-random
    http://hackage.haskell.org/package/cprng-aes

Pardon the long email!

Cheers,
  -Ryan

*Appendix A: List of tickets*

#5278 System.Random.randomIvalInteger makes invalid assumptions about
RandomGen
#5280 System.Random commits (rand `mod` base) error.
#5133 Random instance for Float can generate values out of requested range
#4218 System.Random is way too lazy
#427 Random.StdGen slowness
#2280 randomR too slow
#3575 mkStdGen and split conspire to make some programs predictable
#3620 The seeds generated by split are not independent libraries/rando
#3352 Data.Word should define instances of Random for each type

*Appendix B:  Performance data*

Here's an "S curve" showing the speedup or slowdown of the new code when
generating different types of data with random or randomR (the R labeled
ones are randomR).  They range from from some massive speedups on floats and
doubles as well as a hefty performance regression when Integer is restricted
to a small range by randomR.  (I'm looking into it.)

speedup of new over old:
new_api revision f85c6a55b731d5e380c1d6e880105c9933aa9d48
old, revision  130e421e912d394653a43c987be99
     161.87843858371693  "Float"
    120.29838090036381  "Float R"
    54.66538723769602  "CDouble"
    41.01849643134359  "CDouble R"
    5.546651773234119  "Int"
    5.250176145938512  "Integer"
    2.4963231822924556  "Word16"
    2.340365231384218  "Bool R"
    2.2810769806267412  "Bool"
    2.1576446209293216  "Double R"
    2.059792819214518  "Double"
    1.4283216783216783  "Integer R Big"
    1.235562774008646  "Word16 R"
*    1.008346493029544  "stdGen/next"*
    0.713868706483777  "Int R"
    0.6738365001610773  "Char R"
    0.5974339051794343  "Char"
    0.16546783206240073  "Integer R"

The baseline stdGen/next hasn't been changed.  Thus it's "speedup" is 1.0.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/libraries/attachments/20110628/7154f8c0/attachment.htm>