[Haskell-cafe] Random number sampling and statistics libraries (was: Comment on "The Historical Futurism of Haskell" by Andrew Boardman)

Wed Sep 15 12:48:20 UTC 2021

Hi Dominic,

neither can I do anything else than thank you for your detailed answer. I also
added a few remarks below.

Dominic Steinitz <dominic at steinitz.org> writes:

> Hi Dominik
>
> Thanks very much for your reply - comments inline below :-)
>
> Dominic Steinitz
> dominic at steinitz.org
> http://idontgetoutmuch.org
> Twitter: @idontgetoutmuch
>
>  On 14 Sep 2021, at 09:27, Dominik Schrempf <dominik.schrempf at gmail.com> wrote:
>
>  Dominic Steinitz <dominic at steinitz.org> writes:
>
>  On 12 Sep 2021, at 13:00, haskell-cafe-request at haskell.org wrote:
>
>  In particular, I am a mathematician/statistician working in evolutionary
>  biology. I work with multivariate distributions (hardly any of those are readily
>  available on Hackage), I work with a lot of random numbers (the support for
>  random sampling is mediocre, at best; 'splitmix' is standard by now but not
>
>  supported by the most important statistics library of Haskell), I work with
>  numerical optimization (I envy Pythonians for their libraries, although I still
>  prefer Haskell because what I achieve, at least I get right), I work with Markov
>  chains (yes, I had to write my own MCMC library in order to run proper Markov
>  chains), I need to plot my data (there is no superb standard plotting library
>  available in Haskell). By now, I do maintain library packages providing answers
>  to some of these problems, but it was (and is) a lot of work.
>
>  I have to take issue with your statement about random sampling. I think we have a really good story with random numbers now. They are of high quality and fast. R and possibly Python and Julia by comparison still use Mersenne Twister,
>  of lower
>  quality, slower and without a good story for generating independent sequences for parallel computations. I maintain random-fu (sampling from distributions) and using the new random number generator it is now several times (x4?)
>  faster than it
>  was. Conceivably it could be made even faster.
>
>  Thank you for mentioning 'random-fu'. It makes me feel like wanting to change
>  from using 'statistics' to 'random-fu'. I started using 'statistics' because I
>  liked (depended on?) the notion of a 'Distribution' which can be instance of
>  many classes (but I just saw that this is also the case for 'random-fu', maybe I
>  overlooked it). I liked that there is a distinction between discrete and
>  continuous distributions, and that there are more statistical functions
>  available such as quantiles, and so on. The package 'statistics' only supports
>  random number generation using the Mersenne Twister. It also does not support
>  multivariate distributions. Right now, I am considering changing to 'random-fu'.
>
>  What also kept me from using 'random-fu' is the following sentence in the
>  description of the package:
>
>  "Quality is prioritized over speed, but performance is an important goal too."
>
>  This sounds to me like 'random-fu' focuses on the generation of
>  cryptographically secure random numbers which is not what I need.
>
> I think the original author meant they were not aiming for C like speed. The library certainly is not intended to generate crypto strength random numbers.
>
> Here’s my take on what random-fu did:
>
> 1 Provides an interface to "sources of entropy” so you can plug in any RNG and produce random values for various specified types.
> 2 Provides a domain specific language so that you can manipulate random values using an early precursor of free monads (the prompt monad)
> 3 Provides a way of sampling from distributions.
> 4 Provides cumulative distribution functions and probability density functions (where they exist). I think this is a bit of a later addition and I would like it to be comparable to what R provides.
>
> The new random interface means that (1) is no longer required. With new random (1.2) you can plug in your favourite RNG without having to add anything to random-fu (this was not the case e.g. for adding MWC previously).
>
> In my free time, I have been looking at how to move from the prompt monad to the free monad: https://github.com/lehins/random-fu/pull/1
Debloating the interface would certainly be advantageous. I just read through
the documentation of "Data.Random" and was astonished how complicated the types
are (but maybe this is necessary, I do not have enough information).

It was also hard to find the reverse dependencies:
splitmix <- random <- random-fu
(splitmix seems to be hidden in 'StdGen', please correct me if this is wrong).

>
>  Please give details on where you think we can improve and better still contribute your own improvements :-)
>
>  In my opinion it would be great to:
>  - separate continuous from discrete distributions
>
> Certainly possible but I am not sure of the benefits and what would it look like concretely?
I was thinking along the lines of the statistics package, which has a nice distinction:
https://hackage.haskell.org/package/statistics-0.15.2.0/docs/Statistics-Distribution.html

>
>  - have one set of type classes used by 'random-fu' and 'statistics' (and all
>   other packages working with distributions)
>
> I find this harder to visualise and what its consequences and benefits would be.
I think I made an error. I didn't mean the type classes would be shared (those
would be different in random-fu and statistics), but the (new)types. For
example, the normal distribution 'Normal' could be instance of
'Data.Random.Distribution' of 'random-fu' but also of
'Statistics.Distribution.ContDistribution' of 'statistics'. Like so, 'random-fu'
can take care of the sampling, and 'statistics' of the probability density
functions etc. But maybe this is too naive of an idea.

In practice, however, that's exactly what I (and I guess, others) need. I do not
only need random samples, but also the (log) probability density function, etc.
With respect to the Dirichlet distribution which you mention below (thanks for
pointing this out), this is exactly the problem. The PDF is not available in
'random-fu' (or is it?), and so, another dependency or a personal implementation
is required.

>
>  - implement more and multivariate distributions (I implemented the 'dirichlet'
>   distribution for 'statistics'; it is available on Hackage but it is not
>   completely finished, and I don't consider myself able enough to contribute to
>   core libraries yet; there is also 'random-fu-multivariate' but it only has the
>   multivariate normal distribution)
>
> Random-fu has a dirichlet sampler (https://hackage.haskell.org/package/random-fu-0.2.7.7/docs/Data-Random-Distribution-Dirichlet.html) but maybe that is not what you meant?
>
> I created random-fu-multivariate with the intention of adding more multivariate distributions when I needed them - I haven’t thus far. It would be great if folks added to it. The reason to separate it from random-fu was that it relies on extra Haskell
> and external packages (LAPACK for Cholesky).
>
> Please do contribute. My approach is to look at what R / Python / Julia have already done and read the old masters such as http://www.eirene.de/Devroye.pdf I no longer feel totally confident that other programming language ecosystems have
> optimal implementations (vide Mersenne Twister).

Thanks for your encouragement!

Best,
Dominik

>
>  In terms of MCMC, I think Jared Tobin wrote some libraries but I don’t think they are maintained. I maintain an SMC library but I don’t know how much use it gets. Tom Nielsen, Henrik Nilsson and I wrote Haskell “bindings” for Stan:
>  https://nottingham-repository.worktribe.com/output/1151875/getting-more-out-of-stan-some-ideas-from-the-haskell-bindings. It would be a lot of work to e.g. re-create Stan in Haskell natively.
>
>  I am aware of Jared Tobins packages! They are great entry points but not
>  flexible enough for what I am doing. If you are interested, have a look at the
>  'mcmc' package, which I am developing.
>
>  Thank you, I didn't know about the Stan Haskell bindings and will have a look.
>
>  I agree about plotting but inline-r makes it possible to use ggplot in R via Haskell which makes things like drawing maps with reasonable projections relatively straightforward. 
>
>  More generally, I think we have a good set of bindings for the ODE solver library SUNDIALS and also for other numeric libraries (e.g. LAPACK and BLAS). The problem we have is not enough hands working on such things.
>
>  I now sadly return to programming in Julia.
>
>  Thanks for you input!
>
>  PS - there is probably more I could say on numerical stuff in Haskell but the above already looks like “stream of consciousness”.
>
>  Dominic Steinitz
>  dominic at steinitz.org
>  http://idontgetoutmuch.org
>  Twitter: @idontgetoutmuch
>
>  _______________________________________________
>  Haskell-Cafe mailing list
>  To (un)subscribe, modify options or view archives go to:
>  http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>  Only members subscribed via the mailman list are allowed to post.