[Haskell-cafe] Mathematics and Statistics libraries

Sun Mar 25 00:24:15 CET 2012

If the goal is to help Haskell be a more acceptable choice for general
statistical analysis tasks, then  hmatrix, statistics, and the various
gsl wrappers already provide the majority of the functionality needed.
I think the bigger problem is that there is no guidance on which
libraries are industrial strength, and there's no glue layer making it
easier to use the APIs you'd want to, and GHCi isn't always ideal as a
repl for this workflow.

If you're interested in UI work, ideally we'd have something similar
to RStudio as an environment, a simple set of windows encapsulating an
editor, a repl, a plotting panel and help/history, this sounds
superficial but it really has an impact when you're exploring a data
set and trying stuff out. However, it would be a bigger contribution
to get us to the point where we are able to just "import
Quant.Prelude" to bring into scope all the standard functionality
assumed in an environment like R or Matlab. In my experience most of
this can come from re-exporting existing libraries while occasionally
wrapping functions to simplify the interfaces and make them more
consistent (e.g., a quant doesn't particularly need to know why
Statistics.Sample.KernelDensity.kde uses unboxed vectors when the rest
of that lib uses Generic, and they certainly won't want to spend their
time remembering that they need to convert to call that function).

As an exercise, in GHCi, try loading a few arbitrary csv files of
tables including floating point columns, do a linear regression of one
such column on another, and then display a scatterplot with the
regression line, maybe throw in a check for the normality of the
residuals. Assume you'll need to be able to handle large data sets so
you need to use bytestring, attoparsec etc; beware that there's a
known bug that will cause a segfault/bus error if you use some
hmatrix/gsl functions from GHCi on x86_64, which is kind of a blocker
in itself. Maybe I missed something obvious but it took me a looong
time to figure out which containers, persistence + parsing, stats and
plotting packages I should choose.

I really disagree that we need a data frame type structure; they're an
abomination in R, they try to accommodate event records and time
series, and do neither well. Haskell records are fine for
inhomogeneous event series and for homogeneous time series parallel
Vectors or Matrices are better as they can be passed to BLAS and
LAPACK with consequent performance and clarity advantages - column
oriented storage rocks, and Haskell is already a good fit.

Having used C++, Matlab and R (the latter for quite a while) I now use
Haskell for all of my statistical analysis work, despite the many
shortcomings it's definitely worth it for the code clarity and type
checking, to say nothing of the pre-optimization performance and
robustness.

Best of luck, happy to share some preliminary code with you directly
if you're interested!
Tom

On 21 March 2012 17:24, Ben Jones <ben.jamin.pwn3d at gmail.com> wrote:
> I am a student currently interested in participating in Google Summer of
> Code. I have a strong interest in Haskell, and a semester's worth of coding
> experience in the language. I am a mathematics and cs double major with only
> a semester left and I am looking for information regarding what the
> community is lacking as far as mathematics and statistics libraries are
> concerned. If there is enough interest I would like to put together a
> project with this. I understand that such libraries are probably low
> priority, but if anyone has anything I would love to hear it.
>
> Thanks for reading,
>       -Benjamin
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>