[Haskell-cafe] Mathematics and Statistics libraries

Thu Mar 29 20:06:18 CEST 2012

Hey All,

Theres actually a number of issues the come up with an effective dataframe-like for haskell, and data vis as well.  (both of which I have some strong personal opinions on for haskell and which I'm exploring / experimenting with this spring). While folks have touched on a bunch, I just thought I'd put together my own opinions in the mix.

First of all: any good data manipulation (i.e. data frame -like ) library needs support for efficiently querying subsets of the data in various ways. Not just that,  it really should provide coherent way of dealing with out of core data! From there you might want to ask the question: "do I want to iterate through chunks of the data" or "do i want to allow more general patterns of data access, and perhaps even ways to parallelize?". The basic thing (as others have remarked after this draft email got underway), you do essentially want to support some sql-like selection operations, and have them be efficient too, along with playing nice with columns of differing types 

What sort of abstractions you provide are somewhat crucial, because that in turn affects how you can write algorithms! If you look closely, this is tantamount to saying that any sufficiently well designed (industrial grade) data frame lib for haskell might wind up leading into a model for supporting mapreduce or graphlab http://graphlab.org/ style algorithms in the multicore / not distributed regime, though a first version would pragmatically just provide an interface with sequentially chunked data and use pipes-core, or one of the other enumerator libraries. Theres also some need for the aforementioned fancy types for managing data, but that not even the real challenge (in my opinion). Probably the best lib to take ideas from is the python Pandas library, or at least thats my personal opinion. 

Now in the space of data vis, probably the best example of a good library in terms of easy of getting informative (and pretty) outputs is ggplot2 (also in R). Now if you look there, you'll see that its VERY much integrated with the model fitting and data analysis functionality of R, and has a very compositional approach  which could easily be ported pretty directly over to haskell. 
However, as with a good data frame-like, certain obstacles come up partly because if we insist a type safe way to do things while being at least as high level as R or python, the absence of row types for frame column names makes specifying linear models that are statically well formed  (as in only referencing column names that are actually in the underlying data frame) bit tricky, and while there are approaches that do work some of the time,  theres not really a good general purpose way (as far as I can tell) for that small problem of trying to resolve names as early as possible. Or at the very least I don't see a simple approach that i'm happy with.

these can be summarized I think as follows:
Any "practical" data frame lib needs to interact well with out of core data, and ideally also simplify the task of writing algorithms on top in a way that sort of gives out of core goodness for free. Theres a lot of different ways this can be perhaps done under the covers, perhaps using one of the libraries like reducers, enumerator or pipes core, but it really should be invisible for the client algorithms author, or at least invisible by default. And more over I think any attack in that direction is essentially a precursor to sorting out map-reduce and graph lab like tools for haskell.
Any really nice high level data vis tool really needs to have some data analysis / machine  learning style library that its working with, and this is probably best understood by looking at things already out there, such as ggplot2 in R

that said, I'm all ears for other folks takes on this, especially since I'm spending some time this spring experimenting in both these directions.

cheers
-Carter

On Sun, Mar 25, 2012 at 9:54 AM, Aleksey Khudyakov <alexey.skladnoy at gmail.com (mailto:alexey.skladnoy at gmail.com)> wrote:
> On 25.03.2012 14 (tel:25.03.2012%2014):52, Tom Doris wrote:
> > Hi Heinrich,
> > 
> > If we compare the GHCi experience with R or IPython, leaving aside any
> > GUIs, the help system they have at the repl level is just a lot more
> > intuitive and easy to use, and you get access to the full manual
> > entries. For example, compare what you see if you type :info sort into
> > GHCi versus ?sort in R. R gives you a view of the full docs for the
> > function, whereas in GHCi you just get the type signature.
> > 
> Ingrating haddock documentation into GHCi would be really helpful but it's GSoC project on its own.
> 
> For me most important difference between R's repl and GHCi is that :reload wipes all local binding. Effectively it forces to write everything in file and to avoid doing anything which couldn't be fitted into one-liner. It may not be bad but it's definitely different style
> 
> And of course data visualization. Only library I know of is Chart[1] but I don't like API much.
> 
> I think talking about data frames is a bit pointless unless we specify what is data frame. Basically there are two representations of tabular data structure: array of tuples or tuple of arrays. If you want first go for Data.Vector.Vector YourData. If you want second you'll probably end up with some HList-like data structure to hold arrays.
> 
> 
> 
> [1] http://hackage.haskell.org/package/Chart
> 
> 
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org (mailto:Haskell-Cafe at haskell.org)
> http://www.haskell.org/mailman/listinfo/haskell-cafe

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120329/366cf219/attachment.htm>