[Haskell-cafe] data analysis question

Dominic Steinitz dominic at steinitz.org
Thu Nov 13 05:44:22 UTC 2014


Tobias Pflug <tobias.pflug <at> gmx.net> writes:

> 
> Hi,
> 
> just the other day I talked to a friend of mine who works for an online 
> radio service who told me he was currently looking into how best work 
> with assorted usage data: currently 250 million entries as a 12GB in a 
> csv comprising of information such as which channel was tuned in for how 
> long with which user agent and what not.
> 
> He accidentally ran into K and Q programming language (*1) which 
> apparently work nicely for this as unfamiliar as it might seem.
> 
> This certainly is not my area of expertise at all. I was just wondering 
> how some of you would suggest to approach this with Haskell. How would 
> you most efficiently parse such data evaluating custom queries ?
> 
> Thanks for your time,
> Tobi
> 
> [1] (http://en.wikipedia.org/wiki/K_(programming_language)
> [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
> 

Hi Tobias,

I use Haskell and R (and Matlab) at work. You can certainly do data
analysis in Haskell; here is a fairly long example

http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-
haskell/.

IIRC the dataset was about 2G so not dissimilar to the one you are
thinking of analysing. I didn't seem to need pipes or conduits but
just used cassava. The data were plotted on a map of London (yes you
can draw maps in Haskell) with diagrams and shapefile
(http://hackage.haskell.org/package/shapefile).

But R (and pandas in python) make this sort of analysis easier. As a
small example, my data contained numbers like -.1.2 and dates and
times. R will happily parse these but in Haskell you have to roll your
own (not that this is difficult and "someone" ought to write a library
like pandas so that the wheel is not continually re-invented).

Also R (and python) have extensive data analysis libraries so if
e.g. you want to apply Nelder Mead then a very well documented R
package exists; I searched in vain for this in Haskell. Similarly, if
you want to construct a GARCH model, then there is not only a package
but an active community upon whom you can call for help.

I have the benefit of being able to use this at work

http://ifl2014.github.io/submissions/ifl2014_submission_16.pdf

 and I am hoping that it will be open-sourced "real soon now" but it
will probably not be available in time for your analysis.

I should also add that my workflow (for data analysis) in Haskell is
similar to that in R. I do a small amount of analysis either in a file
or at the command line and usually chart the results again using the
command line:

http://hackage.haskell.org/package/Chart

I haven't had time to try iHaskell but I think the next time I have
some data analysis to do I will try it out.

http://gibiansky.github.io/IHaskell/demo.html
http://andrew.gibiansky.com/blog/haskell/finger-trees/

Finally, doing data analysis is quite different from quality
production code. I would imagine turning Haskell data analysis into
production code would be a lot easier than doing this in R.

Dominic.






More information about the Haskell-Cafe mailing list