[Haskell-cafe] data analysis question
Dominic Steinitz
dominic at steinitz.org
Thu Nov 13 05:44:22 UTC 2014
Tobias Pflug <tobias.pflug <at> gmx.net> writes:
>
> Hi,
>
> just the other day I talked to a friend of mine who works for an online
> radio service who told me he was currently looking into how best work
> with assorted usage data: currently 250 million entries as a 12GB in a
> csv comprising of information such as which channel was tuned in for how
> long with which user agent and what not.
>
> He accidentally ran into K and Q programming language (*1) which
> apparently work nicely for this as unfamiliar as it might seem.
>
> This certainly is not my area of expertise at all. I was just wondering
> how some of you would suggest to approach this with Haskell. How would
> you most efficiently parse such data evaluating custom queries ?
>
> Thanks for your time,
> Tobi
>
> [1] (http://en.wikipedia.org/wiki/K_(programming_language)
> [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
>
Hi Tobias,
I use Haskell and R (and Matlab) at work. You can certainly do data
analysis in Haskell; here is a fairly long example
http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-
haskell/.
IIRC the dataset was about 2G so not dissimilar to the one you are
thinking of analysing. I didn't seem to need pipes or conduits but
just used cassava. The data were plotted on a map of London (yes you
can draw maps in Haskell) with diagrams and shapefile
(http://hackage.haskell.org/package/shapefile).
But R (and pandas in python) make this sort of analysis easier. As a
small example, my data contained numbers like -.1.2 and dates and
times. R will happily parse these but in Haskell you have to roll your
own (not that this is difficult and "someone" ought to write a library
like pandas so that the wheel is not continually re-invented).
Also R (and python) have extensive data analysis libraries so if
e.g. you want to apply Nelder Mead then a very well documented R
package exists; I searched in vain for this in Haskell. Similarly, if
you want to construct a GARCH model, then there is not only a package
but an active community upon whom you can call for help.
I have the benefit of being able to use this at work
http://ifl2014.github.io/submissions/ifl2014_submission_16.pdf
and I am hoping that it will be open-sourced "real soon now" but it
will probably not be available in time for your analysis.
I should also add that my workflow (for data analysis) in Haskell is
similar to that in R. I do a small amount of analysis either in a file
or at the command line and usually chart the results again using the
command line:
http://hackage.haskell.org/package/Chart
I haven't had time to try iHaskell but I think the next time I have
some data analysis to do I will try it out.
http://gibiansky.github.io/IHaskell/demo.html
http://andrew.gibiansky.com/blog/haskell/finger-trees/
Finally, doing data analysis is quite different from quality
production code. I would imagine turning Haskell data analysis into
production code would be a lot easier than doing this in R.
Dominic.
More information about the Haskell-Cafe
mailing list