[Haskell-cafe] data analysis question
Tobias Pflug
tobias.pflug at gmx.net
Thu Nov 13 09:37:29 UTC 2014
On 13.11.2014 02:22, Christopher Allen wrote:
> I'm working on a Haskell article for https://howistart.org/ which is
> actually about the rudiments of processing CSV data in Haskell.
>
> To that end, take a look at my rather messy workspace here:
> https://github.com/bitemyapp/csvtest
>
> And my in-progress article here:
> https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md
> (please don't post this anywhere, incomplete!)
>
> And here I'll link my notes on profiling memory use with different
> streaming abstractions:
> https://twitter.com/bitemyapp/status/531617919181258752
>
> csv-conduit isn't in the test results because I couldn't figure out
> how to use it. pipes-csv is proper streaming, but uses cassava's
> parsing machinery and data types. Possibly this is a problem if you
> have really wide rows but I've never seen anything that would be
> problematic in that realm even when I did a lot of HDFS/Hadoop
> ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not
> columns. With csv-conduit you might be able to incrementally process
> the columns too based on my guess from glancing at the rather scary code.
>
> Let me know if you have any further questions.
>
> Cheers all.
>
> --- Chris Allen
>
>
Thank you, this looks rather useful. I will have a closer look at it for
sure. Surprised that csv-conduit was so troublesome. I was in fact
expecting/hoping for the opposite. I will just give it a try.
Thanks also to everyone else who replied. Let me add some tidbits to
refine the problem space a bit. As I said before the size of the data is
around 12GB of csv files. One file per month with
each line representing a user tuning in to a stream:
[date-time-stamp], [radio-stream-name], [duration], [mobile|desktop],
[country], [areaCode]
which could be represented as:
data RadioStat = {
rStart :: Integer -- POSIX time stamp
, rStation :: Integer -- index to station map
, rDuration :: Integer -- duration in seconds
, rAgent :: Integer -- index to agent map
("mobile", "desktop", ..)
, rCountry :: Integer -- index to country map
("DE", "CH", ..)
, rArea :: Integer -- German geo location info
}
I guess it parsing a csv into a list of [RadioStat] list and respective
entries in a HashMap for the station names
should work just fine (thanks again for your linked material chris).
While this is straight forward I the type of queries I got as examples
might indicate that I should not try to
reinvent a query language but look for something else (?). Examples would be
- summarize per day : total listening duration, average listening
duration, amount of listening actions
- summarize per day per agent total listening duration, average
listening duration, amount of listening actions
I don't think MySQL would perform all that well operating on a table
with 125 million entries ;] What approach
would you guys take ?
Thanks for your input and sorry for the broad scope of these questions.
best wishes,
Tobi
More information about the Haskell-Cafe
mailing list