[Haskell-cafe] data analysis question

Thu Nov 13 09:37:29 UTC 2014

On 13.11.2014 02:22, Christopher Allen wrote:
> I'm working on a Haskell article for https://howistart.org/ which is 
> actually about the rudiments of processing CSV data in Haskell.
>
> To that end, take a look at my rather messy workspace here: 
> https://github.com/bitemyapp/csvtest
>
> And my in-progress article here: 
> https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md 
> (please don't post this anywhere, incomplete!)
>
> And here I'll link my notes on profiling memory use with different 
> streaming abstractions: 
> https://twitter.com/bitemyapp/status/531617919181258752
>
> csv-conduit isn't in the test results because I couldn't figure out 
> how to use it. pipes-csv is proper streaming, but uses cassava's 
> parsing machinery and data types. Possibly this is a problem if you 
> have really wide rows but I've never seen anything that would be 
> problematic in that realm even when I did a lot of HDFS/Hadoop 
> ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not 
> columns. With csv-conduit you might be able to incrementally process 
> the columns too based on my guess from glancing at the rather scary code.
>
> Let me know if you have any further questions.
>
> Cheers all.
>
> --- Chris Allen
>
>
Thank you, this looks rather useful. I will have a closer look at it for 
sure. Surprised that csv-conduit was so troublesome. I was in fact 
expecting/hoping for the opposite. I will just give it a try.

Thanks also to everyone else who replied. Let me add some tidbits to 
refine the problem space a bit. As I said before the size of the data is 
around 12GB of csv files. One file per month with
each line representing a user tuning in to a stream:

[date-time-stamp], [radio-stream-name], [duration], [mobile|desktop], 
[country], [areaCode]

which could be represented as:

data RadioStat = {
                    rStart     :: Integer      -- POSIX time stamp
                  , rStation   :: Integer      -- index to station map
                  , rDuration  :: Integer      -- duration in seconds
                  , rAgent     :: Integer      -- index to agent map 
("mobile", "desktop", ..)
                  , rCountry   :: Integer      -- index to country map 
("DE", "CH", ..)
                  , rArea      :: Integer      -- German geo location info
                  }

I guess it parsing a csv into a list of [RadioStat] list and respective 
entries in a HashMap for the station names
should work just fine (thanks again for your linked material chris).

While this is straight forward I the type of queries I got as examples 
might indicate that I should not try to
reinvent a query language but look for something else (?). Examples would be

- summarize per day : total listening duration, average listening 
duration, amount of listening actions
- summarize per day per agent total listening duration, average 
listening duration, amount of listening actions

I don't think MySQL would perform all that well operating on a table 
with 125 million entries ;] What approach
would you guys take ?

Thanks for your input and sorry for the broad scope of these questions.
best wishes,
Tobi