[Haskell-cafe] data analysis question

Thu Nov 13 16:49:38 UTC 2014

I wouldn't hold it against csv-conduit too much, conduit and Pipes both
take some getting used too and I hadn't used either in anger before I
started kicking around the CSV parsing stuff. I was a bit spoiled by how
easy Cassava was to use as well.

Thanks to Christopher Reichert's PR, there is an example for csv-conduit as
well, so you've now got four ways to try processing CSV, *three* of which
are streaming :)

I'd say just try each in turn and see what you're happy with, if you're not
married to a particular streaming operation.

>I don't think MySQL would perform all that well operating on a table with
125 million entries ;] What approach
would you guys take ?

Big enough machine with enough memory and it's fine. I used to keep a job
queue with a billion rows on MySQL at a gig long ago. Could do it with
PostgreSQL pretty easily too. On your personal work machine? I dunno.

Not trying to steer you away from using Haskell here by any means, but if
you can process your data in a SQL database efficiently, that's often
pretty optimal in terms of speed and ease of use until you start doing more
sophisticated analysis. I don't have a lot of experience in data analysis
but I knew people to do some preliminary slicing/dicing in SQL before
moving onto a building a custom model for understanding the data.

Cheers,
Chris Allen

On Thu, Nov 13, 2014 at 3:37 AM, Tobias Pflug <tobias.pflug at gmx.net> wrote:

> On 13.11.2014 02:22, Christopher Allen wrote:
>
>> I'm working on a Haskell article for https://howistart.org/ which is
>> actually about the rudiments of processing CSV data in Haskell.
>>
>> To that end, take a look at my rather messy workspace here:
>> https://github.com/bitemyapp/csvtest
>>
>> And my in-progress article here: https://github.com/bitemyapp/
>> howistart/blob/master/haskell/1/index.md (please don't post this
>> anywhere, incomplete!)
>>
>> And here I'll link my notes on profiling memory use with different
>> streaming abstractions: https://twitter.com/bitemyapp/
>> status/531617919181258752
>>
>> csv-conduit isn't in the test results because I couldn't figure out how
>> to use it. pipes-csv is proper streaming, but uses cassava's parsing
>> machinery and data types. Possibly this is a problem if you have really
>> wide rows but I've never seen anything that would be problematic in that
>> realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with
>> pipes-csv you're streaming rows, but not columns. With csv-conduit you
>> might be able to incrementally process the columns too based on my guess
>> from glancing at the rather scary code.
>>
>> Let me know if you have any further questions.
>>
>> Cheers all.
>>
>> --- Chris Allen
>>
>>
>>  Thank you, this looks rather useful. I will have a closer look at it for
> sure. Surprised that csv-conduit was so troublesome. I was in fact
> expecting/hoping for the opposite. I will just give it a try.
>
> Thanks also to everyone else who replied. Let me add some tidbits to
> refine the problem space a bit. As I said before the size of the data is
> around 12GB of csv files. One file per month with
> each line representing a user tuning in to a stream:
>
> [date-time-stamp], [radio-stream-name], [duration], [mobile|desktop],
> [country], [areaCode]
>
> which could be represented as:
>
> data RadioStat = {
>                    rStart     :: Integer      -- POSIX time stamp
>                  , rStation   :: Integer      -- index to station map
>                  , rDuration  :: Integer      -- duration in seconds
>                  , rAgent     :: Integer      -- index to agent map
> ("mobile", "desktop", ..)
>                  , rCountry   :: Integer      -- index to country map
> ("DE", "CH", ..)
>                  , rArea      :: Integer      -- German geo location info
>                  }
>
> I guess it parsing a csv into a list of [RadioStat] list and respective
> entries in a HashMap for the station names
> should work just fine (thanks again for your linked material chris).
>
> While this is straight forward I the type of queries I got as examples
> might indicate that I should not try to
> reinvent a query language but look for something else (?). Examples would
> be
>
> - summarize per day : total listening duration, average listening
> duration, amount of listening actions
> - summarize per day per agent total listening duration, average listening
> duration, amount of listening actions
>
> I don't think MySQL would perform all that well operating on a table with
> 125 million entries ;] What approach
> would you guys take ?
>
> Thanks for your input and sorry for the broad scope of these questions.
> best wishes,
> Tobi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20141113/e846509d/attachment.html>