[Haskell-cafe] data analysis question

Christopher Allen cma at bitemyapp.com
Thu Nov 13 01:22:58 UTC 2014


I'm working on a Haskell article for https://howistart.org/ which is
actually about the rudiments of processing CSV data in Haskell.

To that end, take a look at my rather messy workspace here:
https://github.com/bitemyapp/csvtest

And my in-progress article here:
https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md
(please don't post this anywhere, incomplete!)

And here I'll link my notes on profiling memory use with different
streaming abstractions:
https://twitter.com/bitemyapp/status/531617919181258752

csv-conduit isn't in the test results because I couldn't figure out how to
use it. pipes-csv is proper streaming, but uses cassava's parsing machinery
and data types. Possibly this is a problem if you have really wide rows but
I've never seen anything that would be problematic in that realm even when
I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're
streaming rows, but not columns. With csv-conduit you might be able to
incrementally process the columns too based on my guess from glancing at
the rather scary code.

Let me know if you have any further questions.

Cheers all.

--- Chris Allen





On Wed, Nov 12, 2014 at 4:17 PM, Markus Läll <markus.l2ll at gmail.com> wrote:

> Hi Tobias,
>
> What he could do is encode the column values to appropriate lengths of
> Word's to reduce the size -- to make it fit in ram. E.g listening times as
> seconds, browsers as categorical variables (in statistics terms), etc. If
> some of the columns are arbitrary length strings, then it seems possible to
> get 12GB down by more than half.
>
> If he doesn't know Haskell, then I'd suggest using  another language.
> (Years ago I tried to do a bigger uni project in Haskell-- being a noob
> --and failed miserably.)
> On Nov 12, 2014 10:45 AM, "Tobias Pflug" <tobias.pflug at gmx.net> wrote:
>
>> Hi,
>>
>> just the other day I talked to a friend of mine who works for an online
>> radio service who told me he was currently looking into how best work with
>> assorted usage data: currently 250 million entries as a 12GB in a csv
>> comprising of information such as which channel was tuned in for how long
>> with which user agent and what not.
>>
>> He accidentally ran into K and Q programming language (*1) which
>> apparently work nicely for this as unfamiliar as it might seem.
>>
>> This certainly is not my area of expertise at all. I was just wondering
>> how some of you would suggest to approach this with Haskell. How would you
>> most efficiently parse such data evaluating custom queries ?
>>
>> Thanks for your time,
>> Tobi
>>
>> [1] (http://en.wikipedia.org/wiki/K_(programming_language)
>> [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20141112/4afbad16/attachment.html>


More information about the Haskell-Cafe mailing list