[Haskell-cafe] Loading a csv file with ~200 columns into Haskell Record

Saurabh Nanda saurabhnanda at gmail.com
Mon Oct 2 02:22:09 UTC 2017


> Having to not have something which I can quickly start off on

What do you mean by that? And what precisely is the  discomfort between
Haskell vs python for your use-case?

On 02-Oct-2017 7:29 AM, "Guru Devanla" <gurudev.devanla at gmail.com> wrote:

> Thank you all for your helpful suggestions. As I wrote the original
> question, even I was trying to decide between the approach of using Records
> to represent each row or  define a vector for each column and each vector
> becomes an attribute of the record.  Even, I was leaning towards the latter
> given the performance needs.
>
> Since, the file is currently available as a CSV adding Persistent and any
> ORM library would be an added dependency.
>
> I was trying to solve this problem without too many dependencies of other
> libraries and wanting to learn new DSLs. Its a tempting time killer as
> everyone here would understand.
>
> @Anthony Thank your for your answer as well. I have explored Frames
> library in the past as I tried to look for Pandas like features in Haskell
> The library is useful and I have played around with it. But, I was never
> confident in adopting it for a serious project. Part of my reluctance,
> would be the learning curve plus I also need to familiarize myself with
> `lens` as well. But, looks like this project I have in hand is a good
> motivation to do both. I will try to use Frames and then report back. Also,
> apologies for not being able to share the data I am working on.
>
> With the original question, what I was trying to get to is, how are these
> kinds of problems solved in real-world projects. Like when Haskell is used
> in data mining, or in financial applications. I believe these applications
> deal with this kind of data where the tables are wide. Having to not have
> something which I can quickly start off on troubles me and makes me wonder
> if the reason is my lack of understanding or just the pain of using static
> typing.
>
> Regards
>
>
> On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <acowley at seas.upenn.edu>
> wrote:
>
>>
>>
>> > On Sep 30, 2017, at 9:30 PM, Guru Devanla <gurudev.devanla at gmail.com>
>> wrote:
>> >
>> > Hello All,
>> >
>> > I am in the process of replicating some code in Python in Haskell.
>> >
>> > In Python, I load a couple of csv files, each file having more than 100
>> columns into a Pandas' data frame. Panda's data-frame, in short is a
>> tabular structure which lets me performs on bunch of joins, and filter out
>> data. I generated different shapes of reports using these operations. Of
>> course, I would love some type checking to help me with these merge, join
>> operations as I create different reports.
>> >
>> > I am not looking to replicate the Pandas data-frame functionality in
>> Haskell. First thing I want to do is reach out to the 'record' data
>> structure. Here are some ideas I have:
>> >
>> > 1.  I need to declare all these 100+ columns into multiple record
>> structures.
>> > 2.  Some of the columns can have NULL/NaN values. Therefore, some of
>> the attributes of the record structure would be 'MayBe' values. Now, I
>> could drop some columns during load and cut down the number of attributes i
>> created per record structure.
>> > 3.  Create a dictionary of each record structure which will help me
>> index into into them.'
>> >
>> > I would like some feedback on the first 2 points. Seems like there is a
>> lot of boiler plate code I have to generate for creating 100s of record
>> attributes. Is this the only sane way to do this?  What other patterns
>> should I consider while solving such a problem.
>> >
>> > Also, I do not want to add too many dependencies into the project, but
>> open to suggestions.
>> >
>> > Any input/advice on this would be very helpful.
>> >
>> > Thank you for the time!
>> > Guru
>>
>> The Frames package generates a vinyl record based on your data (like
>> hlist; with a functor parameter that can be Maybe to support missing data),
>> storing each column in a vector for very good runtime performance. As you
>> get past 100 columns, you may encounter compile-time performance issues. If
>> you have a sample data file you can make available, I can help diagnose
>> performance troubles.
>>
>> Anthony
>>
>>
>>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20171002/d6556c6c/attachment.html>


More information about the Haskell-Cafe mailing list