[Haskell-cafe] Loading a csv file with ~200 columns into Haskell Record
Guru Devanla
gurudev.devanla at gmail.com
Mon Oct 2 04:02:12 UTC 2017
Yes, I am totally in agreement. My motivation to replicate this project and
demonstrate the power of Haskell in these scenarios boils down to the 2
reasons you rightly mentioned:
>> You also get the confidence of writing transformation and reduction
functions whose types are consistent with your actual data,
Just this aspect makes me loose sleep looking at Python code. I crave for
such guarantees at compile-time and that is the reason why I am replicating
this implementation in Haskell. I am sure I will get this guarantee is
Haskell. **But, at what cost is what I am in the process of understanding.**
>> The upside is compiler-verified safety, and runtime performance
informed by that compile-time work.
I agree with compiler-verified safety.* I will have to prove the
performance part of this exercise to myself.*
I was not able to share the data due to licensing restrictions. But, I will
get in touch with you offline once I am at a point of sharing some stats.
Thank you very much for your input and the effort you have been putting
into Frames.
Regards
Guru
On Sun, Oct 1, 2017 at 8:46 PM, Anthony Cowley <acowley at seas.upenn.edu>
wrote:
>
>
> On Oct 1, 2017, at 9:55 PM, Guru Devanla <gurudev.devanla at gmail.com>
> wrote:
>
> Thank you all for your helpful suggestions. As I wrote the original
> question, even I was trying to decide between the approach of using Records
> to represent each row or define a vector for each column and each vector
> becomes an attribute of the record. Even, I was leaning towards the latter
> given the performance needs.
>
> Since, the file is currently available as a CSV adding Persistent and any
> ORM library would be an added dependency.
>
> I was trying to solve this problem without too many dependencies of other
> libraries and wanting to learn new DSLs. Its a tempting time killer as
> everyone here would understand.
>
> @Anthony Thank your for your answer as well. I have explored Frames
> library in the past as I tried to look for Pandas like features in Haskell
> The library is useful and I have played around with it. But, I was never
> confident in adopting it for a serious project. Part of my reluctance,
> would be the learning curve plus I also need to familiarize myself with
> `lens` as well. But, looks like this project I have in hand is a good
> motivation to do both. I will try to use Frames and then report back. Also,
> apologies for not being able to share the data I am working on.
>
> With the original question, what I was trying to get to is, how are these
> kinds of problems solved in real-world projects. Like when Haskell is used
> in data mining, or in financial applications. I believe these applications
> deal with this kind of data where the tables are wide. Having to not have
> something which I can quickly start off on troubles me and makes me wonder
> if the reason is my lack of understanding or just the pain of using static
> typing.
>
> Regards
>
>
>
> The pain is that of a rock yet to be smoothed by a running current: it is
> neither your lack of understanding nor something inherent to static typing.
> I ask for a sample file because the only way we can improve is through
> contact with real world use. I can say that Frames has been demonstrated to
> give performance neck and neck with Pandas in conjunction with greatly
> reduced (ie order of magnitude less) memory use. You also get the
> confidence of writing transformation and reduction functions whose types
> are consistent with your actual data, and that consistency can be verified
> as you type by tooling like Intero.
>
> Your concerns are justified: the problem with using Haskell for data
> processing is that without attempts like Frames, you still have this
> disconnect between the types that characterize your data and the types
> delineating your program code. Add to this the comparative dearth of
> statistical analysis and plotting options between Haskell and R or Python,
> and you can see that Haskell only makes sense if you want to use it for
> other reasons (eg familiarity, or interpretation with streaming or server
> libraries where the Haskell ecosystem is healthy). In the realm of data
> analysis, you are taking a risk choosing Haskell, but it is not a
> thoughtless risk. The upside is compiler-verified safety, and runtime
> performance informed by that compile-time work.
>
> So I’ll be happy if you can help improve the Frames story, but it is
> certainly a story still in progress.
>
> Anthony
>
>
>
>
> On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <acowley at seas.upenn.edu>
> wrote:
>
>>
>>
>> > On Sep 30, 2017, at 9:30 PM, Guru Devanla <gurudev.devanla at gmail.com>
>> wrote:
>> >
>> > Hello All,
>> >
>> > I am in the process of replicating some code in Python in Haskell.
>> >
>> > In Python, I load a couple of csv files, each file having more than 100
>> columns into a Pandas' data frame. Panda's data-frame, in short is a
>> tabular structure which lets me performs on bunch of joins, and filter out
>> data. I generated different shapes of reports using these operations. Of
>> course, I would love some type checking to help me with these merge, join
>> operations as I create different reports.
>> >
>> > I am not looking to replicate the Pandas data-frame functionality in
>> Haskell. First thing I want to do is reach out to the 'record' data
>> structure. Here are some ideas I have:
>> >
>> > 1. I need to declare all these 100+ columns into multiple record
>> structures.
>> > 2. Some of the columns can have NULL/NaN values. Therefore, some of
>> the attributes of the record structure would be 'MayBe' values. Now, I
>> could drop some columns during load and cut down the number of attributes i
>> created per record structure.
>> > 3. Create a dictionary of each record structure which will help me
>> index into into them.'
>> >
>> > I would like some feedback on the first 2 points. Seems like there is a
>> lot of boiler plate code I have to generate for creating 100s of record
>> attributes. Is this the only sane way to do this? What other patterns
>> should I consider while solving such a problem.
>> >
>> > Also, I do not want to add too many dependencies into the project, but
>> open to suggestions.
>> >
>> > Any input/advice on this would be very helpful.
>> >
>> > Thank you for the time!
>> > Guru
>>
>> The Frames package generates a vinyl record based on your data (like
>> hlist; with a functor parameter that can be Maybe to support missing data),
>> storing each column in a vector for very good runtime performance. As you
>> get past 100 columns, you may encounter compile-time performance issues. If
>> you have a sample data file you can make available, I can help diagnose
>> performance troubles.
>>
>> Anthony
>>
>>
>>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20171001/b3b5b7e9/attachment.html>
More information about the Haskell-Cafe
mailing list