[Haskell-cafe] Loading a csv file with ~200 columns into Haskell Record

Saurabh Nanda saurabhnanda at gmail.com
Tue Oct 3 05:04:01 UTC 2017


Do evaluate the option of peeking at the first few rows of the CSV and
generating the types via code-gen. This will allow your transformation
pipeline to fail-fast if your CSV format changes.

On 02-Oct-2017 8:27 PM, "Guru Devanla" <gurudev.devanla at gmail.com> wrote:

> Yes, Thank you for the encouraging words. I will keep at it.
>
> >> Also, be sure of what exactly is the warm fuzzy feeling that the
> compiler is giving you. From whatever you have described, most of your bugs
> are going to occur when you change your data transformation pipeline (core
> logic) or your CSV format. Compilation and static types will help in only
> one of those.
>
> Yes, i am aware of that. I have tests for the core logic, but the
> mechanical part of type checking the data that passes through this pipeline
> is much desired.
>
>
>
>
> On Sun, Oct 1, 2017 at 10:00 PM, Saurabh Nanda <saurabhnanda at gmail.com>
> wrote:
>
>> I whole heartedly agree with your sentiment. I have felt the same way in
>> my initial days, and only my stubborn head prevented me from giving up on
>> Haskell [1]
>>
>> Haskell is **unnecessarily** hard. It doesn't have to be that way. Stop
>> beating yourself up over what is essentially a tooling, API design, and
>> documentation problem. Start speaking up instead.
>>
>> Wrt the current problem at hand, try thinking of the types as a **spec**
>> rather than boilerplate. That spec is necessary to give you your compile
>> time guarantees. Without the spec the compiler can't do anything. This spec
>> is non-existent in python.
>>
>> Also, be sure of what exactly is the warm fuzzy feeling that the compiler
>> is giving you. From whatever you have described, most of your bugs are
>> going to occur when you change your data transformation pipeline (core
>> logic) or your CSV format. Compilation and static types will help in only
>> one of those.
>>
>> [1] https://medium.com/@saurabhnanda/why-building-web-apps-
>> in-haskell-is-harder-than-it-ought-to-be-c9b13de0be4f
>>
>>
>> On 02-Oct-2017 8:20 AM, "Guru Devanla" <gurudev.devanla at gmail.com> wrote:
>>
>> Did not mean to complain. For example, being able to use Data Frame
>> library in Pandas, did not involve a big learning curve to understand the
>> syntax of Pandas. With the basic knowledge of Python is was easy to learn
>> and start using it.  Trying, to replicate that kind of program in Haskell
>> seems to be a lot difficult for me. For example,  the leap from dynamic
>> typing to static typing does involve this kind of boiler plate an I am fine
>> with it. Now, when I try to reach out to the libraries in use, it involves
>> a lot of learning of the library syntax/special operators etc to get stuff
>> done.
>> I understand that is the philosophy eschewed by Haskell community, but it
>> takes up a lot of the spare time I have to both learn and also build my toy
>> projects. I love coding in Haskell. But, that love takes a lot of time
>> before it translates to any good code I  can show. It could be just me.
>>
>> Again, I am happy to do this out of my love for Haskell. But, I am
>> hesitant to recommend that to other team members since it is difficult for
>> me to quantify the gains. And I say this with limited experience building
>> real world Haskell applications and therefore my train of thought is
>> totally mis-guided.
>>
>> On Sun, Oct 1, 2017 at 7:22 PM, Saurabh Nanda <saurabhnanda at gmail.com>
>> wrote:
>>
>>> > Having to not have something which I can quickly start off on
>>>
>>> What do you mean by that? And what precisely is the  discomfort between
>>> Haskell vs python for your use-case?
>>>
>>> On 02-Oct-2017 7:29 AM, "Guru Devanla" <gurudev.devanla at gmail.com>
>>> wrote:
>>>
>>>> Thank you all for your helpful suggestions. As I wrote the original
>>>> question, even I was trying to decide between the approach of using Records
>>>> to represent each row or  define a vector for each column and each vector
>>>> becomes an attribute of the record.  Even, I was leaning towards the latter
>>>> given the performance needs.
>>>>
>>>> Since, the file is currently available as a CSV adding Persistent and
>>>> any ORM library would be an added dependency.
>>>>
>>>> I was trying to solve this problem without too many dependencies of
>>>> other libraries and wanting to learn new DSLs. Its a tempting time killer
>>>> as everyone here would understand.
>>>>
>>>> @Anthony Thank your for your answer as well. I have explored Frames
>>>> library in the past as I tried to look for Pandas like features in Haskell
>>>> The library is useful and I have played around with it. But, I was never
>>>> confident in adopting it for a serious project. Part of my reluctance,
>>>> would be the learning curve plus I also need to familiarize myself with
>>>> `lens` as well. But, looks like this project I have in hand is a good
>>>> motivation to do both. I will try to use Frames and then report back. Also,
>>>> apologies for not being able to share the data I am working on.
>>>>
>>>> With the original question, what I was trying to get to is, how are
>>>> these kinds of problems solved in real-world projects. Like when Haskell is
>>>> used in data mining, or in financial applications. I believe these
>>>> applications deal with this kind of data where the tables are wide. Having
>>>> to not have something which I can quickly start off on troubles me and
>>>> makes me wonder if the reason is my lack of understanding or just the pain
>>>> of using static typing.
>>>>
>>>> Regards
>>>>
>>>>
>>>> On Sun, Oct 1, 2017 at 1:58 PM, Anthony Cowley <acowley at seas.upenn.edu>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> > On Sep 30, 2017, at 9:30 PM, Guru Devanla <gurudev.devanla at gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hello All,
>>>>> >
>>>>> > I am in the process of replicating some code in Python in Haskell.
>>>>> >
>>>>> > In Python, I load a couple of csv files, each file having more than
>>>>> 100 columns into a Pandas' data frame. Panda's data-frame, in short is a
>>>>> tabular structure which lets me performs on bunch of joins, and filter out
>>>>> data. I generated different shapes of reports using these operations. Of
>>>>> course, I would love some type checking to help me with these merge, join
>>>>> operations as I create different reports.
>>>>> >
>>>>> > I am not looking to replicate the Pandas data-frame functionality in
>>>>> Haskell. First thing I want to do is reach out to the 'record' data
>>>>> structure. Here are some ideas I have:
>>>>> >
>>>>> > 1.  I need to declare all these 100+ columns into multiple record
>>>>> structures.
>>>>> > 2.  Some of the columns can have NULL/NaN values. Therefore, some of
>>>>> the attributes of the record structure would be 'MayBe' values. Now, I
>>>>> could drop some columns during load and cut down the number of attributes i
>>>>> created per record structure.
>>>>> > 3.  Create a dictionary of each record structure which will help me
>>>>> index into into them.'
>>>>> >
>>>>> > I would like some feedback on the first 2 points. Seems like there
>>>>> is a lot of boiler plate code I have to generate for creating 100s of
>>>>> record attributes. Is this the only sane way to do this?  What other
>>>>> patterns should I consider while solving such a problem.
>>>>> >
>>>>> > Also, I do not want to add too many dependencies into the project,
>>>>> but open to suggestions.
>>>>> >
>>>>> > Any input/advice on this would be very helpful.
>>>>> >
>>>>> > Thank you for the time!
>>>>> > Guru
>>>>>
>>>>> The Frames package generates a vinyl record based on your data (like
>>>>> hlist; with a functor parameter that can be Maybe to support missing data),
>>>>> storing each column in a vector for very good runtime performance. As you
>>>>> get past 100 columns, you may encounter compile-time performance issues. If
>>>>> you have a sample data file you can make available, I can help diagnose
>>>>> performance troubles.
>>>>>
>>>>> Anthony
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Haskell-Cafe mailing list
>>>> To (un)subscribe, modify options or view archives go to:
>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>> Only members subscribed via the mailman list are allowed to post.
>>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20171003/d668d11a/attachment.html>


More information about the Haskell-Cafe mailing list