[Haskell-cafe] Loading a csv file with ~200 columns into Haskell Record

Anthony Cowley acowley at seas.upenn.edu
Sun Oct 1 20:58:36 UTC 2017


> On Sep 30, 2017, at 9:30 PM, Guru Devanla <gurudev.devanla at gmail.com> wrote:
> 
> Hello All,
> 
> I am in the process of replicating some code in Python in Haskell.
> 
> In Python, I load a couple of csv files, each file having more than 100 columns into a Pandas' data frame. Panda's data-frame, in short is a tabular structure which lets me performs on bunch of joins, and filter out data. I generated different shapes of reports using these operations. Of course, I would love some type checking to help me with these merge, join operations as I create different reports.
>  
> I am not looking to replicate the Pandas data-frame functionality in Haskell. First thing I want to do is reach out to the 'record' data structure. Here are some ideas I have:
> 
> 1.  I need to declare all these 100+ columns into multiple record structures.
> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the attributes of the record structure would be 'MayBe' values. Now, I could drop some columns during load and cut down the number of attributes i created per record structure. 
> 3.  Create a dictionary of each record structure which will help me index into into them.'
> 
> I would like some feedback on the first 2 points. Seems like there is a lot of boiler plate code I have to generate for creating 100s of record attributes. Is this the only sane way to do this?  What other patterns should I consider while solving such a problem.  
> 
> Also, I do not want to add too many dependencies into the project, but open to suggestions.
> 
> Any input/advice on this would be very helpful.
> 
> Thank you for the time!
> Guru

The Frames package generates a vinyl record based on your data (like hlist; with a functor parameter that can be Maybe to support missing data), storing each column in a vector for very good runtime performance. As you get past 100 columns, you may encounter compile-time performance issues. If you have a sample data file you can make available, I can help diagnose performance troubles. 

Anthony




More information about the Haskell-Cafe mailing list