[Haskell-cafe] Loading a csv file with ~200 columns into Haskell Record

Saurabh Nanda saurabhnanda at gmail.com
Sun Oct 1 18:03:27 UTC 2017


If your data is originating from a DB, read the DB schema and use code-gen
or TH to generate your record structure. Please confirm that your Haskell
data pipeline is able to handle 100-field+ records beforehand. I have a
strange feeling that some library or the other is going to break at the
64-field mark.

If you don't have access to the underlying DB, read the CSV header and
code-gen your data structures. This will still lead to a lot of boilerplate
because your code-gen script will need to maintain a col-name<>data-type
mapping. See if you can peek at the first row of the data and take an
educated guess about each column's data-type based on the column values.
This will not be 100% accurate, but you can get good results by manually
specifying only a few data-types instead of the entire 100+ data-types.

-- Saurabh.

On Sun, Oct 1, 2017 at 4:38 PM, Leandro Ostera <leandro at ostera.io> wrote:

> Two things come to mind.
>
> The first one is *Crazy idea, bad pitch*: generate the record code from
> the data.
>
> The second is to make the records dynamically typed:
>
> Would it be simpler to define a Column type you can parameterize with a
> string for its name (GADTs?) so you automatically get a type of that
> specific column?
>
> That way as you read the CSV files you could define the type of the
> columns based on the actual column name.
>
> Rows would then become sets of pairings of defined columns and values,
> perhaps having a Maybe would encode that any given value for a particular
> column is missing. You could encode these pairings a list too.
>
> At least there you can have type guarantees that you’re joining fields
> that are of the same column type. I think.
>
> Either way, my 2 cents and keep it up!
>
>
> sön 1 okt. 2017 kl. 03:34 skrev Guru Devanla <gurudev.devanla at gmail.com>:
>
>> Hello All,
>>
>> I am in the process of replicating some code in Python in Haskell.
>>
>> In Python, I load a couple of csv files, each file having more than 100
>> columns into a Pandas' data frame. Panda's data-frame, in short is a
>> tabular structure which lets me performs on bunch of joins, and filter out
>> data. I generated different shapes of reports using these operations. Of
>> course, I would love some type checking to help me with these merge, join
>> operations as I create different reports.
>>
>> I am not looking to replicate the Pandas data-frame functionality in
>> Haskell. First thing I want to do is reach out to the 'record' data
>> structure. Here are some ideas I have:
>>
>> 1.  I need to declare all these 100+ columns into multiple record
>> structures.
>> 2.  Some of the columns can have NULL/NaN values. Therefore, some of the
>> attributes of the record structure would be 'MayBe' values. Now, I could
>> drop some columns during load and cut down the number of attributes i
>> created per record structure.
>> 3.  Create a dictionary of each record structure which will help me index
>> into into them.'
>>
>> I would like some feedback on the first 2 points. Seems like there is a
>> lot of boiler plate code I have to generate for creating 100s of record
>> attributes. Is this the only sane way to do this?  What other patterns
>> should I consider while solving such a problem.
>>
>> Also, I do not want to add too many dependencies into the project, but
>> open to suggestions.
>>
>> Any input/advice on this would be very helpful.
>>
>> Thank you for the time!
>> Guru
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>



-- 
http://www.saurabhnanda.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20171001/bf3db67e/attachment.html>


More information about the Haskell-Cafe mailing list