[Haskell-cafe] Haskell and Big Data

Alexander Kjeldaas alexander.kjeldaas at gmail.com
Sat Dec 21 13:50:51 UTC 2013


In the HPCC documentation it is hard to cut through the buzzword jungle.
Is there an efficient storage solution lurking there?

I searched for haskell packages related to the big data storage layer, and
the only thing I've found that could support efficient erasure code-based
storage is this 3 years old binding to libhdfs.  There is only one commit
in github:

https://github.com/kim/hdfs-haskell

Somewhat related are these bindings to zfec, from 2008, and part of the
Tahoe LAFS project.

http://hackage.haskell.org/package/fec


Alexander



On Fri, Dec 20, 2013 at 8:24 AM, Carter Schonwald <
carter.schonwald at gmail.com> wrote:

> Cloud Haskell is a substrate that could be used to build such a layer.
>  I'm sure the cloud Haskell people would love such experimenration.
>
>
> On Friday, December 20, 2013, He-chien Tsai wrote:
>
>> What I meant is that split the data into several parts,send each splited
>> data to different computers, train them seperately, finally send the
>> results back and combine those results. I didn't mean to use Cloud Haskell.
>>
>> 2013/12/20 上午5:40 於 "jean-christophe mincke" <
>> jeanchristophe.mincke at gmail.com> 寫道:
>> >
>> > He-Chien Tsai,
>> >
>> > >  its training result is designed for composable
>> >
>> > Yes it is indeed composable (parallel function of that lib) but
>> parallelizing it on a cluster changes all the type because running on a
>> cluster implies IO.
>> > Moreover using Cloud Haskell (for instance) implies that:
>> > 1. training functions should be (serializable) clojures, which can only
>> be defined as module level (not as local -let/where - bindings).
>> > 2. train is a typeclass function and is not serializable.
>> >
>> > So the idea behind HLearn are interesting but I do not see how it could
>> be run on a cluster... But, unfortunately, I am not an Haskell expert.
>> >
>> > What do you think?
>> >
>> > Regards
>> >
>> > J-C
>> >
>> >
>> >
>> > On Thu, Dec 19, 2013 at 6:15 PM, He-chien Tsai <depot051 at gmail.com>
>> wrote:
>> >>
>> >> have you took a look at hlearn and statistics packages? it's even easy
>> to parallellize hlearn on cluster because it's training result is designed
>> for composable, which means you can create two model , train them
>> seperately and finally combine them. you can also use other database such
>> as redis or cassandra,which has haskell binding, as backend. for
>> parallellizing on clusters, hdph is also good.
>> >>
>> >> I personally prefer python for data science because it has much more
>> mature packages and is more interactive and more effective (not kidding.
>> you can create compiled C for core datas and algorithms by python-like
>> cython and call it from python, and exploit gpus for accelerating by
>> theano) than haskell and scala, spark also has a unfinish python binding.
>> >>
>> >> 2013/12/18 下午3:41 於 "jean-christophe mincke" <
>> jeanchristophe.mincke at gmail.com> 寫道:
>> >>
>> >>
>> >> >
>> >> > Hello Cafe,
>> >> >
>> >> > Big Data is a bit trendy these days.
>> >> >
>> >> > Does anybody know about plans to develop an Haskell eco-system in
>> that domain?
>> >> > I.e tools such as Storm or Spark (possibly on top of Cloud Haskell)
>> or, at least, bindings to tools which exist in other languages.
>> >> >
>> >> > Thank you
>> >> >
>> >> > Regards
>> >> >
>> >> > J-C
>> >> >
>> >> > _______________________________________________
>> >> > Haskell-Cafe mailing list
>> >> > Haskell-Cafe at haskell.org
>> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >> >
>> >
>> >
>>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20131221/2d13fbf3/attachment.html>


More information about the Haskell-Cafe mailing list