[Haskell-cafe] Haskell and Big Data

Sat Dec 21 17:53:14 UTC 2013

Looking at the docs, it is indeed not clear whether it has (ie. the way
Spark has) or not.

J-C

On Sat, Dec 21, 2013 at 6:30 PM, Carter Schonwald <
carter.schonwald at gmail.com> wrote:

> Interesting. Does it have a design that let's computation be structured in
> a locality aware way? (I'd imagine yes, but I'm Afk much of this week so
> it's a bit hard to read docs)
>
>
> On Saturday, December 21, 2013, Flavio Villanustre wrote:
>
>> Alexander,
>>
>> The distributed storage in the HPCC platform relies on an underlying
>> Posix compliant Linux filesystem (any will do), and provides an abstraction
>> layer based on record oriented (as opposed to block oriented, like HDFS)
>> fileparts located in the local storage of the physical nodes. It also uses
>> a component called Dali which, among other things, is a metadata server
>> that provides a "logical file" view of these partitioned data files, and
>> the system provides the tooling to create them from an external data source
>> (in a process called spray).
>>
>> While you could conceivably use the distributed file system in HPCC as a
>> stand alone data repository, I think that it would be more interesting to
>> take advantage of the data processing machinery too. The HPCC platform has
>> already a declarative dataflow language called ECL which, coincidentally,
>> advocates purity, is non-strict (implemented through laziness) and compiles
>> into C++ (and uses g++/clang to compile this into machine code). Since ECL
>> already allows for embedded C++, Python, R, Java and Javascript, allowing
>> Haskell to be embedded too (through FFI?) would be the best integration
>> option, IMO.
>>
>> I'm copying Richard, Jake and Gavin, who are the ones that wrote most of
>> the original code base for the distributed filesystem and ECL compiler
>> (among many other parts), and perhaps can provide some ideas/pointers.
>>
>> Flavio
>>
>> Flavio Villanustre
>>
>>
>> On Sat, Dec 21, 2013 at 8:50 AM, Alexander Kjeldaas <
>> alexander.kjeldaas at gmail.com> wrote:
>>
>>
>> In the HPCC documentation it is hard to cut through the buzzword jungle.
>> Is there an efficient storage solution lurking there?
>>
>> I searched for haskell packages related to the big data storage layer,
>> and the only thing I've found that could support efficient erasure
>> code-based storage is this 3 years old binding to libhdfs.  There is only
>> one commit in github:
>>
>> https://github.com/kim/hdfs-haskell
>>
>> Somewhat related are these bindings to zfec, from 2008, and part of the
>> Tahoe LAFS project.
>>
>> http://hackage.haskell.org/package/fec
>>
>>
>> Alexander
>>
>>
>>
>> On Fri, Dec 20, 2013 at 8:24 AM, Carter Schonwald <
>> carter.schonwald at gmail.com> wrote:
>>
>> Cloud Haskell is a substrate that could be used to build such a layer.
>>  I'm sure the cloud Haskell people would love such experimenration.
>>
>>
>> On Friday, December 20, 2013, He-chien Tsai wrote:
>>
>> What I meant is that split the data into several parts,send each splited
>> data to different computers, train them seperately, finally send the
>> results back and combine those results. I didn't mean to use Cloud Haskell.
>>
>> 2013/12/20 上午5:40 於 "jean-christophe mincke" <
>> jeanchristophe.mincke at gmail.com> 寫道：
>> >
>> > He-Chien Tsai,
>> >
>> > >  its training result is designed for composable
>> >
>> > Yes it is indeed composable (parallel function of that lib) but
>> parallelizing it on a cluster changes all the type because running on a
>> cluster implies IO.
>> > Moreover using Cloud Haskell (for instance) implies that:
>> > 1. training functions should be (serializable) clojures, which can only
>> be defined as module level (not as local -let/where - bindings).
>> > 2. train is a typeclass function and is not serializable.
>> >
>> > So the idea behind HLearn are interesting but I do not see how it could
>> be run on a cluster... But, unfortunately, I am not an Haskell expert.
>> >
>> > What do you think?
>> >
>> > Regards
>> >
>> > J-C
>> >
>> >
>> >
>> > On Thu, Dec 19, 2013 at 6:15 PM, He-chien Tsai <depot051 at gmail.com>
>> wrote:
>> >>
>> >> have you took a look at hlearn and statistics packages? it's even easy
>> to parallellize hlearn on cluster because it's training result is designed
>> for composable, which means you can create two model , train them
>> seperately and finally combine them. you can also use other database such
>> as redis or cassandra,which has haskell binding, as backend. for
>> parallellizing on clusters, hdph is also good.
>> >>
>> >> I personally prefer python for data science because it has much more
>> mature packages and is more interactive and more effective (not kidding.
>> you can create compiled C for core datas and algorithms by python-like
>> cython and call it from python, and exploit gpus for accelerating by
>> theano) than haskell and scala, spark also has a unfinish python binding.
>> >>
>> >> 2013/12/18 下午3:41 於 "jean-christophe mincke" <
>> jeanchristophe.mincke at gmail.com> 寫道：
>> >>
>> >>
>> >> >
>> >> > Hello Cafe,
>> >> >
>> >> > Big Data is a bit trendy these days.
>> >> >
>> >> > Does anybody know about plans to develop an Haskell eco-system in
>> that domain?
>> >> > I.e tools such as Storm or Spark (possibly on top of Cloud Haskell)
>> or, at least, bindings to tools which exist in other languages.
>> >> >
>> >> > Thank you
>> >> >
>> >> > Regards
>> >> >
>> >> > J-C
>> >> >
>> >> > _______________________________________________
>> >> > Haskell-Cafe mailing list
>> >> > Haskell-Cafe at haskell.org
>> >> > http://www.haskell.org/mailman/listinfo/haskell-caf<http://www.haskell.org/mailman/listinfo/haskell-cafe>
>>
>>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20131221/1568b916/attachment.html>