[Haskell-cafe] Haskell and Big Data

Sat Dec 21 20:12:55 UTC 2013

It does exploit data locality for efficiency, but this is largely dependent
on the type of activity: while certain activities can be independently
performed in each data record in parallel (i.e., turning a field in every
record to uppercase), activities that operate on groups of records (think
of a group or rollup operation based on one of the fields) may require that
records be globally redistributed as part of that operation, and lastly,
there are activities that may require that all records be reshuffled across
the storage (a global/distributed sort operation, for example). ECL
abstracts the complexities originated in the underlying distribution,
partitioning and parallelism, so Haskell could theoretically do that same.

Flavio

Flavio Villanustre

On Sat, Dec 21, 2013 at 12:32 PM, Carter Schonwald <
carter.schonwald at gmail.com> wrote:

> Interesting. Does it have a design that let's computation be structured in
> a locality aware way? (I'd imagine yes, but I'm Afk much of this week so
> it's a bit hard to read docs)
>
> On Saturday, December 21, 2013, Flavio Villanustre wrote:
>
>> Alexander,
>>
>> The distributed storage in the HPCC platform relies on an underlying
>> Posix compliant Linux filesystem (any will do), and provides an abstraction
>> layer based on record oriented (as opposed to block oriented, like HDFS)
>> fileparts located in the local storage of the physical nodes. It also uses
>> a component called Dali which, among other things, is a metadata server
>> that provides a "logical file" view of these partitioned data files, and
>> the system provides the tooling to create them from an external data source
>> (in a process called spray).
>>
>> While you could conceivably use the distributed file system in HPCC as a
>> stand alone data repository, I think that it would be more interesting to
>> take advantage of the data processing machinery too. The HPCC platform has
>> already a declarative dataflow language called ECL which, coincidentally,
>> advocates purity, is non-strict (implemented through laziness) and compiles
>> into C++ (and uses g++/clang to compile this into machine code). Since ECL
>> already allows for embedded C++, Python, R, Java and Javascript, allowing
>> Haskell to be embedded too (through FFI?) would be the best integration
>> option, IMO.
>>
>> I'm copying Richard, Jake and Gavin, who are the ones that wrote most of
>> the original code base for the distributed filesystem and ECL compiler
>> (among many other parts), and perhaps can provide some ideas/pointers.
>>
>> Flavio
>>
>> Flavio Villanustre
>>
>>
>> On Sat, Dec 21, 2013 at 8:50 AM, Alexander Kjeldaas <
>> alexander.kjeldaas at gmail.com> wrote:
>>
>>
>> In the HPCC documentation it is hard to cut through the buzzword jungle.
>> Is there an efficient storage solution lurking there?
>>
>> I searched for haskell packages related to the big data storage layer,
>> and the only thing I've found that could support efficient erasure
>> code-based storage is this 3 years old binding to libhdfs.  There is only
>> one commit in github:
>>
>> https://github.com/kim/hdfs-haskell
>>
>> Somewhat related are these bindings to zfec, from 2008, and part of the
>> Tahoe LAFS project.
>>
>> http://hackage.haskell.org/package/fec
>>
>>
>> Alexander
>>
>>
>>
>> On Fri, Dec 20, 2013 at 8:24 AM, Carter Schonwald <
>> carter.schonwald at gmail.com> wrote:
>>
>> Cloud Haskell is a substrate that could be used to build such a layer.
>>  I'm sure the cloud Haskell people would love such experimenration.
>>
>>
>> On Friday, December 20, 2013, He-chien Tsai wrote:
>>
>> What I meant is that split the data into several parts,send each splited
>> data to different computers, train them seperately, finally send the
>> results back and combine those results. I didn't mean to use Cloud Haskell.
>>
>> 2013/12/20 上午5:40 於 "jean-christophe mincke" <
>> jeanchristophe.mincke at gmail.com> 寫道：
>> >
>> > He-Chien Tsai,
>> >
>> > >  its training result is designed for composable
>> >
>> > Yes it is indeed composable (parallel function of that lib) but
>> parallelizing it on a cluster changes all the type because running on a
>> cluster implies IO.
>> > Moreover using Cloud Haskell (for instance) implies that:
>> > 1. training functions should be (serializable) clojures, which can only
>> be defined as module level (not as local -let/where - bindings).
>> > 2. train is a typeclass function and is not serializable.
>> >
>> > So the idea behind HLearn are interesting but I do not see how it could
>> be run on a cluster... But, unfortunately, I am not an Haskell expert.
>> >
>> > What do you think?
>> >
>> > Regards
>> >
>> > J-C
>> >
>> >
>> >
>> > On Thu, Dec 19, 2013 at 6:15 PM, He-chien Tsai <depot051 at gmail.com>
>> wrote:
>> >>
>> >> have you took a look at hlearn and statistics packages? it's even easy
>> to parallellize hlearn on cluster because it's training result is designed
>> for composable, which means you can create two model , train them
>> seperately and finally combine them. you can also use other database such
>> as redis or cassandra,which has haskell binding, as backend. for
>> parallellizing on clusters, hdph is also good.
>> >>
>> >> I personally prefer python for data science because it has much more
>> mature packages and is more interactive and more effective (not kidding.
>> you can create compiled C for core datas and algorithms by python-like
>> cython and call it from python, and exploit gpus for accelerating by
>> theano) than haskell and scala, spark also has a unfinish python binding.
>> >>
>> >> 2013/12/18 下午3:41 於 "jean-christophe mincke" <
>> jeanchristophe.mincke at gmail.com> 寫道：
>> >>
>> >>
>> >> >
>> >> > Hello Cafe,
>> >> >
>> >> > Big Data is a bit trendy these days.
>> >> >
>> >> > Does anybody know about plans to develop an Haskell eco-system in
>> that domain?
>> >> > I.e tools such as Storm or Spark (possibly on top of Cloud Haskell)
>> or, at least, bindings to tools which exist in other languages.
>> >> >
>> >> > Thank you
>> >> >
>> >> > Regards
>> >> >
>> >> > J-C
>> >> >
>> >> > _______________________________________________
>> >> > Haskell-Cafe mailing list
>> >> > Haskell-Cafe at haskell.org
>> >> > http://www.haskell.org/mailman/listinfo/haskell-caf<http://www.haskell.org/mailman/listinfo/haskell-cafe>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20131221/bfeb602b/attachment-0001.html>