[Haskell-cafe] Haskell and Big Data

Carter Schonwald carter.schonwald at gmail.com
Sat Dec 21 17:32:47 UTC 2013


Interesting. Does it have a design that let's computation be structured in
a locality aware way? (I'd imagine yes, but I'm Afk much of this week so
it's a bit hard to read docs)

On Saturday, December 21, 2013, Flavio Villanustre wrote:

> Alexander,
>
> The distributed storage in the HPCC platform relies on an underlying Posix
> compliant Linux filesystem (any will do), and provides an abstraction layer
> based on record oriented (as opposed to block oriented, like HDFS)
> fileparts located in the local storage of the physical nodes. It also uses
> a component called Dali which, among other things, is a metadata server
> that provides a "logical file" view of these partitioned data files, and
> the system provides the tooling to create them from an external data source
> (in a process called spray).
>
> While you could conceivably use the distributed file system in HPCC as a
> stand alone data repository, I think that it would be more interesting to
> take advantage of the data processing machinery too. The HPCC platform has
> already a declarative dataflow language called ECL which, coincidentally,
> advocates purity, is non-strict (implemented through laziness) and compiles
> into C++ (and uses g++/clang to compile this into machine code). Since ECL
> already allows for embedded C++, Python, R, Java and Javascript, allowing
> Haskell to be embedded too (through FFI?) would be the best integration
> option, IMO.
>
> I'm copying Richard, Jake and Gavin, who are the ones that wrote most of
> the original code base for the distributed filesystem and ECL compiler
> (among many other parts), and perhaps can provide some ideas/pointers.
>
> Flavio
>
> Flavio Villanustre
>
>
> On Sat, Dec 21, 2013 at 8:50 AM, Alexander Kjeldaas <
> alexander.kjeldaas at gmail.com> wrote:
>
>
> In the HPCC documentation it is hard to cut through the buzzword jungle.
> Is there an efficient storage solution lurking there?
>
> I searched for haskell packages related to the big data storage layer, and
> the only thing I've found that could support efficient erasure code-based
> storage is this 3 years old binding to libhdfs.  There is only one commit
> in github:
>
> https://github.com/kim/hdfs-haskell
>
> Somewhat related are these bindings to zfec, from 2008, and part of the
> Tahoe LAFS project.
>
> http://hackage.haskell.org/package/fec
>
>
> Alexander
>
>
>
> On Fri, Dec 20, 2013 at 8:24 AM, Carter Schonwald <
> carter.schonwald at gmail.com> wrote:
>
> Cloud Haskell is a substrate that could be used to build such a layer.
>  I'm sure the cloud Haskell people would love such experimenration.
>
>
> On Friday, December 20, 2013, He-chien Tsai wrote:
>
> What I meant is that split the data into several parts,send each splited
> data to different computers, train them seperately, finally send the
> results back and combine those results. I didn't mean to use Cloud Haskell.
>
> 2013/12/20 上午5:40 於 "jean-christophe mincke" <
> jeanchristophe.mincke at gmail.com> 寫道:
> >
> > He-Chien Tsai,
> >
> > >  its training result is designed for composable
> >
> > Yes it is indeed composable (parallel function of that lib) but
> parallelizing it on a cluster changes all the type because running on a
> cluster implies IO.
> > Moreover using Cloud Haskell (for instance) implies that:
> > 1. training functions should be (serializable) clojures, which can only
> be defined as module level (not as local -let/where - bindings).
> > 2. train is a typeclass function and is not serializable.
> >
> > So the idea behind HLearn are interesting but I do not see how it could
> be run on a cluster... But, unfortunately, I am not an Haskell expert.
> >
> > What do you think?
> >
> > Regards
> >
> > J-C
> >
> >
> >
> > On Thu, Dec 19, 2013 at 6:15 PM, He-chien Tsai <depot051 at gmail.com>
> wrote:
> >>
> >> have you took a look at hlearn and statistics packages? it's even easy
> to parallellize hlearn on cluster because it's training result is designed
> for composable, which means you can create two model , train them
> seperately and finally combine them. you can also use other database such
> as redis or cassandra,which has haskell binding, as backend. for
> parallellizing on clusters, hdph is also good.
> >>
> >> I personally prefer python for data science because it has much more
> mature packages and is more interactive and more effective (not kidding.
> you can create compiled C for core datas and algorithms by python-like
> cython and call it from python, and exploit gpus for accelerating by
> theano) than haskell and scala, spark also has a unfinish python binding.
> >>
> >> 2013/12/18 下午3:41 於 "jean-christophe mincke" <
> jeanchristophe.mincke at gmail.com> 寫道:
> >>
> >>
> >> >
> >> > Hello Cafe,
> >> >
> >> > Big Data is a bit trendy these days.
> >> >
> >> > Does anybody know about plans to develop an Haskell eco-system in
> that domain?
> >> > I.e tools such as Storm or Spark (possibly on top of Cloud Haskell)
> or, at least, bindings to tools which exist in other languages.
> >> >
> >> > Thank you
> >> >
> >> > Regards
> >> >
> >> > J-C
> >> >
> >> > _______________________________________________
> >> > Haskell-Cafe mailing list
> >> > Haskell-Cafe at haskell.org
> >> > http://www.haskell.org/mailman/listinfo/haskell-caf<http://www.haskell.org/mailman/listinfo/haskell-cafe>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20131221/087947af/attachment.html>


More information about the Haskell-Cafe mailing list