[Haskell-cafe] repa parallelization results

Tue Mar 17 01:20:30 UTC 2015

You want to throw your parallelizable matrix operations to the GPU cores.

MATLAB can now do this and I believe it is starting to be built into R so
that R can use the GPU cores..

On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko <aeyakovenko at gmail.com>
wrote:

> hmm, so i was wrong
>
> https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6
>
> the first version performed the best, even when running with -N1
> agains the sequential version.
>
> On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald
> <carter.schonwald at gmail.com> wrote:
> > you're getting on the right track! :)
> >
> > the idea you're reaching for is "parallel work depth".  Eg, if instead of
> > foldl' (which has O(n) work depth), you had a "parallel" fold that kinda
> > looks like a recursive split and then merge version of the fold
> operation,
> > you'd have O(log n) work depth. (and that'd likely be faster!). But then
> > you'd notice "below some threshold, its better to compute sequentially,
> > because the overhead of parallization is too big".
> >
> > etc etc. (the point i'm trying to reach for is that effective
> > parallelization requires a pretty rich understanding of your application
> /
> > software / hardware cost model)
> >
> > likewise, REPA is really only going to shine on workloads that look
> > "pointwise" or "flat", at least with the current iteration. Its probably
> a
> > good idea to look at the various example codes that are available for
> repa
> > and acccelerate, because you'll notice that the codes which are
> especially
> > performant have that "flat" style of paralellims
> >
> >
> > On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko <
> aeyakovenko at gmail.com>
> > wrote:
> >>
> >> Ok, got it. I picked the wrong function to try to understand how Repa
> >> parallelizes :)
> >>
> >> So whats a good heuristic for using the parallel versions vs
> >> sequential for Repa?
> >>
> >> Do the internals try to parallelize every element?  or does it fuse
> >> them into some small number of parallelized tasks?
> >>
> >> So just based from my observations
> >>
> >> f (Z :. r :. c) = r * c
> >>
> >> a <- computeP (fromFunction f)
> >> a `deepSeqArray` sumAllP a
> >>
> >> should be faster then:
> >>
> >> let a = computeS $ fromFunction f
> >> a `deepSeqArray` sumAllP $ a
> >>
> >> but probably slower then
> >>
> >> sumAllS $ computeS $ fromFunction f
> >>
> >> Since an intermediate array is not even computed.
> >>
> >> Thanks,
> >> Anatoly
> >>
> >>
> >> On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald
> >> <carter.schonwald at gmail.com> wrote:
> >> > Read that paper I linked. Anything else I say will be a rehash of that
> >> > paper. :)
> >> >
> >> > On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <aeyakovenko at gmail.com>
> >> > wrote:
> >> >>
> >> >> Ok, so whats the difference between the sequence and parallel
> >> >> versions? does the parallel one contain a thunk for every element in
> >> >> the output?
> >> >>
> >> >> On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald
> >> >> <carter.schonwald at gmail.com> wrote:
> >> >> > Read what I linked.
> >> >> > You are benchmarking repa for exactly the pessimal workload that it
> >> >> > is
> >> >> > bad
> >> >> > at.
> >> >> >
> >> >> > Repa is for point wise parallel and local convolution parallel
> >> >> > programs.
> >> >> > The way repa can express matrix multiplication is exactly the worst
> >> >> > way
> >> >> > to
> >> >> > implement a parallel matrix mult.  Like, pretty pessimal wrt a
> memory
> >> >> > traffic / communication complexity metric of performance.
> >> >> >
> >> >> > Benchmark something like image blur algorithms and repa will really
> >> >> > shine.
> >> >> >
> >> >> > Right now your benchmark is the repa equivalent of noticing that
> >> >> > random
> >> >> > access on singly linked lists is slow :)
> >> >> >
> >> >> > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" <
> aeyakovenko at gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> I am not really focusing on matrix multiply specifically.  So the
> >> >> >> real
> >> >> >> problem is that the implementation using parallelized functions is
> >> >> >> slower then the sequential one, and adding more threads makes it
> >> >> >> barely as fast as the sequential one.
> >> >> >>
> >> >> >> So why would i ever use the parallelized versions?
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald
> >> >> >> <carter.schonwald at gmail.com> wrote:
> >> >> >> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf
> this
> >> >> >> > paper
> >> >> >> > (among many others by the blis project) articulates some of the
> >> >> >> > ideas
> >> >> >> > i
> >> >> >> > allude to pretty well (with pictures!)
> >> >> >> >
> >> >> >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald
> >> >> >> > <carter.schonwald at gmail.com> wrote:
> >> >> >> >>
> >> >> >> >> dense matrix product is not an algorithm that makes sense in
> >> >> >> >> repa's
> >> >> >> >> execution model,
> >> >> >> >> in square matrix multiply of two N x N matrices, each result
> >> >> >> >> entry
> >> >> >> >> depends
> >> >> >> >> on 2n values total across the  two input matrices.
> >> >> >> >> even then, thats actually the wrong way to parallelize dense
> >> >> >> >> matrix
> >> >> >> >> product! its worth reading the papers about goto blas and the
> >> >> >> >> more
> >> >> >> >> recent
> >> >> >> >> blis project. a high performance dense matrix multipy winds up
> >> >> >> >> needing
> >> >> >> >> to do
> >> >> >> >> some nested array parallelism with mutable updates to have
> >> >> >> >> efficient
> >> >> >> >> sharing
> >> >> >> >> of sub computations!
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko
> >> >> >> >> <aeyakovenko at gmail.com>
> >> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>> you think the backed would make any difference?  this seems
> like
> >> >> >> >>> a
> >> >> >> >>> runtime issue to me, how are the threads scheduled by the ghc
> >> >> >> >>> runtime?
> >> >> >> >>>
> >> >> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <kc1956 at gmail.com> wrote:
> >> >> >> >>> > How is the LLVM?
> >> >> >> >>> >
> >> >> >> >>> > --
> >> >> >> >>> > --
> >> >> >> >>> >
> >> >> >> >>> > Sent from an expensive device which will be obsolete in a
> few
> >> >> >> >>> > months!
> >> >> >> >>> > :D
> >> >> >> >>> >
> >> >> >> >>> > Casey
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
> >> >> >> >>> > <aeyakovenko at gmail.com>
> >> >> >> >>> > wrote:
> >> >> >> >>> >>
> >> >> >> >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >> so i am seeing basically results with N4 that are as good
> as
> >> >> >> >>> >> using
> >> >> >> >>> >> sequential computation on my macbook for the matrix
> multiply
> >> >> >> >>> >> algorithm.  any idea why?
> >> >> >> >>> >>
> >> >> >> >>> >> Thanks,
> >> >> >> >>> >> Anatoly
> >> >> >> >>> >> _______________________________________________
> >> >> >> >>> >> Haskell-Cafe mailing list
> >> >> >> >>> >> Haskell-Cafe at haskell.org
> >> >> >> >>> >>
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >> >> >>> _______________________________________________
> >> >> >> >>> Haskell-Cafe mailing list
> >> >> >> >>> Haskell-Cafe at haskell.org
> >> >> >> >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >
> >
>

-- 

--

Sent from an expensive device which will be obsolete in a few months! :D
Casey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150316/1f985fe1/attachment.html>