[Haskell-cafe] repa parallelization results

Mon Mar 16 03:04:52 UTC 2015

you're getting on the right track! :)

the idea you're reaching for is "parallel work depth".  Eg, if instead of
foldl' (which has O(n) work depth), you had a "parallel" fold that kinda
looks like a recursive split and then merge version of the fold operation,
you'd have O(log n) work depth. (and that'd likely be faster!). But then
you'd notice "below some threshold, its better to compute sequentially,
because the overhead of parallization is too big".

etc etc. (the point i'm trying to reach for is that effective
parallelization requires a pretty rich understanding of your application /
software / hardware cost model)

likewise, REPA is really only going to shine on workloads that look
"pointwise" or "flat", at least with the current iteration. Its probably a
good idea to look at the various example codes that are available for repa
and acccelerate, because you'll notice that the codes which are especially
performant have that "flat" style of paralellims

On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko <aeyakovenko at gmail.com>
wrote:

> Ok, got it. I picked the wrong function to try to understand how Repa
> parallelizes :)
>
> So whats a good heuristic for using the parallel versions vs
> sequential for Repa?
>
> Do the internals try to parallelize every element?  or does it fuse
> them into some small number of parallelized tasks?
>
> So just based from my observations
>
> f (Z :. r :. c) = r * c
>
> a <- computeP (fromFunction f)
> a `deepSeqArray` sumAllP a
>
> should be faster then:
>
> let a = computeS $ fromFunction f
> a `deepSeqArray` sumAllP $ a
>
> but probably slower then
>
> sumAllS $ computeS $ fromFunction f
>
> Since an intermediate array is not even computed.
>
> Thanks,
> Anatoly
>
>
> On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald
> <carter.schonwald at gmail.com> wrote:
> > Read that paper I linked. Anything else I say will be a rehash of that
> > paper. :)
> >
> > On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <aeyakovenko at gmail.com>
> wrote:
> >>
> >> Ok, so whats the difference between the sequence and parallel
> >> versions? does the parallel one contain a thunk for every element in
> >> the output?
> >>
> >> On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald
> >> <carter.schonwald at gmail.com> wrote:
> >> > Read what I linked.
> >> > You are benchmarking repa for exactly the pessimal workload that it is
> >> > bad
> >> > at.
> >> >
> >> > Repa is for point wise parallel and local convolution parallel
> programs.
> >> > The way repa can express matrix multiplication is exactly the worst
> way
> >> > to
> >> > implement a parallel matrix mult.  Like, pretty pessimal wrt a memory
> >> > traffic / communication complexity metric of performance.
> >> >
> >> > Benchmark something like image blur algorithms and repa will really
> >> > shine.
> >> >
> >> > Right now your benchmark is the repa equivalent of noticing that
> random
> >> > access on singly linked lists is slow :)
> >> >
> >> > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" <aeyakovenko at gmail.com>
> >> > wrote:
> >> >>
> >> >> I am not really focusing on matrix multiply specifically.  So the
> real
> >> >> problem is that the implementation using parallelized functions is
> >> >> slower then the sequential one, and adding more threads makes it
> >> >> barely as fast as the sequential one.
> >> >>
> >> >> So why would i ever use the parallelized versions?
> >> >>
> >> >>
> >> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald
> >> >> <carter.schonwald at gmail.com> wrote:
> >> >> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this
> >> >> > paper
> >> >> > (among many others by the blis project) articulates some of the
> ideas
> >> >> > i
> >> >> > allude to pretty well (with pictures!)
> >> >> >
> >> >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald
> >> >> > <carter.schonwald at gmail.com> wrote:
> >> >> >>
> >> >> >> dense matrix product is not an algorithm that makes sense in
> repa's
> >> >> >> execution model,
> >> >> >> in square matrix multiply of two N x N matrices, each result entry
> >> >> >> depends
> >> >> >> on 2n values total across the  two input matrices.
> >> >> >> even then, thats actually the wrong way to parallelize dense
> matrix
> >> >> >> product! its worth reading the papers about goto blas and the more
> >> >> >> recent
> >> >> >> blis project. a high performance dense matrix multipy winds up
> >> >> >> needing
> >> >> >> to do
> >> >> >> some nested array parallelism with mutable updates to have
> efficient
> >> >> >> sharing
> >> >> >> of sub computations!
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko
> >> >> >> <aeyakovenko at gmail.com>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> you think the backed would make any difference?  this seems like
> a
> >> >> >>> runtime issue to me, how are the threads scheduled by the ghc
> >> >> >>> runtime?
> >> >> >>>
> >> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <kc1956 at gmail.com> wrote:
> >> >> >>> > How is the LLVM?
> >> >> >>> >
> >> >> >>> > --
> >> >> >>> > --
> >> >> >>> >
> >> >> >>> > Sent from an expensive device which will be obsolete in a few
> >> >> >>> > months!
> >> >> >>> > :D
> >> >> >>> >
> >> >> >>> > Casey
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
> >> >> >>> > <aeyakovenko at gmail.com>
> >> >> >>> > wrote:
> >> >> >>> >>
> >> >> >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> so i am seeing basically results with N4 that are as good as
> >> >> >>> >> using
> >> >> >>> >> sequential computation on my macbook for the matrix multiply
> >> >> >>> >> algorithm.  any idea why?
> >> >> >>> >>
> >> >> >>> >> Thanks,
> >> >> >>> >> Anatoly
> >> >> >>> >> _______________________________________________
> >> >> >>> >> Haskell-Cafe mailing list
> >> >> >>> >> Haskell-Cafe at haskell.org
> >> >> >>> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >> >>> _______________________________________________
> >> >> >>> Haskell-Cafe mailing list
> >> >> >>> Haskell-Cafe at haskell.org
> >> >> >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >> >>
> >> >> >>
> >> >> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150315/a46e1f22/attachment.html>