[Haskell-cafe] repa parallelization results

Tue Mar 17 04:08:11 UTC 2015

if you want to understand writing matrix algorithms for the GPU,
http://icl.cs.utk.edu/magma/pubs/index.html has lots of reading resources.
The cost model for memory and computation on a GPU is different from the
CPU cost model.

and http://icl.cs.utk.edu/magma/software/ has a bunch of high performance
gpu kernels for a bunch of architectures.

accelerate is good to look at, at minimum because it manages to focus on a
fragment of functionall array programs that are moderately easy to optmize
for gpu

On Mon, Mar 16, 2015 at 11:27 PM, Anatoly Yakovenko <aeyakovenko at gmail.com>
wrote:

> you mean by hand?  or using accelerate?  I am not sure GPU's would do
> that much better at matrix multiplication.  According to that paper
> its pretty cache bound, so having a bigger l1/l2 would have a higher
> impact on performance then more cores, or wider simd.
>
> either way, i am more trying to understand how the parallelization
> techniques work in Repa, what usecases are they applicable in, and how
> to pick sequential or parallel versions of functions.
>
> Anatoly
>
>
>
> On Mon, Mar 16, 2015 at 6:20 PM, KC <kc1956 at gmail.com> wrote:
> > You want to throw your parallelizable matrix operations to the GPU cores.
> >
> > MATLAB can now do this and I believe it is starting to be built into R so
> > that R can use the GPU cores..
> >
> >
> > On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko <
> aeyakovenko at gmail.com>
> > wrote:
> >>
> >> hmm, so i was wrong
> >>
> >> https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6
> >>
> >> the first version performed the best, even when running with -N1
> >> agains the sequential version.
> >>
> >> On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald
> >> <carter.schonwald at gmail.com> wrote:
> >> > you're getting on the right track! :)
> >> >
> >> > the idea you're reaching for is "parallel work depth".  Eg, if instead
> >> > of
> >> > foldl' (which has O(n) work depth), you had a "parallel" fold that
> kinda
> >> > looks like a recursive split and then merge version of the fold
> >> > operation,
> >> > you'd have O(log n) work depth. (and that'd likely be faster!). But
> then
> >> > you'd notice "below some threshold, its better to compute
> sequentially,
> >> > because the overhead of parallization is too big".
> >> >
> >> > etc etc. (the point i'm trying to reach for is that effective
> >> > parallelization requires a pretty rich understanding of your
> application
> >> > /
> >> > software / hardware cost model)
> >> >
> >> > likewise, REPA is really only going to shine on workloads that look
> >> > "pointwise" or "flat", at least with the current iteration. Its
> probably
> >> > a
> >> > good idea to look at the various example codes that are available for
> >> > repa
> >> > and acccelerate, because you'll notice that the codes which are
> >> > especially
> >> > performant have that "flat" style of paralellims
> >> >
> >> >
> >> > On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko
> >> > <aeyakovenko at gmail.com>
> >> > wrote:
> >> >>
> >> >> Ok, got it. I picked the wrong function to try to understand how Repa
> >> >> parallelizes :)
> >> >>
> >> >> So whats a good heuristic for using the parallel versions vs
> >> >> sequential for Repa?
> >> >>
> >> >> Do the internals try to parallelize every element?  or does it fuse
> >> >> them into some small number of parallelized tasks?
> >> >>
> >> >> So just based from my observations
> >> >>
> >> >> f (Z :. r :. c) = r * c
> >> >>
> >> >> a <- computeP (fromFunction f)
> >> >> a `deepSeqArray` sumAllP a
> >> >>
> >> >> should be faster then:
> >> >>
> >> >> let a = computeS $ fromFunction f
> >> >> a `deepSeqArray` sumAllP $ a
> >> >>
> >> >> but probably slower then
> >> >>
> >> >> sumAllS $ computeS $ fromFunction f
> >> >>
> >> >> Since an intermediate array is not even computed.
> >> >>
> >> >> Thanks,
> >> >> Anatoly
> >> >>
> >> >>
> >> >> On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald
> >> >> <carter.schonwald at gmail.com> wrote:
> >> >> > Read that paper I linked. Anything else I say will be a rehash of
> >> >> > that
> >> >> > paper. :)
> >> >> >
> >> >> > On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <
> aeyakovenko at gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Ok, so whats the difference between the sequence and parallel
> >> >> >> versions? does the parallel one contain a thunk for every element
> in
> >> >> >> the output?
> >> >> >>
> >> >> >> On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald
> >> >> >> <carter.schonwald at gmail.com> wrote:
> >> >> >> > Read what I linked.
> >> >> >> > You are benchmarking repa for exactly the pessimal workload that
> >> >> >> > it
> >> >> >> > is
> >> >> >> > bad
> >> >> >> > at.
> >> >> >> >
> >> >> >> > Repa is for point wise parallel and local convolution parallel
> >> >> >> > programs.
> >> >> >> > The way repa can express matrix multiplication is exactly the
> >> >> >> > worst
> >> >> >> > way
> >> >> >> > to
> >> >> >> > implement a parallel matrix mult.  Like, pretty pessimal wrt a
> >> >> >> > memory
> >> >> >> > traffic / communication complexity metric of performance.
> >> >> >> >
> >> >> >> > Benchmark something like image blur algorithms and repa will
> >> >> >> > really
> >> >> >> > shine.
> >> >> >> >
> >> >> >> > Right now your benchmark is the repa equivalent of noticing that
> >> >> >> > random
> >> >> >> > access on singly linked lists is slow :)
> >> >> >> >
> >> >> >> > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko"
> >> >> >> > <aeyakovenko at gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> I am not really focusing on matrix multiply specifically.  So
> the
> >> >> >> >> real
> >> >> >> >> problem is that the implementation using parallelized functions
> >> >> >> >> is
> >> >> >> >> slower then the sequential one, and adding more threads makes
> it
> >> >> >> >> barely as fast as the sequential one.
> >> >> >> >>
> >> >> >> >> So why would i ever use the parallelized versions?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald
> >> >> >> >> <carter.schonwald at gmail.com> wrote:
> >> >> >> >> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf
> >> >> >> >> > this
> >> >> >> >> > paper
> >> >> >> >> > (among many others by the blis project) articulates some of
> the
> >> >> >> >> > ideas
> >> >> >> >> > i
> >> >> >> >> > allude to pretty well (with pictures!)
> >> >> >> >> >
> >> >> >> >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald
> >> >> >> >> > <carter.schonwald at gmail.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> dense matrix product is not an algorithm that makes sense in
> >> >> >> >> >> repa's
> >> >> >> >> >> execution model,
> >> >> >> >> >> in square matrix multiply of two N x N matrices, each result
> >> >> >> >> >> entry
> >> >> >> >> >> depends
> >> >> >> >> >> on 2n values total across the  two input matrices.
> >> >> >> >> >> even then, thats actually the wrong way to parallelize dense
> >> >> >> >> >> matrix
> >> >> >> >> >> product! its worth reading the papers about goto blas and
> the
> >> >> >> >> >> more
> >> >> >> >> >> recent
> >> >> >> >> >> blis project. a high performance dense matrix multipy winds
> up
> >> >> >> >> >> needing
> >> >> >> >> >> to do
> >> >> >> >> >> some nested array parallelism with mutable updates to have
> >> >> >> >> >> efficient
> >> >> >> >> >> sharing
> >> >> >> >> >> of sub computations!
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko
> >> >> >> >> >> <aeyakovenko at gmail.com>
> >> >> >> >> >> wrote:
> >> >> >> >> >>>
> >> >> >> >> >>> you think the backed would make any difference?  this seems
> >> >> >> >> >>> like
> >> >> >> >> >>> a
> >> >> >> >> >>> runtime issue to me, how are the threads scheduled by the
> ghc
> >> >> >> >> >>> runtime?
> >> >> >> >> >>>
> >> >> >> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <kc1956 at gmail.com>
> wrote:
> >> >> >> >> >>> > How is the LLVM?
> >> >> >> >> >>> >
> >> >> >> >> >>> > --
> >> >> >> >> >>> > --
> >> >> >> >> >>> >
> >> >> >> >> >>> > Sent from an expensive device which will be obsolete in a
> >> >> >> >> >>> > few
> >> >> >> >> >>> > months!
> >> >> >> >> >>> > :D
> >> >> >> >> >>> >
> >> >> >> >> >>> > Casey
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
> >> >> >> >> >>> > <aeyakovenko at gmail.com>
> >> >> >> >> >>> > wrote:
> >> >> >> >> >>> >>
> >> >> >> >> >>> >>
> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
> >> >> >> >> >>> >>
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> so i am seeing basically results with N4 that are as
> good
> >> >> >> >> >>> >> as
> >> >> >> >> >>> >> using
> >> >> >> >> >>> >> sequential computation on my macbook for the matrix
> >> >> >> >> >>> >> multiply
> >> >> >> >> >>> >> algorithm.  any idea why?
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> Thanks,
> >> >> >> >> >>> >> Anatoly
> >> >> >> >> >>> >> _______________________________________________
> >> >> >> >> >>> >> Haskell-Cafe mailing list
> >> >> >> >> >>> >> Haskell-Cafe at haskell.org
> >> >> >> >> >>> >>
> >> >> >> >> >>> >>
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >> >> >> >>> _______________________________________________
> >> >> >> >> >>> Haskell-Cafe mailing list
> >> >> >> >> >>> Haskell-Cafe at haskell.org
> >> >> >> >> >>>
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >
> >> >
> >
> >
> >
> >
> > --
> >
> > --
> >
> > Sent from an expensive device which will be obsolete in a few months! :D
> >
> > Casey
> >
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150317/bc4dd88c/attachment-0001.html>