<div dir="ltr"><div>You want to throw your parallelizable matrix operations to the GPU cores.<br><br></div>MATLAB can now do this and I believe it is starting to be built into R so that R can use the GPU cores..<br><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko <span dir="ltr"><<a href="mailto:aeyakovenko@gmail.com" target="_blank">aeyakovenko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">hmm, so i was wrong<br>

<br>

<a href="https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6" target="_blank">https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6</a><br>

<br>

the first version performed the best, even when running with -N1<br>

agains the sequential version.<br>

<br>

On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald<br>

<div class="HOEnZb"><div class="h5"><<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

> you're getting on the right track! :)<br>

><br>

> the idea you're reaching for is "parallel work depth".  Eg, if instead of<br>

> foldl' (which has O(n) work depth), you had a "parallel" fold that kinda<br>

> looks like a recursive split and then merge version of the fold operation,<br>

> you'd have O(log n) work depth. (and that'd likely be faster!). But then<br>

> you'd notice "below some threshold, its better to compute sequentially,<br>

> because the overhead of parallization is too big".<br>

><br>

> etc etc. (the point i'm trying to reach for is that effective<br>

> parallelization requires a pretty rich understanding of your application /<br>

> software / hardware cost model)<br>

><br>

> likewise, REPA is really only going to shine on workloads that look<br>

> "pointwise" or "flat", at least with the current iteration. Its probably a<br>

> good idea to look at the various example codes that are available for repa<br>

> and acccelerate, because you'll notice that the codes which are especially<br>

> performant have that "flat" style of paralellims<br>

><br>

><br>

> On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>

> wrote:<br>

>><br>

>> Ok, got it. I picked the wrong function to try to understand how Repa<br>

>> parallelizes :)<br>

>><br>

>> So whats a good heuristic for using the parallel versions vs<br>

>> sequential for Repa?<br>

>><br>

>> Do the internals try to parallelize every element?  or does it fuse<br>

>> them into some small number of parallelized tasks?<br>

>><br>

>> So just based from my observations<br>

>><br>

>> f (Z :. r :. c) = r * c<br>

>><br>

>> a <- computeP (fromFunction f)<br>

>> a `deepSeqArray` sumAllP a<br>

>><br>

>> should be faster then:<br>

>><br>

>> let a = computeS $ fromFunction f<br>

>> a `deepSeqArray` sumAllP $ a<br>

>><br>

>> but probably slower then<br>

>><br>

>> sumAllS $ computeS $ fromFunction f<br>

>><br>

>> Since an intermediate array is not even computed.<br>

>><br>

>> Thanks,<br>

>> Anatoly<br>

>><br>

>><br>

>> On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald<br>

>> <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

>> > Read that paper I linked. Anything else I say will be a rehash of that<br>

>> > paper. :)<br>

>> ><br>

>> > On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>

>> > wrote:<br>

>> >><br>

>> >> Ok, so whats the difference between the sequence and parallel<br>

>> >> versions? does the parallel one contain a thunk for every element in<br>

>> >> the output?<br>

>> >><br>

>> >> On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald<br>

>> >> <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

>> >> > Read what I linked.<br>

>> >> > You are benchmarking repa for exactly the pessimal workload that it<br>

>> >> > is<br>

>> >> > bad<br>

>> >> > at.<br>

>> >> ><br>

>> >> > Repa is for point wise parallel and local convolution parallel<br>

>> >> > programs.<br>

>> >> > The way repa can express matrix multiplication is exactly the worst<br>

>> >> > way<br>

>> >> > to<br>

>> >> > implement a parallel matrix mult.  Like, pretty pessimal wrt a memory<br>

>> >> > traffic / communication complexity metric of performance.<br>

>> >> ><br>

>> >> > Benchmark something like image blur algorithms and repa will really<br>

>> >> > shine.<br>

>> >> ><br>

>> >> > Right now your benchmark is the repa equivalent of noticing that<br>

>> >> > random<br>

>> >> > access on singly linked lists is slow :)<br>

>> >> ><br>

>> >> > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>

>> >> > wrote:<br>

>> >> >><br>

>> >> >> I am not really focusing on matrix multiply specifically.  So the<br>

>> >> >> real<br>

>> >> >> problem is that the implementation using parallelized functions is<br>

>> >> >> slower then the sequential one, and adding more threads makes it<br>

>> >> >> barely as fast as the sequential one.<br>

>> >> >><br>

>> >> >> So why would i ever use the parallelized versions?<br>

>> >> >><br>

>> >> >><br>

>> >> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald<br>

>> >> >> <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

>> >> >> > <a href="http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf" target="_blank">http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf</a> this<br>

>> >> >> > paper<br>

>> >> >> > (among many others by the blis project) articulates some of the<br>

>> >> >> > ideas<br>

>> >> >> > i<br>

>> >> >> > allude to pretty well (with pictures!)<br>

>> >> >> ><br>

>> >> >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald<br>

>> >> >> > <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

>> >> >> >><br>

>> >> >> >> dense matrix product is not an algorithm that makes sense in<br>

>> >> >> >> repa's<br>

>> >> >> >> execution model,<br>

>> >> >> >> in square matrix multiply of two N x N matrices, each result<br>

>> >> >> >> entry<br>

>> >> >> >> depends<br>

>> >> >> >> on 2n values total across the  two input matrices.<br>

>> >> >> >> even then, thats actually the wrong way to parallelize dense<br>

>> >> >> >> matrix<br>

>> >> >> >> product! its worth reading the papers about goto blas and the<br>

>> >> >> >> more<br>

>> >> >> >> recent<br>

>> >> >> >> blis project. a high performance dense matrix multipy winds up<br>

>> >> >> >> needing<br>

>> >> >> >> to do<br>

>> >> >> >> some nested array parallelism with mutable updates to have<br>

>> >> >> >> efficient<br>

>> >> >> >> sharing<br>

>> >> >> >> of sub computations!<br>

>> >> >> >><br>

>> >> >> >><br>

>> >> >> >><br>

>> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko<br>

>> >> >> >> <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>

>> >> >> >> wrote:<br>

>> >> >> >>><br>

>> >> >> >>> you think the backed would make any difference?  this seems like<br>

>> >> >> >>> a<br>

>> >> >> >>> runtime issue to me, how are the threads scheduled by the ghc<br>

>> >> >> >>> runtime?<br>

>> >> >> >>><br>

>> >> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <<a href="mailto:kc1956@gmail.com">kc1956@gmail.com</a>> wrote:<br>

>> >> >> >>> > How is the LLVM?<br>

>> >> >> >>> ><br>

>> >> >> >>> > --<br>

>> >> >> >>> > --<br>

>> >> >> >>> ><br>

>> >> >> >>> > Sent from an expensive device which will be obsolete in a few<br>

>> >> >> >>> > months!<br>

>> >> >> >>> > :D<br>

>> >> >> >>> ><br>

>> >> >> >>> > Casey<br>

>> >> >> >>> ><br>

>> >> >> >>> ><br>

>> >> >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"<br>

>> >> >> >>> > <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>

>> >> >> >>> > wrote:<br>

>> >> >> >>> >><br>

>> >> >> >>> >> <a href="https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8" target="_blank">https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8</a><br>

>> >> >> >>> >><br>

>> >> >> >>> >><br>

>> >> >> >>> >> so i am seeing basically results with N4 that are as good as<br>

>> >> >> >>> >> using<br>

>> >> >> >>> >> sequential computation on my macbook for the matrix multiply<br>

>> >> >> >>> >> algorithm.  any idea why?<br>

>> >> >> >>> >><br>

>> >> >> >>> >> Thanks,<br>

>> >> >> >>> >> Anatoly<br>

>> >> >> >>> >> _______________________________________________<br>

>> >> >> >>> >> Haskell-Cafe mailing list<br>

>> >> >> >>> >> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>

>> >> >> >>> >> <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

>> >> >> >>> _______________________________________________<br>

>> >> >> >>> Haskell-Cafe mailing list<br>

>> >> >> >>> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>

>> >> >> >>> <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

>> >> >> >><br>

>> >> >> >><br>

>> >> >> ><br>

><br>

><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature"><div dir="ltr"><p dir="ltr">--<br>

</p><p dir="ltr">Sent from an expensive device which will be obsolete in a few months! :D</p>

Casey<br><br></div></div>

</div>