<div dir="ltr"><div>You want to throw your parallelizable matrix operations to the GPU cores.<br><br></div>MATLAB can now do this and I believe it is starting to be built into R so that R can use the GPU cores..<br><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko <span dir="ltr"><<a href="mailto:aeyakovenko@gmail.com" target="_blank">aeyakovenko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">hmm, so i was wrong<br>
<br>
<a href="https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6" target="_blank">https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6</a><br>
<br>
the first version performed the best, even when running with -N1<br>
agains the sequential version.<br>
<br>
On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald<br>
<div class="HOEnZb"><div class="h5"><<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>
> you're getting on the right track! :)<br>
><br>
> the idea you're reaching for is "parallel work depth". Eg, if instead of<br>
> foldl' (which has O(n) work depth), you had a "parallel" fold that kinda<br>
> looks like a recursive split and then merge version of the fold operation,<br>
> you'd have O(log n) work depth. (and that'd likely be faster!). But then<br>
> you'd notice "below some threshold, its better to compute sequentially,<br>
> because the overhead of parallization is too big".<br>
><br>
> etc etc. (the point i'm trying to reach for is that effective<br>
> parallelization requires a pretty rich understanding of your application /<br>
> software / hardware cost model)<br>
><br>
> likewise, REPA is really only going to shine on workloads that look<br>
> "pointwise" or "flat", at least with the current iteration. Its probably a<br>
> good idea to look at the various example codes that are available for repa<br>
> and acccelerate, because you'll notice that the codes which are especially<br>
> performant have that "flat" style of paralellims<br>
><br>
><br>
> On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>
> wrote:<br>
>><br>
>> Ok, got it. I picked the wrong function to try to understand how Repa<br>
>> parallelizes :)<br>
>><br>
>> So whats a good heuristic for using the parallel versions vs<br>
>> sequential for Repa?<br>
>><br>
>> Do the internals try to parallelize every element? or does it fuse<br>
>> them into some small number of parallelized tasks?<br>
>><br>
>> So just based from my observations<br>
>><br>
>> f (Z :. r :. c) = r * c<br>
>><br>
>> a <- computeP (fromFunction f)<br>
>> a `deepSeqArray` sumAllP a<br>
>><br>
>> should be faster then:<br>
>><br>
>> let a = computeS $ fromFunction f<br>
>> a `deepSeqArray` sumAllP $ a<br>
>><br>
>> but probably slower then<br>
>><br>
>> sumAllS $ computeS $ fromFunction f<br>
>><br>
>> Since an intermediate array is not even computed.<br>
>><br>
>> Thanks,<br>
>> Anatoly<br>
>><br>
>><br>
>> On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald<br>
>> <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>
>> > Read that paper I linked. Anything else I say will be a rehash of that<br>
>> > paper. :)<br>
>> ><br>
>> > On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> Ok, so whats the difference between the sequence and parallel<br>
>> >> versions? does the parallel one contain a thunk for every element in<br>
>> >> the output?<br>
>> >><br>
>> >> On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald<br>
>> >> <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>
>> >> > Read what I linked.<br>
>> >> > You are benchmarking repa for exactly the pessimal workload that it<br>
>> >> > is<br>
>> >> > bad<br>
>> >> > at.<br>
>> >> ><br>
>> >> > Repa is for point wise parallel and local convolution parallel<br>
>> >> > programs.<br>
>> >> > The way repa can express matrix multiplication is exactly the worst<br>
>> >> > way<br>
>> >> > to<br>
>> >> > implement a parallel matrix mult. Like, pretty pessimal wrt a memory<br>
>> >> > traffic / communication complexity metric of performance.<br>
>> >> ><br>
>> >> > Benchmark something like image blur algorithms and repa will really<br>
>> >> > shine.<br>
>> >> ><br>
>> >> > Right now your benchmark is the repa equivalent of noticing that<br>
>> >> > random<br>
>> >> > access on singly linked lists is slow :)<br>
>> >> ><br>
>> >> > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>
>> >> > wrote:<br>
>> >> >><br>
>> >> >> I am not really focusing on matrix multiply specifically. So the<br>
>> >> >> real<br>
>> >> >> problem is that the implementation using parallelized functions is<br>
>> >> >> slower then the sequential one, and adding more threads makes it<br>
>> >> >> barely as fast as the sequential one.<br>
>> >> >><br>
>> >> >> So why would i ever use the parallelized versions?<br>
>> >> >><br>
>> >> >><br>
>> >> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald<br>
>> >> >> <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>
>> >> >> > <a href="http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf" target="_blank">http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf</a> this<br>
>> >> >> > paper<br>
>> >> >> > (among many others by the blis project) articulates some of the<br>
>> >> >> > ideas<br>
>> >> >> > i<br>
>> >> >> > allude to pretty well (with pictures!)<br>
>> >> >> ><br>
>> >> >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald<br>
>> >> >> > <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>
>> >> >> >><br>
>> >> >> >> dense matrix product is not an algorithm that makes sense in<br>
>> >> >> >> repa's<br>
>> >> >> >> execution model,<br>
>> >> >> >> in square matrix multiply of two N x N matrices, each result<br>
>> >> >> >> entry<br>
>> >> >> >> depends<br>
>> >> >> >> on 2n values total across the two input matrices.<br>
>> >> >> >> even then, thats actually the wrong way to parallelize dense<br>
>> >> >> >> matrix<br>
>> >> >> >> product! its worth reading the papers about goto blas and the<br>
>> >> >> >> more<br>
>> >> >> >> recent<br>
>> >> >> >> blis project. a high performance dense matrix multipy winds up<br>
>> >> >> >> needing<br>
>> >> >> >> to do<br>
>> >> >> >> some nested array parallelism with mutable updates to have<br>
>> >> >> >> efficient<br>
>> >> >> >> sharing<br>
>> >> >> >> of sub computations!<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko<br>
>> >> >> >> <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>
>> >> >> >> wrote:<br>
>> >> >> >>><br>
>> >> >> >>> you think the backed would make any difference? this seems like<br>
>> >> >> >>> a<br>
>> >> >> >>> runtime issue to me, how are the threads scheduled by the ghc<br>
>> >> >> >>> runtime?<br>
>> >> >> >>><br>
>> >> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <<a href="mailto:kc1956@gmail.com">kc1956@gmail.com</a>> wrote:<br>
>> >> >> >>> > How is the LLVM?<br>
>> >> >> >>> ><br>
>> >> >> >>> > --<br>
>> >> >> >>> > --<br>
>> >> >> >>> ><br>
>> >> >> >>> > Sent from an expensive device which will be obsolete in a few<br>
>> >> >> >>> > months!<br>
>> >> >> >>> > :D<br>
>> >> >> >>> ><br>
>> >> >> >>> > Casey<br>
>> >> >> >>> ><br>
>> >> >> >>> ><br>
>> >> >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"<br>
>> >> >> >>> > <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>
>> >> >> >>> > wrote:<br>
>> >> >> >>> >><br>
>> >> >> >>> >> <a href="https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8" target="_blank">https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8</a><br>
>> >> >> >>> >><br>
>> >> >> >>> >><br>
>> >> >> >>> >> so i am seeing basically results with N4 that are as good as<br>
>> >> >> >>> >> using<br>
>> >> >> >>> >> sequential computation on my macbook for the matrix multiply<br>
>> >> >> >>> >> algorithm. any idea why?<br>
>> >> >> >>> >><br>
>> >> >> >>> >> Thanks,<br>
>> >> >> >>> >> Anatoly<br>
>> >> >> >>> >> _______________________________________________<br>
>> >> >> >>> >> Haskell-Cafe mailing list<br>
>> >> >> >>> >> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>
>> >> >> >>> >> <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
>> >> >> >>> _______________________________________________<br>
>> >> >> >>> Haskell-Cafe mailing list<br>
>> >> >> >>> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>
>> >> >> >>> <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> ><br>
><br>
><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature"><div dir="ltr"><p dir="ltr">--<br>
</p><p dir="ltr">Sent from an expensive device which will be obsolete in a few months! :D</p>
Casey<br><br></div></div>
</div>