<p dir="ltr">Read that paper I linked. Anything else I say will be a rehash of that paper. :)</p>

<div class="gmail_quote">On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Ok, so whats the difference between the sequence and parallel<br>

versions? does the parallel one contain a thunk for every element in<br>

the output?<br>

<br>

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald<br>

<<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

> Read what I linked.<br>

> You are benchmarking repa for exactly the pessimal workload that it is bad<br>

> at.<br>

><br>

> Repa is for point wise parallel and local convolution parallel programs.<br>

> The way repa can express matrix multiplication is exactly the worst way to<br>

> implement a parallel matrix mult.  Like, pretty pessimal wrt a memory<br>

> traffic / communication complexity metric of performance.<br>

><br>

> Benchmark something like image blur algorithms and repa will really shine.<br>

><br>

> Right now your benchmark is the repa equivalent of noticing that random<br>

> access on singly linked lists is slow :)<br>

><br>

> On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>> wrote:<br>

>><br>

>> I am not really focusing on matrix multiply specifically.  So the real<br>

>> problem is that the implementation using parallelized functions is<br>

>> slower then the sequential one, and adding more threads makes it<br>

>> barely as fast as the sequential one.<br>

>><br>

>> So why would i ever use the parallelized versions?<br>

>><br>

>><br>

>> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald<br>

>> <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

>> > <a href="http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf" target="_blank">http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf</a> this paper<br>

>> > (among many others by the blis project) articulates some of the ideas i<br>

>> > allude to pretty well (with pictures!)<br>

>> ><br>

>> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald<br>

>> > <<a href="mailto:carter.schonwald@gmail.com">carter.schonwald@gmail.com</a>> wrote:<br>

>> >><br>

>> >> dense matrix product is not an algorithm that makes sense in repa's<br>

>> >> execution model,<br>

>> >> in square matrix multiply of two N x N matrices, each result entry<br>

>> >> depends<br>

>> >> on 2n values total across the  two input matrices.<br>

>> >> even then, thats actually the wrong way to parallelize dense matrix<br>

>> >> product! its worth reading the papers about goto blas and the more<br>

>> >> recent<br>

>> >> blis project. a high performance dense matrix multipy winds up needing<br>

>> >> to do<br>

>> >> some nested array parallelism with mutable updates to have efficient<br>

>> >> sharing<br>

>> >> of sub computations!<br>

>> >><br>

>> >><br>

>> >><br>

>> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko<br>

>> >> <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>

>> >> wrote:<br>

>> >>><br>

>> >>> you think the backed would make any difference?  this seems like a<br>

>> >>> runtime issue to me, how are the threads scheduled by the ghc runtime?<br>

>> >>><br>

>> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <<a href="mailto:kc1956@gmail.com">kc1956@gmail.com</a>> wrote:<br>

>> >>> > How is the LLVM?<br>

>> >>> ><br>

>> >>> > --<br>

>> >>> > --<br>

>> >>> ><br>

>> >>> > Sent from an expensive device which will be obsolete in a few<br>

>> >>> > months!<br>

>> >>> > :D<br>

>> >>> ><br>

>> >>> > Casey<br>

>> >>> ><br>

>> >>> ><br>

>> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"<br>

>> >>> > <<a href="mailto:aeyakovenko@gmail.com">aeyakovenko@gmail.com</a>><br>

>> >>> > wrote:<br>

>> >>> >><br>

>> >>> >> <a href="https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8" target="_blank">https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8</a><br>

>> >>> >><br>

>> >>> >><br>

>> >>> >> so i am seeing basically results with N4 that are as good as using<br>

>> >>> >> sequential computation on my macbook for the matrix multiply<br>

>> >>> >> algorithm.  any idea why?<br>

>> >>> >><br>

>> >>> >> Thanks,<br>

>> >>> >> Anatoly<br>

>> >>> >> _______________________________________________<br>

>> >>> >> Haskell-Cafe mailing list<br>

>> >>> >> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>

>> >>> >> <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

>> >>> _______________________________________________<br>

>> >>> Haskell-Cafe mailing list<br>

>> >>> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>

>> >>> <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

>> >><br>

>> >><br>

>> ><br>

</blockquote></div>