[Haskell-cafe] repa parallelization results

Sun Mar 15 20:21:15 UTC 2015

Ok, so whats the difference between the sequence and parallel
versions? does the parallel one contain a thunk for every element in
the output?

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald
<carter.schonwald at gmail.com> wrote:
> Read what I linked.
> You are benchmarking repa for exactly the pessimal workload that it is bad
> at.
>
> Repa is for point wise parallel and local convolution parallel programs.
> The way repa can express matrix multiplication is exactly the worst way to
> implement a parallel matrix mult.  Like, pretty pessimal wrt a memory
> traffic / communication complexity metric of performance.
>
> Benchmark something like image blur algorithms and repa will really shine.
>
> Right now your benchmark is the repa equivalent of noticing that random
> access on singly linked lists is slow :)
>
> On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" <aeyakovenko at gmail.com> wrote:
>>
>> I am not really focusing on matrix multiply specifically.  So the real
>> problem is that the implementation using parallelized functions is
>> slower then the sequential one, and adding more threads makes it
>> barely as fast as the sequential one.
>>
>> So why would i ever use the parallelized versions?
>>
>>
>> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald
>> <carter.schonwald at gmail.com> wrote:
>> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this paper
>> > (among many others by the blis project) articulates some of the ideas i
>> > allude to pretty well (with pictures!)
>> >
>> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald
>> > <carter.schonwald at gmail.com> wrote:
>> >>
>> >> dense matrix product is not an algorithm that makes sense in repa's
>> >> execution model,
>> >> in square matrix multiply of two N x N matrices, each result entry
>> >> depends
>> >> on 2n values total across the  two input matrices.
>> >> even then, thats actually the wrong way to parallelize dense matrix
>> >> product! its worth reading the papers about goto blas and the more
>> >> recent
>> >> blis project. a high performance dense matrix multipy winds up needing
>> >> to do
>> >> some nested array parallelism with mutable updates to have efficient
>> >> sharing
>> >> of sub computations!
>> >>
>> >>
>> >>
>> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko
>> >> <aeyakovenko at gmail.com>
>> >> wrote:
>> >>>
>> >>> you think the backed would make any difference?  this seems like a
>> >>> runtime issue to me, how are the threads scheduled by the ghc runtime?
>> >>>
>> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <kc1956 at gmail.com> wrote:
>> >>> > How is the LLVM?
>> >>> >
>> >>> > --
>> >>> > --
>> >>> >
>> >>> > Sent from an expensive device which will be obsolete in a few
>> >>> > months!
>> >>> > :D
>> >>> >
>> >>> > Casey
>> >>> >
>> >>> >
>> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
>> >>> > <aeyakovenko at gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
>> >>> >>
>> >>> >>
>> >>> >> so i am seeing basically results with N4 that are as good as using
>> >>> >> sequential computation on my macbook for the matrix multiply
>> >>> >> algorithm.  any idea why?
>> >>> >>
>> >>> >> Thanks,
>> >>> >> Anatoly
>> >>> >> _______________________________________________
>> >>> >> Haskell-Cafe mailing list
>> >>> >> Haskell-Cafe at haskell.org
>> >>> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> >>> _______________________________________________
>> >>> Haskell-Cafe mailing list
>> >>> Haskell-Cafe at haskell.org
>> >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> >>
>> >>
>> >