Anatoly Yakovenko aeyakovenko at gmail.com
Mon Mar 16 12:11:35 UTC 2015

hmm, so i was wrong

https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6

the first version performed the best, even when running with -N1
agains the sequential version.

On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald
<carter.schonwald at gmail.com> wrote:
> you're getting on the right track! :)
>
> the idea you're reaching for is "parallel work depth".  Eg, if instead of
> foldl' (which has O(n) work depth), you had a "parallel" fold that kinda
> looks like a recursive split and then merge version of the fold operation,
> you'd have O(log n) work depth. (and that'd likely be faster!). But then
> you'd notice "below some threshold, its better to compute sequentially,
> because the overhead of parallization is too big".
>
> etc etc. (the point i'm trying to reach for is that effective
> parallelization requires a pretty rich understanding of your application /
> software / hardware cost model)
>
> likewise, REPA is really only going to shine on workloads that look
> "pointwise" or "flat", at least with the current iteration. Its probably a
> good idea to look at the various example codes that are available for repa
> and acccelerate, because you'll notice that the codes which are especially
> performant have that "flat" style of paralellims
>
>
> On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko <aeyakovenko at gmail.com>
> wrote:
>>
>> Ok, got it. I picked the wrong function to try to understand how Repa
>> parallelizes :)
>>
>> So whats a good heuristic for using the parallel versions vs
>> sequential for Repa?
>>
>> Do the internals try to parallelize every element?  or does it fuse
>> them into some small number of parallelized tasks?
>>
>> So just based from my observations
>>
>> f (Z :. r :. c) = r * c
>>
>> a <- computeP (fromFunction f)
>> a `deepSeqArray` sumAllP a
>>
>> should be faster then:
>>
>> let a = computeS \$ fromFunction f
>> a `deepSeqArray` sumAllP \$ a
>>
>> but probably slower then
>>
>> sumAllS \$ computeS \$ fromFunction f
>>
>> Since an intermediate array is not even computed.
>>
>> Thanks,
>> Anatoly
>>
>>
>> On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald
>> <carter.schonwald at gmail.com> wrote:
>> > Read that paper I linked. Anything else I say will be a rehash of that
>> > paper. :)
>> >
>> > On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <aeyakovenko at gmail.com>
>> > wrote:
>> >>
>> >> Ok, so whats the difference between the sequence and parallel
>> >> versions? does the parallel one contain a thunk for every element in
>> >> the output?
>> >>
>> >> On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald
>> >> <carter.schonwald at gmail.com> wrote:
>> >> > You are benchmarking repa for exactly the pessimal workload that it
>> >> > is
>> >> > at.
>> >> >
>> >> > Repa is for point wise parallel and local convolution parallel
>> >> > programs.
>> >> > The way repa can express matrix multiplication is exactly the worst
>> >> > way
>> >> > to
>> >> > implement a parallel matrix mult.  Like, pretty pessimal wrt a memory
>> >> > traffic / communication complexity metric of performance.
>> >> >
>> >> > Benchmark something like image blur algorithms and repa will really
>> >> > shine.
>> >> >
>> >> > Right now your benchmark is the repa equivalent of noticing that
>> >> > random
>> >> > access on singly linked lists is slow :)
>> >> >
>> >> > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" <aeyakovenko at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> I am not really focusing on matrix multiply specifically.  So the
>> >> >> real
>> >> >> problem is that the implementation using parallelized functions is
>> >> >> slower then the sequential one, and adding more threads makes it
>> >> >> barely as fast as the sequential one.
>> >> >>
>> >> >> So why would i ever use the parallelized versions?
>> >> >>
>> >> >>
>> >> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald
>> >> >> <carter.schonwald at gmail.com> wrote:
>> >> >> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this
>> >> >> > paper
>> >> >> > (among many others by the blis project) articulates some of the
>> >> >> > ideas
>> >> >> > i
>> >> >> > allude to pretty well (with pictures!)
>> >> >> >
>> >> >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald
>> >> >> > <carter.schonwald at gmail.com> wrote:
>> >> >> >>
>> >> >> >> dense matrix product is not an algorithm that makes sense in
>> >> >> >> repa's
>> >> >> >> execution model,
>> >> >> >> in square matrix multiply of two N x N matrices, each result
>> >> >> >> entry
>> >> >> >> depends
>> >> >> >> on 2n values total across the  two input matrices.
>> >> >> >> even then, thats actually the wrong way to parallelize dense
>> >> >> >> matrix
>> >> >> >> product! its worth reading the papers about goto blas and the
>> >> >> >> more
>> >> >> >> recent
>> >> >> >> blis project. a high performance dense matrix multipy winds up
>> >> >> >> needing
>> >> >> >> to do
>> >> >> >> some nested array parallelism with mutable updates to have
>> >> >> >> efficient
>> >> >> >> sharing
>> >> >> >> of sub computations!
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko
>> >> >> >> <aeyakovenko at gmail.com>
>> >> >> >> wrote:
>> >> >> >>>
>> >> >> >>> you think the backed would make any difference?  this seems like
>> >> >> >>> a
>> >> >> >>> runtime issue to me, how are the threads scheduled by the ghc
>> >> >> >>> runtime?
>> >> >> >>>
>> >> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC <kc1956 at gmail.com> wrote:
>> >> >> >>> > How is the LLVM?
>> >> >> >>> >
>> >> >> >>> > --
>> >> >> >>> > --
>> >> >> >>> >
>> >> >> >>> > Sent from an expensive device which will be obsolete in a few
>> >> >> >>> > months!
>> >> >> >>> > :D
>> >> >> >>> >
>> >> >> >>> > Casey
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
>> >> >> >>> > <aeyakovenko at gmail.com>
>> >> >> >>> > wrote:
>> >> >> >>> >>
>> >> >> >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> so i am seeing basically results with N4 that are as good as
>> >> >> >>> >> using
>> >> >> >>> >> sequential computation on my macbook for the matrix multiply
>> >> >> >>> >> algorithm.  any idea why?
>> >> >> >>> >>
>> >> >> >>> >> Thanks,
>> >> >> >>> >> Anatoly
>> >> >> >>> >> _______________________________________________
>> >> >> >>> >> Haskell-Cafe mailing list