<div dir="ltr">Not sure what changed, but after rerunning it I get expected results:<div><br></div><div><span><div>anatolys-MacBook:rbm anatolyy$  dist/build/proto/proto +RTS -N2</div><div>benchmarking P</div><div>time                 1.791 s    (1.443 s .. 2.304 s)</div><div>                     0.991 R²   (0.974 R² .. 1.000 R²)</div><div>mean                 1.803 s    (1.750 s .. 1.855 s)</div><div>std dev              90.06 ms   (0.0 s .. 90.90 ms)</div><div>variance introduced by outliers: 19% (moderately inflated)</div><div><br></div><div>benchmarking S</div><div>time                 3.225 s    (2.685 s .. 3.837 s)</div><div>                     0.996 R²   (0.985 R² .. 1.000 R²)</div><div>mean                 3.033 s    (2.857 s .. 3.142 s)</div><div>std dev              165.0 ms   (0.0 s .. 188.7 ms)</div><div>variance introduced by outliers: 19% (moderately inflated)</div><div><br></div><div>perf log written to dist/perf-mmult.html</div><div>anatolys-MacBook:rbm anatolyy$  dist/build/proto/proto +RTS -N4</div><div>benchmarking P</div><div>time                 1.851 s    (1.326 s .. 2.316 s)</div><div>                     0.990 R²   (0.964 R² .. 1.000 R²)</div><div>mean                 1.784 s    (1.693 s .. 1.901 s)</div><div>std dev              106.3 ms   (0.0 s .. 119.8 ms)</div><div>variance introduced by outliers: 19% (moderately inflated)</div><div><br></div><div>benchmarking S</div><div>time                 3.329 s    (3.041 s .. 3.944 s)</div><div>                     0.996 R²   (0.993 R² .. 1.000 R²)</div><div>mean                 3.173 s    (3.100 s .. 3.244 s)</div><div>std dev              119.6 ms   (0.0 s .. 121.9 ms)</div><div>variance introduced by outliers: 19% (moderately inflated)</div><div><br></div><div>perf log written to dist/perf-mmult.html</div><div>anatolys-MacBook:rbm anatolyy$  dist/build/proto/proto +RTS -N</div><div>benchmarking P</div><div>time                 1.717 s    (1.654 s .. 1.830 s)</div><div>                     0.999 R²   (0.999 R² .. 1.000 R²)</div><div>mean                 1.717 s    (1.701 s .. 1.728 s)</div><div>std dev              16.64 ms   (0.0 s .. 19.20 ms)</div><div>variance introduced by outliers: 19% (moderately inflated)</div><div><br></div><div>benchmarking S</div><div>time                 3.127 s    (3.079 s .. 3.222 s)</div><div>                     1.000 R²   (1.000 R² .. 1.000 R²)</div><div>mean                 3.105 s    (3.094 s .. 3.116 s)</div><div>std dev              18.12 ms   (543.9 as .. 18.50 ms)</div><div>variance introduced by outliers: 19% (moderately inflated)</div><div><br></div><div>perf log written to dist/perf-mmult.html</div><div><br></div></span><br></div></div><br><div class="gmail_quote"><div dir="ltr">On Thu, Jan 14, 2016 at 11:22 AM Thomas Miedema <<a href="mailto:thomasmiedema@gmail.com">thomasmiedema@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">To avoid any confusion, this was a reply to the following email:</div><div dir="ltr"><br><br><div class="gmail_quote">On Fri, Mar 13, 2015 at 6:23 PM, Anatoly Yakovenko <span dir="ltr"><<a href="mailto:aeyakovenko@gmail.com" target="_blank">aeyakovenko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><a href="https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8" rel="noreferrer" target="_blank">https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8</a><br><br><br>so i am seeing basically results with N4 that are as good as using<br>sequential computation on my macbook for the matrix multiply<br>algorithm.  any idea why?<br><br>Thanks,<br>Anatoly<br></blockquote></div></div><div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 14, 2016 at 8:19 PM, Thomas Miedema <span dir="ltr"><<a href="mailto:thomasmiedema@gmail.com" target="_blank">thomasmiedema@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Anatoly: I also ran your benchmark, and can not reproduce your findings.</div><div><br></div><div>Note that GHC does not make effective use of hyperthreads (<a href="https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12" target="_blank">https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12</a>). So don't use -N4 when you have only a dual core machine. Maybe that's why you were getting bad results? I also notice a `NaN` in one of your timing results. I don't know how that is possible, or if it affected your results. Could you try running your benchmark again, but this time with -N2?</div><div class="gmail_extra"><br><div class="gmail_quote"><span>On Sat, Mar 14, 2015 at 5:21 PM, Carter Schonwald <span dir="ltr"><<a href="mailto:carter.schonwald@gmail.com" target="_blank">carter.schonwald@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">dense matrix product is not an algorithm that makes sense in repa's execution model, </div></blockquote><div><br></div></span><div>Matrix multiplication is the first example in the first repa paper: <a href="http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf" target="_blank">http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf</a>. Look at figures 2 and 7.<br></div><div><br></div><div>    "we measured very good absolute speedup, ×7.2 for 8 cores, on multicore hardware"</div><div><br></div><div>Doing a quick experiment with 2 threads (my laptop doesn't have more cores):</div><div><br></div><div>$ cabal install repa-examples    # I did not bother with `-fllvm`</div><div>...</div><div><br></div><div>$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204</div><div>elapsedTimeMS   = 6491<br></div><div><div><br></div><div>$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 +RTS -N2</div></div><div>elapsedTimeMS   = 3393<br></div><div><br></div><div>This is with GHC 7.10.3 and repa-3.4.0.1 (and dependencies from <a href="http://www.stackage.org/snapshot/lts-3.22" target="_blank">http://www.stackage.org/snapshot/lts-3.22</a>)</div><div><br></div><div><br></div></div></div></div>

</blockquote></div><br></div></div></blockquote></div>