<div dir="ltr"><div dir="ltr" class="gmail_msg"><div class="gmail_msg"><div class="gmail_msg">> On Tue, Dec 6, 2016 at 2:44 AM Ben Gamari <<a href="mailto:ben@smart-cactus.org">ben@smart-cactus.org</a>> wrote:</div><div class="gmail_msg">> Michal Terepeta <<a href="mailto:michal.terepeta@gmail.com">michal.terepeta@gmail.com</a>> writes:</div><div class="gmail_msg">> </div><div class="gmail_msg">> [...]</div><div class="gmail_msg">>></div><div class="gmail_msg">>> Looking at the comments on the proposal from Moritz, most people would</div><div class="gmail_msg">>> prefer to</div><div class="gmail_msg">>> extend/improve nofib or `tests/perf/compiler` tests. So I guess the main</div><div class="gmail_msg">>> question is - what would be better:</div><div class="gmail_msg">>> - Extending nofib with modules that are compile only (i.e., not</div><div class="gmail_msg">>>   runnable) and focus on stressing the compiler?</div><div class="gmail_msg">>> - Extending `tests/perf/compiler` with ability to run all the tests and do</div><div class="gmail_msg">>>   easy "before and after" comparisons?</div><div class="gmail_msg">>></div><div class="gmail_msg">>I don't have a strong opinion on which of these would be better.</div><div class="gmail_msg">>However, I would point out that currently the tests/perf/compiler tests</div><div class="gmail_msg">>are extremely labor-intensive to maintain while doing relatively little</div><div class="gmail_msg">>to catch performance regressions. There are a few issues here:</div><div class="gmail_msg">></div><div class="gmail_msg">> * some tests aren't very reproducible between runs, meaning that</div><div class="gmail_msg">>   contributors sometimes don't catch regressions in their local</div><div class="gmail_msg">>   validations</div><div class="gmail_msg">> * many tests aren't very reproducible between platforms and all tests</div><div class="gmail_msg">>   are inconsistent between differing word sizes. This means that we end</div><div class="gmail_msg">>   up having many sets of expected performance numbers in the testsuite.</div><div class="gmail_msg">>   In practice nearly all of these except 64-bit Linux are out-of-date.</div><div class="gmail_msg">> * our window-based acceptance criterion for performance metrics doesn't</div><div class="gmail_msg">>   catch most regressions, which typically bump allocations by a couple</div><div class="gmail_msg">>   percent or less (whereas the acceptance thresholds range from 5% to</div><div class="gmail_msg">>   20%). This means that the testsuite fails to catch many deltas, only</div><div class="gmail_msg">>   failing when some unlucky person finally pushes the number over the</div><div class="gmail_msg">>   threshold.</div><div class="gmail_msg">> </div><div class="gmail_msg">> Joachim and I discussed this issue a few months ago at Hac Phi; he had</div><div class="gmail_msg">> an interesting approach to tracking expected performance numbers which</div><div class="gmail_msg">> may both alleviate these issues and reduce the maintenance burden that</div><div class="gmail_msg">> the tests pose. I wrote down some terse notes in #12758.</div><div class="gmail_msg"><br></div><div class="gmail_msg">Thanks for mentioning the ticket!</div><div class="gmail_msg"><br></div><div class="gmail_msg">To be honest, I'm not a huge fan of having performance tests being treated the</div><div class="gmail_msg">same as any other tests. IMHO they are quite different:</div><div class="gmail_msg"><br></div><div class="gmail_msg">- They usually need a quiet environment (e.g., cannot run two different tests at</div><div class="gmail_msg">  the same time). But with ordinary correctness tests, I can run as many as I</div><div class="gmail_msg">  want concurrently.</div><div class="gmail_msg"><br></div><div class="gmail_msg">- The output is not really binary (correct vs incorrect) but some kind of a</div><div class="gmail_msg">  number (or collection of numbers) that we want to track over time.</div><div class="gmail_msg"><br></div><div class="gmail_msg">- The decision whether to fail is harder. Since output might be noisy, you</div><div class="gmail_msg">  need to have either quite relaxed bounds (and miss small regressions) or</div><div class="gmail_msg">  try to enforce stronger bounds (and suffer from the flakiness and maintenance</div><div class="gmail_msg">  overhead).</div><div class="gmail_msg"><br></div><div class="gmail_msg">So for the purpose of:</div><div class="gmail_msg">  "I have a small change and want to check its effect on compiler performance</div><div class="gmail_msg">  and expect, e.g., ~1% difference"</div><div class="gmail_msg">the model running of benchmarks separately from tests is much nicer. I can run</div><div class="gmail_msg">them when I'm not doing anything else on the computer and then easily compare</div><div class="gmail_msg">the results. (that's what I usually do for nofib). For tracking the performance</div><div class="gmail_msg">over time, one could set something up to run the benchmarks when idle. (isn't</div><div class="gmail_msg">that's what <a href="http://perf.haskell.org">perf.haskell.org</a> is doing?)</div><div class="gmail_msg"><br></div><div class="gmail_msg">Due to that, if we want to extend tests/perf/compiler to support this use case,</div><div class="gmail_msg">I think we should include there benchmarks that are *not* tests (and are not</div><div class="gmail_msg">included in ./validate), but there's some easy tool to run all of them and give</div><div class="gmail_msg">you a quick comparison of what's changed.</div><div class="gmail_msg"><br></div><div class="gmail_msg">To a certain degree this would be then orthogonal to the improvements suggested</div><div class="gmail_msg">in the ticket. But we could probably reuse some things (e.g., dumping .csv files</div><div class="gmail_msg">for perf metrics?)</div><div class="gmail_msg"><br></div><div class="gmail_msg">How should we proceed? Should I open a new ticket focused on this? (maybe we</div><div class="gmail_msg">could try to figure out all the details there?)</div><div class="gmail_msg"><br></div><div class="gmail_msg">Thanks,</div><div class="gmail_msg">Michal</div><div><br></div></div></div></div>