Measuring performance of GHC

Tue Dec 6 19:27:13 UTC 2016

> On Tue, Dec 6, 2016 at 2:44 AM Ben Gamari <ben at smart-cactus.org> wrote:
> Michal Terepeta <michal.terepeta at gmail.com> writes:
>
> [...]
>>
>> Looking at the comments on the proposal from Moritz, most people would
>> prefer to
>> extend/improve nofib or `tests/perf/compiler` tests. So I guess the main
>> question is - what would be better:
>> - Extending nofib with modules that are compile only (i.e., not
>>   runnable) and focus on stressing the compiler?
>> - Extending `tests/perf/compiler` with ability to run all the tests and
do
>>   easy "before and after" comparisons?
>>
>I don't have a strong opinion on which of these would be better.
>However, I would point out that currently the tests/perf/compiler tests
>are extremely labor-intensive to maintain while doing relatively little
>to catch performance regressions. There are a few issues here:
>
> * some tests aren't very reproducible between runs, meaning that
>   contributors sometimes don't catch regressions in their local
>   validations
> * many tests aren't very reproducible between platforms and all tests
>   are inconsistent between differing word sizes. This means that we end
>   up having many sets of expected performance numbers in the testsuite.
>   In practice nearly all of these except 64-bit Linux are out-of-date.
> * our window-based acceptance criterion for performance metrics doesn't
>   catch most regressions, which typically bump allocations by a couple
>   percent or less (whereas the acceptance thresholds range from 5% to
>   20%). This means that the testsuite fails to catch many deltas, only
>   failing when some unlucky person finally pushes the number over the
>   threshold.
>
> Joachim and I discussed this issue a few months ago at Hac Phi; he had
> an interesting approach to tracking expected performance numbers which
> may both alleviate these issues and reduce the maintenance burden that
> the tests pose. I wrote down some terse notes in #12758.

Thanks for mentioning the ticket!

To be honest, I'm not a huge fan of having performance tests being treated
the
same as any other tests. IMHO they are quite different:

- They usually need a quiet environment (e.g., cannot run two different
tests at
  the same time). But with ordinary correctness tests, I can run as many as
I
  want concurrently.

- The output is not really binary (correct vs incorrect) but some kind of a
  number (or collection of numbers) that we want to track over time.

- The decision whether to fail is harder. Since output might be noisy, you
  need to have either quite relaxed bounds (and miss small regressions) or
  try to enforce stronger bounds (and suffer from the flakiness and
maintenance
  overhead).

So for the purpose of:
  "I have a small change and want to check its effect on compiler
performance
  and expect, e.g., ~1% difference"
the model running of benchmarks separately from tests is much nicer. I can
run
them when I'm not doing anything else on the computer and then easily
compare
the results. (that's what I usually do for nofib). For tracking the
performance
over time, one could set something up to run the benchmarks when idle.
(isn't
that's what perf.haskell.org is doing?)

Due to that, if we want to extend tests/perf/compiler to support this use
case,
I think we should include there benchmarks that are *not* tests (and are not
included in ./validate), but there's some easy tool to run all of them and
give
you a quick comparison of what's changed.

To a certain degree this would be then orthogonal to the improvements
suggested
in the ticket. But we could probably reuse some things (e.g., dumping .csv
files
for perf metrics?)

How should we proceed? Should I open a new ticket focused on this? (maybe we
could try to figure out all the details there?)

Thanks,
Michal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20161206/d65c9d3e/attachment-0001.html>