Measuring performance of GHC

Tue Dec 6 21:09:52 UTC 2016

Michal Terepeta <michal.terepeta at gmail.com> writes:

>> On Tue, Dec 6, 2016 at 2:44 AM Ben Gamari <ben at smart-cactus.org> wrote:
>>
>>I don't have a strong opinion on which of these would be better.
>>However, I would point out that currently the tests/perf/compiler tests
>>are extremely labor-intensive to maintain while doing relatively little
>>to catch performance regressions. There are a few issues here:
>>
>> * some tests aren't very reproducible between runs, meaning that
>>   contributors sometimes don't catch regressions in their local
>>   validations
>> * many tests aren't very reproducible between platforms and all tests
>>   are inconsistent between differing word sizes. This means that we end
>>   up having many sets of expected performance numbers in the testsuite.
>>   In practice nearly all of these except 64-bit Linux are out-of-date.
>> * our window-based acceptance criterion for performance metrics doesn't
>>   catch most regressions, which typically bump allocations by a couple
>>   percent or less (whereas the acceptance thresholds range from 5% to
>>   20%). This means that the testsuite fails to catch many deltas, only
>>   failing when some unlucky person finally pushes the number over the
>>   threshold.
>>
>> Joachim and I discussed this issue a few months ago at Hac Phi; he had
>> an interesting approach to tracking expected performance numbers which
>> may both alleviate these issues and reduce the maintenance burden that
>> the tests pose. I wrote down some terse notes in #12758.
>
> Thanks for mentioning the ticket!
>
Sure!

> To be honest, I'm not a huge fan of having performance tests being
> treated the same as any other tests. IMHO they are quite different:
>
> - They usually need a quiet environment (e.g., cannot run two different
>   tests at the same time). But with ordinary correctness tests, I can
>   run as many as I want concurrently.
>
This is absolutely true; if I had a nickel for every time I saw the
testsuite fail, only to pass upon re-running I would be able to fund a
great deal of GHC development ;)

> - The output is not really binary (correct vs incorrect) but some kind of a
>   number (or collection of numbers) that we want to track over time.
>
Yes, and this is more or less the idea which the ticket is supposed to
capture; we track performance numbers in the GHC repository in git
notes and have Harbormaster (or some other stable test environment)
maintain them. Exact metrics would be recorded for every commit and we
could warn during validate if something changes suspiciously (e.g. look
at the mean and variance of the metric over the past N commits and
squawk if the commit bumps the metric more than some number of sigmas).

This sort of scheme could be implemented in either the testsuite or
nofib. It's not clear that one is better than the other (although we
would want to teach the testsuite driver to run performance tests
serially).

> - The decision whether to fail is harder. Since output might be noisy, you
>   need to have either quite relaxed bounds (and miss small
>   regressions) or try to enforce stronger bounds (and suffer from the
>   flakiness and maintenance overhead).
>
Yep. That is right.

> So for the purpose of:
>   "I have a small change and want to check its effect on compiler
>   performance and expect, e.g., ~1% difference"
> the model running of benchmarks separately from tests is much nicer. I
> can run them when I'm not doing anything else on the computer and then
> easily compare the results. (that's what I usually do for nofib). For
> tracking the performance over time, one could set something up to run
> the benchmarks when idle. (isn't that's what perf.haskell.org is
> doing?)
>
> Due to that, if we want to extend tests/perf/compiler to support this
> use case, I think we should include there benchmarks that are *not*
> tests (and are not included in ./validate), but there's some easy tool
> to run all of them and give you a quick comparison of what's changed.
>
When you put it like this it does sound like nofib is the natural choice
here.

> To a certain degree this would be then orthogonal to the improvements
> suggested in the ticket. But we could probably reuse some things
> (e.g., dumping .csv files for perf metrics?)
>
Indeed.

> How should we proceed? Should I open a new ticket focused on this?
> (maybe we could try to figure out all the details there?)
>
That sounds good to me.

Cheers,

- Ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 454 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20161206/7216922b/attachment.sig>