On CI

Thu Mar 18 19:50:48 UTC 2021

Simon Peyton Jones via ghc-devs <ghc-devs at haskell.org> writes:

> > We need to do something about this, and I'd advocate for just not making stats fail with marge.
>
> Generally I agree. One point you don’t mention is that our perf tests
> (which CI forces us to look at assiduously) are often pretty weird
> cases. So there is at least a danger that these more exotic cases will
> stand in the way of (say) a perf improvement in the typical case.
>
> But “not making stats fail” is a bit crude.   Instead how about
>
To be clear, the proposal isn't to accept stats failures for merge request
validation jobs. I believe Moritz was merely suggesting that we accept
such failures in marge-bot validations (that is, the pre-merge
validation done on batches of merge requests).

In my opinion this is reasonable since we know that all of the MRs in
the batch do not individually regress. While it's possible that
interactions between two or more MRs result in a qualitative change in
performance, it seems quite unlikely. What is far *more* likely (and
what we see regularly) is that the cumulative effect of a batch of
improving patches pushes the batches' overall stat change out of the
acceptance threshold. This is quite annoying as it dooms the entire
batch.

For this reason, I think we should at very least accept stat
improvements during Marge validations (as you suggest). I agree that we
probably want a batch to fail if two patches accumulate to form a
regression, even if the two passed CI individually.

>   * We already have per-benchmark windows. If the stat falls outside
>   the window, we fail. You are effectively saying “widen all windows
>   to infinity”. If something makes a stat 10 times worse, I think we
>   *should* fail. But 10% worse? Maybe we should accept and look later
>   as you suggest. So I’d argue for widening the windows rather than
>   disabling them completely.
>
Yes, I agree.
>
>   * If we did that we’d need good instrumentation to spot steps and
>   drift in perf, as you say. An advantage is that since the perf
>   instrumentation runs only on committed master patches, not on every
>   CI, it can cost more. In particular , it could run a bunch of
>   “typical” tests, including nofib and compiling Cabal or other
>   libraries.
>
We already have the beginnings of such instrumentation.

> The big danger is that by relieving patch authors from worrying about
> perf drift, it’ll end up in the lap of the GHC HQ team. If it’s hard
> for the author of a single patch (with which she is intimately
> familiar) to work out why it’s making some test 2% worse, imagine how
> hard, and demotivating, it’d be for Ben to wonder why 50 patches (with
> which he is unfamiliar) are making some test 5% worse.
>
Yes, I absolutely agree with this. I would very much like to avoid
having to do this sort of post-hoc investigation any more than
necessary.

Cheers,

- Ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 487 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20210318/69914b30/attachment.sig>