On CI

Wed Mar 17 09:26:16 UTC 2021

We need to do something about this, and I'd advocate for just not making stats fail with marge.

Generally I agree.   One point you don’t mention is that our perf tests (which CI forces us to look at assiduously) are often pretty weird cases.  So there is at least a danger that these more exotic cases will stand in the way of (say) a perf improvement in the typical case.

But “not making stats fail” is a bit crude.   Instead how about

  *   Always accept stat improvements

  *   We already have per-benchmark windows.  If the stat falls outside the window, we fail.  You are effectively saying “widen all windows to infinity”.  If something makes a stat 10 times worse, I think we *should* fail.  But 10% worse?  Maybe we should accept and look later as you suggest.   So I’d argue for widening the windows rather than disabling them completely.

  *   If we did that we’d need good instrumentation to spot steps and drift in perf, as you say.  An advantage is that since the perf instrumentation runs only on committed master patches, not on every CI, it can cost more.  In particular , it could run a bunch of “typical” tests, including nofib and compiling Cabal or other libraries.

The big danger is that by relieving patch authors from worrying about perf drift, it’ll end up in the lap of the GHC HQ team.  If it’s hard for the author of a single patch (with which she is intimately familiar) to work out why it’s making some test 2% worse, imagine how hard, and demotivating, it’d be for Ben to wonder why 50 patches (with which he is unfamiliar) are making some test 5% worse.

I’m not sure how to address this problem.   At least we should make it clear that patch authors are expected to engage *actively* in a conversation about why their patch is making something worse, even after it lands.

Simon

From: ghc-devs <ghc-devs-bounces at haskell.org> On Behalf Of Moritz Angermann
Sent: 17 March 2021 03:00
To: ghc-devs <ghc-devs at haskell.org>
Subject: On CI

Hi there!

Just a quick update on our CI situation. Ben, John, Davean and I have been
discussion on CI yesterday, and what we can do about it, as well as some
minor notes on why we are frustrated with it. This is an open invitation to anyone who in earnest wants to work on CI. Please come forward and help!
We'd be glad to have more people involved!

First the good news, over the last few weeks we've seen we *can* improve
CI performance quite substantially. And the goal is now to have MR go through
CI within at most 3hs.  There are some ideas on how to make this even faster,
especially on wide (high core count) machines; however that will take a bit more
time.

Now to the more thorny issue: Stat failures.  We do not want GHC to regress,
and I believe everyone is on board with that mission.  Yet we have just witnessed a train of marge trials all fail due to a -2% regression in a few tests. Thus we've been blocking getting stuff into master for at least another day. This is (in my opinion) not acceptable! We just had five days of nothing working because master was broken and subsequently all CI pipelines kept failing. We have thus effectively wasted a week. While we can mitigate the latter part by enforcing marge for all merges to master (and with faster pipeline turnaround times this might be more palatable than with 9-12h turnaround times -- when you need to get something done! ha!), but that won't help us with issues where marge can't find a set of buildable MRs, because she just keeps hitting a combination of MRs that somehow together increase or decrease metrics.

We have three knobs to adjust:
- Make GHC build faster / make the testsuite run faster.
  There is some rather interesting work going on about parallelizing (earlier)
  during builds. We've also seen that we've wasted enormous amounts of
  time during darwin builds in the kernel, because of a bug in the testdriver.
- Use faster hardware.
  We've seen that just this can cut windows build times from 220min to 80min.
- Reduce the amount of builds.
  We used to build two pipelines for each marge merge, and if either of both
  (see below) failed, marge's merge would fail as well. So not only did we build
  twice as much as we needed, we also increased our chances to hit bogous
  build failures by 2.

We need to do something about this, and I'd advocate for just not making stats fail with marge. Build errors of course, but stat failures, no. And then have a separate dashboard (and Ben has some old code lying around for this, which someone would need to pick up and polish, ...), that tracks GHC's Performance for each commit to master, with easy access from the dashboard to the offending commit. We will also need to consider the implications of synthetic micro benchmarks, as opposed to say building Cabal or other packages, that reflect more real-world experience of users using GHC.

I will try to provide a data driven report on GHC's CI on a bi-weekly or month (we will have to see what the costs for writing it up, and the usefulness is) going forward. And my sincere hope is that it will help us better understand our CI situation; instead of just having some vague complaints about it.

Cheers,
 Moritz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20210317/d47f01fb/attachment.html>