Tracking intermittently failing CI jobs

Tue Jul 12 11:03:29 UTC 2022

Hello again,

Thanks to everyone who pointed out spurious failures over the last few 
weeks. Here's the current state of affairs and some discussion on next 
steps.

*
*

*Dashboard
***

I made a dashboard for tracking spurious failures:

https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2

I created this for three reasons:

 1. Keep tabs on new occurrences of spurious failures
 2. Understand which problems are causing the most issues
 3. Measure the effectiveness of any intervention

The dashboard still needs development, but it can already be used to 
show that the number of "Cannot connect to Docker daemon" failures has 
been reduced.

*
*

*Characterizing and Fixing Failures*

I have preliminary results on a few failure types. For instance, I used 
the "docker" type of failure to bootstrap the dashboard. Along with 
"Killed with signal 9", it seems to indicate a problem with the CI 
runner, itself.

To look more deeply into these types of runner-system failures, *I will 
need more access*. If you are responsible for some runners and you're 
comfortable giving me shell access, you can find my public ssh key at 
https://gitlab.haskell.org/-/snippets/5546. (Posted as a snippet so at 
least you know the key comes from somebody who can access my GitLab 
account. Other secure means of communication are listed at 
https://keybase.io/chreekat.) Send me a message if you do so.

Besides runner problems, there are spurious failures that may have more 
to do with the CI code, itself. They include some problem with 
environment variables and (probably) some issue with console buffering. 
Neither of these are being tracked on the dashboard yet. Many other 
problems are yet to be explored at all.

*Next Steps*

The theme for the next steps is finalizing the dashboard and 
characterizing more failures.

  * Track more failure types on the dashboard
  * Improve the process of backfilling failure data on the dashboard
  * Include more metadata (like project id!) on the dashboard so it's
    easier to zoom on failures
  * Document the dashboard and the processes that populate it for posterity
  * Diagnose runner-system failures (if accessible)
  * Continue exploring other failure types
  * Fix failures omg!?

The list of next steps is currently heavy on finalizing the dashboard 
and light on fixing spurious failures. I know that might be frustrating. 
My justification is that CI is a complex hardware/software/human system 
under continuous operation where most the low-hanging fruit have already 
been plucked. It's time to get serious. :) My goal is to make spurious 
failures surprising rather than commonplace. This is the best way I know 
to achieve that.

Thanks again for helping me with this goal. :)

-Bryan

P.S. If you're interested, I've been posting updates like this one on 
Discourse:

https://discourse.haskell.org/search?q=DevOps%20Weekly%20Log%20%23haskell-foundation%20order%3Alatest_topic

On 18/05/2022 13:25, Bryan wrote:
> Hi all,
>
> I'd like to get some data on weird CI failures. Before clicking 
> "retry" on a spurious failure, please paste the url for your job into 
> the spreadsheet you'll find linked at 
> https://gitlab.haskell.org/ghc/ghc/-/issues/21591.
>
> Sorry for the slight misdirection. I wanted the spreadsheet to be 
> world-writable, which means I don't want its url floating around in 
> too many places. Maybe you can bookmark it if CI is causing you too 
> much trouble. :)
>
> -Bryan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20220712/813b44ab/attachment.html>