<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hello again,</p>
<p>Thanks to everyone who pointed out spurious failures over the
last few weeks. Here's the current state of affairs and some
discussion on next steps.<br>
</p>
<p><b><br>
</b></p>
<p><b>Dashboard<br>
</b><b></b></p>
<p>I made a dashboard for tracking spurious failures:</p>
<p><a moz-do-not-send="true"
href="https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2"
class="moz-txt-link-freetext">https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2</a></p>
<p>I created this for three reasons:<br>
</p>
<ol>
<li>Keep tabs on new occurrences of spurious failures</li>
<li>Understand which problems are causing the most issues</li>
<li>Measure the effectiveness of any intervention</li>
</ol>
<p>The dashboard still needs development, but it can already be used
to show that the number of "Cannot connect to Docker daemon"
failures has been reduced. <br>
</p>
<p><b><br>
</b></p>
<p><b>Characterizing and Fixing Failures</b></p>
<p>I have preliminary results on a few failure types. For instance,
I used the "docker" type of failure to bootstrap the dashboard.
Along with "Killed with signal 9", it seems to indicate a problem
with the CI runner, itself.</p>
<p>To look more deeply into these types of runner-system failures, <b>I
will need more access</b>. If you are responsible for some
runners and you're comfortable giving me shell access, you can
find my public ssh key at <a moz-do-not-send="true"
href="https://gitlab.haskell.org/-/snippets/5546"
class="moz-txt-link-freetext">https://gitlab.haskell.org/-/snippets/5546</a>.
(Posted as a snippet so at least you know the key comes from
somebody who can access my GitLab account. Other secure means of
communication are listed at <a moz-do-not-send="true"
href="https://keybase.io/chreekat" class="moz-txt-link-freetext">https://keybase.io/chreekat</a>.)
Send me a message if you do so.<br>
</p>
Besides runner problems, there are spurious failures that may have
more to do with the CI code, itself. They include some problem with
environment variables and (probably) some issue with console
buffering. Neither of these are being tracked on the dashboard yet.
Many other problems are yet to be explored at all.
<p><br>
</p>
<p><b>Next Steps</b></p>
<p>The theme for the next steps is finalizing the dashboard and
characterizing more failures.<br>
</p>
<ul>
<li>Track more failure types on the dashboard</li>
<li>Improve the process of backfilling failure data on the
dashboard</li>
<li>Include more metadata (like project id!) on the dashboard so
it's easier to zoom on failures</li>
<li>Document the dashboard and the processes that populate it for
posterity<br>
</li>
<li>Diagnose runner-system failures (if accessible)<br>
</li>
<li>Continue exploring other failure types</li>
<li>Fix failures omg!?</li>
</ul>
<p>The list of next steps is currently heavy on finalizing the
dashboard and light on fixing spurious failures. I know that might
be frustrating. My justification is that CI is a complex
hardware/software/human system under continuous operation where
most the low-hanging fruit have already been plucked. It's time to
get serious. :) My goal is to make spurious failures surprising
rather than commonplace. This is the best way I know to achieve
that.</p>
<p>Thanks again for helping me with this goal. :)</p>
<p><br>
</p>
<p>-Bryan</p>
<p>P.S. If you're interested, I've been posting updates like this
one on Discourse:</p>
<p><a moz-do-not-send="true"
href="https://discourse.haskell.org/search?q=DevOps%20Weekly%20Log%20%23haskell-foundation%20order%3Alatest_topic"
class="moz-txt-link-freetext">
https://discourse.haskell.org/search?q=DevOps%20Weekly%20Log%20%23haskell-foundation%20order%3Alatest_topic</a></p>
<p><br>
</p>
<div class="moz-cite-prefix">On 18/05/2022 13:25, Bryan wrote:<br>
</div>
<blockquote type="cite"
cite="mid:czzKVKXJAL4F1htJPSgwSH5ISqY8AtP8bFDHwoyYEwJx5k5AYviwVUE7uzymbikuf4w03da1wAHG6O0MS6EXqCt9GQtWeKR4If4hD8WhmxY=@chreekat.net">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div style="font-family: arial; font-size: 14px;">Hi all,</div>
<div style="font-family: arial; font-size: 14px;"><br>
</div>
<div style="font-family: arial; font-size: 14px;">I'd like to get
some data on weird CI failures. Before clicking "retry" on a
spurious failure, please paste the url for your job into the
spreadsheet you'll find linked at <span><a target="_blank"
rel="noreferrer nofollow noopener"
href="https://gitlab.haskell.org/ghc/ghc/-/issues/21591"
moz-do-not-send="true" class="moz-txt-link-freetext">https://gitlab.haskell.org/ghc/ghc/-/issues/21591</a>.</span><br>
</div>
<div style="font-family: arial; font-size: 14px;"><br>
</div>
<div style="font-family: arial; font-size: 14px;">Sorry for the
slight misdirection. I wanted the spreadsheet to be
world-writable, which means I don't want its url floating around
in too many places. Maybe you can bookmark it if CI is causing
you too much trouble. :)<br>
</div>
<div style="font-family: arial; font-size: 14px;"><br>
</div>
<div style="font-family: arial; font-size: 14px;">-Bryan<br>
</div>
<div style="font-family: arial; font-size: 14px;"><br>
</div>
<div style="font-family: arial; font-size: 14px;"><br>
</div>
<div style="font-family: arial; font-size: 14px;"
class="protonmail_signature_block
protonmail_signature_block-empty">
<div class="protonmail_signature_block-user
protonmail_signature_block-empty"> </div>
<div class="protonmail_signature_block-proton
protonmail_signature_block-empty"> </div>
</div>
</blockquote>
</body>
</html>