[GHC DevOps Group] DevOps: Next steps

Boespflug, Mathieu m at tweag.io
Wed Oct 11 12:09:49 UTC 2017


I assume Ben meant to keep the list in CC on this one. Will reply shortly.
--
Mathieu Boespflug
Founder at http://tweag.io.


On 10 October 2017 at 17:23, Ben Gamari <ben at well-typed.com> wrote:
> "Boespflug, Mathieu" <m at tweag.io> writes:
>
>> Hi Ben,
>>
>> many thanks for your detailed and thoughtful reply. I won't myself
>> address your points one by one (I expect Manuel will jump in), but I
>> do want to ground the discussion with the following remarks:
>
> Oh no! Sorry for the increibly belated response, Mathieu. I somehow
> overlooked this message.
>
>> * What are the requirements that the current Jenkins effort building
>> towards? I seem to remember some page on the GHC wiki stating these
>> and then comparing various alternatives, but I can't find it now, so
>> maybe I dreamed it. The blog post [1] mentions alternatives but
>> doesn't evaluate them, nor does it state the requirements.
>
> The requirements are briefly listed in see #13716.
>
>
>> * A key requirement I think is not just that this kind of
>> infrastructure should not take time to setup given scarce development
>> resources, but more importantly that none of the maintenance be
>> bottlenecked on a single person managing a custom fleet of machines
>> whose state cannot be reproduced.
>
> Of course, it goes without saying that the state of the builders should
> certainly be reproducible.
>
>> * Better yet if anyone that forks GHC (with a single click on GitHub)
>> gets a local copy of the CI by the same token, which can then be
>> modified at will.
>>
>> * If we can get very quick wins today for at least 3 of the 4 "Tier 1"
>> platforms, that's already a step forward and we can work on the rest
>> later, just like Rust has (see below).
>>
> My thought here is that a solution that doesn't allow code to be
> tested on real target hardware isn't particularly fit to test a
> compiler. Qemu is neither fast nor bug-free; the GCC project uses qemu
> for their nightly builds and they have been forced to resign themselves
> to ignoring entire classes of failures which are only attributable to
> qemu bugs. This is something that I would like to avoid for our primary
> CI system if possible.
>
>> I'll copy here an experience report [2] from the Rust infra authors
>> from before they switched to a Travis CI backed solution:
>>
>>> * Our buildbot-based CI / release infrastructure cannot be maintained
>>> by community members, is generally bottlenecked on Alex and myself.
>>
>> Sounds like this applies equally to the current Harbourmaster setup.
>> Perhaps to the Jenkins based one also?
>>
> In the future I imagine that the devops group will also have some
> administrative authority over the CI infrastructure. But currently this
> is quite true: our CI infrastructure is very much bottlenecked on me and
> indeed can at times suffer as a consequence.
>
>>> * Our buildbot configuration has reliability issues, particularly around
>>> managing dynamic EC2 instances.
>>
>> Sounds familiar. Is any OS X automated testing happening at this
>> point? I heard some time befor ICFP that one or both of the OS X build
>> bots had fallen off the edge of the Internet.
>>
> To clarify: the OS X builder (we have only one) has only been down for a
> single weekend in the roughly two years that we have been using it; the
> outage was due to scheduled network maintenance at the facility that
> housed it. It just so happens that this was the weekend before ICFP.
>
>>> * Our nightly builds sometimes fail for reasons not caught during CI and
>>> are down for multiple days.
>>
>> This matches my experience when adding CircleCI support: the tip of
>> the master branch at the time had failing tests.
>>
> Indeed, this is a real problem and something which I have been hoping to
> solve with our CI reboot. Currently we test individual differentials via
> Harbormaster and I do local integration testing when I merge them.
> However, this does not mean that things won't break on other platforms
> after merge.
>
> Ideally we would do pre-merge integration testing in all of our CI
> environments before a given commit becomes `master`. This is the sort of
> thing that Jenkins will solve.
>
>>> * Packaging Rust for distribution is overly complex, involving
>>> many systems and source repositories.
>>
>> Yup. But admittedly this is an orthogonal issue.
>>
>>> * The beta and stable branches do not run the test suite today.
>>> With the volume of beta backports each release receives this is
>>> a freightening situation.
>>
>> I assume this is not the case for us. But it's unclear where I'd look
>> to find a declarative description of what's going on for each branch?
>> Can each branch define their own way to perform CI?
>>
> All CI currently is currently performed via a single set of Harbormaster
> build plans, regardless of branch. See [1]. Indeed the user can't easily
> change this configuration, although this changes in Jenkins where the
> pipeline configuration is in the repository.
>
>
> [1] https://phabricator.haskell.org/harbormaster/plan/
>
>>> * As certain core Rust tools mature we want to deliver them as part of
>>> the Rust distribution, and this is difficult to do within the
>>> current infrastructure / build system design. Distributing
>>> additional tools with Rust is particularly crucial for those
>>> intimately tied to compiler internals, like the RLS and clippy.
>>
>> Also a familiar situation, though again an orthogonal issue.
>>
>> So it sounds like at this cross road we've been seeing a lot of the
>> same things the Rust team has experienced. The jurisprudence they've
>> established here is pretty strong. If we want to address the very same
>> problems then we need:
>>
>> 1. Reproducible cloud instances that are created/destroyed on-demand,
>> and whose state doesn't drift over time. That way, no problems with
>> build bots that eventually disappear.
>>
> Indeed; but CircleCI/Travis are not the only solution which enable this
> sort of reproducibility. This same sort of thing can be achieved in
> Jenkins as well.
>
>> 2. A declarative description of the *entire infrastructure and test
>> environment*, for each target platform, so that it can be replicated
>> by anyone who wants to so, in a single command. That way we're not
>> blocked on any single person to make changes to it.
>>
> Yes, Jenkins also provides this [2].
>
>> I believe reusing existing managed CI solutions. But let's discuss.
>> Just know that we'd be happy to contribute towards any paid
>> subscription necessary. So that shouldn't be a barrier.
>>
> That is good to know; however I think we first make sure that the
> contributions that we have will amount to what is needed to make this
> idea fly before taking the plunge.
>
>
> To be clear, I only grudgingly find myself advocating for Jenkins; it is
> in many ways terrible to work with. Furthermore, I'll be the first to
> admit that the administration that it requires does carry a very real
> cost. However, I think we should be careful to distinguish the
> accidental complexity imposed by Jenkins from the intrinsic complexity
> of testing a large project like GHC. For better or worse much of the
> effort that has gone into setting up Jenkins thusfar hasn't actually
> been Jenkins-specific; rather it's been adapting GHC to be amenable to
> the sort of end-to-end testing that we want and fixing bugs when I find
> them.
>
> I fear that in moving to a hosted solution in place of our own
> infrastructure we incur a different set of no-less-significant,
>
>  * we fragment our testing infrastructure since now we need at least
>    CircleCI and Appveyor
>
>  * we preclude proper testing of non-Linux/amd64 environments
>
>  * as a substitute for proper bare-metal testing of these platforms we
>    instead have to write, administer, and pay for the inefficiency of
>    emulation-based testing
>
>  * we lose the ability to prioritize jobs to use our resources
>    effectively (e.g. prioritizing patch validating over commit validation)
>
> As with most things in life, this is a trade-off. I'm still quite
> undecided whether it's a worthwhile trade-off, but at the moment I
> remain a bit skeptical. However, as I said earlier, if we can
> demonstrate that it is possible to test non-Linux platforms reliably and
> efficiently, then that certainly helps convince me.
>
> Cheers,
>
> - Ben
>
>
> [2] https://github.com/bgamari/ghc/blob/wip/jenkins/Jenkinsfile


More information about the Ghc-devops-group mailing list