[GHC DevOps Group] DevOps: Next steps

Wed Oct 11 12:15:48 UTC 2017

Hi Ben,

On 10 October 2017 at 17:23, Ben Gamari <ben at well-typed.com> wrote:
>
> [...]
>
>> * What are the requirements that the current Jenkins effort building
>> towards? I seem to remember some page on the GHC wiki stating these
>> and then comparing various alternatives, but I can't find it now, so
>> maybe I dreamed it. The blog post [1] mentions alternatives but
>> doesn't evaluate them, nor does it state the requirements.
>
> The requirements are briefly listed in see #13716.

Thanks for the pointer. Let's merge those with the list you provide in
another email. I think a few more regarding the following topics need
to be added to that list:

* infrastructure reproducibility (easy to reproduce build environments
and results)
* infrastructure forkability (easy for others to fork the infra, test
it the changes and then submit a pull request)
* security (who has access, who can build etc)
* one you mention in your email: prioritization? (run tests for some
platforms first)

Most important requirement: low maintenance overhead.

>> * If we can get very quick wins today for at least 3 of the 4 "Tier 1"
>> platforms, that's already a step forward and we can work on the rest
>> later, just like Rust has (see below).
>>
> My thought here is that a solution that doesn't allow code to be
> tested on real target hardware isn't particularly fit to test a
> compiler. Qemu is neither fast nor bug-free; the GCC project uses qemu
> for their nightly builds and they have been forced to resign themselves
> to ignoring entire classes of failures which are only attributable to
> qemu bugs. This is something that I would like to avoid for our primary
> CI system if possible.

References would be appreciated. My thoughts:

- Emulation for non x86 targets will be required anyways as it is
(unless you have an iPhone lying around 247/7 as a build bot), if we
are to include testing on them as part of CI.
- The needs of GHC are not those of GCC (we are targeting far fewer
platforms, with a simpler NCG).
- Emulation hasn't prevented Rust (also a compiler) from being tested
successfully on far more platforms than we are be going to be
targeting anytime soon. Either because possibly QEMU works just fine,
or they're not fiddling with the NCG on a daily basis (since they
outsource that to LLVM).

> Ideally we would do pre-merge integration testing in all of our CI
> environments before a given commit becomes `master`. This is the sort of
> thing that Jenkins will solve.

There is an important security issue here. If you build PR's from any
spontaneous contributor on the Internet (as you should), then you
should only do so in a sandboxed environment. But Jenkins does not
give you that out-of-the-box. Without any sandboxing it's not
reasonable to let users run arbitrary code on the CI server, i.e. the
very same server on which later that day or many months later, a
release binary distribution will be cut and sent out to thousands of
users to install...

It's possible to add in other technologies into the mix to sandbox
each Jenkins build (we've done it, and it took us a fair amount of
time, and even then, not with the same security requirements). But by
then, you've reinvented half of TravisCI/CircleCI/Appveyor/etc.

Best to outsource this security aspect to providers that are *paid* by
thousands of companies to get it right, I think.

>>> * The beta and stable branches do not run the test suite today.
>>> With the volume of beta backports each release receives this is
>>> a freightening situation.
>>
>> I assume this is not the case for us. But it's unclear where I'd look
>> to find a declarative description of what's going on for each branch?
>> Can each branch define their own way to perform CI?
>>
> All CI currently is currently performed via a single set of Harbormaster
> build plans, regardless of branch. See [1]. Indeed the user can't easily
> change this configuration, although this changes in Jenkins where the
> pipeline configuration is in the repository.

Cool.

>> 1. Reproducible cloud instances that are created/destroyed on-demand,
>> and whose state doesn't drift over time. That way, no problems with
>> build bots that eventually disappear.
>>
> Indeed; but CircleCI/Travis are not the only solution which enable this
> sort of reproducibility. This same sort of thing can be achieved in
> Jenkins as well.

True. Through mechanisms orthogonal to Jenkins. One can mitigate build
drone configurations drift and get some reproducibility using
configuration management tools (Ansible, SaltStack etc). Or via
Dockerfiles. Or via OS images. It's just more work.

> For better or worse much of the
> effort that has gone into setting up Jenkins thusfar hasn't actually
> been Jenkins-specific; rather it's been adapting GHC to be amenable to
> the sort of end-to-end testing that we want and fixing bugs when I find
> them.

Great! That's as I expected: we ought to be able to reuse a lot of
existing work no matter the CI driver. :)

> As with most things in life, this is a trade-off. I'm still quite
> undecided whether it's a worthwhile trade-off, but at the moment I
> remain a bit skeptical. However, as I said earlier, if we can
> demonstrate that it is possible to test non-Linux platforms reliably and
> efficiently, then that certainly helps convince me.

Not that I think it's worth investing much time on this just yet (see
Manuel's earlier comment), but here's a screenshot of FreeBSD running
inside QEMU inside a Docker container on CircleCI:

https://imgur.com/a/3YRXs
--
Mathieu Boespflug
Founder at http://tweag.io.

On 11 October 2017 at 14:09, Boespflug, Mathieu <m at tweag.io> wrote:
> I assume Ben meant to keep the list in CC on this one. Will reply shortly.
> --
> Mathieu Boespflug
> Founder at http://tweag.io.
>
>
> On 10 October 2017 at 17:23, Ben Gamari <ben at well-typed.com> wrote:
>> "Boespflug, Mathieu" <m at tweag.io> writes:
>>
>>> Hi Ben,
>>>
>>> many thanks for your detailed and thoughtful reply. I won't myself
>>> address your points one by one (I expect Manuel will jump in), but I
>>> do want to ground the discussion with the following remarks:
>>
>> Oh no! Sorry for the increibly belated response, Mathieu. I somehow
>> overlooked this message.
>>
>>> * What are the requirements that the current Jenkins effort building
>>> towards? I seem to remember some page on the GHC wiki stating these
>>> and then comparing various alternatives, but I can't find it now, so
>>> maybe I dreamed it. The blog post [1] mentions alternatives but
>>> doesn't evaluate them, nor does it state the requirements.
>>
>> The requirements are briefly listed in see #13716.
>>
>>
>>> * A key requirement I think is not just that this kind of
>>> infrastructure should not take time to setup given scarce development
>>> resources, but more importantly that none of the maintenance be
>>> bottlenecked on a single person managing a custom fleet of machines
>>> whose state cannot be reproduced.
>>
>> Of course, it goes without saying that the state of the builders should
>> certainly be reproducible.
>>
>>> * Better yet if anyone that forks GHC (with a single click on GitHub)
>>> gets a local copy of the CI by the same token, which can then be
>>> modified at will.
>>>
>>> * If we can get very quick wins today for at least 3 of the 4 "Tier 1"
>>> platforms, that's already a step forward and we can work on the rest
>>> later, just like Rust has (see below).
>>>
>> My thought here is that a solution that doesn't allow code to be
>> tested on real target hardware isn't particularly fit to test a
>> compiler. Qemu is neither fast nor bug-free; the GCC project uses qemu
>> for their nightly builds and they have been forced to resign themselves
>> to ignoring entire classes of failures which are only attributable to
>> qemu bugs. This is something that I would like to avoid for our primary
>> CI system if possible.
>>
>>> I'll copy here an experience report [2] from the Rust infra authors
>>> from before they switched to a Travis CI backed solution:
>>>
>>>> * Our buildbot-based CI / release infrastructure cannot be maintained
>>>> by community members, is generally bottlenecked on Alex and myself.
>>>
>>> Sounds like this applies equally to the current Harbourmaster setup.
>>> Perhaps to the Jenkins based one also?
>>>
>> In the future I imagine that the devops group will also have some
>> administrative authority over the CI infrastructure. But currently this
>> is quite true: our CI infrastructure is very much bottlenecked on me and
>> indeed can at times suffer as a consequence.
>>
>>>> * Our buildbot configuration has reliability issues, particularly around
>>>> managing dynamic EC2 instances.
>>>
>>> Sounds familiar. Is any OS X automated testing happening at this
>>> point? I heard some time befor ICFP that one or both of the OS X build
>>> bots had fallen off the edge of the Internet.
>>>
>> To clarify: the OS X builder (we have only one) has only been down for a
>> single weekend in the roughly two years that we have been using it; the
>> outage was due to scheduled network maintenance at the facility that
>> housed it. It just so happens that this was the weekend before ICFP.
>>
>>>> * Our nightly builds sometimes fail for reasons not caught during CI and
>>>> are down for multiple days.
>>>
>>> This matches my experience when adding CircleCI support: the tip of
>>> the master branch at the time had failing tests.
>>>
>> Indeed, this is a real problem and something which I have been hoping to
>> solve with our CI reboot. Currently we test individual differentials via
>> Harbormaster and I do local integration testing when I merge them.
>> However, this does not mean that things won't break on other platforms
>> after merge.
>>
>> Ideally we would do pre-merge integration testing in all of our CI
>> environments before a given commit becomes `master`. This is the sort of
>> thing that Jenkins will solve.
>>
>>>> * Packaging Rust for distribution is overly complex, involving
>>>> many systems and source repositories.
>>>
>>> Yup. But admittedly this is an orthogonal issue.
>>>
>>>> * The beta and stable branches do not run the test suite today.
>>>> With the volume of beta backports each release receives this is
>>>> a freightening situation.
>>>
>>> I assume this is not the case for us. But it's unclear where I'd look
>>> to find a declarative description of what's going on for each branch?
>>> Can each branch define their own way to perform CI?
>>>
>> All CI currently is currently performed via a single set of Harbormaster
>> build plans, regardless of branch. See [1]. Indeed the user can't easily
>> change this configuration, although this changes in Jenkins where the
>> pipeline configuration is in the repository.
>>
>>
>> [1] https://phabricator.haskell.org/harbormaster/plan/
>>
>>>> * As certain core Rust tools mature we want to deliver them as part of
>>>> the Rust distribution, and this is difficult to do within the
>>>> current infrastructure / build system design. Distributing
>>>> additional tools with Rust is particularly crucial for those
>>>> intimately tied to compiler internals, like the RLS and clippy.
>>>
>>> Also a familiar situation, though again an orthogonal issue.
>>>
>>> So it sounds like at this cross road we've been seeing a lot of the
>>> same things the Rust team has experienced. The jurisprudence they've
>>> established here is pretty strong. If we want to address the very same
>>> problems then we need:
>>>
>>> 1. Reproducible cloud instances that are created/destroyed on-demand,
>>> and whose state doesn't drift over time. That way, no problems with
>>> build bots that eventually disappear.
>>>
>> Indeed; but CircleCI/Travis are not the only solution which enable this
>> sort of reproducibility. This same sort of thing can be achieved in
>> Jenkins as well.
>>
>>> 2. A declarative description of the *entire infrastructure and test
>>> environment*, for each target platform, so that it can be replicated
>>> by anyone who wants to so, in a single command. That way we're not
>>> blocked on any single person to make changes to it.
>>>
>> Yes, Jenkins also provides this [2].
>>
>>> I believe reusing existing managed CI solutions. But let's discuss.
>>> Just know that we'd be happy to contribute towards any paid
>>> subscription necessary. So that shouldn't be a barrier.
>>>
>> That is good to know; however I think we first make sure that the
>> contributions that we have will amount to what is needed to make this
>> idea fly before taking the plunge.
>>
>>
>> To be clear, I only grudgingly find myself advocating for Jenkins; it is
>> in many ways terrible to work with. Furthermore, I'll be the first to
>> admit that the administration that it requires does carry a very real
>> cost. However, I think we should be careful to distinguish the
>> accidental complexity imposed by Jenkins from the intrinsic complexity
>> of testing a large project like GHC. For better or worse much of the
>> effort that has gone into setting up Jenkins thusfar hasn't actually
>> been Jenkins-specific; rather it's been adapting GHC to be amenable to
>> the sort of end-to-end testing that we want and fixing bugs when I find
>> them.
>>
>> I fear that in moving to a hosted solution in place of our own
>> infrastructure we incur a different set of no-less-significant,
>>
>>  * we fragment our testing infrastructure since now we need at least
>>    CircleCI and Appveyor
>>
>>  * we preclude proper testing of non-Linux/amd64 environments
>>
>>  * as a substitute for proper bare-metal testing of these platforms we
>>    instead have to write, administer, and pay for the inefficiency of
>>    emulation-based testing
>>
>>  * we lose the ability to prioritize jobs to use our resources
>>    effectively (e.g. prioritizing patch validating over commit validation)
>>
>> As with most things in life, this is a trade-off. I'm still quite
>> undecided whether it's a worthwhile trade-off, but at the moment I
>> remain a bit skeptical. However, as I said earlier, if we can
>> demonstrate that it is possible to test non-Linux platforms reliably and
>> efficiently, then that certainly helps convince me.
>>
>> Cheers,
>>
>> - Ben
>>
>>
>> [2] https://github.com/bgamari/ghc/blob/wip/jenkins/Jenkinsfile