[GHC DevOps Group] DevOps: Next steps

Wed Oct 11 15:01:18 UTC 2017

"Boespflug, Mathieu" <m at tweag.io> writes:

> Hi Ben,
>
> On 10 October 2017 at 17:23, Ben Gamari <ben at well-typed.com> wrote:
>>
>> [...]
>>
>> The requirements are briefly listed in see #13716.
>
> Thanks for the pointer. Let's merge those with the list you provide in
> another email. I think a few more regarding the following topics need
> to be added to that list:
>
> * infrastructure reproducibility (easy to reproduce build environments
> and results)
> * infrastructure forkability (easy for others to fork the infra, test
> it the changes and then submit a pull request)
> * security (who has access, who can build etc)
> * one you mention in your email: prioritization? (run tests for some
> platforms first)
>
> Most important requirement: low maintenance overhead.

Yes, these all sound like perfectly reasonable goals.

>> My thought here is that a solution that doesn't allow code to be
>> tested on real target hardware isn't particularly fit to test a
>> compiler. Qemu is neither fast nor bug-free; the GCC project uses qemu
>> for their nightly builds and they have been forced to resign themselves
>> to ignoring entire classes of failures which are only attributable to
>> qemu bugs. This is something that I would like to avoid for our primary
>> CI system if possible.
>
> References would be appreciated. My thoughts:
>
I'm afraid I can't provide a public reference for the GCC experience;
however, I can say that the source is an ARM employee who works full
time on GCC.

Regardless, it's not hard to find infelicities in qemu's dynamic
translation layer, even in the quite "mature" x86 implementation. This
isn't surprising; faithfully emulating an entire CPU architecture,
memory model, and support peripherals is a quite nontrivial task. Just a
quick glance through the currently open tickets reveals

 * https://bugs.launchpad.net/qemu/+bug/645662
 * https://bugs.launchpad.net/qemu/+bug/1098729
 * https://bugs.launchpad.net/qemu/+bug/902413
 * https://bugs.launchpad.net/qemu/+bug/1226531

> - Emulation for non x86 targets will be required anyways as it is
> (unless you have an iPhone lying around 247/7 as a build bot), if we
> are to include testing on them as part of CI.

There are a variety of people in the GHC community who have access to
such hardware. Furthermore, programs like the OSU OSL are actively
looking for open-source projects to support. Finally, if all else fails
this sort of hardware is easily procured via a variety of VPS providers.

> - The needs of GHC are not those of GCC (we are targeting far fewer
> platforms, with a simpler NCG).
> - Emulation hasn't prevented Rust (also a compiler) from being tested
> successfully on far more platforms than we are be going to be
> targeting anytime soon. Either because possibly QEMU works just fine,
> or they're not fiddling with the NCG on a daily basis (since they
> outsource that to LLVM).
>
Well, as GHC is not GCC, GHC is also not Rust. Rust has the advantage of
having a strong cross-compilation story, and a testsuite which was
designed to make this usage easy. GHC is behind rust in both of these
areas.

Yesterday I discussed this with two core members of the Rust
infrastructure team; who explicitly said that (paraphrasing, albeit
closely, as I didn't ask permission to quote him at the time),

 * making CI under qemu fast is nontrivial; their testing strategy of
   running the compiler on the host and running only the tests
   themselves on the target is critical to making the approach scale

 * when issues occur, debugging issues inside the emulator has proven to
   be quite difficult

>> Ideally we would do pre-merge integration testing in all of our CI
>> environments before a given commit becomes `master`. This is the sort of
>> thing that Jenkins will solve.
>
> There is an important security issue here. If you build PR's from any
> spontaneous contributor on the Internet (as you should), then you
> should only do so in a sandboxed environment. But Jenkins does not
> give you that out-of-the-box. Without any sandboxing it's not
> reasonable to let users run arbitrary code on the CI server, i.e. the
> very same server on which later that day or many months later, a
> release binary distribution will be cut and sent out to thousands of
> users to install...
>
Absolutely; this is indeed something I'm currently quite uncomfortable
with under the current Harbormaster scheme. You are quite right that
this problem becomes much thornier once we are building release
artifacts on these machines as well.

> It's possible to add in other technologies into the mix to sandbox
> each Jenkins build (we've done it, and it took us a fair amount of
> time, and even then, not with the same security requirements). But by
> then, you've reinvented half of TravisCI/CircleCI/Appveyor/etc.
>
> Best to outsource this security aspect to providers that are *paid* by
> thousands of companies to get it right, I think.
>
Indeed this is a fair point. In order to keep complexity at bay my plan
in the Jenkins infrastructure was to simply spin up new instances
for releases (using automation, of course). You are quite right that a
general solution of this problem is quite hard to get right and Jenkins
offers very little help in this area.

This is one area where hosted services win hands down.

[...]

>>> 1. Reproducible cloud instances that are created/destroyed on-demand,
>>> and whose state doesn't drift over time. That way, no problems with
>>> build bots that eventually disappear.
>>>
>> Indeed; but CircleCI/Travis are not the only solution which enable this
>> sort of reproducibility. This same sort of thing can be achieved in
>> Jenkins as well.
>
> True. Through mechanisms orthogonal to Jenkins. One can mitigate build
> drone configurations drift and get some reproducibility using
> configuration management tools (Ansible, SaltStack etc). Or via
> Dockerfiles. Or via OS images. It's just more work.
>
Yes, it is indeed more work. However, I would argue it is the only sane
way to deploy Jenkins.

>> For better or worse much of the effort that has gone into setting up
>> Jenkins thusfar hasn't actually been Jenkins-specific; rather it's
>> been adapting GHC to be amenable to the sort of end-to-end testing
>> that we want and fixing bugs when I find them.
>
> Great! That's as I expected: we ought to be able to reuse a lot of
> existing work no matter the CI driver. :)
>
Right, this work should be applicable regardless of which CI solution we
use.

>> As with most things in life, this is a trade-off. I'm still quite
>> undecided whether it's a worthwhile trade-off, but at the moment I
>> remain a bit skeptical. However, as I said earlier, if we can
>> demonstrate that it is possible to test non-Linux platforms reliably and
>> efficiently, then that certainly helps convince me.
>
> Not that I think it's worth investing much time on this just yet (see
> Manuel's earlier comment), but here's a screenshot of FreeBSD running
> inside QEMU inside a Docker container on CircleCI:
>
To be clear, I'm not claiming that it is impossible to run qemu inside
CircleCI. I'm simply worried that it will be prohibitively slow. It
would be nice to see some evidence that this is not the case before
committing to this path, but I can understand if getting a minimal
viable solution takes priority.

At this point I'm fairly close to agreeing with you that CircleCI is the
right path forward. My primary reservation continues to be non-Linux
platforms.

Cheers,

- Ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 483 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/ghc-devops-group/attachments/20171011/f681b583/attachment.sig>