[GHC DevOps Group] CI

Mon Oct 16 06:11:59 UTC 2017

> 13.10.2017, 00:18 schrieb Ben Gamari <ben at well-typed.com>:
> Manuel M T Chakravarty <manuel.chakravarty at tweag.io> writes:
> 
>> As promised, I have taken a first cut at listing the requirements and
>> the pros and cons of the main contenders on a Trac page:
>> 
>>  https://ghc.haskell.org/trac/ghc/wiki/ContinuousIntegration
>> 
> I think this list is being a bit generous to the hosted option.
> 
> Other costs of this approach might include:
> 
> * Under this heterogeneous scheme we will have to maintain two or more
>   distinct CI systems, each requiring some degree of setup and
>   maintenance.

As Mathieu mentioned in an earlier post, most of the code is the same. It is essentially just the CI-specific config files that vary. Given how quickly Mathieu wrote the one for CircleCI, I doubt that this is much of an overhead. Anyway, I added a point about having to deal with two CI providers.

> * Using qemu for building on/for a non-Linux/amd64 platforms requires a
>   non-negligible amount of additional complexity (see rust's CI
>   implementation [1])
> 
> * It's unclear whether testing GHC via qemu is even practical given
>   computational constraints.

This is part of the biggest disadvantage of hosted CI — i.e., part of the first con of hosted CI.

> * We lose the ability to prioritize jobs, requiring more hardware to
>   maintain similar build turnaround

I am not sure. Is that inherently so?

> * We are utterly dependent on our CI service(s) to behave well; for
>   instance, here are two examples that the Rust infrastructure team
>   related to me,
> 
>     * They have been struggling to keep Travis the tail of their build
>       turnaround time distribution in check, with some builds taking
>       over 8 hours to complete. Despite raising the issue with Travis
>       customer support they are still having trouble, despite being a
>       paying customer.
> 
>     * They have noticed that Travis has a tendency to simply drop builds
>       in mid-flight, losing hours of work. Again, despite working with
>       upstream they haven't been able to resolve the problem
> 
>     * They have been strongly affected by apparent instability in
>       Travis' OS X infrastructure which goes down, to quote, "*a lot*"
> 
>   Of course, both of these are picking on Travis in particular as that
>   is the example we have available. However, in general the message
>   here is that by giving up our own infrastructure we are at the mercy
>   of the services that we use. Unfortunately, sometimes those services
>   are not accustomed to testing projects of the scale of GHC or rustc.
>   At this point you have little recourse but to minimize the damage.

I think, the issues with large, long running jobs is why Mathieu proposed CicleCI over Travis. But you are right, of course, if we outsource work, we need to trust the people who we outsource to to do a good job.

On the other hand, I assume that CircleCI, has a response team that jumps in when bad things happen. In contrast, I don’t think, we want to hand you a pager so we can notify you if some urgent maintenance is needed in the middle of the night.

> We avoid all of this by self-hosting (at, of course, the expense of
> administration time). Furthermore, we continue to benefit from hardware
> provided by a multitude of sources including users, Rackspace (and other
> VPS providers if we wanted), and programs like OSU OSL. It is important
> to remember that until recently we were operating under the assumption
> that these were the only resources available to us for testing.
> 
> It's still quite unclear to me what a CircleCI/Appveyor solution will
> ultimately cost, but will almost certainly not be free. Assuming there
> are users who are willing to foot that bill, this is of course fine.
> However, it's quite contrary to the assumptions we have been working
> with for much of this process.

Yes, you are right. That we have sponsors for the CI costs changes the situation wrt to the previous planning. And it is precisely one of the reasons why we founded the GHC DevOps group: to unlock new resources.

I am sorry that this comes in the middle of the existing effort. I can see how this is annoying. However, all the work on getting GHC’s build in shape and the scripts to generate artefacts are all still needed.

> Lastly: If I understand the point correctly, the "the set up is not
> forkable" "con" of Jenkins is not accurate. Under Jenkins the build
> configuration resides in the repository being tested. A user can easily
> modify it and submit a PR, which will be tested just like any other
> change.

That is not what I mean by forkable, because this still requires that user to use the central infrastructure. Forkable here means that they can run CI on, e.g., their own CircleCI account. That makes things more scalable as the user doesn’t count towards our limits and doesn’t put stress on our infrastructure (including cluttering things with PRs, which may not really be for integration (yet), but just for testing).

A user can even experiment with varying the CI set up on their own without involving us.

With Jenkins that is much harder, because they need to recreate the CI infrastructure.

> [1] https://github.com/rust-lang/rust/tree/master/src/ci
> 
> 
>> Maybe I am biased, but is there any advantage to Jenkins other than
>> that we can run builds and tests on exotic platforms?
> 
> Some of these "exotic" platforms might also be called "the most populous
> architecture in the world” (ARM),

You keep mentioning ARM. I don’t understand. We can run Android and iOS CI on CircleCI. (Any other OS on ARM, I would categorise as exotic, though.)

> "the operating system that feeds a
> third of the world's Internet traffic (FreeBSD), and "the operating
> system that powers much of the world's financial system" (AIX). I'm not
> sure that the ”exotic" label really does these platforms justice.

AFAIK virtually nobody runs GHC on those — i.e., wrt to this specific discussion, these are exotic platforms.

> More importantly, all of these platforms have contributors working on
> their support in GHC. Historically, GHC HQ has tried to recognize their
> efforts by allowing porters to submit binary distributions which are
> distributed alongside GHC HQ distributions. Recently I have tried to
> pursue a different model, handling some of these binary builds myself in
> the name of consistency and reduced release overhead (as previously we
> incurred a full round-trip through binary build contributors every time
> we released).

It is nice to support all contributors, but I think, we shouldn’t do it at the expense of the main platforms. I think, we all agree that we need really proper CI and we need fully automatic release builds. Putting those in place as quickly and with as little effort as possible ought to be our main goal IMHO.

> The desire to scale our release process up to handle the breadth of
> platforms that GHC supports, with either Tier 1 or what is currently
> Tier 2 support, was one motivation for the new CI effort. While I don't
> consider testing any one of these platforms to be a primary goal, I do
> think it is important to have a viable plan by which they might be
> covered in the future for this reason.
> 
> 
> To be clear, I am supportive of the CI-as-a-service direction. However,
> I want to recognize the trade-offs where they exist and have answers to
> some of the thorny questions, including those surrounding platform
> support, before committing.

We absolutely want to make a rationale choice on the basis of all the facts. However, I strongly think that some considerations have to have more weight than others. Simply what I have learnt about Jenkins security and the amount of *your* time that a Jenkins setup appears to costs gives me pause. I will happily incur more complexity for building and testing exotic platforms in exchange for that.

Cheers,
Manuel