[GHC DevOps Group] CI

Mon Oct 16 08:58:40 UTC 2017

Hi Manuel,

On 16 October 2017 at 08:11, Manuel M T Chakravarty
<manuel.chakravarty at tweag.io> wrote:
>
> [...]
>
>> * We lose the ability to prioritize jobs, requiring more hardware to
>>   maintain similar build turnaround
>
> I am not sure. Is that inherently so?

It is. To some extent. That said, we do not to think carefully about
what our requirements really are and why.

CircleCI has a notion of "workflow". This is extremely powerful. But
what it means in this context is that you can always run the e.g. the
Linux 64-bit validate job before the 32-bit one, and decide to run the
latter only if the former succeeds. Currently macOS has not been
integrated into this workflow support (it's a new feature), so macOS
builds will trigger in parallel to Linux builds. We'll want to do
Windows builds on Appveyor I think. Those won't be part of any
CircleCI workflow, not without some hacking.

Does this matter? Well, why have prioritization in the first place? To
avoid tying up multiple build resources if the build is very likely to
fail anyways, and therefore run the build on just one instance first
to save on resource usage? Early feedback to the user? The user will
get to know the build failed as soon as it fails on any platform, so I
don't think prioritization of one particular platform helps.

So that leaves us with resource usage minimization. macOS has its own
quota of parallel builds, separate from Linux. And separate from
Windows as well. The exact quota depends on our plan (free plans allow
2-4 parallel builds, paid plans allow for more).

We could actually try to play games to avoid failing builds from
clogging the build queue. But this is yet another case of - just throw
more money at it to get more slots and keep it simple. That way, no
prioritization necessary. KISS saves human time, hence saves money
overall.

>> * We are utterly dependent on our CI service(s) to behave well; for
>>   instance, here are two examples that the Rust infrastructure team
>>   related to me,
>>
>>     * They have been struggling to keep Travis the tail of their build
>>       turnaround time distribution in check, with some builds taking
>>       over 8 hours to complete. Despite raising the issue with Travis
>>       customer support they are still having trouble, despite being a
>>       paying customer.
>>
>>     * They have noticed that Travis has a tendency to simply drop builds
>>       in mid-flight, losing hours of work. Again, despite working with
>>       upstream they haven't been able to resolve the problem
>>
>>     * They have been strongly affected by apparent instability in
>>       Travis' OS X infrastructure which goes down, to quote, "*a lot*"
>>
>>   Of course, both of these are picking on Travis in particular as that
>>   is the example we have available. However, in general the message
>>   here is that by giving up our own infrastructure we are at the mercy
>>   of the services that we use. Unfortunately, sometimes those services
>>   are not accustomed to testing projects of the scale of GHC or rustc.
>>   At this point you have little recourse but to minimize the damage.
>
> I think, the issues with large, long running jobs is why Mathieu proposed CicleCI over Travis. But you are right, of course, if we outsource work, we need to trust the people who we outsource to to do a good job.
>
> On the other hand, I assume that CircleCI, has a response team that jumps in when bad things happen. In contrast, I don’t think, we want to hand you a pager so we can notify you if some urgent maintenance is needed in the middle of the night.

FWIW, we've had projects with 200+ builds a month (per project) on
CircleCI for some time without these kinds of issues. Our experience
with Travis CI isn't as extensive as the Rust team. But I get the
impression Travis prioritize non-paying open source projects lower
than projects with paid plans.

For me the main reason for CircleCI is mainly the availability of
faster build hardware (and yes, lower queuing times). But Travis CI
should work just fine too (let's work out the math). We won't yet be
at Rust's scale any time soon (3 supported platforms vs 35+ platforms
supported).

Best,

Mathieu