[GHC DevOps Group] Fwd: DevOps: Next steps

Tue Oct 10 06:55:22 UTC 2017

[RESENT MESSAGE — see https://mail.haskell.org/pipermail/ghc-devops-group/2017-October/000004.html]
[Includes messages from Ben and me to which it responds.]

> From: "Boespflug, Mathieu" <m at tweag.io>
> Subject: Aw: [GHC DevOps Group] DevOps: Next steps
> Date: 5. Oktober 2017 um 09:29:10 GMT+11
> To: Ben Gamari <ben at well-typed.com>
> Cc: Jonas Pfenniger Chevalier <jonas.chevalier at tweag.io>, ghc-devops-group at haskell.org
> 
> Hi Ben,
> 
> many thanks for your detailed and thoughtful reply. I won't myself
> address your points one by one (I expect Manuel will jump in), but I
> do want to ground the discussion with the following remarks:
> 
> * What are the requirements that the current Jenkins effort building
> towards? I seem to remember some page on the GHC wiki stating these
> and then comparing various alternatives, but I can't find it now, so
> maybe I dreamed it. The blog post [1] mentions alternatives but
> doesn't evaluate them, nor does it state the requirements.
> * A key requirement I think is not just that this kind of
> infrastructure should not take time to setup given scarce development
> resources, but more importantly that none of the maintenance be
> bottlenecked on a single person managing a custom fleet of machines
> whose state cannot be reproduced.
> * Better yet if anyone that forks GHC (with a single click on GitHub)
> gets a local copy of the CI by the same token, which can then be
> modified at will.
> * If we can get very quick wins today for at least 3 of the 4 "Tier 1"
> platforms, that's already a step forward and we can work on the rest
> later, just like Rust has (see below).
> 
> I'll copy here an experience report [2] from the Rust infra authors
> from before they switched to a Travis CI backed solution:
> 
>> * Our buildbot-based CI / release infrastructure cannot be maintained
>> by community members, is generally bottlenecked on Alex and myself.
> 
> Sounds like this applies equally to the current Harbourmaster setup.
> Perhaps to the Jenkins based one also?
> 
>> * Our buildbot configuration has reliability issues, particularly around
>> managing dynamic EC2 instances.
> 
> Sounds familiar. Is any OS X automated testing happening at this
> point? I heard some time befor ICFP that one or both of the OS X build
> bots had fallen off the edge of the Internet.
> 
>> * Our nightly builds sometimes fail for reasons not caught during CI and
>> are down for multiple days.
> 
> This matches my experience when adding CircleCI support: the tip of
> the master branch at the time had failing tests.
> 
>> * Packaging Rust for distribution is overly complex, involving
>> many systems and source repositories.
> 
> Yup. But admittedly this is an orthogonal issue.
> 
>> * The beta and stable branches do not run the test suite today.
>> With the volume of beta backports each release receives this is
>> a freightening situation.
> 
> I assume this is not the case for us. But it's unclear where I'd look
> to find a declarative description of what's going on for each branch?
> Can each branch define their own way to perform CI?
> 
>> * As certain core Rust tools mature we want to deliver them as part of
>> the Rust distribution, and this is difficult to do within the
>> current infrastructure / build system design. Distributing
>> additional tools with Rust is particularly crucial for those
>> intimately tied to compiler internals, like the RLS and clippy.
> 
> Also a familiar situation, though again an orthogonal issue.
> 
> So it sounds like at this cross road we've been seeing a lot of the
> same things the Rust team has experienced. The jurisprudence they've
> established here is pretty strong. If we want to address the very same
> problems then we need:
> 
> 1. Reproducible cloud instances that are created/destroyed on-demand,
> and whose state doesn't drift over time. That way, no problems with
> build bots that eventually disappear.
> 2. A declarative description of the *entire infrastructure and test
> environment*, for each target platform, so that it can be replicated
> by anyone who wants to so, in a single command. That way we're not
> blocked on any single person to make changes to it.
> 
> I believe reusing existing managed CI solutions. But let's discuss.
> Just know that we'd be happy to contribute towards any paid
> subscription necessary. So that shouldn't be a barrier.
> 
> Best,
> 
> Mathieu
> 
> [1] https://ghc.haskell.org/trac/ghc/blog/jenkins-ci
> [2] https://internals.rust-lang.org/t/rust-ci-release-infrastructure-changes/4489
> --
> Mathieu Boespflug
> Founder at http://tweag.io.
> 
> 
> On 4 October 2017 at 19:30, Ben Gamari <ben at well-typed.com> wrote:
>> Manuel M T Chakravarty <manuel.chakravarty at tweag.io> writes:
>> 
>>> Hi Ben,
>>> 
>> Hi Manuel,
>> 
>> Thanks again for your help here!
>> 
>>> Since we talked last week, I have talked with Mathieu and Jonas (our
>>> resident DevOps guru) about the whole CI situation and our discussion
>>> about automating the production of build artefacts for GHC to make the
>>> release process less labour-intensive. I am adding both to CC, so that
>>> they can correct me if I am getting anything wrong.
>>> 
>>> When we talked on the phone, you mentioned that we need to be able to
>>> support all the Tier 1 platforms, and we both concluded that this
>>> implies the need for using Jenkins and we can’t, e.g., use CircleCI as
>>> they only support macOS and Linux. Mathieu and Jonas explained to me
>>> that this is actually not the case. Apparently, Rust solves this issue
>>> by building Linux and macOS artefacts on CircleCI, Windows on
>>> Appveyor, and everything else using QEMU on CircleCI (e.g., FreeBSD
>>> could be done that way and eventually ARM builds).
>>> 
>> Indeed when starting this I looked a bit at what rustc does. By my
>> recollection, they don't actually perform builds on anything but
>> Linux/amd64. Instead they build cross-compilers on x86-64, use these to
>> build their testsuite artifacts, and then run these under qemu (and in
>> some cases, e.g. FreeBSD, they don't even do this).
>> 
>> While in general I would love to be able to do everything with
>> cross-compiled binaries from Linux/amd64, our cross-compilation story
>> may be a bit lacking to pull this off at the moment. Moritz Angerman has
>> been making great strides in this area recently but it's going to be a
>> while until we can really make this work. In particular, our Template
>> Haskell story will need quite some work before we can reliably do a full
>> cross-compiled testsuite run.
>> 
>> In general I'm a bit skeptical of moving to a solution that relegates
>> non-Linux/amd64 builds to a VM. Non-Linux/amd64 platforms have
>> commercial users and do deserve first-class CI support. Furthermore,
>> without KVM or hypervisor support (which, as far as I can tell, CircleCI
>> does not provide [1]) I'm not sure that virtualisation will allow us to
>> get where we want to be in terms of test coverage and build response
>> time due to the cost of virtualisation. Without hardware support qemu
>> can be rather expensive.
>> 
>>> They convinced me that this is a worthwhile direction to consider for
>>> the following reasons:
>>> 
>>> * Jenkins is a fickle beast: apparently scaling Jenkins to work
>>> reliably when running tests against multiple PRs on distributed
>>> infrastructure is hard — we ran into significant problems in a client
>>> project recently.
>>> 
>> 
>> I agree that Jenkins is a rather fickle beast; indeed it can be
>> positively infuriating to work with. However, I've not yet noticed the
>> scaling issues you describe. What in particular did you observe?
>> 
>>> * All the custom set up and maintaining of build nodes etc required by
>>> Jenkins disappears. (Mathieu built the CircleCI setup that he
>>> contributed recently quite quickly, so there really is little overhead
>>> in setting this up.)
>>> 
>> I'm not sure that the difference here is actually so great. Yes, in the
>> case of Jenkins you do have physical machines to administer. However,
>> this typically isn't the hard part. If you look at Rust's configuration,
>> they have roughly a dozen Docker environments which they had to setup
>> and maintain; this effort will likely far outweigh the setup cost of the
>> machines themselves. This has certainly been the case for Jenkins and I
>> suspect it would be true of CircleCI as well; this is simply the cost to
>> entry for cross-platform testing.
>> 
>> Moreover, we can't write off the cost of integrating with CircleCI. Of
>> course, if we do decide to move to GitHub then perhaps this cost shrinks
>> dramatically. However, until this decision is made it seems like we need
>> to assume that Phabricator integration will be necessary.
>> 
>>> * The problems we discussed with possibly not having enough Rackspace
>>> capacity for the transition disappears.
>>> 
>> In some sense this is true; however, it seems like we are trading one
>> commodity of finite supply for another. We currently have Rackspace
>> credit and consequently these instances can be considered to be
>> essentially free.
>> 
>> While CircleCI is does offer four free containers for open source
>> projects (and perhaps a bit more in our case if we ask), I'm skeptical
>> that this will be enough; currently our four build bots give us
>> multi-day wait times which makes development remarkably painful. The
>> appeal of Jenkins is that we can shorten this timescale as well as grow
>> our test coverage with the resources that we already have.
>> 
>> Let's have a brief look at what resources we may need.
>> 
>> A quick back-of-the-envelope calculation suggests that to simply keep up
>> with our current average commit rate (around 200 commits/month) for the
>> four environments that we currently build we need a bare minimum of:
>> 
>>    200 commit/month
>>  * 4 build/commit             (Linux/i386, Linux/amd64,
>>                                OS X, Windows/amd64)
>>  * 2.5 CPU-hour/build         (approx. average across platforms
>>                                for a validate)
>>  / (2 CPU-hour/machine-hour)  (CircleCI appears to use 2 vCPU instances)
>>  / (30*24 machine-hour/month)
>>  ~ 2 machines
>> 
>> note that this doesn't guarantee reasonable wait times but rather merely
>> ensure that we can keep up on the mean. On top of this, we see around
>> 300 differential revisions per month. This requires another 3 machines
>> to keep up.
>> 
>> So, we need at least five machines but, again, this is a minimum;
>> modelling response times is hard but I expect we would likely need to
>> add at least two more machines to keep response times in the
>> contributor-friendly range, especially considering that under Circle CI
>> we will lose the ability to prioritize jobs (by contrast, with Jenkins
>> we can prioritize pull requests as this is the response time that we
>> really care about). Now consider that we would like to add at least
>> three more platforms (FreeBSD, OpenBSD, Linux/aarch64, all of which may
>> be relatively slow to build due to virtualisation overhead) as well as a
>> few more build configurations on amd64 (LLVM, unregisterised, at least
>> one cross-compilation target) and a periodic slow validation and we may
>> be at over a dozen machines.
>> 
>> All of this appears to put us well outside CircleCI's offering to
>> open-source projects. Of course, it may be worth asking whether they are
>> willing to extend GHC a more generous offer. However, I don't think we
>> can count on this and I'm not certain that Haskell.org is currently in a
>> position to be able to shoulder such a financial burden.
>> 
>>> * We also don’t need to worry about a macOS box either.
>>> 
>> Quite true.
>> 
>>> Also, Jonas could help us getting things running and, I think, his
>>> wealth of experience would be very useful. (At least, I would be very
>>> grateful for his advise.)
>>> 
>>> I think, this route has the potential to get us to where we want to be
>>> quite quickly and in a manner that is very little effort to maintain
>>> once set up. What do you think?
>>> 
>> Indeed I can see that there are many advantages to the CircleCI option.
>> The ease of bringing up a Linux/amd64 build environment which easily
>> scales and requires no administration is quite enticing. However, I am a
>> skeptical that it will be as easy to get the full suite of builds that
>> we are aiming to produce. I would be quite curious to see what Jonas has
>> to say on the matter of non-Linux platforms. Seeing a simple
>> configuration which compiles and tests even a FreeBSD/amd64 build in a
>> reasonable amout of time may well be enough to convince me.
>> 
>> Thanks again for your help on this!
>> 
>> Cheers,
>> 
>> - Ben
>> 
>> 
>> [1] https://circleci.com/docs/1.0/android/
> _______________________________________________
> Ghc-devops-group mailing list
> Ghc-devops-group at haskell.org
> https://haskell.org/cgi-bin/mailman/listinfo/ghc-devops-group

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://haskell.org/pipermail/ghc-devops-group/attachments/20171010/1597879e/attachment-0001.html>