[Haskell-cafe] Improvements to package hosting and security

Thu Apr 16 10:57:54 UTC 2015

On Thu, 2015-04-16 at 10:32 +0000, Michael Snoyman wrote:
> On Thu, Apr 16, 2015 at 1:12 PM Duncan Coutts <duncan at well-typed.com> wrote:
> 
> > On Thu, 2015-04-16 at 09:52 +0000, Michael Snoyman wrote:
> > > Thanks for responding, I intend to go read up on TUF and your blog post
> > > now. One question:
> > >
> > >       * We're incorporating an existing design for incremental updates
> > >         of the package index to significantly improve "cabal update"
> > >         times.
> > >
> > > Can you give any details about what you're planning here?
> >
> > Sure, it's partially explained in the blog post.
> >
> > > I put together a
> > > Git repo already that has all of the cabal files from Hackage and which
> > > updates every 30 minutes, and it seems that, instead of reinventing
> > > anything, simply using `git pull` would be the right solution here:
> > >
> > > https://github.com/commercialhaskell/all-cabal-files
> >
> > It's great that we can mirror to lots of different formats so
> > easily :-).
> >
> > I see that we now have two hackage mirror tools, one for mirroring to a
> > hackage-server instance and one for S3. The bit I think is missing is
> > mirroring to a simple directory based archive, e.g. to be served by a
> > normal http server.
> >
> > From the blog post:
> >
> >         The trick is that the tar format was originally designed to be
> >         append only (for tape drives) and so if the server simply
> >         updates the index in an append only way then the clients only
> >         need to download the tail (with appropriate checks and fallback
> >         to a full update). Effectively the index becomes an append only
> >         transaction log of all the package metadata changes. This is
> >         also fully backwards compatible.
> >
> > The extra detail is that we can use HTTP range requests. These are
> > supported on pretty much all dumb/passive http servers, so it's still
> > possible to host a hackage archive on a filesystem or ordinary web
> > server (this has always been a design goal of the repository format).
> >
> > We use a HTTP range request to get the tail of the tarball, so we only
> > have to download the data that has been added since the client last
> > fetched the index. This is obviously much much smaller than the whole
> > index. For safety (and indeed security) the final tarball content is
> > checked to make sure it matches up with what is expected. Resetting and
> > changing files earlier in the tarball is still possible: if the content
> > check fails then we have to revert to downloading the whole index from
> > scratch. In practice we would not expect this to happen except when
> > completely blowing away a repository and starting again.
> >
> > The advantage of this approach compared to others like rsync or git is
> > that it's fully compatible with the existing format and existing
> > clients. It's also in the typical case a smaller download than rsync and
> > probably similar or smaller than git. It also doesn't need much new from
> > the clients, they just need the same tar, zlib and HTTP features as they
> > have now (e.g. in cabal-install) and don't have to distribute
> > rsync/git/etc binaries on other platforms (e.g. windows).
> >
> > That said, I have no problem whatsoever with there being git or rsync
> > based mirrors. Indeed the central hackage server could provide an rsync
> > point for easy setup for public mirrors (including the package files).
> >
> >
> >
> I don't like this approach at all. There are many tools out there that do a
> good job of dealing with incremental updates. Instead of using any of
> those, the idea is to create a brand new approach, implement it in both
> Hackage Server and cabal-install (two projects that already have a massive
> bug deficit), and roll it out hoping for the best.

I looked at other incremental HTTP update approaches that would be
compatible with the existing format and work with passive http servers.
There's one rsync-like thing over http but the update sizes for our case
would be considerably larger than this very simple "get the tail, check
the secure hash is still right". This approach is minimally disruptive,
compatible with the existing format and clients.

> There's no explanation here as to how you'll deal with things like
> cabal file revisions, which are very common these days and seem to
> necessitate redownloading the entire database in your proposal.

The tarball becomes append only. The tar format works in this way;
updated files are simply appended. (This is how incremental backups to
tape drives worked in the old days, using the tar format). So no, cabal
file revisions will be handled just fine, as will other updates to other
metadata. Indeed we get the full transaction history.

> Here's my proposal: use Git. If Git isn't available on the host, then
> revert to the current codepath and download the index. We can roll that out
> in an hour of work and everyone gets the benefits, without the detriments
> of creating a new incremental update framework.

I was not proposing to change the repository format significantly (and
only in a backwards compatible way). The existing format is pretty
simple, using standard old well understood formats and protocols with
wide tool support.

The incremental update is fairly unobtrusive. Passive http servers don't
need to know about it, and clients that don't know about it can just
download the whole index as they do now.

The security extensions for TUF are also compatible with the existing
format and clients.

-- 
Duncan Coutts, Haskell Consultant
Well-Typed LLP, http://www.well-typed.com/