[Haskell-cafe] Improvements to package hosting and security

Thu Apr 16 11:18:29 UTC 2015

On Thu, Apr 16, 2015 at 1:57 PM Duncan Coutts <duncan at well-typed.com> wrote:

> On Thu, 2015-04-16 at 10:32 +0000, Michael Snoyman wrote:
> > On Thu, Apr 16, 2015 at 1:12 PM Duncan Coutts <duncan at well-typed.com>
> wrote:
> >
> > > On Thu, 2015-04-16 at 09:52 +0000, Michael Snoyman wrote:
> > > > Thanks for responding, I intend to go read up on TUF and your blog
> post
> > > > now. One question:
> > > >
> > > >       * We're incorporating an existing design for incremental
> updates
> > > >         of the package index to significantly improve "cabal update"
> > > >         times.
> > > >
> > > > Can you give any details about what you're planning here?
> > >
> > > Sure, it's partially explained in the blog post.
> > >
> > > > I put together a
> > > > Git repo already that has all of the cabal files from Hackage and
> which
> > > > updates every 30 minutes, and it seems that, instead of reinventing
> > > > anything, simply using `git pull` would be the right solution here:
> > > >
> > > > https://github.com/commercialhaskell/all-cabal-files
> > >
> > > It's great that we can mirror to lots of different formats so
> > > easily :-).
> > >
> > > I see that we now have two hackage mirror tools, one for mirroring to a
> > > hackage-server instance and one for S3. The bit I think is missing is
> > > mirroring to a simple directory based archive, e.g. to be served by a
> > > normal http server.
> > >
> > > From the blog post:
> > >
> > >         The trick is that the tar format was originally designed to be
> > >         append only (for tape drives) and so if the server simply
> > >         updates the index in an append only way then the clients only
> > >         need to download the tail (with appropriate checks and fallback
> > >         to a full update). Effectively the index becomes an append only
> > >         transaction log of all the package metadata changes. This is
> > >         also fully backwards compatible.
> > >
> > > The extra detail is that we can use HTTP range requests. These are
> > > supported on pretty much all dumb/passive http servers, so it's still
> > > possible to host a hackage archive on a filesystem or ordinary web
> > > server (this has always been a design goal of the repository format).
> > >
> > > We use a HTTP range request to get the tail of the tarball, so we only
> > > have to download the data that has been added since the client last
> > > fetched the index. This is obviously much much smaller than the whole
> > > index. For safety (and indeed security) the final tarball content is
> > > checked to make sure it matches up with what is expected. Resetting and
> > > changing files earlier in the tarball is still possible: if the content
> > > check fails then we have to revert to downloading the whole index from
> > > scratch. In practice we would not expect this to happen except when
> > > completely blowing away a repository and starting again.
> > >
> > > The advantage of this approach compared to others like rsync or git is
> > > that it's fully compatible with the existing format and existing
> > > clients. It's also in the typical case a smaller download than rsync
> and
> > > probably similar or smaller than git. It also doesn't need much new
> from
> > > the clients, they just need the same tar, zlib and HTTP features as
> they
> > > have now (e.g. in cabal-install) and don't have to distribute
> > > rsync/git/etc binaries on other platforms (e.g. windows).
> > >
> > > That said, I have no problem whatsoever with there being git or rsync
> > > based mirrors. Indeed the central hackage server could provide an rsync
> > > point for easy setup for public mirrors (including the package files).
> > >
> > >
> > >
> > I don't like this approach at all. There are many tools out there that
> do a
> > good job of dealing with incremental updates. Instead of using any of
> > those, the idea is to create a brand new approach, implement it in both
> > Hackage Server and cabal-install (two projects that already have a
> massive
> > bug deficit), and roll it out hoping for the best.
>
> I looked at other incremental HTTP update approaches that would be
> compatible with the existing format and work with passive http servers.
> There's one rsync-like thing over http but the update sizes for our case
> would be considerably larger than this very simple "get the tail, check
> the secure hash is still right". This approach is minimally disruptive,
> compatible with the existing format and clients.
>
> > There's no explanation here as to how you'll deal with things like
> > cabal file revisions, which are very common these days and seem to
> > necessitate redownloading the entire database in your proposal.
>
> The tarball becomes append only. The tar format works in this way;
> updated files are simply appended. (This is how incremental backups to
> tape drives worked in the old days, using the tar format). So no, cabal
> file revisions will be handled just fine, as will other updates to other
> metadata. Indeed we get the full transaction history.
>
> > Here's my proposal: use Git. If Git isn't available on the host, then
> > revert to the current codepath and download the index. We can roll that
> out
> > in an hour of work and everyone gets the benefits, without the detriments
> > of creating a new incremental update framework.
>
> I was not proposing to change the repository format significantly (and
> only in a backwards compatible way). The existing format is pretty
> simple, using standard old well understood formats and protocols with
> wide tool support.
>
> The incremental update is fairly unobtrusive. Passive http servers don't
> need to know about it, and clients that don't know about it can just
> download the whole index as they do now.
>
> The security extensions for TUF are also compatible with the existing
> format and clients.
>
>
>
The theme you seem to be creating here is "compatible with current format."
You didn't say it directly, but you've strongly implied that, somehow, Git
isn't compatible with existing tooling. Let me make clear that that is, in
fact, false[1]:

```
#!/bin/bash

set -e
set -x

DIR=$HOME/.cabal/packages/hackage.haskell.org
TAR=$DIR/00-index.tar
TARGZ=$TAR.gz

git pull
mkdir -p "$DIR"

rm -f $TAR $TARGZ

git archive --format=tar -o "$TAR" master
gzip -k "$TAR"
```

I wrote this in 5 minutes. My official proposal is to add code to `cabal`
which does the following:

1. Check for the presence of the `git` executable. If not present, download
the current tarball
2. Check for existence of ~/.cabal/all-cabal-files (or similar). If
present, run `git pull` inside of it. If absent, clone it
3. Run the equivalent of the above shell script to produce the 00-index.tar
file (not sure if the .gz is also used by cabal)

This seems like such a drastically simpler solution than using byte ranges,
modifying Hackage to produce tarballs in an append-only manner, and setting
up cabal-install to stitch together and check various pieces of a
downloaded file.

I was actually planning on proposing this some time next week. Can you tell
me the downsides of using Git here, which seems to fit all the benefits you
touted of:

> pretty simple, using standard old well understood formats and protocols
with wide tool support.

Unless Git at 10 years old isn't old enough yet.

Michael

[1]
https://github.com/commercialhaskell/all-cabal-files/commit/133cd026f8a1f99d719d97fcf884372ded173655
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150416/cfbcc650/attachment.html>