[Haskell-cafe] Improvements to package hosting and security

Duncan Coutts duncan at well-typed.com
Thu Apr 16 10:12:49 UTC 2015


On Thu, 2015-04-16 at 09:52 +0000, Michael Snoyman wrote:
> Thanks for responding, I intend to go read up on TUF and your blog post
> now. One question:
> 
>       * We're incorporating an existing design for incremental updates
>         of the package index to significantly improve "cabal update"
>         times.
> 
> Can you give any details about what you're planning here?

Sure, it's partially explained in the blog post.

> I put together a
> Git repo already that has all of the cabal files from Hackage and which
> updates every 30 minutes, and it seems that, instead of reinventing
> anything, simply using `git pull` would be the right solution here:
> 
> https://github.com/commercialhaskell/all-cabal-files

It's great that we can mirror to lots of different formats so
easily :-).

I see that we now have two hackage mirror tools, one for mirroring to a
hackage-server instance and one for S3. The bit I think is missing is
mirroring to a simple directory based archive, e.g. to be served by a
normal http server.

>From the blog post:

        The trick is that the tar format was originally designed to be
        append only (for tape drives) and so if the server simply
        updates the index in an append only way then the clients only
        need to download the tail (with appropriate checks and fallback
        to a full update). Effectively the index becomes an append only
        transaction log of all the package metadata changes. This is
        also fully backwards compatible.

The extra detail is that we can use HTTP range requests. These are
supported on pretty much all dumb/passive http servers, so it's still
possible to host a hackage archive on a filesystem or ordinary web
server (this has always been a design goal of the repository format).

We use a HTTP range request to get the tail of the tarball, so we only
have to download the data that has been added since the client last
fetched the index. This is obviously much much smaller than the whole
index. For safety (and indeed security) the final tarball content is
checked to make sure it matches up with what is expected. Resetting and
changing files earlier in the tarball is still possible: if the content
check fails then we have to revert to downloading the whole index from
scratch. In practice we would not expect this to happen except when
completely blowing away a repository and starting again.

The advantage of this approach compared to others like rsync or git is
that it's fully compatible with the existing format and existing
clients. It's also in the typical case a smaller download than rsync and
probably similar or smaller than git. It also doesn't need much new from
the clients, they just need the same tar, zlib and HTTP features as they
have now (e.g. in cabal-install) and don't have to distribute
rsync/git/etc binaries on other platforms (e.g. windows).

That said, I have no problem whatsoever with there being git or rsync
based mirrors. Indeed the central hackage server could provide an rsync
point for easy setup for public mirrors (including the package files).

-- 
Duncan Coutts, Haskell Consultant
Well-Typed LLP, http://www.well-typed.com/



More information about the Haskell-Cafe mailing list