[Haskell-cafe] Improvements to package hosting and security
Gershom B
gershomb at gmail.com
Mon May 4 13:55:34 UTC 2015
On May 4, 2015 at 4:42:05 AM, Mathieu Boespflug (mboes at tweag.net) wrote:
> - cabal-install mysteriously dropping HTTP connections and corrupting
> .cabal files: this particular firewall that I've seen is used by
> hundreds of developers in the company without it silently truncating
> requests on anything else but Cabal updates. Investigations so far
> point to a bad interaction between Network.HTTP and lazy bytestrings,
> see http://www.hawaga.org.uk/tmp/bug-cabal-zlib-http-lazy.html (no bug
> report just yet). Reusing the same download mechanism that hundreds of
> others are already using in the company means we are not at risk of a
> firewall triggering an obscure latent race condition in the way
> cabal-install retrieves HTTP responses. It means if there is a real
> problem with the firewall, it won't just be for the local Haskellian
> outpost who are trying to sell Haskell to their boss, but for
> everyone, and therefore fixed.
Yes, in this particular case, clearly using git is a transport that works and using HTTP is a transport that doesn’t. But as you note, this appears to be a problem with the firewall, not the HTTP library. You’re right that moving to a transport used more widely would help this problem. But, so would moving to curl apparently. In any case, as I wrote, the best way to address this is to make ourselves more generally flexible in our transport layer — and the way to do this is not to swap the HTTP library simply for git, but to open up our choices more broadly. Which is precisely the plan already under discussion with regards to Cabal. Git is no magic bullet here. It is just “anything besides the current thing that happens to trigger a specific bug in a specific firewall."
> - the reversing revisions issue was NOT just a display issue: it
> completely broke Stackage Nightly builds that day, which just calls
> `cabal update` under the hood:
> https://github.com/haskell/hackage-server/issues/305. Other users of
> Hackage in that time window also experienced the issue. It's an issue
> that caused massive breakage in a lot of places. Notice how PkgInfo_v2
> is a data structure that is entirely redundant with what Git would
> provide already, so need not be serialized to disk, have migrations
> written for it, etc, nor perhaps exist at all. Further, Git would have
> made it quite impossible to distribute what amounts to a rewritten and
> inadvertently tampered with history (because the clients would have
> noticed and refused to fast forward). Fewer pieces of state managed
> independently + less code = more reliable service.
Ah I see — they were flipped in the migration, not just in the display of the data. Regardless — there will always be a layer between our data storage — be it git, acid-state, database, anything else — and the programmatic use we make of that data. No matter what we do to that storage layer, the intermediate layer will need to turn that into a programmatic representation, and then the frontend services will need to display/make use of it. No matter what, there is always room for such bugs. You might say “but the server couldn’t cause such a bug in this system!” That’s silly — the deserialization from that storage layer will just take place later then — at each client. And they could cause such a bug. So yes, the literal place the bug was found is in code that would be different under a different storage layer. But there’s absolutely nothing in switching storage layers that rules out such bugs.
And furthermore, in the migration you propose, which involves taking all our data, pushing it into an entirely new representation, and then rewriting the entire hackage-server to talk to this new representation at all stages, and writing cabal-install to do the same — I promise that this would necessarily create a _whole lot_ of bugs.
Again, there may be reasons to do this (I’m dubious) — but let’s not overstate them to sell the case.
> Hosting we don't have to
> manage ourselves is hosting we don't have to keep humming. Of course
> no service guarantees 100% uptime, so mirrors are a key additional (or
> alternative) ingredient here. Efficient, low-latency and reliable
> mirroring is certainly possible by other means, but mirroring a
> history of changes is exactly what Git was designed for, and what it
> does well. Why reinvent that?
In the last case here, you say that mirroring is easier with git? But don’t we already have mirroring now? And haven’t we had it for some time? The work underway, to my knowledge, is only to make mirroring more secure (as a related consequence of making hackage in general more secure). So this seems a silly thing to raise.
> > However, I don’t think that migrating to git will solve any of the problems you mentioned
> above in your parenthetical. It _does_ help with the incremental fetch issue (though
> there are other ways to do that), and it _is_ a way to tackle the index signing issue, though
> I’m not sure that it is the best way (in particular, given the difficulty of configuring
> git _with keys_ on windows).
>
> That's an interesting concern, though without knowing more, this is
> not an actionable issue. What difficulties? If MinGHC packaged
> Git+gpg4win, what would the issue be?
I can give you an example I ran into with MinGHC already — I had a preexisting cygwin install on my machine, and tried to install MinGHC. This mixed msys paths with cygwin paths and everything mismatched and was horrible until I ripped out those msys paths. But now, of course, my new GHC can’t find the libraries to build against for e.g. doing a network reinstall, which was the entire point of the exercise.
By analogy, many windows users may have an existing git, and some may have an existing gpg. These may come from windows binaries (in a few flavors — direct, wrapped via tortoise, etc), from cygwin, or perhaps from another existing msys install.
Now they’re going to get multiple copies of these programs on their system with potentially conflicting paths, settings, etc? (Same goes for gpg, but not git on mac). And since we won’t have guarantees that everyone will have git, we’ll need to maintain existing transports anyway, so this only gives us a very partial solution...
I know there are some neat ideas in what you’re pushing for. But I feel like you’re overlooking all the potential issues — and also just underestimating the amount of work it would take to cut everything over to a new storage layer, on both front and backend, while keeping the set of existing features intact.
—Gershom
More information about the Haskell-Cafe
mailing list