hackage-server: index format

Sun May 29 22:41:37 CEST 2011

On 29 May 2011 19:46, Antoine Latter <aslatter at gmail.com> wrote:
> On Sun, May 29, 2011 at 11:13 AM, Duncan Coutts
> <duncan.coutts at googlemail.com> wrote:
>> On Fri, 2010-11-19 at 11:16 -0600, Antoine Latter wrote:
>
>>
>> I'm not sure I really understand the difference. Whether there is a
>> difference in content/meaning or just a difference in the format.
>>
>
> Oh my, what an old thread. I'll try an resurrect my state of mind at the time.

Sorry :-)

> I think my main concern was, as you said, a difference in format not a
> difference in substance. I also might have thrown in a good amount of
> over-engineering as well.
>
> What it comes down to is that embedding relative URLs (or even
> absolute URLs) in a tar-file feels like an odd thing to do - I don't
> see what advantage is has over a flat text file, and I can no longer
> create/consume the tar-file with standard tools.
>
> But maybe this doesn't matter - can we re-state what goals we're
> trying to get to, and what problems we're trying to solve? Going back
> into this thread I'm not even sure what I was talking about.

Hah! :-)

I'll restate my thoughts.

> Are we trying to come up with a master plan of allowing cabal-install
> to interact with diverse sources of packages-which-may-be-installed
> data?

Yes.

> I'm imagining the following use cases:
>
> 1. hackage.haskell.org
> 2. a network share/file system path with a collection of packages
> 3. an internet url with a collection of packages

Yes.

> 4. an internet url for a single package

That we can do now, because it's a single package rather than really a
collection.

cabal install http://example.com/~me/foo-1.0.tar.gz

> 5. a tarball with a collection of packages

Yes, distributing a whole bunch of packages in a single file.

> 6. a tarball with a single package

We can also do that now:

cabal install ./foo-1.0.tar.gz

> 7. an untarred folder containing a package (as in 'cabal install' in
> my dev directory)

Yes.

> With the ability to specify some of these in the .cabal/config or at
> the command line as appropriate. There's going to be some overlap
> between these cases, almost certainly.

Yes. The policy is up for grabs, the important point here is mechanism
and format.

> Am I missing any important cases? Are any of these cases unimportant?

Another impotant use case: the "cabal-dev" use case, a local unpacked
package with a bunch of other local source packages, either local
dirs, local or remote tarballs. This is basically when you want a
special local package environment for this specific package.

A closely related and overlapping use case is having a project that
consists of multiple packages, e.g. gtk2hs consists of
gtk2hs-buildtools, glib, cairo, pango and gtk. Devs hacking on this
want to build them all in one batch. Technically you can do this now,
but it's not convenient. I'd have to say:

gtk2hs$ cabal install gtk2hs-buildtools/ glib/ cairo/ pango/ gtk/

What we want there is a simple index that contains them all and that
cabal-install then uses by default when we build/install in this
directory. Or something like that.

> The next question would be how much effort do we require of the
> provider of a specific case? So for numbers 4 & 5, is the output of
> 'cabal sdist' good enough? For numbers 2 & 3, will I be able to just
> place package tgz files into a particular folder structure, or will I
> need to produce an index file?

For the single package cases, yes we don't need an index and we can
already do these cases.

My though about the UI is that we always have an index, so no pure
directory collections. I'd add a "cabal index" command with
subcommands for adding, removing and listing the collection. There
would be some options when you add to choose the kind of entry.

> What are other folks doing? I don't know much about ruby gems.
> Microsoft's new 'NuGet' packages supports tossing packages in a
> directory and then telling Visual Studio to look there (they also
> support pointing the tools at an ATOM feed, which was interesting).

Ah that's interesting. I've also been thinking about incremental
updates of the hackage index. I think we can do this with a tar based
format.

We're not precluding cabal-install supporting a pure directory style,
but having a specific collection resource is necessary in most use
cases, particularly the http remote cases. If we get the UI right then
we probably don't need the pure directory style since it'd just be a
matter of "cp" vs "cabal index add".

Ok, you've mostly covered it, but to try and present it all in one go,
here's what I think we need:

We need a way to describe collections of Cabal packages. These
collections should either link to packages or include them by value.
Optionally, for improved performance the .cabal file for packages can
be included. The format should be usable in a REST context, that is it
should support locating packages via a URL.

For each package in the index we need:
 * A link to the package (either tarball or local directory)
   OR: the package tarball by value (rather than a link)
 * optionally a .cabal file for the package

We need a format that has forwards compatability so that in future we
allow other optional attributes/metadata for the package, e.g. digital
signatures, or other collection-global information.

Using proper URLs (absolute or relative to the location of the
collection itself) gives a good deal of flexibility. The current
hackage archive format has implicit links which means the layout of
the archive is fixed and it requires that all the packages are
provided directly on the same http server. Using URLs allows a
flexible archive layout and allows "shallow" or "mirror" archives that
redirect to other servers for all or some packages.

In addition to hackage/archive-style use cases, the other major use
case is on local machines to create special source package
environments. This is just a mapping of source package id to its
implementation as a source package. This is useful for multi-package
projects, or building some package with special local versions of
dependencies. The key distinguishing feature of these package
environments is that they are local to some project directory rather
than registered globally in the ~/.cabal/config.

The motivation for including package tarballs by value is that it
allows distributing multi-package systems/projects as a single file,
or as a convenient way of making snapshots of packages without having
to stash them specially in some local directory.

My suggestion to get this kind of flexible format is to reuse and
abuse the tar format. The tar format is a collection of files. We can
encode different kinds of entries into file extensions.

To encode URL links my idea was to abuse the tar symlink support and
say that symlinks are really just URLs. Relative links are already
URLs, the abuse is to suggest allowing absolute URLs also, like
http://example.com/~me/foo-1.0.tar.gz. The advantage of this approach
is that each kind of entry (tarball .cabal file etc) can be either
included by value as a file or included as a link. If we have to
encode links as .url files then we lose that ability.

Instead of using symlinks it is also possible to add new tar entry
types. Standard tools will either ignore custom types on unpacking or
treat them as ordinary files. Standard tools will obviously not create
custom tar entries, though they will add symlinks.

Here is an example convention for names and meanings of tar entries

 1. foo-1.0.tar.gz
 2. foo-1.0.cabal
 3. foo-1.0

1 & 2 can be a file entry or they can be a symlink/url, while 3 can
only be a symlink. For example:

 * foo-1.0.tar.gz -> packages/foo/1.0/foo-1.0.tar.gz
 * foo-1.0.tar.gz -> htpp://code.haskell.org/~me/foo-1.0.tar.gz
 * foo-1.0 -> foo-1.0/
 * foo-1.0 -> ../deps/foo-1.0/

The links are interpreted as ordinary URLs, possibly relative to the
location of the collection itself. For example if we got this
index.tar.gz from http://hackage.haskell.org/index.tar.gz then the
link packages/foo-1.0.tar.gz gives us
http://hackage.haskell.org/packages/foo-1.0.tar.gz

Links to directories are only valid for local cases because we do not
support remote unpacked packages (because there's no reasonable way to
enumerate the contents).

For these relative URLs one can use standard tar tools to construct
the index. For absolute URLs it is in fact still possible by making
broken symlinks that point to non-existent files like:

$ ln -s htpp://code.haskell.org/~me/foo-1.0.tar.gz foo-1.0.tar.gz

and the tar tool will happily include such broken symlinks into the tar file.

We could instead use a custom tar entry type for URLs but we would
lose this ability.

For a user interface I was thinking of something along the lines of:

cabal index init [indexfile]
cabal index add [indexfile] [--copy] [--link] [targets]
cabal index list [indexfile]
cabal index remove [indexfile] [pkgname]

The --copy and --link flags for index add are to distinguish between
adding a snapshot copy of a tarball to the index or linking to the
local tarball which may be updated later. We may also want to
distinguish between a volatile local tarball and a stable one. In the
latter case we can include a cached copy of the .cabal file. I'm not
sure if there's a sensible default for --copy vs --link or whether we
should force people to choose like "cabal index add-copy vs add-link".

Duncan