Hackage 2 and acid-state vs traditional databases

Thu Sep 6 22:06:08 CEST 2012

On 6 September 2012 19:49, Ian Lynagh <ian at well-typed.com> wrote:
>
> Hi all,
>
> I've had a bit of experience with Hackage 2 and acid-state now, and I'm
> not convinced that it's the best fit for us:
>
> * It's slow. It takes about 5 minutes for me to stop and then start the
>   server. It's actually surprising just how slow it is, so it might be
>   possible/easy to get this down to seconds, but it still won't be
>   instantaneous.

Yes it probably is slower than necessary there. It should be possible
to make it as fast as reading the data from disk.

For near-instantaneous server code upgrades we would need to use a
feature of happs-state that currently isn't implemented in acid-state
(afaik). The happs-state allowed clustering so you could start a new
server process, let it sync up with the state from the existing
process, then kill the old process.

> * Memory usage is high. It's currently in the 700M-1G range, and to get
>   it that low I had to stop the parsed .cabal files from being held in
>   memory (which presumably has an impact on performance, although I
>   don't know how significant that is), and disable the reverse
>   dependencies feature. It will grow at least linearly with the number
>   of package/versions in Hackage.

I think this is solvable. The most costly thing is the package
metadata. We can use more compact representations (the Cabal
PackageDescription is pretty bad in this respect) and secondly by
doing a lot of sharing within the package index. Both of these changes
could be done just in the Cabal library without significantly
affecting things. The sharing would mean it grows linearly but much
more slowly.

That said, I don't think 1GB should be considered high. If we want to
be able to have good performance, then we want all commonly used data
in memory anyway. 1GB for a central community server is not at all
unreasonable.

> * Only a single process can use the database at once. For example, if
>   the admins want a tool that will make it easier for them to approve
>   user requests, then that tool needs to be integrated into the Hackage
>   server (or talk to it over HTTP), rather than being standalone.

> * The database is relatively opaque. While in principle tools could be
>   written for browsing, modifying or querying it, currently none exist
>   (as far as I know).

In both these points, I would argue that we should simply make all the
data available via http REST interfaces. Instead of making the data
available only locally to someone with direct access to the database,
it should be there in machine readable form so that everyone can get
it and experiment.

In the approving user requests example, that just needs an HTTP
PUT/POST. It doesn't need any web form. Totally scriptable with
wget/curl etc.

> * The above 2 points mean that, for example, there was no easy way for
>   me to find out how many packages use each top-level module hierarchy
>   (Data, Control, etc). This would have been a simple SQL query if the
>   data had been in a traditional database, but as it was I had to write
>   a Haskell program to process all the package .tar.gz's and parse the
>   .cabal files manually.

As I mentioned on IRC the problem here is really that cabal-install
isn't currently available as a library. If it were then loading up the
00-index.tar file with all the .cabal files is just a couple lines of
code. Then your query is just a list comprehension.

As another example, we should be able to make all the package data
available in other formats, e.g. JSON and that provides loads of
opportunity for other people to do ad-hoc queries.

> * acid-state forces us to use a server-process model, rather than having
>   processes for individual requests run by apache. I don't know if we
>   would have made this choice anyway, so this may or may not be an
>   issue. But the current model does mean that adding a feature or fixing
>   a bug means restarting the process, rather than just installing the
>   new program in-place.

True. See above about quick restarts.

> Someone pointed out that one disadvantage of traditional databases is
> that they discourage you from writing as if everything was Haskell
> datastructures in memory.

Right.

> This is even more notable with the Cabal types (like PackageDescription)
> as the types and various utility functions already exist

> - although it's
> currently somewhat moot as the current acid-state backend doesn't keep
> the Cabal datastructures in memory anyway.

As I've said, I think that's fixable.

> The other issue raised is performance. I'd want to see (full-size)
> benchmarks before commenting on that.

There's basically two approaches you can take here: scaling up or out,
that is making a single server handle lots and lots of requests, or
have lots of machines. It is much simpler (both in terms of code and
infrastructure) to have a single server. The performance here can
still scale a long way if we can keep all data that requests need
in-memory. Additionally, we can scale read requests a lot further
using caching proxies without adding a great deal of complexity.

> Has anyone else got any thoughts?

Easy deployment was also a goal. We want anyone to be able to deploy a
hackage server instance, not just a central community one. Setting up
external databases (mysql, postgres) makes that a lot harder.
In-process databases like sqlite would be a lot slower than in-memory
dbs. Currently it is rather easy to deploy. It's more or less just
cabal install and go.

Johan mentions the issue of data formats. It's a very valid point.
That is why we designed the server to do full backups in standard
external formats (csv etc). We do not and should not rely on the
acid-state binary format. But as long as we do actually use the dump
feature then I am not at all concerned.

Duncan