Hackage 2 and acid-state vs traditional databases

Thu Sep 6 20:49:28 CEST 2012

Hi all,

I've had a bit of experience with Hackage 2 and acid-state now, and I'm
not convinced that it's the best fit for us:

* It's slow. It takes about 5 minutes for me to stop and then start the
  server. It's actually surprising just how slow it is, so it might be
  possible/easy to get this down to seconds, but it still won't be
  instantaneous.

* Memory usage is high. It's currently in the 700M-1G range, and to get
  it that low I had to stop the parsed .cabal files from being held in
  memory (which presumably has an impact on performance, although I
  don't know how significant that is), and disable the reverse
  dependencies feature. It will grow at least linearly with the number
  of package/versions in Hackage.

* Only a single process can use the database at once. For example, if
  the admins want a tool that will make it easier for them to approve
  user requests, then that tool needs to be integrated into the Hackage
  server (or talk to it over HTTP), rather than being standalone.

* The database is relatively opaque. While in principle tools could be
  written for browsing, modifying or querying it, currently none exist
  (as far as I know).

* The above 2 points mean that, for example, there was no easy way for
  me to find out how many packages use each top-level module hierarchy
  (Data, Control, etc). This would have been a simple SQL query if the
  data had been in a traditional database, but as it was I had to write
  a Haskell program to process all the package .tar.gz's and parse the
  .cabal files manually.

* acid-state forces us to use a server-process model, rather than having
  processes for individual requests run by apache. I don't know if we
  would have made this choice anyway, so this may or may not be an
  issue. But the current model does mean that adding a feature or fixing
  a bug means restarting the process, rather than just installing the
  new program in-place.

Someone pointed out that one disadvantage of traditional databases is
that they discourage you from writing as if everything was Haskell
datastructures in memory. For example, if you have things of type
    data Foo = Foo {
        str :: String,
        bool :: Bool,
        ints :: [Int]
    }
stored in a database then you could write either:
    foo <- getFoo 23
    print $ bool foo
or
    b <- getFooBool 23
    print b

The former is what you would more naturally write, but would require
constructing the whole Foo from the database (including reading an
arbitrary number of Ints). The latter is thus more efficient with the
database backend, but emphasises that you aren't working with regular
Haskell datastructures.

This is even more notable with the Cabal types (like PackageDescription)
as the types and various utility functions already exist - although it's
currently somewhat moot as the current acid-state backend doesn't keep
the Cabal datastructures in memory anyway.

The other issue raised is performance. I'd want to see (full-size)
benchmarks before commenting on that.

Has anyone else got any thoughts?

On a related note, I think it would be a little nicer to store blobs as
e.g.
    54/54fb24083b14b5916df11f1ffcd03b26/foo-1.0.tar.gz
rather than
    54/54fb24083b14b5916df11f1ffcd03b26

I don't think that this breaks anything, so it should be noncontentious.

Thanks
Ian