Hackage 2 and acid-state vs traditional databases
Ian Lynagh
ian at well-typed.com
Thu Sep 6 20:49:28 CEST 2012
Hi all,
I've had a bit of experience with Hackage 2 and acid-state now, and I'm
not convinced that it's the best fit for us:
* It's slow. It takes about 5 minutes for me to stop and then start the
server. It's actually surprising just how slow it is, so it might be
possible/easy to get this down to seconds, but it still won't be
instantaneous.
* Memory usage is high. It's currently in the 700M-1G range, and to get
it that low I had to stop the parsed .cabal files from being held in
memory (which presumably has an impact on performance, although I
don't know how significant that is), and disable the reverse
dependencies feature. It will grow at least linearly with the number
of package/versions in Hackage.
* Only a single process can use the database at once. For example, if
the admins want a tool that will make it easier for them to approve
user requests, then that tool needs to be integrated into the Hackage
server (or talk to it over HTTP), rather than being standalone.
* The database is relatively opaque. While in principle tools could be
written for browsing, modifying or querying it, currently none exist
(as far as I know).
* The above 2 points mean that, for example, there was no easy way for
me to find out how many packages use each top-level module hierarchy
(Data, Control, etc). This would have been a simple SQL query if the
data had been in a traditional database, but as it was I had to write
a Haskell program to process all the package .tar.gz's and parse the
.cabal files manually.
* acid-state forces us to use a server-process model, rather than having
processes for individual requests run by apache. I don't know if we
would have made this choice anyway, so this may or may not be an
issue. But the current model does mean that adding a feature or fixing
a bug means restarting the process, rather than just installing the
new program in-place.
Someone pointed out that one disadvantage of traditional databases is
that they discourage you from writing as if everything was Haskell
datastructures in memory. For example, if you have things of type
data Foo = Foo {
str :: String,
bool :: Bool,
ints :: [Int]
}
stored in a database then you could write either:
foo <- getFoo 23
print $ bool foo
or
b <- getFooBool 23
print b
The former is what you would more naturally write, but would require
constructing the whole Foo from the database (including reading an
arbitrary number of Ints). The latter is thus more efficient with the
database backend, but emphasises that you aren't working with regular
Haskell datastructures.
This is even more notable with the Cabal types (like PackageDescription)
as the types and various utility functions already exist - although it's
currently somewhat moot as the current acid-state backend doesn't keep
the Cabal datastructures in memory anyway.
The other issue raised is performance. I'd want to see (full-size)
benchmarks before commenting on that.
Has anyone else got any thoughts?
On a related note, I think it would be a little nicer to store blobs as
e.g.
54/54fb24083b14b5916df11f1ffcd03b26/foo-1.0.tar.gz
rather than
54/54fb24083b14b5916df11f1ffcd03b26
I don't think that this breaks anything, so it should be noncontentious.
Thanks
Ian
More information about the cabal-devel
mailing list