Hackage 2 and acid-state vs traditional databases

Thu Sep 6 21:38:58 CEST 2012

Hi Ian,

We used acid-state (actually happstack-state) at Silk for our session
store. We had the same problems you describe: slow shutdown/startup,
high memory usage, unable to inspect the data. We recently switched to
an SQL database. Just another data point.

Erik

On Thu, Sep 6, 2012 at 8:49 PM, Ian Lynagh <ian at well-typed.com> wrote:
>
> Hi all,
>
> I've had a bit of experience with Hackage 2 and acid-state now, and I'm
> not convinced that it's the best fit for us:
>
> * It's slow. It takes about 5 minutes for me to stop and then start the
>   server. It's actually surprising just how slow it is, so it might be
>   possible/easy to get this down to seconds, but it still won't be
>   instantaneous.
>
> * Memory usage is high. It's currently in the 700M-1G range, and to get
>   it that low I had to stop the parsed .cabal files from being held in
>   memory (which presumably has an impact on performance, although I
>   don't know how significant that is), and disable the reverse
>   dependencies feature. It will grow at least linearly with the number
>   of package/versions in Hackage.
>
> * Only a single process can use the database at once. For example, if
>   the admins want a tool that will make it easier for them to approve
>   user requests, then that tool needs to be integrated into the Hackage
>   server (or talk to it over HTTP), rather than being standalone.
>
> * The database is relatively opaque. While in principle tools could be
>   written for browsing, modifying or querying it, currently none exist
>   (as far as I know).
>
> * The above 2 points mean that, for example, there was no easy way for
>   me to find out how many packages use each top-level module hierarchy
>   (Data, Control, etc). This would have been a simple SQL query if the
>   data had been in a traditional database, but as it was I had to write
>   a Haskell program to process all the package .tar.gz's and parse the
>   .cabal files manually.
>
> * acid-state forces us to use a server-process model, rather than having
>   processes for individual requests run by apache. I don't know if we
>   would have made this choice anyway, so this may or may not be an
>   issue. But the current model does mean that adding a feature or fixing
>   a bug means restarting the process, rather than just installing the
>   new program in-place.
>
> Someone pointed out that one disadvantage of traditional databases is
> that they discourage you from writing as if everything was Haskell
> datastructures in memory. For example, if you have things of type
>     data Foo = Foo {
>         str :: String,
>         bool :: Bool,
>         ints :: [Int]
>     }
> stored in a database then you could write either:
>     foo <- getFoo 23
>     print $ bool foo
> or
>     b <- getFooBool 23
>     print b
>
> The former is what you would more naturally write, but would require
> constructing the whole Foo from the database (including reading an
> arbitrary number of Ints). The latter is thus more efficient with the
> database backend, but emphasises that you aren't working with regular
> Haskell datastructures.
>
> This is even more notable with the Cabal types (like PackageDescription)
> as the types and various utility functions already exist - although it's
> currently somewhat moot as the current acid-state backend doesn't keep
> the Cabal datastructures in memory anyway.
>
>
> The other issue raised is performance. I'd want to see (full-size)
> benchmarks before commenting on that.
>
>
> Has anyone else got any thoughts?
>
>
>
> On a related note, I think it would be a little nicer to store blobs as
> e.g.
>     54/54fb24083b14b5916df11f1ffcd03b26/foo-1.0.tar.gz
> rather than
>     54/54fb24083b14b5916df11f1ffcd03b26
>
> I don't think that this breaks anything, so it should be noncontentious.
>
>
> Thanks
> Ian
>
>
> _______________________________________________
> cabal-devel mailing list
> cabal-devel at haskell.org
> http://www.haskell.org/mailman/listinfo/cabal-devel