[Haskell-cafe] A distributed and replicating native Haskell
database
Paul Johnson
paul at cogito.org.uk
Fri Feb 2 10:06:53 EST 2007
Joel Reymont wrote:
> Folks,
>
> Allegro Common Lisp has AllegroCache [1], a database built on B-Trees
> that lets one store Lisp objects of any type. You can designate
> certain slots (object fields) as key and use them for lookup. ACL used
> to come bundled with the ObjectStore OODBMS for the same purpose but
> then adopted a native solution.
>
> AllegroCache is not distributed or replicating but supports automatic
> versioning. You can redefine a class and new code will store more (or
> less) data in the database while code that uses the old schema will
> merrily chug along.
That implies being able to put persistent code into the database. Easy
enough in Lisp, less easy in Haskell. How do you serialize it?
As a rule, storing functions along with data is a can of worms. Either
you actually store the code as a BLOB or you store a pointer to the
function in memory. Either way you run into problems when you upgrade
your software and expect the stored functions to work in the new context.
> Erlang [2] has Mnesia [3] which lets you store any Erlang term
> ("object"). It stores records (tuples, actually) and you can also
> designate key fields and use them for lookup. I haven't looked into
> this deeply but Mnesia is built on top of DETS (Disk-based Term
> Storage) which most likely also uses a form of B-Trees.
Erlang also has a very disciplined approach to code updates, which
presumably helps a lot when functions are stored.
>
> Mnesia is distributed and replicated in real-time. There's no
> automatic versioning with Mnesia but user code can be run to read old
> records and write new ones.
>
> Would it make sense to build a similar type of a database for Haskell?
> I can immediately see how versioning would be much harder as Haskell
> is statically typed. I would love to extend recent gains in binary
> serialization, though, to add indexing of records based on a
> designated key, distribution and real-time replication.
I very much admire Mnesia, even though I'm not an Erlang programmer. It
would indeed be really cool to have something like that. But Mnesia is
built on the Erlang OTP middleware. I would suggest that Haskell needs a
middleware with the same sort of capabilities first. Then we can build a
database on top of it.
> What do you think?
>
> To stimulate discussion I would like to ask a couple of pointed
> questions:
>
> - How would you "designate" a key for a Haskell data structure?
I haven't tried compiling it, but something like:
class (Ord k) => DataKey a k | a -> k where
keyValue :: a -> k
> - Is the concept of a schema applicable to Haskell?
The real headache is type safety. Erlang is entirely dynamically typed,
so untyped schemas with column values looked up by name at run-time fit
right in, and its up to the programmer to manage schema and code
evolution to prevent errors. Doing all this in a statically type safe
way is another layer of complexity and checking.
Actually this is also just another special case of the middleware case.
If we have two processes, A and B, that need to communicate then they
need to agree on a protocol. Part of that protocol is the data types. If
B is a database then this reduces to the schema problem. So lets look at
the more general problem first and see if we can solve that.
There are roughly two ways for A and B to agree on the protocol. One is
to implement the protocol separately in A and B. If it is done correctly
then they will work together. But this is not statically checkable
(ignoring state machines and model checking for now). This is the Erlang
approach, because dynamic checking is the Erlang philosophy.
Alternatively the protocol can be defined in a special purpose protocol
module P, and A and B then import P. This is the approach taken by CORBA
with IDL. However what happens if P is updated to P'? Does this mean
that both A and B need to be recompiled and restarted simultaneously?
Requiring this is a Bad Thing; imagine if every bank in the world had to
upgrade and restart its computers simultaneously in order to upgrade a
common protocol. (This protocol versioning problem was one of the major
headaches with CORBA.) We would have to have P and P', live
simultaneously, and processes negotiate the latest version of the
protocol that they both support when they start talking. That way the
introduction of P' does not need to be simultaneous with the withdrawal
of P.
There is still the possibility of a run-time failure at the protocol
negotiation stage of course, if it transpires that the to processes have
no common protocol.
So we need a DSL which allows the definition of data types and abstract
protocols (i.e. who sends what to whom when) that can be imported by the
two processes (do we need N-way protocols?) on each end of the link. If
we could embed this in Haskell directly then so much the better, but
something that needs preprocessing would be fine too.
However there is a wrinkle here: what about "pass through" processes
which don't interpret the data but just store and forward it. Various
forms of protocol adapter fit this scenario, as does the database you
originally asked about. We want to be able to have these things talk in
a type-safe manner without needing to be compiled with every data
structure they transmit. You could describe these things using type
variables, so that for instance if a database table is created to store
a datatype D then any process reading or writing the data must also use
D, even though the database itself knows nothing more of D than the
name. Similarly a gateway that sets up a channel for datatype D would
not need to know anything more than the name.
Paul.
More information about the Haskell-Cafe
mailing list