[Haskell-cafe] Where is Data.Atom ?

Sun Jul 2 18:34:03 EDT 2006

Brian Hulley wrote:
> So perhaps my original spec is impossible to implement, though it is
> an open question whether some very clever encoding (with corresponding
> implementation of <) could be found which would lead to a better
> average performance (whatever that means).
>
> An alternative design for an atom module could be:
>
>      create :: MonadIO m => String -> m Atom
>      toString :: Atom -> String
>
>      instance Eq Atom          -- O(1)
>      instance Ord Atom        -- O(1) but depends on creation order
>
> but here the < would not be lexicographic, so although it would be
> useful for implementing symbol tables, environments etc it's not
> ideal for GUI use (eg when displaying a tree of modules where
> everything should be listed alphabetically).

Eureka! There *is* a way to get O(1) lexicographic comparisons of Atoms :-) 
although at the expense of O(n + k) but possibly amortized O(n + ?) creation 
time where n is the length of the string and k is the number of Atoms in 
existence at any given time. The previous reason for needing to be 
clairvoyant in the choice of Unique (or Integer or Int) to represent an Atom 
was to allow expressions such as

     atom "a" < atom "b"

because when the code has determined the representation for the first atom 
the creation of the second atom mustn't be allowed to change it.

However with monadic creation, such expressions can't be written, and 
therefore it would be possible to represent atoms as follows:

   newtype Atom = Atom (IORef Int) -- or IORef Integer

and when new atoms are created, we could simply update the Int's referenced 
by other atoms to ensure that the relative ordering is still lexicographic. 
If we assume that it's very unlikely we'd need more than 2^31 atoms to exist 
at any given time (if this assumption is wrong we could use Integer instead 
of Int) and that usually we'd have a lot less, the Int's could be allocated 
with gaps between them so that we'd only occasionally need to adjust the 
Int's of the atoms to the left or right when a new atom is inserted in the 
table, and in all cases it's extremely unlikely that we'd need to adjust the 
representation of all the other Atoms (hence amortized O(n) hopefully).

The comparison would then be:

      compare (Atom l) (Atom r) = unsafePerformIO $ do
            li <- readIORef l
            ri <- readIORef r
            return (compare li ri)

which is safe as long as creation of atoms is not allowed inside 
unsafePerformIO (it would be nice if there was a way to tell the typechecker 
that a specific action is not allowed in unsafe IO)

Still there would be some tricky details to work out eg how to ensure that 
the average number of Integer's that need to be updated on creation of each 
new atom is as small as possible, so I'll leave this as an exercise for the 
reader!!! :-)

Essentially it depends on how well the following can be implemented:

      data OrderingToken = ???
      instance Eq OrderingToken
      instance Ord OrderingToken

      create :: MonadIO m => m OrderingToken

      -- x < y |- createBetween x y == z where x < z and z < y
      createBetween :: MonadIO m => OrderingToken -> OrderingToken -> m 
(OrderingToken)

It would be interesting to investigate what the theoretical tradeoffs would 
be between the complexities of the above functions (assuming we want (<) to 
be O(1)) and whether or not they require to be monadic.

Anyway apologies for rambling on,

Brian.

-- 
Logic empowers us and Love gives us purpose.
Yet still phantoms restless for eras long past,
congealed in the present in unthought forms,
strive mightily unseen to destroy us.

http://www.metamilk.com