[Haskell-cafe] Re: Where is Data.Atom?

Tue Jul 4 18:37:13 EDT 2006

Iain Alexander wrote:
> Another suggestion:
>
> Put your strings in an ordered binary tree (other data structures
> might also work here).
>
> Make your Atom an encoding of the structure of the tree (resp. other
> structure).  This is logically a sequence of bits, 0 for left (less
> than), 1 for right (greater than) - if you terminate this
> variable-length sequence with 1, I think it all works out OK.  You
> then encode this sequence of bits in a Word32 (or some other
> convenient item), with implicit trailing zeros.
>
> root ~> 1[000...]
> root.left ~> 01[000...]
> root.right ~> 11[000...]
> ...
>
> There are clearly some questions such as how unbalanced is the tree
> likely to get (since you can't conveniently rebalance it), and what
> to do if your Atom "overflows", but this might give you something to
> work with.

Thanks for the suggestion. However I think that just using a binary tree 
would be equivalent to just using the representation of the original string 
as a sequence of bits (except for "early out" eg representing 0b00001111 as 
just 1111 etc), and any other systematic encoding of strings that preserved 
all the info necessary for lexicographic ordering (that still works when new 
atoms are created later) would also end up using at least the same number of 
bit comparisons on average (since a new information preserving 
representation is just a permutation of the original at the bit level - if 
some strings are represented with fewer bits others would need more bits so 
it would all balance out).

The only way I can see of getting truly O(1) lexicographic comparisons (by 
this I mean O(log P) where P is the maximum number of atoms that can exist 
simultaneously at any point during program execution) would be to have a 
"floating" representation (accessed via an IORef) where the concrete values 
of existing atoms are changed as necessary behind the scenes as new atoms 
are created, but as Chris pointed out this would require a locking operation 
if the atoms were used in a multi-threaded program, so all in all, it may 
well be much faster to just use ByteStrings especially if the Atoms are just 
to be used to represent fairly short strings such as module names or natural 
language words etc.

Still I think this would be an interesting project - Prolog style Atoms with 
O(1) lexicographic ordering and O(n + f k) creation time where k is the size 
of the existing population and f is some function which minimizes the 
amortized time...

Regards, Brian.
-- 
Logic empowers us and Love gives us purpose.
Yet still phantoms restless for eras long past,
congealed in the present in unthought forms,
strive mightily unseen to destroy us.

http://www.metamilk.com