Generic tries (long)

Fri Jun 20 06:51:40 EDT 2008

Hello apfelmus,

Thanks for taking the time to look at this. As what Jamie has posted is
largely based on my own initial efforts I can offer some insight about
what's going on here and why the class API looks like it does.

The first thing to note is that what Jamie has posted is the proposed
class methods *only*. It's not the complete user level map API.
There's also a whole lot of more convenient and sensible looking
functions that are just regular overloaded functions. Unfortunately
these are not visible in the API posted. Among them are things like
this..

size :: GT map k => map a -> Int
size mp = ASINT(addSize mp L(0))
(ASINT and L are a cpp macros)

elemsAscending :: GT map k => map a -> [a]
elemsAscending mp = foldrElemsAscending (:) mp []

assocsAscending :: GT map k => map a -> [(k,a)]
assocsAscending mp = foldrAssocsAscending (\k a assocs -> (k,a):assocs) 
mp []

keysAscending :: GT map k => map a -> [k]
keysAscending mp = foldrKeysAscending (:) mp []

The second thing to note is that the class API has been designed with
the *implementation* of generalised tries in mind. It's not necessary
that any instance is actually implemented as any kind of trie, but it
is necessary that the resulting API contains what's needed to enable
its use in other instances that are based on generalised tries.

So the actual class methods chosen are designed to be convenience for
generalised trie implementation, not typical map users. The types
are chosen to reflect how they will likely be used in other generalised
trie implementations and the functionality provided is what it seems
is actually needed to do this efficiently.

What the class API contains has kind of evolved empirically with stuff
being added as and when it was discovered it was needed (by implementing
a few common instancance by hand).

Here's my thoughtS about your specific observations..

apfelmus wrote:
> 1) Some terminology seems queer to me. In particular, I don't like the 
> terms "assoc" and "association" in the documentation. The Data.Map 
> documentation uses "key/value pair" or just "key" and "value"/"element" 
> instead, which I think is much better. For instance:
> 
>   "Insert a new association in the map"     :(
>   "Insert a new key and value in the map."  :)
> 
> Also, I like  fromList  better than  fromAssocs .

Well this is one of those bike shed arguments :-) I'm easy about it so
if Jamie agrees with you that's fine. Use of the term "association"
seems quite common. I've seen numereous uses of the term "association
list", never seen anyone talk about "key/value pair list".

> 2) Don't use Int# , looks like a premature optimization to me. 
> Furthermore, I'd change the queer  addSize  to simply
> 
>   size :: map a -> Int

This does exist (see above), but it's not a class method. One could
argue that is was unnecessary to make unboxing explicit. It's something
I got into the habit of doing because it's way easier to do that than
is inspecting ghc's output to make it's done it on its own (and figuring
out what to do about it if it hasn't). Also, because of the nested
nature generalised tries addSize is more convenient for implementors
than size IMO.

> 3)  insert  is strange. Why not use the convention in Data.Map, name it 
>  insertWith  and have
> 
>   insert :: k -> a -> map a -> map a
> instead? 

Yes, this this should probably be added to the user API. As a choice of
primitive class method, the current form seems more appropriate (though
perhaps not it's name). All the common user level variants can easily
be implemented with this primitive.

 > Similar for  union .

Actually I don't like Data.Maps union much. It would be deprecated if I
ruled the world. I think users should be always be made to specify
how overlapping values are to be combined (or discarded or whatever).
So the name union is now free we don't need to qualify the function that
is explicit about this using a "With" suffix.

> 4) Most functions, in particular the  ..Maybe  variants have fixed 
> default definitions. I wouldn't include them in the class. How about a 
> minimal class interface along the lines of
> 
>   class GMap map k | map -> k where
>      empty    :: map a
>      null     :: map a -> Bool
> 
>      lookup   :: k -> map a -> Maybe a
>      alter    :: (Maybe a -> Maybe a) -> k -> map a -> map a
>      merge    :: (k -> Maybe a -> Maybe b -> Maybe c) -> map a -> map b 
> -> map c
> 
>      fromList :: [(k,a)] -> map a
>      toList   :: map a -> [(k,a)]
> 
>   instance Functor map where ...
> 
> and implementing the various variants as normal polymorphic functions? Like
> 
>   insert k a    = alter (const $ Just a) k
>   singleton k a = insert k a empty
> 
> IMHO, having a few "flag" functions like
> 
>   difference :: k -> Maybe a -> Maybe b -> Maybe b
>   union      :: k -> Maybe a -> Maybe a -> Maybe a
>   intersect  :: k -> Maybe a -> Maybe a -> Maybe a
> 
> that can be plugged into  merge  is much nicer than having to 
> remember/lookup  all the  ..Maybe  and whatnot'' variants.

I've never been very keen on alter myself (on efficiency grounds) and
was wondering whether of not to include it. If the "altering" results
in an unchanged map it would be nice to just return the unchanged
map (rather than duplicate all nodes on the search path). There are
other possible alternatives to alter that are more efficient in this
respect.

The merge function looks like an interesting idea, but it's not clear
to me that it can always (or ever even :-) be implemented as efficiently
as the more specialised versions. Maybe as and experiment we could
implement it and if it turns out that union,intersection etc can be
implemeted using it with liitle or no extra cost then we could put them
in the convinience API instead (not as class methods).

> 5) I think the following functions are queer
> 
>   lookupCont f k m = lookup k m >>= f

Or perhaps..
  lookup = lookupCont Just
Yes, there's probably unwanted duplication in the class methods there.
My vote would be to keep lookupCont and a class method and have lookup
as regular overloaded function.

>   pair k1 k2       = if k1 == k2 then Nothing else
>      Just $ \a1 a2 -> insert k1 a1 (insert k2 a2 empty)
> 
> What use is  pair  ? I'd definitively put them outside the class if not 
> remove them entirely.

Again, pair is one of those things that a typical Map user wouldn't use,
but is definitely needed to implement an efficient trie for Lists and
probably product types in general. The reason for it's existance is that
singleton maps need special treatment (you want to avoid making long
chains of them).

Unfortunately the above definition is inefficient. You've already done
most of the work requred to evaluate pair in the first equality test.
This would all have to be repeated in the second insertion. Have a look
at the ListGT module to see how pair is defined and used for the gory
details.

> 6)  status  is an interesting idea with a meaningless name. Don't you 
> have a name that is more to the point, that expresses the nub (pun 
> intended) of what  status  does?

If you're suggesting it should be called nub, that seems confusing
considering that name is already used in Data.List to mean something
quite different. Perhaps you can suggest some other name?

> 7) I'm overwhelmed by the many  foldr...Asc/Descending  functions. They 
> all can be expressed in terms of  toListAsc  and  toListDesc  and I 
> don't think that "marginally faster" is a reason not to do so.

..or vice-versa, but without the need for list deforestation to get the
same efficiency.  Also I think we need to distinguish between the
variants that require key reconstruction from those that don't (keys
generally aren't stored in a trie).

 > I'd throw away the  from...L  functions, too.

The L versions can improve performance in some map implementations where
length is needed and is already known as by product of other stuff
that's going on. In such cases it seems to make sense to pass it as an
argument rather than incur the cost of evaluating it again from scratch.

Anway, I don't think we should get to worried about the precise details
of class methods right now. I think the main concern should be getting
the class hierarchy right, I'm not sure that it at present. e.g. There
may be specialised representations that support lookup very efficiently
but not much else. Should we have a separate class for them? Should
we have separate classes for implementations that store keys (hence no
key reconstruction cost) and those that must reconstruct keys? Then
there's the ordering can of worms too. This is the sort of thing that
really needs to be got right from the start IMHO.

Regards
--
Adrian Hey