RFC: Data.StringMap

Axel Simon A.Simon at kent.ac.uk
Thu Aug 18 08:50:58 EDT 2005

On Thu, 2005-08-18 at 12:17 +0100, Adrian Hey wrote:
> On Thursday 18 Aug 2005 8:27 am, Axel Simon wrote:
> > On Thu, 2005-08-18 at 08:25 +0100, Adrian Hey wrote:
> > > Also, I'd be grateful someone could tell me what the
> > > data structure I've used is actually called :-)
> >
> > AFAIK, what you've implemented is called a trie, pronounced "try" from
> > retrieval.
> Thanks, I guessed it is some form of trie but I'm a bit confused about
> the precise definitions of various very similar data structures. I was
> looking at..
>  http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/

Ah! I thought it was a benign/naive question, but it really was a trap
and I fell for it...

> I'm not entirly clear about the difference between "tries", "compact tries"
> (depends what you call compact AFAICS) and "PATRICIA tries", and is a PATRICIA
> trie the same thing as a "PATRICIA tree"?
> The TString type is basically a linked list which may or may not be terminated
> with an N way fork (in the form of an AVL tree, N >=2), but then strings are
> linked lists anyway in Haskell. So is this compact or not I wonder?

I guess in that sense there are no compact tries in Haskell. Unless you
use PackedString.

> It's compact in the sense that it doesn't require an AVL tree at every
> character (most of which would be singletons in practice). I guess you
> could achieve the same thing with this definition..
> data TString v = TString {-# UNPACK #-} !Char String v !(AVL (TString v))
> .. which might look more compact, but isn't really AFAICS (well not while
> Strings are linked lists of boxed Chars).

There is still a difference: If you look up a word, the suffix of the
word will never be evaluated. In particular, the chain of TString nodes
containing a character each will never be built. This could avoid
potential space leaks. Furthermore, a compiler might do some
optimisation magic on lists, which doesn't apply to your data type.

> Replacing the String with a packed string might be better, but I'm not
> sure how efficient those are in reality (and if I was to go that route
> it might be a lot simpler and more efficient to use a completely different
> approach, such as hashing and a pure AVL tree say).

PackedStrings are certainly not the right thing to put into a general-
purpose data structure. The problem is that if you request one character
from a thunk that contains a conversion to PackedString, the whole
packed string will be built before this single character can be
extracted again. Hence the library would have a rather weird evaluation
behaviour. I once fiddled with Olaf Chitil's pretty printing library and
tried to improve it by using PackedStrings. It blew right up into my
face since it forced stuff which was nicely delayed before, rendering
all benefits of the concise representation useless.

My 5p,

More information about the Libraries mailing list