[Haskell-cafe] Layered maps

Sun Oct 10 02:09:07 EDT 2010

On 10/9/10 2:02 AM, Alex Rozenshteyn wrote:
> This came up as I was doing homework for natural language processing.
>
> I'm constructing a trigram model from training data, but I also need the
> bigram and unigram counts.  I could, for each triple of characters, add the
> 3 tuple to a trigram map and increment its count (I know I'm not actually
> mutating anything), add the last two letters to a bigram map, and add the
> last character to a unigram map.
>
> But this looks very much like a tree...  each node contains the character
> (the key) and list of characters that can appear after it.  (I think this is
> a trie, but I don't know very much about tries.)  The problem is that lookup
> is more horribly inefficient than I can stand, if I use lists.

Yep, that's a trie :)

For what it's worth, in the HMM tagger I'm working on now, I'm using 
something to the effect of:

     type IntTrie a = IntMap (a, IntTrie a)

Of course that's not a valid type synonym and I'm not really using 
infinite depth (though there's no reason not to allow it), but that's 
the idea. The Ints are gotten by interning the word/tag strings and I 
actually do a lot of newtype wrapping to ensure that my IDs for words, 
tags, etc and their maps are all kept distinct. In order to update your 
counts during training you can use things like:

     succZ z = alter (Just . succ . maybe 0 id) z

     succYZ y z =
         alter (Just . (succ *** succZ z) . maybe (0,empty) id) y

     succYZ x y z =
         alter (Just . (succ *** succYZ y z) . maybe (0,empty) id) x

And to look things up you can use things like:

     lookupZ       z = maybe 0 id . fst . lookup z
     lookupYZ    y z = lookupZ    z . snd <=< lookup y
     lookupXYZ x y z = lookupYZ y z . snd <=< lookup x

Clearly there's a lot of boilerplate there, but it works and it's very 
efficient. Ironically, combining the unigram, bigram, and trigram tables 
actually decreases performance (slightly) on my benchmarks; but I'm 
keeping them combined in order to reduce memory pressure.

You can also take a look at bytestring-trie[1] which is very similar and 
provides a nicer interface. Unfortunately, there's currently a big 
performance hit for using bytestring-trie, which I'm in the process of 
tracking down. Most of it is probably because it's Word8-based instead 
of Int-based, but I'd like to be sure. bytestring-trie does expose the 
trie-ness (more to come in the HEAD version). Of course, for HMM stuff, 
you don't really take advantage of the trie structure, so there's 
nothing special about using a trie instead of a Map NGram where,

     data NGram
         = Unigram {-# UNPACK #-} !Int
         | Bigram  {-# UNPACK #-} !Int {-# UNPACK #-} !Int
         | Trigram {-# UNPACK #-} !Int {-# UNPACK #-} !Int
                   {-# UNPACK #-} !Int
         deriving (Eq, Ord, Read, Show)

Expanding that out to three different IntMaps or the nested IntMaps is 
just a performance improvement, which I'm sure won't be important for 
your homework (i.e., it's not asymptotically significant).

[1] http://hackage.haskell.org/package/bytestring-trie

-- 
Live well,
~wren