[Haskell-cafe] Layered maps
wren ng thornton
wren at freegeek.org
Sun Oct 10 02:09:07 EDT 2010
On 10/9/10 2:02 AM, Alex Rozenshteyn wrote:
> This came up as I was doing homework for natural language processing.
>
> I'm constructing a trigram model from training data, but I also need the
> bigram and unigram counts. I could, for each triple of characters, add the
> 3 tuple to a trigram map and increment its count (I know I'm not actually
> mutating anything), add the last two letters to a bigram map, and add the
> last character to a unigram map.
>
> But this looks very much like a tree... each node contains the character
> (the key) and list of characters that can appear after it. (I think this is
> a trie, but I don't know very much about tries.) The problem is that lookup
> is more horribly inefficient than I can stand, if I use lists.
Yep, that's a trie :)
For what it's worth, in the HMM tagger I'm working on now, I'm using
something to the effect of:
type IntTrie a = IntMap (a, IntTrie a)
Of course that's not a valid type synonym and I'm not really using
infinite depth (though there's no reason not to allow it), but that's
the idea. The Ints are gotten by interning the word/tag strings and I
actually do a lot of newtype wrapping to ensure that my IDs for words,
tags, etc and their maps are all kept distinct. In order to update your
counts during training you can use things like:
succZ z = alter (Just . succ . maybe 0 id) z
succYZ y z =
alter (Just . (succ *** succZ z) . maybe (0,empty) id) y
succYZ x y z =
alter (Just . (succ *** succYZ y z) . maybe (0,empty) id) x
And to look things up you can use things like:
lookupZ z = maybe 0 id . fst . lookup z
lookupYZ y z = lookupZ z . snd <=< lookup y
lookupXYZ x y z = lookupYZ y z . snd <=< lookup x
Clearly there's a lot of boilerplate there, but it works and it's very
efficient. Ironically, combining the unigram, bigram, and trigram tables
actually decreases performance (slightly) on my benchmarks; but I'm
keeping them combined in order to reduce memory pressure.
You can also take a look at bytestring-trie[1] which is very similar and
provides a nicer interface. Unfortunately, there's currently a big
performance hit for using bytestring-trie, which I'm in the process of
tracking down. Most of it is probably because it's Word8-based instead
of Int-based, but I'd like to be sure. bytestring-trie does expose the
trie-ness (more to come in the HEAD version). Of course, for HMM stuff,
you don't really take advantage of the trie structure, so there's
nothing special about using a trie instead of a Map NGram where,
data NGram
= Unigram {-# UNPACK #-} !Int
| Bigram {-# UNPACK #-} !Int {-# UNPACK #-} !Int
| Trigram {-# UNPACK #-} !Int {-# UNPACK #-} !Int
{-# UNPACK #-} !Int
deriving (Eq, Ord, Read, Show)
Expanding that out to three different IntMaps or the nested IntMaps is
just a performance improvement, which I'm sure won't be important for
your homework (i.e., it's not asymptotically significant).
[1] http://hackage.haskell.org/package/bytestring-trie
--
Live well,
~wren
More information about the Haskell-Cafe
mailing list