[Haskell-cafe] Handling a large database (of ngrams)

Sun May 22 14:40:38 CEST 2011

Hi Wren,

First of all, thanks for your elaborate answer! Your input is very much
appreciated!

On Sat, May 21, 2011 at 10:42:57PM -0400, wren ng thornton wrote:
> I've been working on some n-gram stuff lately, which I'm hoping to
> put up on Hackage sometime this summer (i.e., in a month or so,
> after polishing off a few rough edges).

Unfortunately, my current time constraints don't allow me to wait for your code
:-( However, research on that system might continue well into the summer (if I
can manage to get paid for it…) so I'll keep an eye out for your announcement!

> Unless you really need to get your hands dirty, the first thing you
> should do is check out SRILM and see if that can handle it. Assuming
> SRILM can do it, you'd just need to write some FFI code to call into
> it from Haskell. This is relatively straightforward, and if it
> hasn't been done already it'd be a great contribution to the Haskell
> NLP community. One great thing about this approach is that it's
> pretty easy to have the SRILM server running on a dedicated machine,
> so that the memory requirements of the model don't interfere with
> the memory requirements of your program. (I can give you pointers to
> Java code doing all this, if you'd like.) One big caveat to bear in
> mind is that SRILM is not threadsafe, so that'll get in the way of
> any parallelism or distributed computing on the client side of
> things. Also, SRILM is hell to install.

I don't need distributed computing, but I haven't used SRILM based on the fact
that I'd have to write a wrapper around it, and because I need to do some custom
language model stuff for which I need to write functions that operate on ngram
probabilities directly, and I want to try out different machine learning methods
to compare their merits for my research. I'll see if SRILM is flexible enough,
to do this though.

> If you have too much trouble trying to get SRILM to work, there's
> also the Berkeley LM which is easier to install. I'm not familiar
> with its inner workings, but it should offer pretty much the same
> sorts of operations.

Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
data sets? Maybe this is also the wrong list to ask for this kind of question.

> But, with that in mind I can offer a few pointers for doing it in
> Haskell. The first one is interning. Assuming the data is already
> tokenized (or can be made to be), then it should be straightforward
> to write a streaming program that converts every unique token into
> an integer (storing the conversion table somewhere for later use).

This is what I meant by using the flyweight pattern. There's of course also the
possibility of computing a hash of every string, but I don't want to deal with
hash collisions. While they are largely avoidable (using, say, SHA1) but since
that would force a multi-byte index, I don't know if that would help too much,
seeing as the average word length isn't dramatically big.

> It doesn't even need to be complete morphological
> segmentation, just break on the obvious boundaries in order to get
> the average word size down to something reasonable.

I will do my best to avoid having to do this. My current research target is
German, which is richer in morphology than English, but not as much as Turkic,
Slavic or Uralic languages. In case of a strong inflectional language or even an
agglutinating or polysynthetic one, the need for a smarter morphological
analysis would arise. Unfortunately, this would have to be in the form of
language-specific morphological analysers.

Since this is something I would want to do in general anyway, I might write a
morphological analyser for German that breaks words down to case and tense
markings as well as lemmas, but this isn't the focus of my project, so I'll
first try to do without.

> For regular projects, that integerization would be enough, but for
> your task you'll probably want to spend some time tweaking the
> codes. In particular, you'll probably have enough word types to
> overflow the space of Int32/Word32 or even Int64/Word64.

I will, most likely, because of the long tail of word frequencies, run into
problems with integer space; not only that, but these ultra-rare words aren't
really going to be of much use for me. I'm debating using a trie for common
words while keeping a backlog of rare words in a mongoDB instance. Rare words
could graduate from there if they occur frequently enough, and get accepted into
the trie for easier access. Everything left over in the long tail end would just
be mapped to RARE<TAG> in the n-grams, where TAG refers to the part of speech
tag (since my corpus is pos-tagged.) What number constitutes "rare" and "common"
will have to be subject to experimentation.

A bloom filter might guard the trie, too. Since BFs can tell you if a certain
element is certainly *not* in a collection, I could cut down search operations
on the trie itself for all the rare words. AFAICR there's a chapter in RWH on
building Bloom filters.

There are other ways to cut down the variance of ngrams in large corpora:
parsing dates and other cardinalities and subsuming them under one category;
using named-entity recognition and subsuming all NEs under one category, etc.

> (If not,
> then skip this section and go boldly forward!) Consequently, you
> won't be able to fit an ID into a machine register, so you'll want
> to find a good dynamic representation. If you collect unigram counts
> along the way (or can estimate their frequencies reliably from a
> subcorpus), then you can use a Huffman code, Shannon--Fano code, or
> arithmetic code in order to bring the average codeword length down.
> The vast majority of word tokens belong to very few word types, so
> this coding will enable you to fit the vast majority of tokens into
> a single machine word, saving a single bit or two to indicate that
> you need to fail-over to passing around an array of bits identifying
> the original token.

That's kinda similar to what I had in mind (in general :-)

> The next question is what sort of trie representation to use.
> Something based on big-endian patricia trees (e.g.,
> containers:Data.IntMap or bytestring-trie:Data.Trie) is a good place
> to start. Though, you'll want to make them wider than
> binary-branching in order to reduce the overhead of the pointers et
> al. If you're still running into memory issues for storing the whole
> trained model, then you should look into cache-oblivious tree
> structures that you can mmap in.

Thanks, this will help me greatly. I didn't know which data structures to use.
I will also have to think about (de)serialization because I want to make these
translation tables persistent.

> If all that still isn't enough, then that's where you start breaking
> into doing research :)

Indeed I will. I'm going to document this implementation in a fairly technical
paper, so if anyone's interested in the result, I can post this on my github or
so. I'll probably try to split of the general analysis part and make a hackage
package out of it and/or post to haskell-nlp.

Regards,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110522/c56d16a0/attachment.pgp>