[Haskell-cafe] Handling a large database (of ngrams)

Sun May 22 04:42:57 CEST 2011

On 5/21/11 3:56 PM, Aleksandar Dimitrov wrote:
> Hi Haskellers,
>
> I'm unaware of a "good method" or "default way" of handling large datasets to
> which I need non-sequential (i.e. random) access in Haskell.
>
> My use case is linguistic analysis of a ~30GB corpus — the most basic form of
> quantitative analysis here are ngram based HMMs, which aren't difficult to pull
> of in Haskell.

I've been working on some n-gram stuff lately, which I'm hoping to put 
up on Hackage sometime this summer (i.e., in a month or so, after 
polishing off a few rough edges). The project aims to be suitable for 
industrial-scale problems, though I'm not sure if it'll make it quite so 
far as 30GB.

Unless you really need to get your hands dirty, the first thing you 
should do is check out SRILM and see if that can handle it. Assuming 
SRILM can do it, you'd just need to write some FFI code to call into it 
from Haskell. This is relatively straightforward, and if it hasn't been 
done already it'd be a great contribution to the Haskell NLP community. 
One great thing about this approach is that it's pretty easy to have the 
SRILM server running on a dedicated machine, so that the memory 
requirements of the model don't interfere with the memory requirements 
of your program. (I can give you pointers to Java code doing all this, 
if you'd like.) One big caveat to bear in mind is that SRILM is not 
threadsafe, so that'll get in the way of any parallelism or distributed 
computing on the client side of things. Also, SRILM is hell to install.

If you have too much trouble trying to get SRILM to work, there's also 
the Berkeley LM which is easier to install. I'm not familiar with its 
inner workings, but it should offer pretty much the same sorts of 
operations.

But, with that in mind I can offer a few pointers for doing it in 
Haskell. The first one is interning. Assuming the data is already 
tokenized (or can be made to be), then it should be straightforward to 
write a streaming program that converts every unique token into an 
integer (storing the conversion table somewhere for later use). This 
will reduce the effective corpus size significantly for any natural 
language--- and dramatically for morphologically poor languages like 
English. For morphologically rich languages the benefits will be best if 
you can do morphological segmentation before interning, though whether 
that's viable depends on your NLP task. It doesn't even need to be 
complete morphological segmentation, just break on the obvious 
boundaries in order to get the average word size down to something 
reasonable.

For regular projects, that integerization would be enough, but for your 
task you'll probably want to spend some time tweaking the codes. In 
particular, you'll probably have enough word types to overflow the space 
of Int32/Word32 or even Int64/Word64. (If not, then skip this section 
and go boldly forward!) Consequently, you won't be able to fit an ID 
into a machine register, so you'll want to find a good dynamic 
representation. If you collect unigram counts along the way (or can 
estimate their frequencies reliably from a subcorpus), then you can use 
a Huffman code, Shannon--Fano code, or arithmetic code in order to bring 
the average codeword length down. The vast majority of word tokens 
belong to very few word types, so this coding will enable you to fit the 
vast majority of tokens into a single machine word, saving a single bit 
or two to indicate that you need to fail-over to passing around an array 
of bits identifying the original token.

If you can get away with not doing those codeword compression tricks, 
then you can also gather all your n-gram statistics along the way (this 
is what my program does). Or, if you can partition the corpus into a 
handful of subcorpora, each of which can get by without the compression 
tricks, then you can run your training code separately on each subcorpus 
and unify the ID spaces when folding the results together.

Other than interning, another major way to save on space is to use 
tries, as you mentioned. In particular, every tree for n-grams fits into 
a tree for (n+1)-grams, and since you're storing the (n+1)-gram tree 
already you might as well have the n-gram tree reuse the overhead. In my 
experience, unifying all the different n-gram trees has at most a 
trivial speed cost compared to giving them all dedicated representations.

The next question is what sort of trie representation to use. Something 
based on big-endian patricia trees (e.g., containers:Data.IntMap or 
bytestring-trie:Data.Trie) is a good place to start. Though, you'll want 
to make them wider than binary-branching in order to reduce the overhead 
of the pointers et al. If you're still running into memory issues for 
storing the whole trained model, then you should look into 
cache-oblivious tree structures that you can mmap in.

If all that still isn't enough, then that's where you start breaking 
into doing research :)

-- 
Live well,
~wren