[Haskell-cafe] Handling a large database (of ngrams)
wren ng thornton
wren at freegeek.org
Mon May 23 11:01:33 CEST 2011
On 5/22/11 8:40 AM, Aleksandar Dimitrov wrote:
>> If you have too much trouble trying to get SRILM to work, there's
>> also the Berkeley LM which is easier to install. I'm not familiar
>> with its inner workings, but it should offer pretty much the same
>> sorts of operations.
> Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
> data sets? Maybe this is also the wrong list to ask for this kind of question.
Well, BerlekelyLM is specifically for n-gram language modeling, it's not
a general database. According to the paper I mentioned off-list, the
entire Google Web1T corpus (approx 1 trillion word tokens, 4 billion
n-gram types) can be fit into 10GB of memory, which is much smaller than
SRILM can do.
Databases aren't really my area so I couldn't give a good comparison.
Though for this scale of data you're going to want to use something
specialized for storing n-grams, rather than a general database. There's
a lot of redundant structure in n-gram counts and you'll want to take
advantage of that.
>> For regular projects, that integerization would be enough, but for
>> your task you'll probably want to spend some time tweaking the
>> codes. In particular, you'll probably have enough word types to
>> overflow the space of Int32/Word32 or even Int64/Word64.
Again according to Pauls & Klein (2011), Google Web1T has 13.5M word
types, which easily fits into 24-bits. That's for English, so
morphologically rich languages will be different. I wouldn't expect too
many problems for German, unless you have a lot of technical text with a
prodigious number of unique compound nouns. Even then I'd be surprised
if you went over 2^64 (that'd be reserved for languages like Japanese,
Hungarian, Inuit,... if even they'd ever get that bad).
More information about the Haskell-Cafe