[Haskell-cafe] Handling a large database (of ngrams)

Mon May 23 11:01:33 CEST 2011

On 5/22/11 8:40 AM, Aleksandar Dimitrov wrote:
>> If you have too much trouble trying to get SRILM to work, there's
>> also the Berkeley LM which is easier to install. I'm not familiar
>> with its inner workings, but it should offer pretty much the same
>> sorts of operations.
>
> Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
> data sets? Maybe this is also the wrong list to ask for this kind of question.

Well, BerlekelyLM is specifically for n-gram language modeling, it's not 
a general database. According to the paper I mentioned off-list, the 
entire Google Web1T corpus (approx 1 trillion word tokens, 4 billion 
n-gram types) can be fit into 10GB of memory, which is much smaller than 
SRILM can do.

Databases aren't really my area so I couldn't give a good comparison. 
Though for this scale of data you're going to want to use something 
specialized for storing n-grams, rather than a general database. There's 
a lot of redundant structure in n-gram counts and you'll want to take 
advantage of that.

>> For regular projects, that integerization would be enough, but for
>> your task you'll probably want to spend some time tweaking the
>> codes. In particular, you'll probably have enough word types to
>> overflow the space of Int32/Word32 or even Int64/Word64.

Again according to Pauls & Klein (2011), Google Web1T has 13.5M word 
types, which easily fits into 24-bits. That's for English, so 
morphologically rich languages will be different. I wouldn't expect too 
many problems for German, unless you have a lot of technical text with a 
prodigious number of unique compound nouns. Even then I'd be surprised 
if you went over 2^64 (that'd be reserved for languages like Japanese, 
Hungarian, Inuit,... if even they'd ever get that bad).

-- 
Live well,
~wren