[Haskell-cafe] Handling a large database (of ngrams)
wren ng thornton
wren at freegeek.org
Sun May 22 04:42:57 CEST 2011
On 5/21/11 3:56 PM, Aleksandar Dimitrov wrote:
> Hi Haskellers,
>
> I'm unaware of a "good method" or "default way" of handling large datasets to
> which I need non-sequential (i.e. random) access in Haskell.
>
> My use case is linguistic analysis of a ~30GB corpus — the most basic form of
> quantitative analysis here are ngram based HMMs, which aren't difficult to pull
> of in Haskell.
I've been working on some n-gram stuff lately, which I'm hoping to put
up on Hackage sometime this summer (i.e., in a month or so, after
polishing off a few rough edges). The project aims to be suitable for
industrial-scale problems, though I'm not sure if it'll make it quite so
far as 30GB.
Unless you really need to get your hands dirty, the first thing you
should do is check out SRILM and see if that can handle it. Assuming
SRILM can do it, you'd just need to write some FFI code to call into it
from Haskell. This is relatively straightforward, and if it hasn't been
done already it'd be a great contribution to the Haskell NLP community.
One great thing about this approach is that it's pretty easy to have the
SRILM server running on a dedicated machine, so that the memory
requirements of the model don't interfere with the memory requirements
of your program. (I can give you pointers to Java code doing all this,
if you'd like.) One big caveat to bear in mind is that SRILM is not
threadsafe, so that'll get in the way of any parallelism or distributed
computing on the client side of things. Also, SRILM is hell to install.
If you have too much trouble trying to get SRILM to work, there's also
the Berkeley LM which is easier to install. I'm not familiar with its
inner workings, but it should offer pretty much the same sorts of
operations.
But, with that in mind I can offer a few pointers for doing it in
Haskell. The first one is interning. Assuming the data is already
tokenized (or can be made to be), then it should be straightforward to
write a streaming program that converts every unique token into an
integer (storing the conversion table somewhere for later use). This
will reduce the effective corpus size significantly for any natural
language--- and dramatically for morphologically poor languages like
English. For morphologically rich languages the benefits will be best if
you can do morphological segmentation before interning, though whether
that's viable depends on your NLP task. It doesn't even need to be
complete morphological segmentation, just break on the obvious
boundaries in order to get the average word size down to something
reasonable.
For regular projects, that integerization would be enough, but for your
task you'll probably want to spend some time tweaking the codes. In
particular, you'll probably have enough word types to overflow the space
of Int32/Word32 or even Int64/Word64. (If not, then skip this section
and go boldly forward!) Consequently, you won't be able to fit an ID
into a machine register, so you'll want to find a good dynamic
representation. If you collect unigram counts along the way (or can
estimate their frequencies reliably from a subcorpus), then you can use
a Huffman code, Shannon--Fano code, or arithmetic code in order to bring
the average codeword length down. The vast majority of word tokens
belong to very few word types, so this coding will enable you to fit the
vast majority of tokens into a single machine word, saving a single bit
or two to indicate that you need to fail-over to passing around an array
of bits identifying the original token.
If you can get away with not doing those codeword compression tricks,
then you can also gather all your n-gram statistics along the way (this
is what my program does). Or, if you can partition the corpus into a
handful of subcorpora, each of which can get by without the compression
tricks, then you can run your training code separately on each subcorpus
and unify the ID spaces when folding the results together.
Other than interning, another major way to save on space is to use
tries, as you mentioned. In particular, every tree for n-grams fits into
a tree for (n+1)-grams, and since you're storing the (n+1)-gram tree
already you might as well have the n-gram tree reuse the overhead. In my
experience, unifying all the different n-gram trees has at most a
trivial speed cost compared to giving them all dedicated representations.
The next question is what sort of trie representation to use. Something
based on big-endian patricia trees (e.g., containers:Data.IntMap or
bytestring-trie:Data.Trie) is a good place to start. Though, you'll want
to make them wider than binary-branching in order to reduce the overhead
of the pointers et al. If you're still running into memory issues for
storing the whole trained model, then you should look into
cache-oblivious tree structures that you can mmap in.
If all that still isn't enough, then that's where you start breaking
into doing research :)
--
Live well,
~wren
More information about the Haskell-Cafe
mailing list