[Haskell-cafe] Parsing unstructured data
Reinier Lamers
reinier.lamers at phil.uu.nl
Thu Nov 29 05:31:36 EST 2007
Olivier Boudry wrote:
> On 11/28/07, *Grzegorz Chrupala* <grzegorz.chrupala at computing.dcu.ie
> <mailto:grzegorz.chrupala at computing.dcu.ie>> wrote:
>
> You may have better luck checking out methods used in parsing natural
> language. In order to use statistical parsing techniques such as
> Probabilistic Context Free Grammars ([1],[2] ) the standard
> approach is to
> extract rule probabilities from an annotated corpus, that is
> collection of
> strings with associated parse trees. Maybe you could use your 2/3 of
> addresses that you know are correctly parsed as your training
> material.
>
> A PCFG parser can output all (or n-best) parses ordered according to
> probabilities so that would seem to be fit your requirements.
> [1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
> [2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14
> <http://www.cs.colorado.edu/%7Emartin/slp2.html#Chapter14>
>
>
> Wow, Natural Language Processing looks quite complex! But it also
> seems to be closely related to my problem. If someone finds a "NPL for
> dummies" article or book I'm interested. ;-)
Especially in the fuzzy cases like this one, NLP often turns to machine
learning models. One could try to train a hidden Markov model or support
vector machines to label parts of the string as "name", "street",
"number", "city", etc. These techniques work very well for part of
speech tagging in natural language, and this seems similar. However, you
need a manually annotated set of examples to train the models. If you
really have a big load of data and it seems like a good solution, you
could use an off-the-shelf part-of-speech tagger like SVMTool
(http://www.lsi.upc.edu/~nlp/SVMTool/) to do it.
Reinier
More information about the Haskell-Cafe
mailing list