[Haskell-cafe] Parsing unstructured data

Grzegorz Chrupala grzegorz.chrupala at computing.dcu.ie
Wed Nov 28 16:53:36 EST 2007

Olivier Boudry wrote:
> Hi all,
> This e-mail may be a bit off topic. My question is more about methods and
> algorithms than Haskell. I'm looking for links to methods or tools for
> parsing unstructured data.
> I'm currently working on data cleaning of a Customer Addresses database.
> Addresses are stored as 3 lines of text without any structure and people
> made used lots of imagination to create the data (20 years of data using
> no
> rules at all). Postal code, street, city, state, region, country and other
> details as suite, building, dock, doors, PO box, etc... are all stored in
> free form in those 3 lines.
> I already wrote a haskell program to do the job. It correctly parses about
> 2/3 addresses and parses much of the rest but with unrecognized parts
> left.
> The current program works by trying to recognize words used to tag parts
> like STE, SUITE, BLDG, street words (STR, AVE, CIRCLE, etc...) and
> countries
> from a list (including typos). It uses regular expressions to recognize
> variation of those words, lookup tables for countries, street words,
> regular
> expression rules for postal codes, etc... The most difficult task is
> splitting the address parts. There is no clearly defined separator for the
> fields. It can be dot, space, comma, dash, slash, or anything you can
> imagine using as a separator and this separator can of course also be
> found
> inside an address part.
> In the current application when part of an address is recognized it will
> not
> be parsed again by the other rules. A system trying all rules and tagging
> them with probabilities would probably give better results.
> Any link to tools or methods that could help me in that task would be
> greatly appreciated. I already searched for fuzzy, probabilistic or
> statistical parsing but without much success.

You may have better luck checking out methods used in parsing natural
language. In order to use statistical parsing techniques such as
Probabilistic Context Free Grammars ([1],[2] ) the standard approach is to
extract rule probabilities from an annotated corpus, that is collection of
strings with associated parse trees. Maybe you could use your 2/3 of
addresses that you know are correctly parsed as your training material. 

A PCFG parser can output all (or n-best) parses ordered according to
probabilities so that would seem to be fit your requirements.
[1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
[2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14
View this message in context: http://www.nabble.com/Parsing-unstructured-data-tf4892274.html#a14011334
Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

More information about the Haskell-Cafe mailing list