[Haskell-cafe] Parsing unstructured data

Wed Nov 28 16:53:36 EST 2007

Olivier Boudry wrote:
> 
> Hi all,
> 
> This e-mail may be a bit off topic. My question is more about methods and
> algorithms than Haskell. I'm looking for links to methods or tools for
> parsing unstructured data.
> 
> I'm currently working on data cleaning of a Customer Addresses database.
> Addresses are stored as 3 lines of text without any structure and people
> made used lots of imagination to create the data (20 years of data using
> no
> rules at all). Postal code, street, city, state, region, country and other
> details as suite, building, dock, doors, PO box, etc... are all stored in
> free form in those 3 lines.
> 
> I already wrote a haskell program to do the job. It correctly parses about
> 2/3 addresses and parses much of the rest but with unrecognized parts
> left.
> The current program works by trying to recognize words used to tag parts
> like STE, SUITE, BLDG, street words (STR, AVE, CIRCLE, etc...) and
> countries
> from a list (including typos). It uses regular expressions to recognize
> variation of those words, lookup tables for countries, street words,
> regular
> expression rules for postal codes, etc... The most difficult task is
> splitting the address parts. There is no clearly defined separator for the
> fields. It can be dot, space, comma, dash, slash, or anything you can
> imagine using as a separator and this separator can of course also be
> found
> inside an address part.
> 
> In the current application when part of an address is recognized it will
> not
> be parsed again by the other rules. A system trying all rules and tagging
> them with probabilities would probably give better results.
> 
> Any link to tools or methods that could help me in that task would be
> greatly appreciated. I already searched for fuzzy, probabilistic or
> statistical parsing but without much success.
> 
> 

You may have better luck checking out methods used in parsing natural
language. In order to use statistical parsing techniques such as
Probabilistic Context Free Grammars ([1],[2] ) the standard approach is to
extract rule probabilities from an annotated corpus, that is collection of
strings with associated parse trees. Maybe you could use your 2/3 of
addresses that you know are correctly parsed as your training material. 

A PCFG parser can output all (or n-best) parses ordered according to
probabilities so that would seem to be fit your requirements.
[1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
[2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14
--
Best,
Grzegorz
-- 
View this message in context: http://www.nabble.com/Parsing-unstructured-data-tf4892274.html#a14011334
Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.