[Haskell-cafe] Parsing unstructured data

Wed Nov 28 13:17:06 EST 2007

On Wed, 2007-11-28 at 12:58 -0500, Olivier Boudry wrote:
> Hi all,
> 
> This e-mail may be a bit off topic. My question is more about methods
> and algorithms than Haskell. I'm looking for links to methods or tools
> for parsing unstructured data.
> 
> I'm currently working on data cleaning of a Customer Addresses
> database. Addresses are stored as 3 lines of text without any
> structure and people made used lots of imagination to create the data
> (20 years of data using no rules at all). Postal code, street, city,
> state, region, country and other details as suite, building, dock,
> doors, PO box, etc... are all stored in free form in those 3 lines. 
> 
> I already wrote a haskell program to do the job. It correctly parses
> about 2/3 addresses and parses much of the rest but with unrecognized
> parts left. The current program works by trying to recognize words
> used to tag parts like STE, SUITE, BLDG, street words (STR, AVE,
> CIRCLE, etc...) and countries from a list (including typos). It uses
> regular expressions to recognize variation of those words, lookup
> tables for countries, street words, regular expression rules for
> postal codes, etc... The most difficult task is splitting the address
> parts. There is no clearly defined separator for the fields. It can be
> dot, space, comma, dash, slash, or anything you can imagine using as a
> separator and this separator can of course also be found inside an
> address part.
> 
> In the current application when part of an address is recognized it
> will not be parsed again by the other rules. A system trying all rules
> and tagging them with probabilities would probably give better
> results.
Have you looked at the Java Rule Engine (I believe JSR 94) and in
particular Jess?
http://herzberg.ca.sandia.gov/

I have no experience with it myself, though, just heard of it.

Regards,

Hans van Thiel
> Any link to tools or methods that could help me in that task would be
> greatly appreciated. I already searched for fuzzy, probabilistic or
> statistical parsing but without much success.
> 
> Thanks,
> 
> Olivier. 
> 
> PS: just in case someone's interested I attached the code and partial
> data to this e-mail.