[Haskell-cafe] parsing machine-generated natural text

Bjorn Bringert bringert at cs.chalmers.se
Sun May 21 04:36:41 EDT 2006


On May 19, 2006, at 6:35 PM, Evan Martin wrote:

> For a toy project I want to parse the output of a program.  The
> program runs on someone else's machine and mails me the results, so I
> only have access to the output it generates,
>
> Unfortunately, the output is intended to be human-readable, and this
> makes parsing it a bit of a pain.  Here are some sample lines from its
> output:
>
> France: Army Marseilles SUPPORT Army Paris -> Burgundy.
> Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia.
> England:     4 Supply centers,  3 Units:  Builds   1 unit.
> The next phase of 'dip' will be Movement for Fall of 1901.
>
> I've been using Parsec and it's felt rather complicated.  For example,
> a "location" is a series of words and possibly parenthesis, except if
> the word is SUPPORT.  And that "Supply centers" line ends up being
> code filled with stuff lie "char ':'; skipMany space".
>
> I actually have a separate parser that's Javascript with a bunch of
> regular expressions and it's far shorter than my Haskell one, which
> makes sense as munging this sort of text feels to me more like a
> regexp job than a careful parsing job.
>
> I'm considering writing a preprocessing stage in Ruby or Perl that
> munges those output lines into something a bit more
> "machine-readable", but before I did that I thought I'd ask here if
> anyone had any pointers, hints, or better ideas.

Hi Evan,

if the text you want to parse is actually similar to natural language  
(some posters have suggested that it is much simpler), you may want  
to have a look at grammar formalisms designed for natural languages.  
Grammatical Framework (GF) [1] is such a formalism, where the  
grammars are functional programs. The GF implementation is written in  
Haskell, and it has an interactive mode and a Haskell API.

[Disclaimer: I participate in the development of GF]

/Björn

[1] http://www.cs.chalmers.se/~aarne/GF/



More information about the Haskell-Cafe mailing list