[Haskell-beginners] Parsing Revisited

Tillmann Rendel rendel at daimi.au.dk
Sun Nov 9 22:33:54 EST 2008


Jeffrey Drake wrote:
> Given a set of combinators that parse specific parts of a document, how
> can you string them together so that you can get the whole document
> parsed?

The general idea is to build parser for complex formats out of parsers 
for simple formats. The structure of the parser often more or less 
follows the structure of the data.

For example, the following data type could be a first approach to 
capture the lexical structure of TeX:

   data TeX
     = Letter Char             -- for example: A
     | Command String          -- for example: \begin
     | Group [TeX]             -- for example: {abc\something{...}}

The idea is to parse "a\test {bc}" into the following list of TeX values:

   [Letter 'a', Command "test", Group [Letter 'c', Letter 'd']]

Note how the use of lists of TeX values allows to actually represent 
whole documents; and how the Group data constructor allows to capture 
the recursive structure of TeX programs.

Let start by writing the parser for a single TeX value. The datatype 
definition shows that a such a value can be a letter, a command or a 
list of TeX values enclosed in braces. We can capture the fact that we 
have three choices directly in parsers:

   tex :: Parser TeX
   tex = texLetter <|> texCommand <|> texGroup

Note how the combinator <|> corresponds to the | syntax in the datatype 
declaration.

Given this parser for TeX values, we can write the parser for a list of 
such values using the many combinator:

   texList :: Parser [TeX]
   texList = many tex

Note how the many combinator corresponds to the list type constructor.

Now we have to define the parser for the three data constructors. 
texLetter is easy:

   texLetter :: Parser TeX
   texLetter = do l <- letter
                  return (Letter l)

Note how the fact that texLetter just wraps letter corresponds to the 
fact that Letter just wraps Char.

Commands are more interesting, because they eat all spaces after the 
name of the control sequence.

   texCommand :: Parser TeX
   texCommand = do char '\\'
                   name <- many letter
                   many (char ' ')
                   return (Command name)

By implementing the space eating feature of commands as part of the 
texCommand parser, we can be sure that spaces not following commands 
will not be eaten.

Finally, I would consider the parser for groups the most interesting. 
The inside of a group looks looks just like the whole TeX document 
itself. Fortunately, we have already implemented a parser for whole TeX 
documents, namely texList, which we use for the texGroup parser as follows:

   texGroup :: Parser TeX
   texGroup = do char '{'
                 content <- texList
                 char '}'

Note how the mutual recursion between texList and texGroup corresponds 
to the recursion in the TeX data type.

Of course, the examples in this messages are not meant to be production 
code. Actually, they are not tested at all. But I hope that they help 
you get started with Parsec.

   Tillmann


More information about the Beginners mailing list