[Haskell-beginners] Parsing Revisited

Sun Nov 9 23:35:38 EST 2008

This helps a lot, and I can go over this in the morning. The only final
question I have is what you would use to apply all this to an arbitrary
string.

For example, in the following partially fictitious code: (based on
something I saw)

main = getContents >>= ...

What would ... be so that you can turn the [Char] into [TeX]? Or can I
specifically use the combinators only?

- Jeff

On Mon, 2008-11-10 at 04:33 +0100, Tillmann Rendel wrote:
> Jeffrey Drake wrote:
> > Given a set of combinators that parse specific parts of a document, how
> > can you string them together so that you can get the whole document
> > parsed?
> 
> The general idea is to build parser for complex formats out of parsers 
> for simple formats. The structure of the parser often more or less 
> follows the structure of the data.
> 
> For example, the following data type could be a first approach to 
> capture the lexical structure of TeX:
> 
>    data TeX
>      = Letter Char             -- for example: A
>      | Command String          -- for example: \begin
>      | Group [TeX]             -- for example: {abc\something{...}}
> 
> The idea is to parse "a\test {bc}" into the following list of TeX values:
> 
>    [Letter 'a', Command "test", Group [Letter 'c', Letter 'd']]
> 
> Note how the use of lists of TeX values allows to actually represent 
> whole documents; and how the Group data constructor allows to capture 
> the recursive structure of TeX programs.
> 
> Let start by writing the parser for a single TeX value. The datatype 
> definition shows that a such a value can be a letter, a command or a 
> list of TeX values enclosed in braces. We can capture the fact that we 
> have three choices directly in parsers:
> 
>    tex :: Parser TeX
>    tex = texLetter <|> texCommand <|> texGroup
> 
> Note how the combinator <|> corresponds to the | syntax in the datatype 
> declaration.
> 
> Given this parser for TeX values, we can write the parser for a list of 
> such values using the many combinator:
> 
>    texList :: Parser [TeX]
>    texList = many tex
> 
> Note how the many combinator corresponds to the list type constructor.
> 
> Now we have to define the parser for the three data constructors. 
> texLetter is easy:
> 
>    texLetter :: Parser TeX
>    texLetter = do l <- letter
>                   return (Letter l)
> 
> Note how the fact that texLetter just wraps letter corresponds to the 
> fact that Letter just wraps Char.
> 
> Commands are more interesting, because they eat all spaces after the 
> name of the control sequence.
> 
>    texCommand :: Parser TeX
>    texCommand = do char '\\'
>                    name <- many letter
>                    many (char ' ')
>                    return (Command name)
> 
> By implementing the space eating feature of commands as part of the 
> texCommand parser, we can be sure that spaces not following commands 
> will not be eaten.
> 
> Finally, I would consider the parser for groups the most interesting. 
> The inside of a group looks looks just like the whole TeX document 
> itself. Fortunately, we have already implemented a parser for whole TeX 
> documents, namely texList, which we use for the texGroup parser as follows:
> 
>    texGroup :: Parser TeX
>    texGroup = do char '{'
>                  content <- texList
>                  char '}'
> 
> Note how the mutual recursion between texList and texGroup corresponds 
> to the recursion in the TeX data type.
> 
> Of course, the examples in this messages are not meant to be production 
> code. Actually, they are not tested at all. But I hope that they help 
> you get started with Parsec.
> 
>    Tillmann