[Haskell-cafe] Techniques for ensuring parser correctness?

Mon Jul 26 04:08:38 EDT 2010

On Mon, Jul 26, 2010 at 12:03 AM, S. Doaitse Swierstra <
doaitse at swierstra.net> wrote:

>
> On 26 jul 2010, at 03:51, Jason Dagit wrote:
>
> Hello,
>
> I find that parser correctness is often hard to verify.  Therefore, I'm
> interested in techniques that others have used successfully, especially with
> Haskell.
>
>
> It seems to me that you are not so much trying to verify parsers, but more
> specifically Parsec parsers. Since in Parsec-based parsers you control the
> backtracking explicitly such parsers can get very complicated semantics. Now
> the question arises: what does it mean for a (Parsec) parser to be correct?
> Do you have another description of the  language which is to be recognised,
> e.g. a context-free grammar. Only then can you give meaning to the word
> "correctness".
>
> In general I think that the more your parser combinators deviate from
> context-free grammars in terms of expressiveness, the more problems you will
> encounter. If you make heavy use of the monadic part, you will not only have
> to prove the correctness of static parsers, but even of parsers which are
> generated dynamically. If you use the backtrack-controlling features, your
> proofs will become even more complicated, since it is unlikely that your
> more abstract formalism in which you have specified your language does not
> have a similar construct: here comes in about 50 years on research on
> parsing techniques and grammar analysis. If your grammar is e.g. LL(1) then
> you can verify that some of the back-tracking-controlling features in your
> Parser parser have been used in a sound way, i.e., that you will be able to
> parse any sentence that your grammar describes.
>
> If you have a context-free grammar, and you want to be relatively sure that
> the parser is correct and you do not want to go through large verification
> efforts I suggest you use the uu-parsinglib; the only restriction there is
> is that your grammar should fulfill certain modest "well-formedness"
> criteria, such as being non-left-recursive and non-ambiguous. Then  the
> semantics of the combinators are exactly what you want, i.e. your parsers
> and your grammars are isomorphic. If you have however an "incorrect formal
> specification", i.e., a specification which contains ambiguous non-terminals
> like p* constructs where p can reduce to an empty string   things break. The
> first problem one is not recognised and will lead to a non-terminating
> parser, whereas the second problem is detected by the grammars analysing
> themselves while being used, and leading to a run-time error message once
> you reach that part of the grammar during parsing.
>

Thanks for the reply.

I think the grammar is fairly simple, although I'm not confident classifying
it.  I know it can be parsed with just a simple pass over the data.  The
only uses of backtracking are just to figure out what is next, like a peek
at the next token.  Let me give you some samples of what the input looks
like.

Here are three entries from the "inventory" they correspond to PatchInfos:
[TAG 2.4
Reinier Lamers <tux_rocker at reinier.de>**20100226180900
 Ignore-this: 36ce0456c214345f55a7bc5fc142e985
]
hash:
0000000560-c6bb2c4334a557826cb1a662a8d57ccb9a78390833fab2f1d65e190939f283a3
[Make record store patch metadata in UTF-8
Reinier Lamers <tux_rocker at reinier.de>**20090917165301
 Ignore-this: 6640e121987d6a76479e46d9cc14413b
]
hash:
0000008496-b0170277eee44adc98f553bfbdadae1fb440cb3aaa4988ea19fbcad9d65e31b0
[Add test for UTF-8 metadata
Reinier Lamers <tux_rocker at reinier.de>**20090917165327
 Ignore-this: 3e81237e8af61a45d64ac60269e1fe90 UTF-8
]
hash:
0000004693-d258a7f56c4ed067d219b540ca6b0ce2e2d66bb5fa9e86799a17504f6ebfce38

The brackets delimit the PatchInfos.  The first line is the short
description, or name, of the patch.  The next line, up to the first *, is
the author.  The second * could also be a -, these are followed by the
date/timestamp.  All the lines between the date and closing bracket must
start with a space that gets dropped by the parser.  These lines constitute
the long patch description.  I think the lines here start with a leading
space so that brackets appearing in the description do not need to be
escaped.  Most patches have no long description so his part could be empty,
although modern darcs inserts headers here in ever patch, such as the
"Ignore-this:" field.

The line "hash: ...", gives the file name of the patch that corresponds to
the PatchInfo immediately before it.  The hash line is optional because the
initial versions of darcs did not have this feature.  The parser for
inventories reads as many PatchInfos as it finds.  I think there is always
at least one, but I'm not certain of that.  The corner cases would be newly
created repository, or immediately after tagging a repository.  Inventories
are split at tags and new repositories have no patches.

The format of patches themselves is similarly linear.  Here is an example,
taking the top several lines of the second patch listed above:
[Make record store patch metadata in UTF-8
Reinier Lamers <tux_rocker at reinier.de>**20090917165301
 Ignore-this: 6640e121987d6a76479e46d9cc14413b
] hunk ./src/ByteStringUtils.hs 49
-        intercalate
+        intercalate,
+        getArgsLocale,
+        decodeLocale,
+        encodeLocale,
+        encodeLatin1,
+        decodeString,
+        utf8ToLocale
hunk ./src/ByteStringUtils.hs 70
+import System.Environment       ( getArgs )
+import System.Console.Haskeline ( runInputT, defaultSettings )
+import System.Console.Haskeline.Encoding ( decode, encode )
hunk ./src/ByteStringUtils.hs 103
+import Codec.Binary.UTF8.Generic ( toString )
+
hunk ./src/ByteStringUtils.hs 500
+
+-- | A locale-aware version of getArgs

Notice that the PatchInfo is repeated, followed by a line describing which
type of patch it is.  In this case they are all "hunk" patches.  It could be
other things though, like "addfile", "adddir", "rmdir", and a few others.
 That first line varies between the patch types, but it always specifies the
patch type, usually mentions a directory or filename, and maybe some other
information.  In the case of hunk patches, it mentions which line of the
file to start modifying at.  The lines starting with - or + are lines to be
removed or added.  As you can see from the second hunk, "hunk
./src/ByteStringUtils.hs 70", that either set of lines can be omitted.  In
this case, there are no lines to be removed, so no lines starting with -.
 The lines starting with +, when present, always come after the lines
starting with -.  This input continues this way until all the patches are
enumerated.  That is, there is no special marker at the end, this parser
simple reads to the end of input returning a list of the patches read.

The parser for patches uses 'try' to figure out if the next token is
"addfile", "hunk", and so on but otherwise it just does a straight pass over
the data without any significant backtracking.

I strongly suspect that as far as grammars are concerned, this one is very
simple.  The trick is to parse it correctly, robustly, and efficiently while
accounting for backwards compatibility.  The format has evolved slightly
over the years.

Now that you've seen some examples, what would you recommend?

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100726/c2a6cf9b/attachment-0001.html