[Haskell-cafe] How to do reversible parsing?

Sat Jan 2 23:59:20 UTC 2021

Hi. How to do deterministic reversible parsing in Haskell? Please, point me to some libs or tools.
Or give some advice on how to write such lib or tool. (Also, I am interested in solutions for C or C++.)

Now let me explain my task in detail.

I am interested in formal math. I am going to develop various proof checkers. In particular, I want
to develop my PTS checker ( https://en.wikipedia.org/wiki/Pure_type_system ). Here is example of
existing PTS checker: http://web.archive.org/web/20180924133846/http://www.cs.kun.nl/~janz/yarrow/ .
Here is example of input this checker accepts (taken from manual):

Var (+) : Nat -> Nat -> Nat
Def double := \x:Nat. x+x
Def S := (+) one

Also, I want to develop something similar to Isabelle ( https://isabelle.in.tum.de/ ). Example of
Isabelle input: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html .

Also, I am going to write converters between various languages. For example, between input for some
existing PTS checker and my PTS checker.

In short, I want to write lot of parsers and pretty-printers for various languages. And thus I need
some parsing solution.

The solution should have the following features:

1. The parsing should be divided into lexing and parsing. I. e. usual lex + yacc model. Parsing should
be done using usual context-free grammars (CFGs). Left-recursive grammars should be supported (this
leaves out parser combinator libraries). All languages I am going to deal with fit this model. I. e.
I will deal with simple to parse languages, such as Pascal. Not complex ones, like C or C++.

2. Parsing should be deterministic. I. e. parser should check that CFG is unambiguous. Yes, I know
that this task is impossible to do on Turing machine. But at very least some partial solution will
go. For example, some algorithm, which will check that some reasonable set of CFGs is deterministic.
Or even brute force up to some limit will go. (Yes, this will not give absolutely correct results,
but this will go.) This is example of grammar for PTS in one of my checkers (start symbol is t0):

t1000 = id
t1000 = "(" t0")"
t999 = t999 t1000
t3 = t4 "::" t3
t3 = "%" id "::" t0 "." t3
t0 = "!!" id "::" t0 "." t0
t1 = t2 "==>" t1
t0 = t1
t1 = t2
t2 = t3
t3 = t4
t4 = t999
t999 = t1000

If I remember correctly, this grammar is deterministic (checked using brute force up to a limit).
I want solution which will verify that this grammar is deterministic.

All stages of parsing should be proved as deterministic (i. e. scanner too).

3. Parsing should be reversible. I. e. pretty-printing should be supported, too. And the tool should
guarantee that parsing and pretty-printing match. Now let me give more precise definition. Let's
assume we have these functions:

parse :: String -> Maybe X
print :: X -> Maybe String

As you can see, not every internal representation is printable. This is OK. For example, if internal
representation contains variable names that happen to equal reserved words, then, of course, this is
unprintable representation.

The tool should guarantee that:
a. Result of parsing should be printable, but printing not necessary should give same string. I. e.
if "parse s == Just x", then "print x /= Nothing". (But not necessary "print x == Just s".)
b. Parsing of result of printing should give same representation. I. e. if "print x == Just s", then
"parse s == Just x"

4. Not only parsing should generate parse tree (i. e. exact representation of derivation), but also
AST. AST differs from parse tree. For example, AST should not contain braces. I. e. "2 + (3 * 4)" and
"2 + 3 * 4" should give same AST. And the tool should be able parse string to AST and pretty print
AST back to string.

5. It should be possible to use CFG not known in compile time. Consider Isabelle language. Here is
again example of Isabelle text: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html . As you can
see the language is divided into inner language and outer language. Strings in inner language are
quoted. Outer language can add productions to inner language. Inner language cannot change itself.
"notation" and "translations" commands add productions to inner language. So inner language is not
known beforehand, it is created dynamically. There is no need to dynamically create scanner, i. e.
it should be possible to dynamically create parser (for inner language), but there is no such
requirement for scanner. Also, parser will not change in the middle of parsing given string.

Again: let's assume I write parser for my Isabelle analog. Scanner and parser for outer language is
fixed. Scanner for inner language is fixed, too. But parser for inner language is not. I read input
document and add productions to CFG of inner language on the fly. Every time I add production,
resulting CFG should be checked for ambiguity. And when I find inner language string in input, I
process it using just constructed parser for inner language.

--

Points 2 and 3 are the most important in this list. If there is solution which contains them alone,
this already will be very good.

Now let me list considered solutions.

* Usual combinator libraries (such as parsec) will not go. They don't work with left-recursive CFGs.
They don't check grammars for ambiguities. They don't do reversible parsing. But they allow creating
language on the fly, and this is good thing.

* Happy and alex will not go. As well as I know Happy can prove unambiguity of some very simple grammars
(i. e. for some set of simple grammars it will not report any conflicts and thus will prove they unambiguity).
It seems this is LR(1) set. But, as well as I know, Happy will report conflicts on PTS grammar given
above, i. e. Happy cannot prove its unambiguity. Also, Happy and Alex cannot pretty-print. And Happy
and Alex don't allow supplying CFG on the fly.

* Package "earley". Handle arbitrary CFGs, including left-recursive. This is good thing. It seems CFG
can be supplied in runtime. This is good thing, too. Gives all parse trees, thus we can verify that
count of parse trees equals to 1. This is good thing, too. Unfortunately, earley cannot verify CFS's
unambiguity before supplying input. Also, earley doesn't support pretty-printing.

* PEG libraries. Will not go, I want CFGs.

* I searched on Hackage reversible parsing libraries. Unfortunately, none of them guarantee
reversibility. I. e. it is very easy to fool these libraries and create parsers and printers not
satisfying equations I gave above. Also, this libraries doesn't have other features I mentioned

* http://augeas.net/ is very cool project. Unlike nearly everything I found in internet, Augeas
provides guaranteed reversible parsing and pretty-printing (!!!!!). Unfortunately, set of supported
target languages is very limited. Regular languages are supported. And some very limited set of CFGs
is supported. For example, it is possible to construct parser for JSON. But parser for Pascal seems
impossible. Augeas not designed for parsing programming languages. Only simple configuration
languages like JSON, XML, etc. Mutually recursive nonterminals are not supported. Also, Augeas
provides C API, not Haskell one.

* https://www.brics.dk/xsugar.html is very good. It provides guaranteed reversible translation
between arbitrary language and XML. And XSugar even verifies unambiguity of the CFG using some
sophisticated algorithm. Unfortunately, this algorithm still cannot prove unambiguity of PTS grammar
given above. Also, XSugar is not Haskell library. It is standalone Java app. So, I need to convert
some language to XML, then load this XML to my Haskell program and parse. Also, there is similar lib
https://www.brics.dk/Xact.html from same author, but it is for Java, not for Haskell.

I was able to find only 2 projects, which do guaranteed reversible parsing: mentioned Augeas and
XSugar/Xact. Unfortunately, both are not for Haskell. And both cannot prove unambiguity for my PTS
grammar.

So, please tell me about solution. Or give advice how to write one.

Also, let me describe my experience with various algorithms for checking ambiguity of grammar.

1. Try to construct LR(1) table. If there is no conflicts, then the grammar is unambiguous. It seems
this is algorithm used by bison and happy. Unfortunately it is too weak and cannot prove unambiguity
of my PTS grammar.
2. Try to construct LR(k) table. This algorithm is stronger than LR(1). I was unable to find
implementation of this algorithm in any language. I don't want to write such algorithm.
3. Schmitz's algorithm. http://www.lsv.fr/~schmitz/pub/expamb.pdf . Even stronger than LR(k). I
downloaded his bison fork. Unfortunately this bison fork can prove unambiguity for same set of
grammars original bison can (i. e. LR(1)). I. e. it seems Schmitz did not implemented his algorithm.
I don't want to implement his algorithm either.
4. Horizontal and vertical ambiguity checking. https://www.brics.dk/grammar/kruse.pdf . This is
algorithm used in XSugar. Unfortunately, it doesn't handle my PTS grammar.
5. Brute force up to some limit. Doesn't give unambiguity guarantee. But I like it, and I will use
it if I don't find anything else. It is the only algorithm which can handle my PTS grammar (and
give enough amount of guarantee), not counting LR(k) and Schmitz's algorithm (I was not able to find
implementations of both).

I will try to write tool I want, I will write about my progress to this list.

==
Askar Safin
https://github.com/safinaskar