[Haskell-cafe] How to do reversible parsing?

Mon Jan 4 21:26:08 UTC 2021

You may be interested in the profunctor technique of bidirectional
parsing/serialisation, a basic version of which is implemented in the
‘codec’ package: <https://hackage.haskell.org/package/codec>. It was
originally intended for binary and textual data formats, but could be used
for a conventional textual language as well. I have also used this
technique at work to develop a similar library, and found it quite
effective & performant once you get the (somewhat delicate) internals
correct.

As you suggest, it’s generally not feasible to have “encode linearity”
(‘decode . encode = id’; printing omits no information from output) and
“decode linearity” (‘encode . decode = id’; parsing discards no info from
input). However, it’s worth verifying both “encode preservation” (‘encode .
decode . encode = encode’; parsing preserves all info that printing does)
and “decode preservation” (‘decode . encode . decode = decode’; printing
preserves all info that parsing does).

On Sat, Jan 2, 2021 at 4:02 PM Askar Safin via Haskell-Cafe <
haskell-cafe at haskell.org> wrote:

> Hi. How to do deterministic reversible parsing in Haskell? Please, point
> me to some libs or tools.
> Or give some advice on how to write such lib or tool. (Also, I am
> interested in solutions for C or C++.)
>
> Now let me explain my task in detail.
>
> I am interested in formal math. I am going to develop various proof
> checkers. In particular, I want
> to develop my PTS checker ( https://en.wikipedia.org/wiki/Pure_type_system
> ). Here is example of
> existing PTS checker:
> http://web.archive.org/web/20180924133846/http://www.cs.kun.nl/~janz/yarrow/
> .
> Here is example of input this checker accepts (taken from manual):
>
> Var (+) : Nat -> Nat -> Nat
> Def double := \x:Nat. x+x
> Def S := (+) one
>
> Also, I want to develop something similar to Isabelle (
> https://isabelle.in.tum.de/ ). Example of
> Isabelle input: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html .
>
> Also, I am going to write converters between various languages. For
> example, between input for some
> existing PTS checker and my PTS checker.
>
> In short, I want to write lot of parsers and pretty-printers for various
> languages. And thus I need
> some parsing solution.
>
> The solution should have the following features:
>
> 1. The parsing should be divided into lexing and parsing. I. e. usual lex
> + yacc model. Parsing should
> be done using usual context-free grammars (CFGs). Left-recursive grammars
> should be supported (this
> leaves out parser combinator libraries). All languages I am going to deal
> with fit this model. I. e.
> I will deal with simple to parse languages, such as Pascal. Not complex
> ones, like C or C++.
>
> 2. Parsing should be deterministic. I. e. parser should check that CFG is
> unambiguous. Yes, I know
> that this task is impossible to do on Turing machine. But at very least
> some partial solution will
> go. For example, some algorithm, which will check that some reasonable set
> of CFGs is deterministic.
> Or even brute force up to some limit will go. (Yes, this will not give
> absolutely correct results,
> but this will go.) This is example of grammar for PTS in one of my
> checkers (start symbol is t0):
>
> t1000 = id
> t1000 = "(" t0")"
> t999 = t999 t1000
> t3 = t4 "::" t3
> t3 = "%" id "::" t0 "." t3
> t0 = "!!" id "::" t0 "." t0
> t1 = t2 "==>" t1
> t0 = t1
> t1 = t2
> t2 = t3
> t3 = t4
> t4 = t999
> t999 = t1000
>
> If I remember correctly, this grammar is deterministic (checked using
> brute force up to a limit).
> I want solution which will verify that this grammar is deterministic.
>
> All stages of parsing should be proved as deterministic (i. e. scanner
> too).
>
> 3. Parsing should be reversible. I. e. pretty-printing should be
> supported, too. And the tool should
> guarantee that parsing and pretty-printing match. Now let me give more
> precise definition. Let's
> assume we have these functions:
>
> parse :: String -> Maybe X
> print :: X -> Maybe String
>
> As you can see, not every internal representation is printable. This is
> OK. For example, if internal
> representation contains variable names that happen to equal reserved
> words, then, of course, this is
> unprintable representation.
>
> The tool should guarantee that:
> a. Result of parsing should be printable, but printing not necessary
> should give same string. I. e.
> if "parse s == Just x", then "print x /= Nothing". (But not necessary
> "print x == Just s".)
> b. Parsing of result of printing should give same representation. I. e. if
> "print x == Just s", then
> "parse s == Just x"
>
> 4. Not only parsing should generate parse tree (i. e. exact representation
> of derivation), but also
> AST. AST differs from parse tree. For example, AST should not contain
> braces. I. e. "2 + (3 * 4)" and
> "2 + 3 * 4" should give same AST. And the tool should be able parse string
> to AST and pretty print
> AST back to string.
>
> 5. It should be possible to use CFG not known in compile time. Consider
> Isabelle language. Here is
> again example of Isabelle text:
> https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html . As you can
> see the language is divided into inner language and outer language.
> Strings in inner language are
> quoted. Outer language can add productions to inner language. Inner
> language cannot change itself.
> "notation" and "translations" commands add productions to inner language.
> So inner language is not
> known beforehand, it is created dynamically. There is no need to
> dynamically create scanner, i. e.
> it should be possible to dynamically create parser (for inner language),
> but there is no such
> requirement for scanner. Also, parser will not change in the middle of
> parsing given string.
>
> Again: let's assume I write parser for my Isabelle analog. Scanner and
> parser for outer language is
> fixed. Scanner for inner language is fixed, too. But parser for inner
> language is not. I read input
> document and add productions to CFG of inner language on the fly. Every
> time I add production,
> resulting CFG should be checked for ambiguity. And when I find inner
> language string in input, I
> process it using just constructed parser for inner language.
>
> --
>
> Points 2 and 3 are the most important in this list. If there is solution
> which contains them alone,
> this already will be very good.
>
> Now let me list considered solutions.
>
> * Usual combinator libraries (such as parsec) will not go. They don't work
> with left-recursive CFGs.
> They don't check grammars for ambiguities. They don't do reversible
> parsing. But they allow creating
> language on the fly, and this is good thing.
>
> * Happy and alex will not go. As well as I know Happy can prove
> unambiguity of some very simple grammars
> (i. e. for some set of simple grammars it will not report any conflicts
> and thus will prove they unambiguity).
> It seems this is LR(1) set. But, as well as I know, Happy will report
> conflicts on PTS grammar given
> above, i. e. Happy cannot prove its unambiguity. Also, Happy and Alex
> cannot pretty-print. And Happy
> and Alex don't allow supplying CFG on the fly.
>
> * Package "earley". Handle arbitrary CFGs, including left-recursive. This
> is good thing. It seems CFG
> can be supplied in runtime. This is good thing, too. Gives all parse
> trees, thus we can verify that
> count of parse trees equals to 1. This is good thing, too. Unfortunately,
> earley cannot verify CFS's
> unambiguity before supplying input. Also, earley doesn't support
> pretty-printing.
>
> * PEG libraries. Will not go, I want CFGs.
>
> * I searched on Hackage reversible parsing libraries. Unfortunately, none
> of them guarantee
> reversibility. I. e. it is very easy to fool these libraries and create
> parsers and printers not
> satisfying equations I gave above. Also, this libraries doesn't have other
> features I mentioned
>
> * http://augeas.net/ is very cool project. Unlike nearly everything I
> found in internet, Augeas
> provides guaranteed reversible parsing and pretty-printing (!!!!!).
> Unfortunately, set of supported
> target languages is very limited. Regular languages are supported. And
> some very limited set of CFGs
> is supported. For example, it is possible to construct parser for JSON.
> But parser for Pascal seems
> impossible. Augeas not designed for parsing programming languages. Only
> simple configuration
> languages like JSON, XML, etc. Mutually recursive nonterminals are not
> supported. Also, Augeas
> provides C API, not Haskell one.
>
> * https://www.brics.dk/xsugar.html is very good. It provides guaranteed
> reversible translation
> between arbitrary language and XML. And XSugar even verifies unambiguity
> of the CFG using some
> sophisticated algorithm. Unfortunately, this algorithm still cannot prove
> unambiguity of PTS grammar
> given above. Also, XSugar is not Haskell library. It is standalone Java
> app. So, I need to convert
> some language to XML, then load this XML to my Haskell program and parse.
> Also, there is similar lib
> https://www.brics.dk/Xact.html from same author, but it is for Java, not
> for Haskell.
>
> I was able to find only 2 projects, which do guaranteed reversible
> parsing: mentioned Augeas and
> XSugar/Xact. Unfortunately, both are not for Haskell. And both cannot
> prove unambiguity for my PTS
> grammar.
>
> So, please tell me about solution. Or give advice how to write one.
>
> Also, let me describe my experience with various algorithms for checking
> ambiguity of grammar.
>
> 1. Try to construct LR(1) table. If there is no conflicts, then the
> grammar is unambiguous. It seems
> this is algorithm used by bison and happy. Unfortunately it is too weak
> and cannot prove unambiguity
> of my PTS grammar.
> 2. Try to construct LR(k) table. This algorithm is stronger than LR(1). I
> was unable to find
> implementation of this algorithm in any language. I don't want to write
> such algorithm.
> 3. Schmitz's algorithm. http://www.lsv.fr/~schmitz/pub/expamb.pdf . Even
> stronger than LR(k). I
> downloaded his bison fork. Unfortunately this bison fork can prove
> unambiguity for same set of
> grammars original bison can (i. e. LR(1)). I. e. it seems Schmitz did not
> implemented his algorithm.
> I don't want to implement his algorithm either.
> 4. Horizontal and vertical ambiguity checking.
> https://www.brics.dk/grammar/kruse.pdf . This is
> algorithm used in XSugar. Unfortunately, it doesn't handle my PTS grammar.
> 5. Brute force up to some limit. Doesn't give unambiguity guarantee. But I
> like it, and I will use
> it if I don't find anything else. It is the only algorithm which can
> handle my PTS grammar (and
> give enough amount of guarantee), not counting LR(k) and Schmitz's
> algorithm (I was not able to find
> implementations of both).
>
> I will try to write tool I want, I will write about my progress to this
> list.
>
> ==
> Askar Safin
> https://github.com/safinaskar
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20210104/10d636c1/attachment.html>