[Haskell-cafe] [ANN] parser-unbiased-choice-monad-embedding - the best parsing library; it is based on arrows (was: Pearl! I just proved theorem about impossibility of monad transformer for parsing with (1) unbiased choice and (2) ambiguity checking before running embedded monadic action (also, I THREAT I will create another parsing lib))

Fri Jul 9 07:03:59 UTC 2021

You might also want to check out 'uu-parsinglib' [1]. It is also unbiased and 
it has some features not mentioned on your comparison table, most notably 
error-correction and online (lazy) results. Especially the lazy results can 
allow you to write parsers that use constant memory [2]. The "Combinator 
Parsing: A Short Tutorial" by Doaitse Swierstra describes the ideas and 
implementation (the advanced features start in section 4) [3]. Unfortunately, I 
think it is not maintained anymore.

There are also some other parsing libraries that also can deal with 
left-recursion besides Earley, namely 'gll' [4] and 'grammatical-parsers' [5]. 
It might be worth adding them to the comparison.

Cheers,

Jaro

[1] https://hackage.haskell.org/package/uu-parsinglib
[2] 
https://discourse.haskell.org/t/memory-usage-for-backtracking-in-infinite-stream-parsing/1384/12?u=jaror
[3] http://www.cs.uu.nl/research/techreps/repo/CS-2008/2008-044.pdf
[4] https://hackage.haskell.org/package/gll
[5] https://hackage.haskell.org/package/grammatical-parsers

On 09-07-2021 04:18, Askar Safin via Haskell-Cafe wrote:
> Hi.
> 
> I announce my parsing library https://hackage.haskell.org/package/parser-unbiased-choice-monad-embedding .
> I think it is best parsing library, and you should always use it instead of other solutions. I will tell
> you why. You may check comparison table: https://paste.debian.net/1203863/ (if you don't
> understand the table, don't worry; come back to the table after reading this mail).
> 
> My library is solution to problem described in this e-mail thread, so read it for motivation:
> https://mail.haskell.org/pipermail/haskell-cafe/2021-June/134094.html .
> 
> Now let me describe my solution in detail.
> 
> I will distinguish parser errors (i. e. "no parse" or "ambiguous parse") and semantic errors
> (i. e. "division by zero", "undefined identifier", "type mismatch", etc).
> 
> So, now I will show you parser with unbiased choice, which allows monad embedding.
> 
> As a very good introduction to arrows I recommend this:
> https://ocharles.org.uk/guest-posts/2014-12-21-arrows.html .
> 
> We start from classic parser with this type:
> 
> newtype ParserClassic t a = ParserClassic ([t] -> [(a, [t])])
> 
> You can make it be instance of Functor, Applicative, Monad and Alternative.
> 
> This type is similar to ReadS from base
> ( https://hackage.haskell.org/package/base-4.15.0.0/docs/Text-ParserCombinators-ReadP.html#t:ReadS )
> and to "type Parser = StateT String []" from example here:
> https://hackage.haskell.org/package/transformers-0.5.5.0/docs/Control-Monad-Trans-Class.html .
> I will not give more information, feel free to find it in internet.
> 
> Now let's replace "a" with Kleisli arrow "b -> m c". We will get this:
> 
> newtype Parser1 t m b c = Parser1 ([t] -> [(b -> m c, [t])])
> 
> Here is resulting parsing library and example: https://godbolt.org/z/qsrdKefjT (backup:
> https://paste.debian.net/1203861/ ). We can use this parser in Applicative style. And when we
> need to lift something to embedded monad, we resort to Arrow style. I didn't test this code much.
> 
> Parser1 cannot be Monad (I proved this in previous letters).
> 
> At this point someone may ask: why we need arrows? Can we achieve same effect using
> Applicative only? Yes, we can. Here is result: https://godbolt.org/z/ocY3csWjs (backup:
> https://paste.debian.net/1203862/ ), I will call such method "anti-arrow". But we have two
> problems here: first, I think this method makes parser code uglier. Second, this method is
> limited, and I will show you why later.
> 
> Okey, now back to arrows.
> 
> Still, we have this problems (I'm about our arrow code):
> - Handling left-recursive grammars is tricky (it is possible, but not as simple as right-recursive
> ones)
> - Parsing errors (as opposed to semantic error messages) are not great. I. e. if there is no valid
> parses, we don't get any additional information, i. e. we don't know which token caused error
> - We don't track locations, so semantic errors are not great, too (i. e. we want to have location
> info to embed it into semantic error messages)
> - I suspect this parsing library will have very bad speed, possibly O(exp(input size))
> 
> So, let's combine our ideas with package Earley ( https://hackage.haskell.org/package/Earley ).
> Earley has type "Prod r e t a". Let's replace "a" with Kleisli arrow "b -> m c".
> Also let's wrap this Kleisli arrow to "L" from srcloc ( https://hackage.haskell.org/package/srcloc )
> to get location info. Also, we wrap "t" to "L", too.
> 
> Thus we get this type:
> 
> newtype ArrowProd r e t m b c = ArrowProd (Prod r e (L t) (L (b -> m c)))
> 
> Here is resulting library with example: https://hackage.haskell.org/package/parser-unbiased-choice-monad-embedding .
> (I didn't test much.) (I recommend to read docs to Earley first.) (My library is designed to be used
> with some external lexer, lexer-applicative is recommended.)
> 
> The example also uses lexer-applicative, because lexer-applicative automatically wraps tokens into "L".
> 
> So, what we got? We solved original goals, i. e.:
> - We have combinator parser with unbiased choice. Unfortunately, it is not monadic, but it is
> Applicative and Arrow
> - We can embed monad, for example, to handle semantic errors
> - We can test parsing errors before executing embedded monadic action
> 
> Additionally we solved 4 remaining problems mentioned above, i. e.:
> - Handling left-recursive grammars is as simple as right-recursive ones (thanks to Earley's
> RecursiveDo)
> - Parsing errors are ok
> - We track locations and we can embed them into semantic errors
> - We have relatively good speed thanks to Earley
> 
> What else? We can test grammar for ambiguity using
> https://hackage.haskell.org/package/Earley-0.13.0.1/docs/Text-Earley.html#v:upTo (I didn't wrap
> this function, but this can easily be done).
> 
> Personally I think that my parsing library is simply best. :) And that one should always use it
> instead of all others. Why I think this? Because:
> - We all know that CFG is better than PEG, i. e. unbiased choice is better that biased
> - I don't want to merely produce AST, I want to process information while I parse, and this
> processing will uncover some semantic errors
> - So I want unbiased choice with handling semantic errors
> - The only existing solution, which can do this is my library (and also "happy")
> - But "happy" doesn't automatically track locations (as well as I know it tracks lines, but not columns)
> - So the best parsing solution is my library :)
> 
> My library has another advantage over happy: by extensive use of Alternative's "many" (and
> arrow banana brackets) you can write things, which are impossible to write in happy. Consider
> this artificial example (subset of Pascal):
> ---
> var
>    a: integer; b: integer;
> begin
>    a := b + c;
> end.
> ---
> This is its parser, which uses my library (completely untested):
> ---
> arrowRule (proc () -> do {
>    sym TVar -< ();
>    declated <- many (ident <* sym TColon <* sym TInteger <* sym TSemicolon) -< ();
>    sym TBegin -< ();
> 
>    -- Banana brackets
>    (|many (do {
>      x <- ident -< ();
>      lift -< when (x `notElem` declated) $ Left "Undeclated identifier";
>      sym TAssign -< ();
> 
>      -- My library doesn't have "sepBy", but it can be easily created
>      (|sepBy (do {
>        y <- ident -< ();
>        lift -< when (y `notElem` declated) $ Left "Undeclated identifier";
>        returnA -< ();
>      })|) [TPlus];
>      sym TSemicolon -< ();
>      returnA -< ();
>    |);
>    sym TEnd -< ();
>    sym TDot -< ();
>    returnA -< ();
> })
> ---
> This is impossible to write similar code using "happy" with similar ergonomics, because in
> happy we would not have access to "declared" inside production for sum. The only way we
> can access "declared" is using state monad (for example, StateT) and put this "declared"
> into this state monad. But in my library you don't have to use any StateT!
> 
> Now let me say why mentioned "anti-arrow" style is not enough. Let's try to rewrite this
> Pascal example using anti-arrow style:
> ---
> antiArrowLift $ do { -- ApplicativeDo
>    sym TVar;
>    declated <- many (ident <* sym TColon <* sym TInteger <* sym TSemicolon);
>    sym TBegin;
>    many $ antiArrowLift $ do {
>      x <- ident;
>      sym TAssign;
> 
>      -- My library doesn't have "sepBy", but it can be easily created
>      sepBy [TPlus] $ antiArrowLift $ do {
>        y <- ident;
>        pure $ when (y `notElem` declated) $ Left "Undeclated identifier"; -- Oops
>      };
>      sym TSemicolon;
>      pure $ when (x `notElem` declated) $ Left "Undeclated identifier"; -- Oops
>    };
>    sym TEnd;
>    sym TDot;
>    pure ();
> }
> ---
> Looks good. But there is a huge problem here: look at lines marked as "Oops". They refer to
> "declared", but they cannot refer to it, because outer "do" is ApplicativeDo. So, yes, merely
> Applicative is not enough.
> 
> Does my library have disadvantages? Of course, it has!
> - It is not monadic
> - It cannot statically check that grammar is element of LR(1) set (as well as I understand,
> happy can do this)
> - My library has relatively good speed asymptotic (same as Earley), but it is still not fastest
> - My library will freeze on infinitely ambiguous grammars. Attempting to check such grammar
> for ambiguity using Earley's "upTo" will cause freezing, too. See also: https://github.com/ollef/Earley/issues/54
> - My library is based on unbiased choice and CFG (as opposed to biased choice and PEG).
> I consider this as advantage, but my library will not go if you want to parse language defined by some PEG
> 
> My library is unfinished. The following things are needed:
> - We need combinator similar to Alternative's "many", but every item should have access to
> already parsed part of list. Such combinator should be made to be used by banana brackets
> - We need combinators similar to parsec's chainl and chainr (my library already supports left
> and right recursion thanks to Earley, but still such combinators would be useful)
> - Already mentioned "sepBy"
> - I didn't wrap all Earley functionality, for example, <?> is left unwrapped
> 
> I don't have motivation for fix this things, because I decided to switch to Rust as my main language.
> 
> Final notes
> - It is quite possible that I actually need attribute grammars or syntax-directed translation. I didn't explore this
> - I suspect that my parser is arrow transformer (whatever this means)
> 
> Side note: I really want some pastebin for reproducible shell scripts (possibly dockerfiles), do you know such?
> 
> Answer me if you have any questions.
> 
> ==
> Askar Safin
> http://safinaskar.com
> https://sr.ht/~safinaskar
> https://github.com/safinaskar
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>