[Haskell-cafe] ANNOUNCE: Megaparsec – an improved and actively maintained fork of Parsec

Tue Sep 29 18:42:55 UTC 2015

Hello!

I'd like to announce the release of Megaparsec, a fork of Parsec that has
been in the works for the past 2 months. There's a lot of improvements and
bugfixes under the hood – as well as a new test suite with 128 Quickcheck
tests covering 80% of code (Parsec has only 3 tests, by the way) – but first
I'd like to explain why a fork was needed, since forking a popular 
library is
a pretty drastic measure and should be accompanied by an explanation. (A
disclaimer: I've been given permission to announce the library, but I'm
neither the author nor an expert on parsing.)

     Hackage:
     https://hackage.haskell.org/package/megaparsec

     Changelog (including a list of differences from Parsec):
     https://hackage.haskell.org/package/megaparsec/changelog

     Github:
     https://github.com/mrkkrp/megaparsec

If you ever had any ideas about what Parsec should've done differently, or
what amazing new combinators it should include, etc., post your ideas here:

     https://github.com/mrkkrp/megaparsec/issues

Why fork Parsec instead of writing a new library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are many parsing libraries on Hackage; Parsec was one of the first to
appear, and it's still the one that is recommended to beginners when 
they ask
about doing parsing in Haskell. Most other libraries aren't exactly 
trying to
compete with Parsec – instead they explore new directions.

Like it or not, it remains a fact that a lot of people are being recommended
Parsec, a lot of people are using Parsec, and a lot of people will probably
continue to use Parsec since there's no clear alternative to it. Writing a
new and *different* library probably won't change it. Even new and *similar*
libraries (trifecta and attoparsec are similar enough) haven't removed the
need for Parsec, only mitigated it somewhat. Perhaps one day trifecta or
something else will completely replace Parsec, but right now we still 
have to
put up with Parsec. So, what we need (or at least what would be nice to 
have)
is simply a better Parsec.

What's wrong with original Parsec
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Parsec's development has stagnated long ago; you'd have a hard time getting
your pull request merged, even if the only thing it does is fixing a typo
(I'm not even talking about anything more controversial than that). Parsec
isn't perfect or bug-free, and there *is* a need for those pull requests –
just look at its Github page, where you'll find 8 unmerged PRs (and 9 open
issues):

     https://github.com/aslatter/parsec

In addition to things that could be fixed but simply aren't, there are some
inconveniences (and bugs!) that are hard to fix without breaking backwards
compatibility:

* “notFollowedBy eof” will just silently not do what you expect it to do (a
   bug old enough to be considered an undocumented feature)

* “<|>” and “many” are redefined and so importing Text.Parsec clashes with
   Control.Applicative (this can't be trivially fixed in Parsec because its
   “<|>” has different precedence from Control.Applicative's “<|>”)

* you can't wrap Parser into monad transformers (I'm talking about things
   like “WriterT [String] Parser a” – if this was possible, there'd be no
   need for “user state” baked into ParsecT)

* “Text.Parsec.Token” is not flexible enough for many needs; if you 
depend on
   it, one day you may find that you have to copy the whole module to 
get the
   behavior you want (look at 
https://github.com/aslatter/parsec/issues/15 and
   https://github.com/aslatter/parsec/issues/24, for example).

In short, there may be a lot of value in improving Parsec – and that's where
Megaparsec comes in.

What is Megaparsec and how is it different
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Megaparsec “reboots” the development of Parsec. It's not backwards
compatible, but it *is* compatible enough to avoid having to rewrite all
Parsec tutorials, and code written for Parsec can be converted to use
Megaparsec pretty mechanically. Given that, I would recommend using
Megaparsec instead of Parsec from now on, unless you need compatibility with
GHCs older than 7.10.

Here is a detailed account of some of the bigger changes:

=== Error reporting ===

Megaparsec's errors messages are significantly more accurate than 
Parsec's in
some cases, as the following demonstration shows.

Parsec:

     > parseTest (try (string "let") <|> string "lexical") "le"
     parse error at (line 1, column 1):
     unexpected end of input
     expecting "lexical"

Megaparsec:

     > parseTest (try (string "let") <|> string "lexical") "le"
     parse error at line 1, column 1:
     unexpected "le"
     expecting "let" or "lexical"

And here's another one, showing off how sometimes “try a <|> b” is a bit
less harmful than usual. Just in case, I'm talking about this blog post:

http://blog.ezyang.com/2014/05/parsec-try-a-or-b-considered-harmful/

If you don't want to read it in full, the gist of it is that if you use 
“try”
too liberally, Parsec's error messages become much worse than they could be.
In this example, we're trying to parse a simple version of Haskell's import
statements, which can look either as “import Foo” or “import qualified 
Foo as
B”. If you write a parser like this:

     try pQualifiedImport <|> pImport

you will get an uninformative error message on the following input:

     import qualified Foo s B

Specifically, the “try pQualifiedImport” branch will fail, and then the
“pImport” branch will stumble upon “qualified” – while the actual error
isn't there, but in misspelled “as”.

The advice given in the blog post (“The scope of backtracking try should be
minimized”) is good, and applies to Megaparsec just as well as it applies to
Parsec. However, a curious thing is that in this particular case Parsec
*should* be able to tell you that the error is in misspelled “as”, because
Parsec implements the longest match rule – errors that occur later in text
are given precedence over errors that occur earlier. The code is there, but
somehow it doesn't do what it's supposed to do (and Megaparsec fixes that).

Here's the code from the article, rewritten for Megaparsec:

     import Text.Megaparsec
     import qualified Text.Megaparsec.Lexer as L

     data Stmt = QualifiedImport String String | Import String
       deriving (Show)

     pStmt = try pQualifiedImport <|> pImport

     pImport = do
       keyword "import"
       Import <$> upperCaseIdentifier

     pQualifiedImport = do
       keyword "import"
       keyword "qualified"
       QualifiedImport
         <$> upperCaseIdentifier
         <*> (keyword "as" *> upperCaseIdentifier)

     upperCaseIdentifier = lexeme $
       (:) <$> upperChar <*> many (alphaNumChar <|> oneOf "_.")

     lexeme  = L.lexeme (hidden space)
     keyword = L.symbol (hidden space)

And here are the error messages it produces. Megaparsec:

     > parseTest (pStmt >> eof) "import qualified Foo s B"
     parse error at line 1, column 22:
     unexpected 's'
     expecting "as"

Parsec (with minor modifications):

     > parseTest (pStmt >> eof) "import qualified Foo s B"
     parse error at (line 1, column 8):
     unexpected "q"
     expecting uppercase letter

=== Integration with monad transformers ===

The key type of Parsec is “ParsecT” (others, such as “Parsec” and “Parser”,
are just type synonyms). It lets you use parsers with other monads – for
instance, if you use “ParsecT String () IO”, you can have IO in your
parsers. This is the reason, by the way, why “char” has the type

     char :: Stream s m Char => Char -> ParsecT s u m Char

instead of a simpler

     char :: Char -> Parser Char

(If it was the latter, you wouldn't be able to use “char” and IO in the
same parser.)

However, even this type isn't general enough for all things you might want
to do. Imagine that you want some parsers to generate warnings when they
are run, and later you want to collect those warnings and do something with
them. This sounds like what Writer was invented for, so you try to use it.
Now all your parsers have this type:

     ParsecT String () (Writer [Warning])

(Where “Warning” is, say, a synonym for String.)

Unfortunately, this is not how you should've composed ParsecT and Writer,
because this way you don't get backtracking – in other words, if you try
to do

     optional (try someParser)

and someParser generates warnings but then fails, “optional” won't be able
to make them disappear – they will still be recorded. Same with State – if
you do

     x <* notFollowedBy y

and “y” changes the state, this change will be recorded even tho it's not
what you want most of the time. (You can use Parsec's internal state and
it *will* work, but it doesn't help you when you want to use Writer or
something else instead of State.)

What you want in such situations is

     WriterT [Warning] Parser ()
     StateT YourState Parser ()
     ...

but you can't get it because “char” and all other primitive parsers simply
don't have those types. With Parsec, the only solution is to apply “lift” to
all parsers you want to use, which is pretty annoying. Megaparsec solves
this by introducing an mtl-style “MonadParsec” class, making primitive
parsers members of this class, and providing instances of MonadParsec for
various monad transformers. (If you have ever used the ‘parsers’ library,
you may recognise this approach.) I think being able to get backtracking
behavior without relying on inelegant ways like “Parsec user state” is
pretty neat, even if it's not something every Parsec user needs.

=== Lexing ===

Parsec has a lexing module (Text.Parsec.Token):

http://hackage.haskell.org/package/parsec/docs/Text-Parsec-Token.html

If you're not familiar with lexing, the idea is as follows. When you are
parsing a programming language, you often have to solve the same set of
problems – parsing numbers, string literals (with all those escaping rules),
identifiers/operators/keywords, comments, making all parsers handle
whitespace (since there can be whitespace between pretty much any 2 tokens),
and so on. With “Text.Parsec.Token” you could just specify what counts as
whitespace, as a comment, as an identifier character, etc. and get a set of
parsers “for free”.

Parsec achieves this by defining a huge structure called “GenLanguageDef”
that contains the specification of your language:

     data GenLanguageDef s u m =
       LanguageDef {
         commentStart    :: String,
         commentEnd      :: String,
         commentLine     :: String,
         nestedComments  :: Bool,
         identStart      :: ParsecT s u m Char,
         identLetter     :: ParsecT s u m Char,
         opStart         :: ParsecT s u m Char,
         opLetter        :: ParsecT s u m Char,
         reservedNames   :: [String],
         reservedOpNames :: [String],
         caseSensitive   :: Bool }

Then you use “makeTokenParser” on it to generate another huge structure
containing lots of useful parsers:

     makeTokenParser :: GenLanguageDef ... -> GenTokenParser ...

     data GenTokenParser s u m =
       TokenParser {
         identifier    :: ParsecT s u m String,
         operator      :: ParsecT s u m String,
         charLiteral   :: ParsecT s u m Char,
         stringLiteral :: ParsecT s u m String,
         natural       :: ParsecT s u m Integer,
         integer       :: ParsecT s u m Integer,
         float         :: ParsecT s u m Double,
         lexeme        :: forall a. ParsecT s u m a -> ParsecT s u m a,
         parens        :: forall a. ParsecT s u m a -> ParsecT s u m a,
         braces        :: forall a. ParsecT s u m a -> ParsecT s u m a,
         comma         :: ParsecT s u m String,
         colon         :: ParsecT s u m String,
         ... }

What's the problem with this approach? It's very inflexible – the moment
you want to change something that wasn't supposed to be changed, you're on
your own. Do you want to special-case “-- |” comments to use them as doc
strings? The easiest solution is to copy the whole module into your own
project:

     https://github.com/aslatter/parsec/issues/15

Do you want to handle newlines by yourself (for instance, to allow them to
be expression separators)? The easiest solution is to fork Parsec:

     https://github.com/aslatter/parsec/issues/24

(Even if those turn out not to be the easiest solutions, it's still
somewhat telling that they were what the authors ended up with.)

Megaparsec uses a simpler, more flexible approach. See the docs here:

https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Lexer.html

The linked module provides 3 categories of functions:

   * “integer”, “decimal”, “float”, “charLiteral”, etc are generic parsers
     that you can use to parse... well, things that they are named 
after. Some
     of those – like “integer” – are occasionally useful even if you're not
     parsing any languages (how often did you have to write “read <$> many1
     digit”? now you don't have to).

   * “skipLineComment” and “skipBlockComment” generate comment parsers, and
     “space” combines them together. You can make comment parsers
     arbitrarily complex before passing them to “space”, or you can write
     your own space-skipping combinator.

   * “lexeme”, “symbol”, and “symbol'” make lexemes out of things; a lexeme,
     by convention, expects no leading whitespace and skips all trailing
     whitespace.

So, instead of having parsers being passed to each other under the hood,
you now have to pass them by yourself – except that you don't actually have
to do much passing, because you can just write

     import qualified Text.Megaparsec.Lexer as L

     lexeme = L.lexeme space
     symbol = L.symbol space

and off you go:

     keyword = label "keyword" . symbol
     parens = between (symbol "(") (symbol ")")
     ... etc ...

How you can help
~~~~~~~~~~~~~~~~

* Take a project, modify it to use Megaparsec, file an issue if you've
   encountered any difficulties.

* Take a Parsec tutorial and rewrite it for Megaparsec. (And then please
   send an email so that a link to your tutorial could be added to the
   README file.)

* As I've already mentioned before, if you have any ideas about what could
   be changed/improved in Parsec, they likely apply to Megaparsec as well –
   propose them on the issue tracker.

* Report typos in documentation (same goes for mistakes, unclear phrasing,
   etc). If you're shy (like me) and don't like opening issues, just ping me
   on IRC (I'm “indiagreen” on Freenode) if you spot anything.