[Haskell-beginners] Attoparsec and multiline comments

Yitzchak Gale gale at sefer.org
Thu Sep 15 11:38:08 CEST 2011


Alexander.Vladislav.Popov wrote:
>> Help me to parse multiline comments (like /* ... */ or even like /*- ...
>> -*/) using attoparsec.

Daniel Fischer wrote:
> The docs for manyTill at
> http://hackage.haskell.org/packages/archive/attoparsec/0.9.1.2/doc/html/Data-
> Attoparsec-Combinator.html
> have the example
>
> simpleComment   = string "<!--" *> manyTill anyChar (try (string "-->"))

Unfortunately, many of the examples in the documentation for Attoparsec
are just lifted directly from the Parsec documentation, so they
need to be modified somewhat to work in Attoparsed.
This is one of those cases.

First of all, the string function in Attoparsec has type

string :: ByteString -> Parser ByteString

So you can't use a String as its argument directly; you need to
wrap the String with Data.ByteString.Char8.pack.

If you are parsing source code text, you anyway would be better
off using Text instead of String and attoparsec-text instead of
attoparsec. That, together with the OverloadedStrings extension,
will solve that problem and more. Then you don't even need
the "string" function. Just use string literals to parse strings,
and the compiler will automatically insert calls to "string".

The next issue is that you don't need "try" - unlike in Parsec,
the string combinator in Attoparsec does *not* consume any
input when it fails.

Finally, if you are bothering to use attoparsec, it is presumably
because you want the super speed it is able to achieve using
fusion. But the documentation points out that you don't get that
fusion for combinators like manyTill, only for "byte-oriented"
combinators like takeTill. So you should factor the parser to
parse the comment (quickly!) into chunks beginning with
'*', then use manyTill only on the chunks.

Here is an untested example with the above modifications:

comment = "/*" .*> (emptyComment <|> T.concat <$> commentChunks)
emptyComment = "*/" .*> pure T.empty
commentChunks = manyTill (takeWhile1 (/= '*')) (string "*/")

I did need to use the string keyword with manyTill, because
manyTill is too polymorphic for the compiler to be able to
deduce the type of the string literal. Since manyTill is so
common, I often define a type-specialized version of it
for convenience:

manyTillS :: Parser a -> Parser Text -> Parser [a]
manyTillS = manyTill

That is analogous to ".*>", attoparsec-text's type-specialized
version of  "*>" from Control.Applicative.

Then you can write:

commentChunks = takeWhile1 (/= '*') `manyTillS` "*/"

Regards,
Yitz



More information about the Beginners mailing list