[Haskell-cafe] attoparsec and backtracking

Fri Mar 15 20:29:53 CET 2013

I have a couple of problems with attoparsec which I think are related
to its always backtrack nature, but maybe there's some other way to
solve the same problems.

The first is that it's hard to get the right error msg out.  For
instance, I have a parser that tries to parse a number with an
optional type suffix.  It's an error if the suffix is unrecognized:

p_num :: A.Parser Score.TypedVal
p_num = do
    num <- p_untyped_num
    suffix <- A.option "" ((:"") <$> A.letter_ascii)
    case Score.code_to_type suffix of
        Nothing -> fail $ "p_num expected suffix in [cdsr]: " ++ show suffix
        Just typ -> return $ Score.Typed typ num

However, which error msg shows up depends on the order of the (<|>)
alternatives, and in general the global structure of the entire
parser, because I think it just backtracks and then picks the last
failing backtrack.  Even after carefully rearranging all the parsers
it seems impossible to get this particular error to bubble up to the
top.  The thing is, as soon as I see an unexpected suffix I know I can
fail entirely right there, with exactly that error msg, but since
there's no way to turn off backtracking I think there's no way to do
that.

The second thing is that I want to lex a single token.  I thought I
could just parse a term and then see how far I got and then use that
index to splitAt the input.  But attoparsec doesn't keep track of the
current index, so I wrote ((,) <$> p_term <*> A.takeByteString), and
then I can use the length of the left over bytestring.  But due to
backtracking, that changes what p_term will parse.  Since
takeByteString always succeeds, it will cause p_term to backtrack
until it finds some prefix that will match.  The result is that
instead of failing to parse, "1." will lex as ("1", ".").  Since I
integrate lexing and parsing (as is common with combinator parsers),
and since it seems I can't keep track of byte position with
attoparsec, I think I'm out of luck trying to do this the easy way.  I
think I have to write a separate lexer that tries to have the same
behaviour as the parser, but is all separate code.

I know attoparsec was designed for speed, and error reporting and
source positions are secondary.  Am I simply asking too much of it?  I
originally used parsec, but parsing speed is my main bottleneck, so I
don't want to give ground there.  Is there a clever way to get
attoparsec to do what I want?  Or a ByteString or Text parser out
there which is fast, but can not backtrack, or keep track of input
position?  I've heard some good things about traditional alex+happy...
of course it would mean a complete rewrite but might be interesting.
Has anyone compared the performance of attoparsec vs. alex+happy?