[Haskell-cafe] parsec2 vs. parsec3... again

Thu Dec 23 04:01:30 CET 2010

Yeah, I know this has been discussed a number of times, but I have
some concrete questions I haven't seen asked before.  And the "parsec
3 is now as fast as parsec 2" thing I've seen around doesn't seem to
be true for me.

I have an app that does a lot of parsing of small expressions.  It's
currently parsec2 operating on lots of little Texts (after an unpack,
of course).  A few parsing functions show up near the top of the
profile output, so I thought an obvious improvement would be to parse
Text directly and avoid the overhead and garbage of unpacking.  Since
parsec3 is now supposed to be as fast as parsec2 I thought I would
give it a try.  Parsec3 is 3.1.0, parsec 2 is 2.1.0.1:

parsec2, String:
        total time  =       10.66 secs   (533 ticks @ 20 ms)
        total alloc = 2,340,113,404 bytes  (excludes profiling overheads)

parsec3, String: (this is just after upgrading the library and editing
it to fix breakage from Parser being a type alias now)
        total time  =       13.76 secs   (688 ticks @ 20 ms)
        total alloc = 2,706,625,256 bytes  (excludes profiling overheads)

parsec3, Text: (wrote a Text instance similar to the one for
ByteString, updated imports, no longer unpacking to String)
        total time  =       15.96 secs   (798 ticks @ 20 ms)
        total alloc = 3,338,005,896 bytes  (excludes profiling overheads)

This is not very encouraging!  Especially strange is how Text
generates *more* allocation... I'd expect less since it doesn't unpack
all the Texts.  The parsing functions are no longer at the top of the
profile, but there are new 'unParser' and 'parsecMap' and 'parserBind'
up at or near the top.  'unParser' just looks like it's unwrapping the
Parsec newtype, so I don't fully understand how it's the most
expensive, but it's called on every bind so it does get called a lot.
There are no obvious super expensive ones, just lots and lots of them
that add up.  Parsec 3's unParser covers up the parsing function I
wrote, so it's now hard to tell what the expensive parsing function
actually is.

I've seen a few remarks that you can't just throw together parsers and
expect them to be fast, you have to profile them, but nothing on how
to actually interpret the results of profiling.

For instance, here's one of the main expensive parsers:

p_unsigned_float :: P.CharParser st Double
p_unsigned_float = do
    i <- P.many P.digit
    f <- P.option "" (P.char '.' >> P.many1 P.digit)
    if (null i && null f) then P.pzero else do
    let int = List.foldl'
            (\total c -> 10 * total + fromIntegral (Char.digitToInt c)) 0 i
        frac = foldr
            (\c total -> (total + fromIntegral (Char.digitToInt c)) / 10) 0 f
    return (int + frac)

There's an obvious problem where I get the digits as a String and then
parse that with list functions, but I can't see any way to get parsec
to return a chunk of Text.  This is roughly how parsec itself parses
numbers, in Text.Parsec.Token.

So, my current options are either figure out some way to speed up
parsec3+Text, revert to parsec2+String and give up, or try an entirely
different parsing library.  I've heard attoparsec is fast but I'd have
to switch to utf8 bytestring which is a big change, and Text seems
like the more correct choice anyway.

Any ideas or experience?