[Haskell-cafe] Announcing binary-parsers

Sun Oct 9 05:56:55 UTC 2016

On Sun, Oct 2, 2016 at 3:17 AM, 韩冬(基础平台部) <handongwinter at didichuxing.com> wrote:
> Hi wren!
>
> Yes, i noticed that attoparsec's numeric parsers are slow. I have a benchmark set to compare attoparsec and binary-parsers on different sample JSON files, it's on github: https://github.com/winterland1989/binary-parsers.
>
> I'm pretty sure bytestring-lexing helped a lot, for example, the average decoding speed improvement is around 20%, but numeric only benchmarks(integers and numbers) improved by 30% !

So still some substantial gains for non-numeric stuff, nice!

> Parsing is just a part of JSON decoding, lots of time is spent on unescaping, .etc. So the parser's improvement is quite large IMHO.
>
> BTW, can you provide a version of lexer which doesn't check whether a Word is a digit? In binary-parsers i use something like `takeWhile isDigit` to extract the input ByteString, so there's no need to verify this in lexer again. Maybe we can have another performance improvement.

I suppose I could, but then it wouldn't be guaranteed to return
correct answers. The way things are set up now, the intended workflow
is that wherever you're expecting a number, you should just hand the
ByteString over to bytestring-lexing (i.e., not bother
scanning/pre-lexing via `takeWhile isDigit`) and it'll give back the
answer together with the remainder of the input. This ensures that you
don't need to do two passes over the characters. So, for Attoparsec
itself you'd wrap it up with something like:

    decimal :: Integral a => Parser a
    decimal =
        get >>= \bs ->
        case readDecimal bs of
        Nothing -> fail "error message"
        Just (a, bs') -> put bs' >> return a

Alas `get` isn't exported[1], but you get the idea. Of course, for
absolute performance you may want to inline all the combinators to see
if there's stuff you can get rid of.

The only reason for scanning ahead is in case you're dealing with lazy
bytestrings and so need to glue them together in order to use
bytestring-lexing. Older versions of the library did have support for
lazy bytestrings, but I removed it because it was bitrotten and
unused. But if you really need it, I can add new variants of the
lexers for dealing with the possibility of requesting new data when
the input runs out.

[1] <http://hackage.haskell.org/package/attoparsec-0.13.1.0/docs/src/Data-Attoparsec-ByteString-Internal.html#get>

-- 
Live well,
~wren