[Haskell-cafe] Correct parsers for bounded integral values

Sun Jul 20 14:52:20 UTC 2025

Hi Stefan

Isn't Victor's suggestion of using Data.ByteString.Char8.readWord8 a
workaround for the issue with read?

λ>  import Data.ByteString.Char8
λ> readWord8 (pack "298")
Nothing

As a relatively naive Haskell user I agree that

> read "298" :: Word8
42
it :: Word8

seems like a bug.

Before implementing a fix / enhancement it is probably worth filing a bug
to get confirmation that it is. Proposing a change to the behavior of the
base library is, IMHO based on experience, unlikely to be accepted.

It reminds me of

λ> maxBound :: Int
9223372036854775807
λ> maxBound + 1 :: Int
-9223372036854775808

My understanding of the explanation for that is that checking for overflow
is too expensive from a performance point of view.

Cheers
George

On Sun, Jul 13, 2025 at 8:34 AM Stefan Klinger <haskell at stefan-klinger.de>
wrote:

> Hello everybody =)
>
> I've been bugged by the silent overflowing of integer parsers as
> provided by `base`, `attoparsec`, and others.  I'd go so far as to
> call it a bug when the user types `298` and the parser says `Right
> 42`.
>
> Unfortunately, all parsing libraries I've looked at get this wrong.  A
> [solution][4] is proposed below.
>
> I'm asking for feedback on how to continue from here.
>
> Kind regards
> Stefan
>
>
>
>
> The following examples can be reproduced with
>
>     $ git clone 'https://github.com/s5k6/robust-int.git'
>     $ cd robust-int
>     $ cabal repl robust-int:demo
>
>
> Situation
> ---------
>
> This is the current situation with [read][1] from base:
>
>     > read "298" :: Word8
>     42
>
> And with [decimal][2] from attoparsec:
>
>     > A.parseOnly (A.decimal :: A.Parser Word8) $ pack "298"
>     Right 42
>
> And the solution [usually suggested][5] for Parsec (which relies on
> `read`):
>
>     parsecWord8 :: P.Parser Word8
>     parsecWord8 = read <$> P.many1 P.digit
>
>     > P.runParser parsecWord8 () "" "298"
>     Right 42
>
> Even worse, the latter would rather exhaust memory than realise its
> input is way out of bounds:
>
>     > P.runParser parsecWord8 () "" $ repeat '1'
>     ⊥
>
> Also, some 3rd-party libraries get this wrong, e.g.,
> [parsec3-numbers][6]:
>
>     > P.runParser (PN.decimal :: P.Parser Word8)  () "" "298"
>     Right 42
>
> And [megaparsec][8], which is at least nice enough to warn about this
> in its documentation:
>
>     > M.parseMaybe (M.decimal :: M.Parsec () String Word8) "298"
>     Just 42
>
> I find this misses the point of a parser validating its input.
>
>
> Solution
> --------
>
> It is [possible to implement][7] parsers for bounded integral types
> which verify the bounds of the parsed value *while* parsing, and even
> doing this without the use of a “bigger” type.
>
> The idea is as follows:
>
> As usual, we parse digits left to right, and collect the resulting
> value in an accumulator `acc`, i.e., for each new digit `d`, the
> accumulator is updated to
>
>     base * acc + d
>
> Nothing new up to here.  However, before we start parsing, calculate
>
>     (lim, m) = upper_bound `divMod` base
>
> and before updating the accumulator with another digit `d`, verify
> that
>
>     acc < lim || (acc == lim && d <= m)
>
> which exactly guarantees that the accumulator will not overflow.  The
> reason why this works is is easily seen by doing the example for
> `Word16` in base 10:
>
>     > (maxBound :: Word16) `divMod` 10
>     (6553,5)
>     > 10 * fst it + snd it
>     65535
>
> Complexity: This adds a modulo operation and two comparisons for every
> literal being parsed, plus one comparison for every digit.  In order
> to limit memory consumption, some comparison has to take place during
> parsing, at least for limiting the number of digits consumed.  In
> total, this does not look too expensive.
>
> I have [implemented][4] this idea for `parsec` and `attoparsec` to
> demonstrate the idea (only for decimal values).
>
>
> What now?
> ---------
>
> Obviously, this *should not be a another library*, trying to fix some
> aspect of some other libraries.  My code is rather intended for
> demonstration.  I'd prefer to help this idea migrate to the libraries
> (`base`, `parsec`, `attoparsec`, …), where the correct parsers should
> be.
>
> Unfortunately, I got a bit lost when trying to track down the code of
> `read` in the `base` package.  And I think I may have overengineered
> my solution for attoparsec to accommodate different stream types.
>
> Also, I get the impression that Haskell *library* code seems to be
> written with a different mindset, a deeper understanding of GHC than
> mine, i.e., more tailored to what the compiler will *actually do* when
> using the code, trying not to spoil opportunities for optimisation.
> And I'm not sure I'm up to that task.
>
> So I'm asking for feedback on the proposed algorithm, my
> implementation, and hints on where and how to get this into
> established libraries.
>
>
> Build instructions
> ==================
>
>     $ cabal build
>     $ cabal run demo
>
>     $ cabal test
>     $ cabal haddock
>
>
> [1]:
> https://hackage.haskell.org/package/base-4.21.0.0/docs/Prelude.html#v:read
> [2]:
> https://hackage.haskell.org/package/attoparsec-0.14.4/docs/Data-Attoparsec-ByteString-Char8.html#v:decimal
> [3]:
> https://hackage.haskell.org/package/parsec-3.1.18.0/docs/Text-Parsec-Token.html#v:decimal
> [4]: https://github.com/s5k6/robust-int
> [5]:
> https://stackoverflow.com/questions/24171005/how-to-parse-an-integer-with-parsec
> [6]: https://hackage.haskell.org/package/parsec3-numbers
> [7]:
> https://github.com/s5k6/robust-int/blob/master/src/Data/RobustInt/Parsec.hs#L32-L52
> [8]:
> https://hackage.haskell.org/package/megaparsec-9.7.0/docs/Text-Megaparsec-Char-Lexer.html#v:decimal
>
>
> --
> Stefan Klinger, Ph.D. -- computer scientist              o/X
> http://stefan-klinger.de                                 /\/
> https://github.com/s5k6                                    \
> I prefer receiving plain text messages, not exceeding 32kB.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20250720/b847864a/attachment.html>