[Haskell-cafe] Correct parsers for bounded integral values
Jeff Clites
jclites at mac.com
Wed Jul 16 23:13:31 UTC 2025
I agree the current behavior is probably not what you'd usually want. A couple of potentially interesting considerations though:
1) Considering `read` separately from parser libraries: I believe that `read` is supposed to let you interpret number strings the way the compiler would, and the compiler accepts `298 :: Word8` as 42, albeit with a warning that can be enabled:
ghci> 300 :: Word8
<interactive>:25:1: warning: [-Woverflowed-literals]
Literal 300 is out of the Word8 range 0..255
44
But you'll get no warning (due to the polymorphism of literals) under usage patterns such as the following, which are probably more typical:
ghci> (let x = 300 in x) :: Word8
44
Since `read` doesn't have a way to indicate failure other than an exception, the current behavior is probably the least bad.
2) For parser libraries, failure is part of the story, so things can be better. Speaking generally, when you parse a number out of a string like "29A", you get 29 with "A" leftover for further parsing. So in terms of consistency, you might expect parsing a `Word8` out of "298" to give you 29, with "8" leftover. This is probably not what anybody would want, but it is hinting at some more general considerations. For the still somewhat restricted case of number, a result type like `Word8` implies a bound, but you might just as reasonably want to parse into an `Int` but with a bound of, say, 150. I had a case in which I was parsing into an `Int` but with a restriction of only allowing up to 3 digits, failing if there were more digits rather than stopping at 3.
So you could imagine a more general parsing primitive, not specific to numbers, where you could accept characters and decide when to stop and whether to fail. Attoparsec has something close but not quite there [1]:
scan :: s -> (s -> Char -> Maybe s) -> Parser Text
runScanner :: s -> (s -> Char -> Maybe s) -> Parser (Text, s)
This is of course very much like a fold. But these never fail; I think what you'd want instead of `Maybe` is a return value which lets you decide between: continue (call again with another Char's), fail (and backtrack), and stop with success and yielding a final accumulator value (with two variants: either consuming or not consuming the last-supplied value). This reminds me of the `Iteratee` type, somewhat. This would let you implement everything discussed above: fail if there are too many digits or instead stop short, and use whatever bounding criteria you might want. And of course this could be used for parsing things other than numbers. I think this would be easy to implement for Attoparsec, and probably for other libraries too.
A couple of other notes:
> parsecWord8 = read <$> P.many1 P.digit
> ...
> Even worse, the latter would rather exhaust memory than realise its
> input is way out of bounds
True, but any use of combinators like `many` will run into problems with unbounded input, so this case isn't categorically worse. (I'm thinking of the common cases where you are planning to keep the "many" things parsed.) I think unbounded input requires a different sort of approach (along the lines of Conduit or Pipes). So in practice this is probably fine for many parsing scenarios. (But yes conceptually it seems sloppy.)
> However, before we start parsing, calculate
>
> (lim, m) = upper_bound `divMod` base
>
> and before updating the accumulator with another digit `d`, verify
> that
>
> acc < lim || (acc == lim && d <= m)
There's another approach, which might save a few computations. For the example of `Word16`, whose `maxBound` is 65535, the following would work:
1) Read up to 6 digits (since the bound has 5 digits).
2) If you got 6 digits, fail.
3) If you got 4 digits, you are within bounds, you can can use whatever typical conversion routine
4) If you got 5 digits:
a) Convert the first 4 to a number as usual
b) If that number is < 6553, accumulating the final digit will be within bounds
c) If that number is > 6553, fail
d) If that number is == 6553, do further logic involving the final digit.
This isn't prettier, but might save some checks. The main point here is that counting the number of digits gets you most of the way, and it's only for the final digit of a max-digits-sized number that you need to go into more detailed logic. (You would need a typeclass to cache the pre-analysis of maxBound, probably.)
Anyway, just some thoughts, since I've previously run into this tension between stopping versus failing when parsing, and it's interesting to think about how best to be consistent while still doing what you actually want in different cases.
Jeff Clites
[1]: https://hackage.haskell.org/package/attoparsec-0.14.4/docs/Data-Attoparsec-Text.html#v:scan
> On Jul 13, 2025, at 4:34 AM, Stefan Klinger <haskell at stefan-klinger.de> wrote:
>
> Hello everybody =)
>
> I've been bugged by the silent overflowing of integer parsers as
> provided by `base`, `attoparsec`, and others. I'd go so far as to
> call it a bug when the user types `298` and the parser says `Right
> 42`.
>
> Unfortunately, all parsing libraries I've looked at get this wrong. A
> [solution][4] is proposed below.
>
> I'm asking for feedback on how to continue from here.
>
> Kind regards
> Stefan
>
>
>
>
> The following examples can be reproduced with
>
> $ git clone 'https://github.com/s5k6/robust-int.git'
> $ cd robust-int
> $ cabal repl robust-int:demo
>
>
> Situation
> ---------
>
> This is the current situation with [read][1] from base:
>
>> read "298" :: Word8
> 42
>
> And with [decimal][2] from attoparsec:
>
>> A.parseOnly (A.decimal :: A.Parser Word8) $ pack "298"
> Right 42
>
> And the solution [usually suggested][5] for Parsec (which relies on
> `read`):
>
> parsecWord8 :: P.Parser Word8
> parsecWord8 = read <$> P.many1 P.digit
>
>> P.runParser parsecWord8 () "" "298"
> Right 42
>
> Even worse, the latter would rather exhaust memory than realise its
> input is way out of bounds:
>
>> P.runParser parsecWord8 () "" $ repeat '1'
> ⊥
>
> Also, some 3rd-party libraries get this wrong, e.g.,
> [parsec3-numbers][6]:
>
>> P.runParser (PN.decimal :: P.Parser Word8) () "" "298"
> Right 42
>
> And [megaparsec][8], which is at least nice enough to warn about this
> in its documentation:
>
>> M.parseMaybe (M.decimal :: M.Parsec () String Word8) "298"
> Just 42
>
> I find this misses the point of a parser validating its input.
>
>
> Solution
> --------
>
> It is [possible to implement][7] parsers for bounded integral types
> which verify the bounds of the parsed value *while* parsing, and even
> doing this without the use of a “bigger” type.
>
> The idea is as follows:
>
> As usual, we parse digits left to right, and collect the resulting
> value in an accumulator `acc`, i.e., for each new digit `d`, the
> accumulator is updated to
>
> base * acc + d
>
> Nothing new up to here. However, before we start parsing, calculate
>
> (lim, m) = upper_bound `divMod` base
>
> and before updating the accumulator with another digit `d`, verify
> that
>
> acc < lim || (acc == lim && d <= m)
>
> which exactly guarantees that the accumulator will not overflow. The
> reason why this works is is easily seen by doing the example for
> `Word16` in base 10:
>
>> (maxBound :: Word16) `divMod` 10
> (6553,5)
>> 10 * fst it + snd it
> 65535
>
> Complexity: This adds a modulo operation and two comparisons for every
> literal being parsed, plus one comparison for every digit. In order
> to limit memory consumption, some comparison has to take place during
> parsing, at least for limiting the number of digits consumed. In
> total, this does not look too expensive.
>
> I have [implemented][4] this idea for `parsec` and `attoparsec` to
> demonstrate the idea (only for decimal values).
>
>
> What now?
> ---------
>
> Obviously, this *should not be a another library*, trying to fix some
> aspect of some other libraries. My code is rather intended for
> demonstration. I'd prefer to help this idea migrate to the libraries
> (`base`, `parsec`, `attoparsec`, …), where the correct parsers should
> be.
>
> Unfortunately, I got a bit lost when trying to track down the code of
> `read` in the `base` package. And I think I may have overengineered
> my solution for attoparsec to accommodate different stream types.
>
> Also, I get the impression that Haskell *library* code seems to be
> written with a different mindset, a deeper understanding of GHC than
> mine, i.e., more tailored to what the compiler will *actually do* when
> using the code, trying not to spoil opportunities for optimisation.
> And I'm not sure I'm up to that task.
>
> So I'm asking for feedback on the proposed algorithm, my
> implementation, and hints on where and how to get this into
> established libraries.
>
>
> Build instructions
> ==================
>
> $ cabal build
> $ cabal run demo
>
> $ cabal test
> $ cabal haddock
>
>
> [1]: https://hackage.haskell.org/package/base-4.21.0.0/docs/Prelude.html#v:read
> [2]: https://hackage.haskell.org/package/attoparsec-0.14.4/docs/Data-Attoparsec-ByteString-Char8.html#v:decimal
> [3]: https://hackage.haskell.org/package/parsec-3.1.18.0/docs/Text-Parsec-Token.html#v:decimal
> [4]: https://github.com/s5k6/robust-int
> [5]: https://stackoverflow.com/questions/24171005/how-to-parse-an-integer-with-parsec
> [6]: https://hackage.haskell.org/package/parsec3-numbers
> [7]: https://github.com/s5k6/robust-int/blob/master/src/Data/RobustInt/Parsec.hs#L32-L52
> [8]: https://hackage.haskell.org/package/megaparsec-9.7.0/docs/Text-Megaparsec-Char-Lexer.html#v:decimal
>
>
> --
> Stefan Klinger, Ph.D. -- computer scientist o/X
> http://stefan-klinger.de /\/
> https://github.com/s5k6 \
> I prefer receiving plain text messages, not exceeding 32kB.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
More information about the Haskell-Cafe
mailing list