Correct parsers for bounded integral values
Stefan Klinger
haskell at stefan-klinger.de
Mon Jul 21 15:56:46 UTC 2025
Thanks for the encouragement Rodrigo! I'll follow the process and
hope to open a ticket soon.
Viktor Dukhovni (2025-Jul-21, excerpt):
> It is also fair to point out that once an Int or other bounded integral
> type is read, arithmetic with that type (addition, subtraction and
> multiplication) silently overflows. And so silent overflow in `read`
> is not inconsistent with the type's semantics.
I see parsing as a boundary between an outside world (throwing text at
me) and an inside world, where I have programmed some algorithm. As
programmer, it is my responsibility to ensure that the types are
chosen so that the algorithm works correctly, ideally on any accepted
input, i.e., I have to guarantee that no inadvertent overflow happens
in this inside world. However, calculating away based on
misinterpreted input, will lead to invalid results.
Viktor Dukhovni (2025-Jul-21, excerpt):
> That said, if various middleware libraries hide overflows, because under
> the covers thay're using `read`, that could be a problem, so we do want
> the ecosystem at large to make sensible choices about when silent
> overflow may or may not be appropriate. Perhaps that means having
> both wrapping and overflow-checked implementations available, and
> clear docs with each about its behaviour and the corresponding
> alternative.
I did not realise this clearly enough before, but have elaborated a
bit on Haskell-cafe [1]. We do have unbounded `read :: String ->
Integer` and silently overflowing `fromInteger :: Integer -> Word8`,
which can be combined if overflow is desired. This follows the idea
to be explicit about dangerous things. In addition, we have `read ::
String -> Word8` and company, which I'd like to fix.
> A few of quick observations about [2]:
Thank you =)
> - It disallows expliccit leading "+" (just like "read", but perhaps
> that should be tolerated).
Yes, it probably should not be that strict. For my own projects I
assumed it easier to make it more forgiving later, than the other way
round. There really should be consensus on whether or not leading `+`
or `0` should be allowed. But these are fixes to make towards the
end, I guess.
> - It disallows multiple leading zeros, perhaps these should be
> tolerated.
>
> - It disallows "-0", perhaps these should be tolerated, as well
> as "-0000", "-000001", ... (With lazy ByteStrings, which might
> never terminate, there is a generous, but sensible limit on
> the number of leading zeros allowed).
I ruled this out because I wanted a simple guarantee for termination.
Your idea of “generous, but sensible” sounds compelling, the leading
`0`s can be cosumed in constant space, we need not keep them.
> - One way to avoid difficulties with handling negative minBound is
> to parse signed values via the corresponding unsigned type, which
> can accommodate `-minBound` as a positive value, and then negate
> the final result. This makse possible sharing the low-level
> digit-by-digit code between the positive and negative cases.
How do you mean? I did not get this “accommodate `-minBound` as a
positive value” right, my initial approach to use
char '-' >> negate <$> parseUnsigned (negate minBound)
fails, exactly because the negation of the lower bound may not be
(read: is usually not) within the upper bound, and thus wraps around,
e.g., incorrectly `negate (minBound :: Int8)` → `-128` due to the
upper bound of `127`.
Viktor Dukhovni (2025-Jul-21, excerpt):
> If parsing of Integer and Natual is also in scope […]
No, not at all. I have no reservations against `read` for the
unbounded types. That should be left alone.
Cheers
Stefan
[1]: https://mail.haskell.org/pipermail/haskell-cafe/2025-July/137162.html
[2]: https://github.com/s5k6/robust-int
--
Stefan Klinger, Ph.D. -- computer scientist o/X
http://stefan-klinger.de /\/
https://github.com/s5k6 \
I prefer receiving plain text messages, not exceeding 32kB.
More information about the ghc-devs
mailing list