[Haskell-cafe] Correct parsers for bounded integral values

Sun Jul 20 10:43:24 UTC 2025

Hello,

and thanks for the detailed feedback!  Unfortunately, I'm unlikely to
find time to respond between weekends.  So here it is in one block.

Jeff Clites via Haskell-Cafe (2025-Jul-16, excerpt):
> I believe that `read` is supposed to let you interpret number
> strings the way the compiler would, and the compiler accepts `298 ::
> Word8` as 42,

Maybe there is a more general perspective on parsing, in which
silently wrapping around is a necessity.  But I'm not aware of it and
hence, please excuse my ignorance, I maintain the standpoint that it's
a bug that should be eradicated.  Even if that's how GHC (currently)
does it.

AFAIK, GHC is tightly coupled with the `base` package (although I do
not understand the details), so maybe fixing this in `base` would fix
it for `read` and GHC itself?

Tom Smeding (2025-Jul-13, excerpt):
> It is implemented in terms of a ReadP parser.

Tom, thanks for pointing me to [4].  Yes, that looks feasible.  How
would I go about modifying the `base` library?  Sorry if that's a
stupid question, but I don't know where to start, `base` just always
happened to be there already.  I.e., could someone please help me to
get a simple setup (my usual modus operandi is the shell, cabal, and
an editor) where the base library is accessible for modification and
testing?

Viktor Dukhovni (2025-Jul-13, excerpt):
> There were more libraries to look at.

True, my horizon is limited — I mean this in all honesty, no sarcasm
implied!  I'd still like to help get a fix on the buggy ones =)

Jeff, thanks for your very elaborate answer.  I'll discuss to some of
these ideas, just slightly out of order.

Jeff Clites via Haskell-Cafe (2025-Jul-16, excerpt):
> Since `read` doesn't have a way to indicate failure other than an
> exception, the current behavior is probably the least bad.

Hmmmmmm.  I'd like to oppose that: I find it hard to imagine any
scenario in which I'd rather have a program continue with a
wrapped-around number than crash (obviously, I'm a follower of the
fail-fast cult).

Are there other opinions on that?  Is wrap-around needed anywhere?

Jeff Clites via Haskell-Cafe (2025-Jul-16, excerpt):
> 2) For parser libraries, failure is part of the story, so things can
> be better. Speaking generally, when you parse a number out of a
> string like "29A", you get 29 with "A" leftover for further
> parsing. So in terms of consistency, you might expect parsing a
> `Word8` out of "298" to give you 29, with "8" leftover. This is
> probably not what anybody would want, but it is hinting at some more
> general considerations. For the still somewhat restricted case of
> number, a result type like `Word8` implies a bound, but you might
> just as reasonably want to parse into an `Int` but with a bound of,
> say, 150. I had a case in which I was parsing into an `Int` but with
> a restriction of only allowing up to 3 digits, failing if there were
> more digits rather than stopping at 3.
>
> So you could imagine a more general parsing primitive, not specific
> to numbers, where you could accept characters and decide when to
> stop and whether to fail. Attoparsec has something close but not
> quite there [1]:
>
>     scan :: s -> (s -> Char -> Maybe s) -> Parser Text
>     runScanner :: s -> (s -> Char -> Maybe s) -> Parser (Text, s)

Ok, I see three different aspects here: (1) parsing a predetermined
number of digits, (2) parsing with a different limit than implied by
the datatype, and (3) having a more general approach not limited to
numbers.  I think none of them opposes fixing the current behaviour of
`read`.

On (1), a use case for a known number of digits is “take the year out
of timestamp `20250720-105659`”.  A valid solution would be

     parseYear :: Word16
     parseYear = read <$> P.count 4 digit

while the following would be a programming error:

     parseYear :: Word8
     parseYear = read <$> P.count 4 digit

In the current situation, this error would not be spotted until the
date is actually used, while an improved `read` would likely fail when
testing the parser on any expectable date.

On (2), the method I've implemented does adapt to different upper
limits.  In fact, the `bounded` parsers I provide are instances [3] of
more primitive parsers allowing for different limits.

But that's not even the core issue.

If I had a separate limit checking parser combinator

    chkLimit :: (Integral a, Bounded a) -> a -> a -> Parser a -> Parser a
    chkLimit min max p = … -- fail if p results in out of bounds value

and used it (with a parser `parseWord` that wraps around) to parse a
value in the range 0..23, like so

    hour :: Parser Word8
    hour = chkLimit 0 23 parseWord

then this parser would only catch overflow up to input `"255"`, but
not beyond.

The issue is, that overflow occurs *before* the limit is checked,
obscuring the error.

As *I* understand point (4), I don't think it is a valid argument: The
idea of providing parsers for numbers is precisely to accommodate for
that special case (base, digits) and its peculiarities (limits,
overflow).  Of course one can build something more general using a
parsing library, but the number-parsers are there and are (IMO)
incorrect.

Jeff Clites via Haskell-Cafe (2025-Jul-16, excerpt):
> There's another approach, which might save a few computations. For
> the example of `Word16`, whose `maxBound` is 65535, the following
> would work:
>
> 1) Read up to 6 digits (since the bound has 5 digits).
> 2) If you got 6 digits, fail.
> 3) If you got 4 digits, you are within bounds, you can can use
>    whatever typical conversion routine
> 4) If you got 5 digits:
>     a) Convert the first 4 to a number as usual
>     b) If that number is < 6553, accumulating the final digit will be
>        within bounds
>     c) If that number is > 6553, fail
>     d) If that number is == 6553, do further logic involving the final
>        digit.
>
> This isn't prettier, but might save some checks. The main point here
> is that counting the number of digits gets you most of the way, and
> it's only for the final digit of a max-digits-sized number that you
> need to go into more detailed logic. (You would need a typeclass to
> cache the pre-analysis of maxBound, probably.)

Yes, I have thought about just limiting the digits instead.  The
surprising thing is, there seems to be no gain: Both approaches
require one comparison per digit read (the number-counting approach
needs to count the digits and test whether the limit of this counter
has been reached, the “modulo” approach needs to check whether `lim`
has been reached).  And both approaches need to accumulate the digits
read so far (digit-counting collects the digits and converts them to a
value just before reading the final digit, “modulo” accumulates the
final value on the go).

When comparing both approaches, I find they have almost the same
structure, only different perspective.  What you're doing in steps
4.b–4.d is almost exactly in [this line of code][2].  I'd go so far as
to say it's the same idea, I've just written it down more nicely ;-)

Are there more ideas to this?  Is this needed, i.e., is there any
point for me working on this?

Kind regards
Stefan

[1]: https://hackage.haskell.org/package/attoparsec-0.14.4/docs/Data-Attoparsec-Text.html#v:runScanner
[2]: https://github.com/s5k6/robust-int/blob/master/src/Data/RobustInt/Parsec.hs#L48
[3]: https://github.com/s5k6/robust-int/blob/master/src/Data/RobustInt/Parsec.hs#L116-L126
[4]: https://hackage.haskell.org/package/ghc-internal-9.1201.0/docs/src/GHC.Internal.Read.html#line-586

-- 
Stefan Klinger, Ph.D. -- computer scientist              o/X
http://stefan-klinger.de                                 /\/
https://github.com/s5k6                                    \
I prefer receiving plain text messages, not exceeding 32kB.