[Haskell-cafe] Correct parsers for bounded integral values

Mon Jul 21 00:16:30 UTC 2025

> On Jul 20, 2025, at 3:43 AM, Stefan Klinger <haskell at stefan-klinger.de> wrote:
> 
> Jeff Clites via Haskell-Cafe (2025-Jul-16, excerpt):
>> I believe that `read` is supposed to let you interpret number
>> strings the way the compiler would, and the compiler accepts `298 ::
>> Word8` as 42,
> 
> Maybe there is a more general perspective on parsing, in which
> silently wrapping around is a necessity.  But I'm not aware of it and
> hence, please excuse my ignorance, I maintain the standpoint that it's
> a bug that should be eradicated.  Even if that's how GHC (currently)
> does it.

GHC's treatment of integer-style numeric literals is part of the Haskell spec--it's not just a GHC implementation detail. Specifically, it treats them as unbounded precision Integer's, subsequently converted to other types via `fromInteger`. (This is described in the Haskell Report.) You can see this in play with expressions such as `x = 123456` -- Haskell doesn't assume a particular number type when it parses this:

    ghci> x = 123456
    ghci> x :: Int
    123456
    ghci> x :: Word8
    64
    ghci> :type x
    x :: Num a => a

You could argue that `fromInteger` should complain, but that's not the language design choice that was made.

My claim isn't that this is what you'd want from a general parsing library. Rather, my claim is that `read` isn't intended to be a building block for use in general parsing, but instead (in the number case) it's to let you interpret a number string the way Haskell would in source code. So it's just the wrong tool for the job when it comes to parsing in general.

> Jeff, thanks for your very elaborate answer.

You are welcome!

> Jeff Clites via Haskell-Cafe (2025-Jul-16, excerpt):
>> 2) For parser libraries, failure is part of the story, so things can
>> be better. Speaking generally, when you parse a number out of a
>> string like "29A", you get 29 with "A" leftover for further
>> parsing. So in terms of consistency, you might expect parsing a
>> `Word8` out of "298" to give you 29, with "8" leftover. This is
>> probably not what anybody would want, but it is hinting at some more
>> general considerations. For the still somewhat restricted case of
>> number, a result type like `Word8` implies a bound, but you might
>> just as reasonably want to parse into an `Int` but with a bound of,
>> say, 150. I had a case in which I was parsing into an `Int` but with
>> a restriction of only allowing up to 3 digits, failing if there were
>> more digits rather than stopping at 3.
>> 
>> So you could imagine a more general parsing primitive, not specific
>> to numbers, where you could accept characters and decide when to
>> stop and whether to fail. Attoparsec has something close but not
>> quite there [1]:
>> 
>>    scan :: s -> (s -> Char -> Maybe s) -> Parser Text
>>    runScanner :: s -> (s -> Char -> Maybe s) -> Parser (Text, s)
> 
> Ok, I see three different aspects here: (1) parsing a predetermined
> number of digits, (2) parsing with a different limit than implied by
> the datatype, and (3) having a more general approach not limited to
> numbers.

Items (1) and (2) were just examples. (There are many others, such as only accepting values from a fixed list, such as HTTP result codes.) This was really just meant to show my thought process, which was to realize that parsing a number with a limit implied by a datatype was just a specific case of a more general operation, and once you have that more general operation you can implement number-limited-by-a-datatype in terms of that.

For instance:

    data Result s r = Continue s | Fail | Succeed r | SucceedWithoutLast r
    scan2 :: s -> (s -> Maybe Char -> Result s r) -> Parser (Text, r)

Given this, you can implement the logic you described inside of the callback (the second parameter). The callback doesn't have to know the internal details of the parsing library (e.g., how it keeps track of the current position, etc.), so it's easy to write different callbacks to get various behaviors.

My point is just, if someone is going to enhance Attoparsec (for example), it may as well be enhanced in a more general way.

> On (1), a use case for a known number of digits is “take the year out
> of timestamp `20250720-105659`”.  A valid solution would be
> 
>     parseYear :: Word16
>     parseYear = read <$> P.count 4 digit

Yes that's one sort of case, but the one I was describing was actually where you want to fail unless you got exactly 3 digits, so "123 abc" would succeed, "12 abc" would fail, and "1234 abc" would also fail. That's more involved to handle.

This was just to illustrate that there are many different behaviors you might want, not only in terms of bounds but also in terms of when you want the parse to stop. Ideally you'd want to be able to implement all those behaviors with similar ease.

> If I had a separate limit checking parser combinator
> 
>    chkLimit :: (Integral a, Bounded a) -> a -> a -> Parser a -> Parser a
>    chkLimit min max p = … -- fail if p results in out of bounds value
> 
> and used it (with a parser `parseWord` that wraps around) to parse a
> value in the range 0..23, like so
> 
>    hour :: Parser Word8
>    hour = chkLimit 0 23 parseWord
> 
> then this parser would only catch overflow up to input `"255"`, but
> not beyond.
> 
> The issue is, that overflow occurs *before* the limit is checked,
> obscuring the error.

Right...so you wouldn't implement it that way. I wasn't advising that. I think I'm missing what you are getting at here.

> As *I* understand point (4), I don't think it is a valid argument: The
> idea of providing parsers for numbers is precisely to accommodate for
> that special case (base, digits) and its peculiarities (limits,
> overflow).  Of course one can build something more general using a
> parsing library, but the number-parsers are there and are (IMO)
> incorrect.

You didn't identify what point 4 was :). So I can't comment on that.

> Jeff Clites via Haskell-Cafe (2025-Jul-16, excerpt):
>> There's another approach, which might save a few computations. For
>> the example of `Word16`, whose `maxBound` is 65535, the following
>> would work:
>> 
>> 1) Read up to 6 digits (since the bound has 5 digits).
>> 2) If you got 6 digits, fail.
>> 3) If you got 4 digits, you are within bounds, you can can use
>>   whatever typical conversion routine
>> 4) If you got 5 digits:
>>    a) Convert the first 4 to a number as usual
>>    b) If that number is < 6553, accumulating the final digit will be
>>       within bounds
>>    c) If that number is > 6553, fail
>>    d) If that number is == 6553, do further logic involving the final
>>       digit.
>> 
>> This isn't prettier, but might save some checks. The main point here
>> is that counting the number of digits gets you most of the way, and
>> it's only for the final digit of a max-digits-sized number that you
>> need to go into more detailed logic. (You would need a typeclass to
>> cache the pre-analysis of maxBound, probably.)
> 
> Yes, I have thought about just limiting the digits instead.  The
> surprising thing is, there seems to be no gain: Both approaches
> require one comparison per digit read (the number-counting approach
> needs to count the digits and test whether the limit of this counter
> has been reached, the “modulo” approach needs to check whether `lim`
> has been reached).

But `divMod` is not a single-instruction operation.

Thinking about it more, I think you actually only need:

1) Read up to 6 digits (since the bound has 5 digits).
2) If you got 6 digits, fail.
3) Otherwise, convert via the current method. The result is correct unless it's negative. (This is because, you can only fail here if you get a 5-digit number and the final digit is too big, which means you overflow by at most 9, so you can't wrap back to positive, unless you really want to handle 3-bit numbers or something.)

Jeff Clites