<div dir="ltr"><div class="gmail_default" style="font-family:times new roman,serif;font-size:large">Hi Stefan</div><div class="gmail_default" style="font-family:times new roman,serif;font-size:large"><br></div><div class="gmail_default"><font face="times new roman, serif" style="font-size:large">Isn't Victor's suggestion of using </font><font face="times new roman, serif" size="4">Data.ByteString.Char8.readWord8 a workaround for the issue with read?</font></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div class="gmail_default"><br>λ>  import Data.ByteString.Char8<br>λ> readWord8 (pack "298")<br>Nothing<br><br><font face="times new roman, serif" size="4"></font></div></blockquote><div class="gmail_default" style="font-family:times new roman,serif;font-size:large"><br></div><div class="gmail_default" style="font-family:times new roman,serif;font-size:large">As a relatively naive Haskell user I agree that </div><div class="gmail_default" style="font-family:times new roman,serif;font-size:large"><br></div><div class="gmail_default"><blockquote style="font-family:"times new roman",serif;font-size:large;margin:0px 0px 0px 40px;border:none;padding:0px"><div class="gmail_default" style="font-family:times new roman,serif;font-size:large">> read "298" :: Word8<br>42<br>it :: Word8</div><div class="gmail_default" style="font-family:times new roman,serif;font-size:large"><br></div></blockquote><font face="times new roman, serif" size="4">seems like a bug. </font></div><div class="gmail_default"><font face="times new roman, serif" size="4"><br></font></div><div class="gmail_default"><font face="times new roman, serif" size="4">Before implementing a fix / enhancement it is probably worth filing a bug to get confirmation that it is. Proposing a change to the behavior of the base library is, IMHO based on experience, unlikely to be accepted.</font></div><div class="gmail_default"><br></div><div class="gmail_default"><font face="times new roman, serif" size="4">It reminds me of </font></div><div class="gmail_default"><font face="times new roman, serif" size="4"><br></font></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div class="gmail_default"><font face="times new roman, serif" size="4">λ> maxBound :: Int<br>9223372036854775807<br>λ> maxBound + 1 :: Int<br>-9223372036854775808<br><br></font><font face="times new roman, serif" size="4"></font></div></blockquote><font face="times new roman, serif" size="4"><span class="gmail_default" style="font-family:"times new roman",serif;font-size:large">My understanding of the explanation for that is that checking for overflow is too expensive from a performance point of view.</span></font><div><font face="times new roman, serif" size="4"><br></font></div><div><div class="gmail_default"><font face="times new roman, serif" size="4">Cheers</font></div><div class="gmail_default"><font face="times new roman, serif" size="4">George</font></div><div class="gmail_default"><br></div><div class="gmail_default"><font face="times new roman, serif" size="4"><br></font></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jul 13, 2025 at 8:34 AM Stefan Klinger <<a href="mailto:haskell@stefan-klinger.de" target="_blank">haskell@stefan-klinger.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello everybody =)<br>

<br>

I've been bugged by the silent overflowing of integer parsers as<br>

provided by `base`, `attoparsec`, and others.  I'd go so far as to<br>

call it a bug when the user types `298` and the parser says `Right<br>

42`.<br>

<br>

Unfortunately, all parsing libraries I've looked at get this wrong.  A<br>

[solution][4] is proposed below.<br>

<br>

I'm asking for feedback on how to continue from here.<br>

<br>

Kind regards<br>

Stefan<br>

<br>

<br>

<br>

<br>

The following examples can be reproduced with<br>

<br>

    $ git clone '<a href="https://github.com/s5k6/robust-int.git" rel="noreferrer" target="_blank">https://github.com/s5k6/robust-int.git</a>'<br>

    $ cd robust-int<br>

    $ cabal repl robust-int:demo<br>

<br>

<br>

Situation<br>

---------<br>

<br>

This is the current situation with [read][1] from base:<br>

<br>

    > read "298" :: Word8<br>

    42<br>

<br>

And with [decimal][2] from attoparsec:<br>

<br>

    > A.parseOnly (A.decimal :: A.Parser Word8) $ pack "298"<br>

    Right 42<br>

<br>

And the solution [usually suggested][5] for Parsec (which relies on<br>

`read`):<br>

<br>

    parsecWord8 :: P.Parser Word8<br>

    parsecWord8 = read <$> P.many1 P.digit<br>

<br>

    > P.runParser parsecWord8 () "" "298"<br>

    Right 42<br>

<br>

Even worse, the latter would rather exhaust memory than realise its<br>

input is way out of bounds:<br>

<br>

    > P.runParser parsecWord8 () "" $ repeat '1'<br>

    ⊥<br>

<br>

Also, some 3rd-party libraries get this wrong, e.g.,<br>

[parsec3-numbers][6]:<br>

<br>

    > P.runParser (PN.decimal :: P.Parser Word8)  () "" "298"<br>

    Right 42<br>

<br>

And [megaparsec][8], which is at least nice enough to warn about this<br>

in its documentation:<br>

<br>

    > M.parseMaybe (M.decimal :: M.Parsec () String Word8) "298"<br>

    Just 42<br>

<br>

I find this misses the point of a parser validating its input.<br>

<br>

<br>

Solution<br>

--------<br>

<br>

It is [possible to implement][7] parsers for bounded integral types<br>

which verify the bounds of the parsed value *while* parsing, and even<br>

doing this without the use of a “bigger” type.<br>

<br>

The idea is as follows:<br>

<br>

As usual, we parse digits left to right, and collect the resulting<br>

value in an accumulator `acc`, i.e., for each new digit `d`, the<br>

accumulator is updated to<br>

<br>

    base * acc + d<br>

<br>

Nothing new up to here.  However, before we start parsing, calculate<br>

<br>

    (lim, m) = upper_bound `divMod` base<br>

<br>

and before updating the accumulator with another digit `d`, verify<br>

that<br>

<br>

    acc < lim || (acc == lim && d <= m)<br>

<br>

which exactly guarantees that the accumulator will not overflow.  The<br>

reason why this works is is easily seen by doing the example for<br>

`Word16` in base 10:<br>

<br>

    > (maxBound :: Word16) `divMod` 10<br>

    (6553,5)<br>

    > 10 * fst it + snd it<br>

    65535<br>

<br>

Complexity: This adds a modulo operation and two comparisons for every<br>

literal being parsed, plus one comparison for every digit.  In order<br>

to limit memory consumption, some comparison has to take place during<br>

parsing, at least for limiting the number of digits consumed.  In<br>

total, this does not look too expensive.<br>

<br>

I have [implemented][4] this idea for `parsec` and `attoparsec` to<br>

demonstrate the idea (only for decimal values).<br>

<br>

<br>

What now?<br>

---------<br>

<br>

Obviously, this *should not be a another library*, trying to fix some<br>

aspect of some other libraries.  My code is rather intended for<br>

demonstration.  I'd prefer to help this idea migrate to the libraries<br>

(`base`, `parsec`, `attoparsec`, …), where the correct parsers should<br>

be.<br>

<br>

Unfortunately, I got a bit lost when trying to track down the code of<br>

`read` in the `base` package.  And I think I may have overengineered<br>

my solution for attoparsec to accommodate different stream types.<br>

<br>

Also, I get the impression that Haskell *library* code seems to be<br>

written with a different mindset, a deeper understanding of GHC than<br>

mine, i.e., more tailored to what the compiler will *actually do* when<br>

using the code, trying not to spoil opportunities for optimisation.<br>

And I'm not sure I'm up to that task.<br>

<br>

So I'm asking for feedback on the proposed algorithm, my<br>

implementation, and hints on where and how to get this into<br>

established libraries.<br>

<br>

<br>

Build instructions<br>

==================<br>

<br>

    $ cabal build<br>

    $ cabal run demo<br>

<br>

    $ cabal test<br>

    $ cabal haddock<br>

<br>

<br>

[1]: <a href="https://hackage.haskell.org/package/base-4.21.0.0/docs/Prelude.html#v:read" rel="noreferrer" target="_blank">https://hackage.haskell.org/package/base-4.21.0.0/docs/Prelude.html#v:read</a><br>

[2]: <a href="https://hackage.haskell.org/package/attoparsec-0.14.4/docs/Data-Attoparsec-ByteString-Char8.html#v:decimal" rel="noreferrer" target="_blank">https://hackage.haskell.org/package/attoparsec-0.14.4/docs/Data-Attoparsec-ByteString-Char8.html#v:decimal</a><br>

[3]: <a href="https://hackage.haskell.org/package/parsec-3.1.18.0/docs/Text-Parsec-Token.html#v:decimal" rel="noreferrer" target="_blank">https://hackage.haskell.org/package/parsec-3.1.18.0/docs/Text-Parsec-Token.html#v:decimal</a><br>

[4]: <a href="https://github.com/s5k6/robust-int" rel="noreferrer" target="_blank">https://github.com/s5k6/robust-int</a><br>

[5]: <a href="https://stackoverflow.com/questions/24171005/how-to-parse-an-integer-with-parsec" rel="noreferrer" target="_blank">https://stackoverflow.com/questions/24171005/how-to-parse-an-integer-with-parsec</a><br>

[6]: <a href="https://hackage.haskell.org/package/parsec3-numbers" rel="noreferrer" target="_blank">https://hackage.haskell.org/package/parsec3-numbers</a><br>

[7]: <a href="https://github.com/s5k6/robust-int/blob/master/src/Data/RobustInt/Parsec.hs#L32-L52" rel="noreferrer" target="_blank">https://github.com/s5k6/robust-int/blob/master/src/Data/RobustInt/Parsec.hs#L32-L52</a><br>

[8]: <a href="https://hackage.haskell.org/package/megaparsec-9.7.0/docs/Text-Megaparsec-Char-Lexer.html#v:decimal" rel="noreferrer" target="_blank">https://hackage.haskell.org/package/megaparsec-9.7.0/docs/Text-Megaparsec-Char-Lexer.html#v:decimal</a><br>

<br>

<br>

--<br>

Stefan Klinger, Ph.D. -- computer scientist              o/X<br>

<a href="http://stefan-klinger.de" rel="noreferrer" target="_blank">http://stefan-klinger.de</a>                                 /\/<br>

<a href="https://github.com/s5k6" rel="noreferrer" target="_blank">https://github.com/s5k6</a>                                    \<br>

I prefer receiving plain text messages, not exceeding 32kB.<br>

_______________________________________________<br>

Haskell-Cafe mailing list<br>

To (un)subscribe, modify options or view archives go to:<br>

<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

Only members subscribed via the mailman list are allowed to post.</blockquote></div>