[GHC] #12665: Make Read instances faster, and make them fail fast

Wed Oct 5 20:15:40 UTC 2016

#12665: Make Read instances faster, and make them fail fast
-------------------------------------+-------------------------------------
        Reporter:  dfeuer            |                Owner:  dfeuer
            Type:  feature request   |               Status:  new
        Priority:  high              |            Milestone:  8.2.1
       Component:  Core Libraries    |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by dfeuer:

@@ -4,8 +4,13 @@
- pattern of lexing and then parsing. Lexing looks at the string and
- attempts to identify the next token, whatever it may be. But for `Read`,
- we don't need that. Thanks to the types, we have a very clear sense of
- what characters we expect to encounter. I've started to sketch out an
- improvement, which reads `Int` and `Word` around seven times as fast (9.5
- times as fast with parentheses and negation), and will fail immediately on
- something like `read (fix ('a':)) :: Int` rather than going into an
- infinite loop.
+ pattern of lexing and then parsing. The first problem with that is that
+ while we follow that model, we don't actually get the big benefit of that
+ model; the class methods don't mention or respect token boundaries, so
+ lexing first doesn't prevent backtracking. More to the point, perhaps,
+ lexing looks at the string and attempts to identify the next token,
+ whatever it may be. But for `Read`, we don't need that. Thanks to the
+ types, we have a very clear sense of what characters we expect to
+ encounter.
+
+ I've started to sketch out an improvement, which reads `Int` and `Word`
+ around seven times as fast (9.5 times as fast with parentheses and
+ negation), and will fail immediately on something like `read (fix ('a':))
+ :: Int` rather than going into an infinite loop.

New description:

 At present, the `Read` instances for standard types are generally written
 ''as though'' `Read` were a typical parser for a programming language.
 However, it most assuredly is not. Specifically, we currently follow the
 pattern of lexing and then parsing. The first problem with that is that
 while we follow that model, we don't actually get the big benefit of that
 model; the class methods don't mention or respect token boundaries, so
 lexing first doesn't prevent backtracking. More to the point, perhaps,
 lexing looks at the string and attempts to identify the next token,
 whatever it may be. But for `Read`, we don't need that. Thanks to the
 types, we have a very clear sense of what characters we expect to
 encounter.

 I've started to sketch out an improvement, which reads `Int` and `Word`
 around seven times as fast (9.5 times as fast with parentheses and
 negation), and will fail immediately on something like `read (fix ('a':))
 :: Int` rather than going into an infinite loop.

 To begin with, modify the definition of `paren` thus:

 {{{#!hs
 expectP' :: Char -> ReadPrec ()
 expectP' c = lift (expect' c)
 {-# INLINE expectP' #-}

 expect' :: Char -> ReadP ()
 expect' c = do
   ReadP.skipSpaces
   thing <- ReadP.get
   if thing == c
     then pure ()
     else ReadP.pfail
 {-# INLINE expect' #-}


 paren :: ReadPrec a -> ReadPrec a
 -- ^ @(paren p)@ parses \"(P0)\"
 --      where @p@ parses \"P0\" in precedence context zero
 paren p = do expectP' '('
              x <- reset p
              expectP' ')'
              return x
 }}}

 This allows fast failure when looking for parentheses, so we don't have to
 scan to the end of the first token (whatever it may be) before concluding
 that it is not `'('`.

 Now we can parse `Word` and `Int` very efficiently. I had to specialize
 earlier than I wanted to convince GHC that I don't want to convert through
 `Integer`. I'm not sure why the `fromIntegral` rule doesn't fire reliably
 around here. The code below (temporarily) uses the current definition for
 base 16, because that's a bit fussy; I'll rewrite it soon.

 {{{#!hs
 charDiff :: Char -> Char -> Word
 charDiff c1 c2 = fromIntegral (ord c1 - ord c2)
 {-# INLINE charDiff #-}

 readHexOct :: ReadP Word
 readHexOct = do
   _ <- ReadP.char '0'
   baseId <- lexBaseChar
   case baseId of
     Oct -> readBaseP 8
     Hex -> L.readHexP
 {-# INLINE readHexOct #-}

 data BaseId = Oct | Hex

 lexBaseChar :: ReadP BaseId
 lexBaseChar = do { c <- ReadP.get;
                    case c of
                      'o' -> pure Oct
                      'O' -> pure Oct
                      'x' -> pure Hex
                      'X' -> pure Hex
                      _ -> ReadP.pfail }

 readWord :: ReadP Word
 readWord = readNumber (readHexOct ReadP.<++ readBaseP 10)

 readInt :: ReadP Int
 readInt = fromIntegral <$> readWord

 readBaseP :: Integral a => Word -> ReadP a
 readBaseP !base = do
   c <- ReadP.get
   let diff = charDiff c '0'
   if diff < base
     then readBaseP' base (fromIntegral diff)
     else ReadP.pfail
 {-# INLINE readBaseP #-}

 readBaseP' :: Integral a => Word -> a -> ReadP a
 readBaseP' !base !acc0 = ReadP.look >>= go acc0
   where
     go !acc (c:cs) | diff < base = ReadP.get *> go (fromIntegral base *
 acc + fromIntegral diff) cs
       where diff = charDiff c '0'
     go !acc _ = pure acc
 {-# INLINE readBaseP' #-}

 readNumber :: Num a => ReadP.ReadP a -> ReadPrec a
 readNumber p = parens $ do
   cs <- lift skipSpaces *> look
   case cs of
     ('-': _) -> get *> lift (skipSpaces *> (negate <$> p))
     _ -> lift $ skipSpaces *> p
 {-# INLINE readNumber #-}
 }}}

 Other `Word` and `Int`-like types can be built on top of that foundation.
 I haven't yet attempted to deal with other instances, but I think there
 are probably a lot of opportunities for similar improvements.

 WARNING: I know very little about parsing, and less about `ReadPrec`. It's
 conceivable that I've made some semantic errors here, although I don't
 think I have.

--

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12665#comment:4>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler