[GHC] #14035: Weird performance results.

Thu Jul 27 20:41:03 UTC 2017

#14035: Weird performance results.
-------------------------------------+-------------------------------------
        Reporter:  danilo2           |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  high              |            Milestone:
       Component:  Compiler          |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by danilo2:

Old description:

> Hi! I was recently testing performance of a critical code in a product we
> are shipping and I'm getting really weird results.
>
> **The code is compiled with `-XStrict` enabled globally. The full source
> code for this ticket is attached.**
> The code is a pseudo-parser implementation. It consumes any char in a
> loop and fails on empty input in the end.
>
> Everything was compiled with following options (and many variations):
> `"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
> -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
> factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.
>
> \\
>
> == Helpers
>
> Let's define 2 helpers:
> {{{#!hs
> (.) :: (b -> c) -> (a -> b) -> a -> c
> (.) f g = \x -> f (g x) ; {-# INLINE (.) #-}
>
> dotl :: (b -> c) -> (a -> b) -> a -> c
> dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
> }}}
>
> So whenever we see `.` in code it is strict in all of its arguments.
>
> \\
>
> == Strict StateT performance improvement
>
> Let's consider following code:
> {{{#!hs
> import qualified Control.Monad.State.Strict as S
>
> newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
> (Applicative, Functor, Monad, MonadTrans)
>
> class MonadState s m | m -> s where
>     get :: m s
>     put :: s -> m ()
>
> runStateT  :: forall s m a.              StateT s m a -> s -> m (a, s)
> evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
> runStateT  m s = S.runStateT  (fromStateT m) s ; {-# INLINE runStateT
> #-}
> evalStateT m   = fmap fst . runStateT m        ; {-# INLINE evalStateT
> #-}
>
> instance Monad m => MonadState s (StateT s m) where
>     get = StateT S.get   ; {-# INLINE get #-}
>     put = StateT . S.put ; {-# INLINE put #-}
> }}}
>
> There are few non-obvious things to note here:
>   1. This wrapper performs about **15 TIMES better** than
> `Control.Monad.State.Strict.StateT` (in the provided examples) and if we
> create a loop in pure code imitating a parser, this `StateT` gets
> completely optimized away, while the `mtl`'s version does not.
>
>   2. If we replace the following functions with lazy composition, we get
> the same, high performance:
> {{{#!hs
> runStateT    = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT  #-}
> evalStateT m = fmap fst `dotl` runStateT m   ; {-# INLINE evalStateT #-}
> }}}
>
>   3. However, if we slightly change the `evalStateT`, we've got the bad
> performance, equals to the `mtl`'s `StateT` version (15 times slower):
> {{{#!hs
> evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
> }}}
>
> It's a very strange result, especially that `evalStateT` is used only
> once in the code while running the tests.
>
> \\
>
> == Strict Either & EitherT
>
> The code contains a very minimalistic implementation of `Either` and
> `EitherT` in order to make their definitions and utils strict. These
> definitions are copy-pasted and simplified (removed unused code) from:
> https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html
> https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Trans.Either.html
>
> \\
>
> == Strict Bool and tuple
>
> Moreover we define strict Bool `or` operation and 2-element tuple with
> strict arguments:
>
> {{{#!hs
> data T a b = T !a !b deriving (Generic, Show, Functor)
>
> data XBool = XTrue | XFalse deriving (Show, Generic)
>
> (|||) :: XBool -> XBool -> XBool
> (|||) !a !b = case a of
>     XTrue  -> a
>     XFalse -> b
> {-# INLINE (|||) #-}
> }}}
>
> \\
>
> == Parser implementation
>
> All the above declarations were simple helpers compiled with `-XStrict`,
> because available libraries do not provide them for us. This code is a
> "real" use case and shows the weird performance results.
>
> The parser implementation is simple:
> {{{#!hs
> newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
> XBool a) } deriving (Functor)
>
> instance Monad m => Applicative (FailParser m) where
>     pure  = undefined
>     (<*>) = undefined
>
> instance Monad m => Monad (FailParser m) where
>     return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
>     FailParser ma >>= f = FailParser $ do
>         T !b  a  <- ma
>         T !b' a' <- fromFailParser $ f a
>         return $! T (b ||| b') a'
>     {-# INLINE (>>=) #-}
>     _ >> _ = undefined ; {-# INLINE (>>) #-}
>
> instance MonadTrans (FailParser) where
>     lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
> }}}
>
> We use `undefined` for non-important functions. The parser is `EitherT`
> wrapper: Left happens when we failed parsing input, while Right
> otherwise. The `XBool` denotes if we made any progress (so after
> consuming a letter it is set to `XTrue`). There are some additional util
> functions, like `returnProgress` which behaves just like return, but also
> sets the `XBool` value to `XTrue`:
>
> {{{#!hs
> instance Monad m => MonadProgressParser (FailParser m) where
>     returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
> returnProgress #-} -- In correct code it should be XTrue as described
> below.
> }}}
>
> In the provided code, there is `XFalse` used instead of `XTrue` because
> of some interesting observations:
>
> 1. The `XBool` value is used ONLY in the expression `return $! T (b |||
> b') a'` above, so it does NOT affect the way the program logically
> executes it's body.
>
> 2. Both `b` and `b'` are strict and fully evaluated.
>
> 3. If both `b` and `b'` are `XFalse` (as in the provided code, they
> always are `XFalse`) we get a good performance. In order to test it, the
> above code contains `XFalse` instead of `XTrue`.
>
> 4. If we use the correct version of `returnProgress` as described just
> before point 1 above, we get 15 times slower performance (the same or
> very similar to the one when used altered `evalStateT` definition). We
> could try to explain it: maybe Haskell was able to optimize code if it
> discovered, there always were only `XFalse` values used and after the
> change there are both `XTrue` and `XFalse`, so it really has to run the
> `(|||)` operator. This way of thinking fails as fast as we check that
> changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
> performance.
>
> 5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
> get good performance, while replacing it with `return $! T b a'` give us
> bad performance. It does not make any sense, because both `b` and `b'`
> are strict and fully evaluated. Moreover, it is the only place in code
> where they are used.
>
> 6. However replacing `return $! T (b ||| b') a'` with `return $! T (b'
> ||| b) a'` does NOT change the performance (we are getting the good one).
>

> == Final notes
>
> We've been talking with some people - both in the company I'm working in
> as well as on IRC and we do not see any reason why this code behaves in
> this way and why it is so sensitive to the changes. In fact we started to
> be worried a lot about how we can use Haskell for high-performance parts
> at all, if it is not obvious if a very simple change do not affect
> performance a lot. A good example is changing `evalStateT m = fmap fst .
> runStateT m` to `evalStateT m a = fmap fst $ runStateT m a`, which gives
> 15 times slowdown. This situation makes the source code both very fragile
> to any changes as well as makes it unmaintainable. Tracking performance
> in a very small program (like the attached one) is possible, while
> tracking it in bigger one, taking in consideration the described
> problems, make it almost impossible. I'm writing this because I'm worried
> about where these problems originate from and I would really like to
> solve them / know why they appear, and be sure we can continue to use
> Haskell for our high-performance demanding applications.

New description:

 Hi! I was recently testing performance of a critical code in a product we
 are shipping and I'm getting really weird results.

 **The code is compiled with `-XStrict` enabled globally. The full source
 code for this ticket is attached, while the exposed code below uses `...`
 to hide some non-important implementations.**

 To get desired results, we use following GHC flags: `-O2 -funfolding-use-
 threshold=10000`.

 Let's consider the following program. It is just a pseudo-parser
 implementation. It consumes 'a' chars in a loop and fails on empty input
 in the end:

 {{{#!hs

 -- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 -- | WARNING: -XStrict enabled in this file !!!
 -- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

 module Main where

 imports ... (full source attached to this ticket)

 ------------------------
 -- === Primitives === --
 ------------------------

 -- === Strict Either === --

 data    Either  e   a = Left e | Right a deriving (Eq, Generic, Ord, Read,
 Show, Functor)
 newtype EitherT e m a = EitherT { runEitherT :: m (Either e a) }

 instance Monad m => Functor (EitherT e m) where ...
 instance Monad m => Applicative (EitherT e m) where ...
 instance Monad m => Monad (EitherT e m) where ...

 -- === Strict Bool === --

 data XBool = XTrue | XFalse deriving (Show, Generic)

 (|||) :: XBool -> XBool -> XBool
 (|||) !a !b = case a of
     XTrue  -> a
     XFalse -> b
 {-# INLINE (|||) #-}

 -- === Strict Tuple === --

 data T a b = T !a !b deriving (Generic, Show, Functor)

 ------------------------
 -- === FailParser === --
 ------------------------

 -- === Definition === --
 -- | It is just like EitherT, but also contains progress indicator - a
 field of type XBool
 --   which tells us if we've already parsed a char or not yet. In this
 snippet code however,
 --   it does not do anything valuable - it just stores the value.

 newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
 XBool a) } deriving (Functor)

 instance Monad m => Applicative (FailParser m) where
     pure  = undefined
     (<*>) = undefined

 instance Monad m => Monad (FailParser m) where
     return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}
     FailParser ma >>= f = FailParser $ do
         T !b  !a  <- ma
         T !b' !a' <- fromFailParser $ f a
         return $ T (b ||| b') a'
     {-# INLINE (>>=) #-}
     _ >> _ = undefined ; {-# INLINE (>>) #-}

 -- === Running === --

 failParser :: m (Either () (T XBool a)) -> FailParser m a
 failParser a = FailParser $ EitherT a ; {-# INLINE failParser #-}

 runFailParser :: forall m a. FailParser m a -> m (Either () (T XBool a))
 runFailParser f = runEitherT $ fromFailParser f ; {-# INLINE runFailParser
 #-}

 -- === MonadFailedParser === --
 -- | Behaves just like "left" - lifts until it hits MonadFailedParser

 class Monad m => MonadFailedParser m where
     failed :: m a

 instance {-# OVERLAPPABLE #-} (MonadFailedParser m, MonadTrans t, Monad (t
 m))
       => MonadFailedParser (t m) where
     failed = lift failed ; {-# INLINE failed #-}

 instance Monad m => MonadFailedParser (FailParser m) where
     failed = failParser $ return $ Left () ; {-# INLINE failed #-}

 -----------------------
 -- === Main loop === --
 -----------------------

 parserLoop :: StateT Text (FailParser Identity) Bool
 parserLoop = parserStep >> parserLoop

 parserStep :: StateT Text (FailParser Identity) Char
 parserStep = get >>= \s -> case Text.uncons s of
     Just (!t, !s') -> if t == 'a' then put s' >> return t else failed
     Nothing        -> failed
 {-# INLINE parserStep #-}

 -- === Criterion === --

 instance NFData XBool
 instance (NFData l, NFData r) => NFData (Either l r)
 instance (NFData a, NFData b) => NFData (T a b)

 genText :: Int -> Text
 genText i = fromString $ replicate i 'a' ; {-# INLINE genText #-}

 a_parsing_main :: IO ()
 a_parsing_main = do
     defaultMain
         [ env (return $ genText $ 10^6) $ bench "a*" . nf (runIdentity .
 runFailParser . evalStateT parserLoop)
         ]

 main = a_parsing_main

 }}}

 The most important part is the `bind` implementation of `FailParser`:

 {{{#!hs
 FailParser ma >>= f = FailParser $ do
     T b  a  <- ma
     T b' a' <- fromFailParser $ f a
     return $ T (b ||| b') a'
 }}}

 There are several performance related observations and problems:

 1. **INFO:** Everything is compiled with `-XStrict` and every field in
 this code is fully evaluated, in particular `b` and `b'` are fully
 evaluated, strict values of type `XBool`.

 2. **INFO:** Neither `b` nor `b'` are used anywhere else in the code. They
 are just fields in `FailParser` which should be used to store information
 if we did consume a letter or we did not.

 3. **PROBLEM:** When provided with `10^6` characters this code works in
 1ms. If we replace `(b ||| b')` with `(b' ||| b)` or with `(b')` the time
 do NOT change. However, if we replace it with `(b)`, we've got **15
 times** slowdown. Moreover, the resulting core is changed drastically in
 some places.

 4. **PROBLEM:** Another interesting observation is that the value of
 `XBool` is created only in one place in the code, namely in:
 `FailParser`'s `Monad` implementation, in `return` function: `return a =
 FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}`. We never
 change the XFalse, so this is the only value that could appear in this
 code. If we change it to `XTrue` in this implementation however, we again
 get **15 times** slowdown.

 5. **INFO:** The order of `case` expressions in definition of `(|||)` or
 the order of constructor defintions of any datatype does not affect the
 above results.

--

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14035#comment:4>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler