[GHC] #14035: Weird performance results.

Thu Jul 27 00:49:40 UTC 2017

#14035: Weird performance results.
-------------------------------------+-------------------------------------
        Reporter:  danilo2           |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  high              |            Milestone:
       Component:  Compiler          |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by danilo2:

Old description:

> Hi! I was recently testing performance of a critical code in a product we
> are shipping and I'm getting really weird results.
>
> **The code is compiled with `-XStrict` enabled globally. The full source
> code for this ticket is attached.**
> The code is a pseudo-parser implementation. It consumes any char in a
> loop and fails on empty input in the end.
>
> Everything was compiled with following options (and many variations):
> `"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
> -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
> factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.
>
> \\
>
> == Helpers
>
> Let's define 2 helpers:
> {{{#!hs
> (.) :: (b -> c) -> (a -> b) -> a -> c
> (.) f g = \x -> f (g x) ; {-# INLINE (.) #-}
>
> dotl :: (b -> c) -> (a -> b) -> a -> c
> dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
> }}}
>
> So whenever we see `.` in code it is strict in all of its arguments.
>
> \\
>
> == Strict StateT performance improvement
>
> Let's consider following code:
> {{{#!hs
> import qualified Control.Monad.State.Strict as S
>
> newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
> (Applicative, Functor, Monad, MonadTrans)
>
> class MonadState s m | m -> s where
>     get :: m s
>     put :: s -> m ()
>
> runStateT  :: forall s m a.              StateT s m a -> s -> m (a, s)
> evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
> runStateT  m s = S.runStateT  (fromStateT m) s ; {-# INLINE runStateT
> #-}
> evalStateT m   = fmap fst . runStateT m        ; {-# INLINE evalStateT
> #-}
>
> instance Monad m => MonadState s (StateT s m) where
>     get = StateT S.get   ; {-# INLINE get #-}
>     put = StateT . S.put ; {-# INLINE put #-}
> }}}
>
> There are few non-obvious things to note here:
>   1. This wrapper performs about **15 TIMES better** than
> `Control.Monad.State.Strict.StateT` (in the provided examples) and if we
> create a loop in pure code imitating a parser, this `StateT` gets
> completely optimized away, while the `mtl`'s version does not.
>
>   2. If we replace the following functions with lazy composition, we get
> the same, high performance:
> {{{#!hs
> runStateT    = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT  #-}
> evalStateT m = fmap fst `dotl` runStateT m   ; {-# INLINE evalStateT #-}
> }}}
>
>   3. However, if we slightly change the `evalStateT`, we've got the bad
> performance, equals to the `mtl`'s `StateT` version (15 times slower):
> {{{#!hs
> evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
> }}}
>
> It's a very strange result, especially that `evalStateT` is used only
> once in the code while running the tests.
>
> \\
>
> == Strict Either & EitherT
>
> The code contains a very minimalistic implementation of `Either` and
> `EitherT` in order to make their definitions and utils strict. These
> definitions are copy-pasted and simplified (removed unused code) from:
> https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html
> https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Trans.Either.html
>
> \\
>
> == Strict Bool and tuple
>
> Moreover we define strict Bool `or` operation and 2-element tuple with
> strict arguments:
>
> {{{#!hs
> data T a b = T !a !b deriving (Generic, Show, Functor)
>
> data XBool = XTrue | XFalse deriving (Show, Generic)
>
> (|||) :: XBool -> XBool -> XBool
> (|||) !a !b = case a of
>     XTrue  -> a
>     XFalse -> b
> {-# INLINE (|||) #-}
> }}}
>
> \\
>
> == Parser implementation
>
> All the above declarations were simple helpers compiled with `-XStrict`,
> because available libraries do not provide them for us. This code is a
> "real" use case and shows the weird performance results.
>
> The parser implementation is simple:
> {{{#!hs
> newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
> XBool a) } deriving (Functor)
>
> instance Monad m => Applicative (FailParser m) where
>     pure  = undefined
>     (<*>) = undefined
>
> instance Monad m => Monad (FailParser m) where
>     return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
>     FailParser ma >>= f = FailParser $ do
>         T !b  a  <- ma
>         T !b' a' <- fromFailParser $ f a
>         return $! T (b ||| b') a'
>     {-# INLINE (>>=) #-}
>     _ >> _ = undefined ; {-# INLINE (>>) #-}
>
> instance MonadTrans (FailParser) where
>     lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
> }}}
>
> We use `undefined` for non-important functions. The parser is `EitherT`
> wrapper: Left happens when we failed parsing input, while Right
> otherwise. The `XBool` denotes if we made any progress (so after
> consuming a letter it is set to `XTrue`). There are some additional util
> functions, like `returnProgress` which behaves just like return, but also
> sets the `XBool` value to `XTrue`:
>
> {{{#!hs
> instance Monad m => MonadProgressParser (FailParser m) where
>     returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
> returnProgress #-} -- In correct code it should be XTrue as described
> below.
> }}}
>
> In the provided code, there is `XFalse` used instead of `XTrue` because
> of some interesting observations:
>
> 1. The `XBool` value is used ONLY in the expression `return $! T (b |||
> b') a'` above, so it does NOT affect the way the program logically
> executes it's body.
>
> 2. Both `b` and `b'` are strict and fully evaluated.
>
> 3. If both `b` and `b'` are `XFalse` (as in the provided code, they
> always are `XFalse`) we get a good performance. In order to test it, the
> above code contains `XFalse` instead of `XTrue`.
>
> 4. If we use the correct version of `returnProgress` as described just
> before point 1 above, we get 15 times slower performance (the same or
> very similar to the one when used altered `evalStateT` definition). We
> could try to explain it: maybe Haskell was able to optimize code if it
> discovered, there always were only `XFalse` values used and after the
> change there are both `XTrue` and `XFalse`, so it really has to run the
> `(|||)` operator. This way of thinking fails as fast as we check that
> changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
> performance.
>
> 5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
> get good performance, while replacing it with `return $! T b a'` give us
> bad performance. It does not make any sense, because both `b` and `b'`
> are strict and fully evaluated. Moreover, it is the only place in code
> where they are used.
>
> 6. However replacing `return $! T (b ||| b') a'` with `return $! T (b'
> ||| b) a'` does NOT change the performance (we are getting the good one).
>

> == Final notes
>
> We've been talking with some people - both in the company I'm working in
> as well as on IRC and we do not see any reason why this code behaves in
> this way and why it is so sensitive to the changes. In fact we started to
> be worried a lot about how we can use Haskell for high-performance parts
> at all if it is not obvious if a very simple changes do not affect
> performance so much, like changing `evalStateT m = fmap fst . runStateT
> m` to `evalStateT m a = fmap fst $ runStateT m a`, which gives 15 times
> slowdown. It makes the code both very fragile to any changes as well as
> makes it unmaintainable. Tracking performance in a very small program
> (like the attached one) is possible, while tracking it in bigger one,
> taking in considerating the described problems, make it almost
> impossible.  I'm writing this because I'm worried about where these
> problems originate from and I would really like to solve them / know why
> they appear and be sure we can continue to use Haskell for our high-
> performance demanding applications.

New description:

 Hi! I was recently testing performance of a critical code in a product we
 are shipping and I'm getting really weird results.

 **The code is compiled with `-XStrict` enabled globally. The full source
 code for this ticket is attached.**
 The code is a pseudo-parser implementation. It consumes any char in a loop
 and fails on empty input in the end.

 Everything was compiled with following options (and many variations):
 `"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
 factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.

 \\

 == Helpers

 Let's define 2 helpers:
 {{{#!hs
 (.) :: (b -> c) -> (a -> b) -> a -> c
 (.) f g = \x -> f (g x) ; {-# INLINE (.) #-}

 dotl :: (b -> c) -> (a -> b) -> a -> c
 dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
 }}}

 So whenever we see `.` in code it is strict in all of its arguments.

 \\

 == Strict StateT performance improvement

 Let's consider following code:
 {{{#!hs
 import qualified Control.Monad.State.Strict as S

 newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
 (Applicative, Functor, Monad, MonadTrans)

 class MonadState s m | m -> s where
     get :: m s
     put :: s -> m ()

 runStateT  :: forall s m a.              StateT s m a -> s -> m (a, s)
 evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
 runStateT  m s = S.runStateT  (fromStateT m) s ; {-# INLINE runStateT  #-}
 evalStateT m   = fmap fst . runStateT m        ; {-# INLINE evalStateT #-}

 instance Monad m => MonadState s (StateT s m) where
     get = StateT S.get   ; {-# INLINE get #-}
     put = StateT . S.put ; {-# INLINE put #-}
 }}}

 There are few non-obvious things to note here:
   1. This wrapper performs about **15 TIMES better** than
 `Control.Monad.State.Strict.StateT` (in the provided examples) and if we
 create a loop in pure code imitating a parser, this `StateT` gets
 completely optimized away, while the `mtl`'s version does not.

   2. If we replace the following functions with lazy composition, we get
 the same, high performance:
 {{{#!hs
 runStateT    = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT  #-}
 evalStateT m = fmap fst `dotl` runStateT m   ; {-# INLINE evalStateT #-}
 }}}

   3. However, if we slightly change the `evalStateT`, we've got the bad
 performance, equals to the `mtl`'s `StateT` version (15 times slower):
 {{{#!hs
 evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
 }}}

 It's a very strange result, especially that `evalStateT` is used only once
 in the code while running the tests.

 \\

 == Strict Either & EitherT

 The code contains a very minimalistic implementation of `Either` and
 `EitherT` in order to make their definitions and utils strict. These
 definitions are copy-pasted and simplified (removed unused code) from:
 https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html
 https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Trans.Either.html

 \\

 == Strict Bool and tuple

 Moreover we define strict Bool `or` operation and 2-element tuple with
 strict arguments:

 {{{#!hs
 data T a b = T !a !b deriving (Generic, Show, Functor)

 data XBool = XTrue | XFalse deriving (Show, Generic)

 (|||) :: XBool -> XBool -> XBool
 (|||) !a !b = case a of
     XTrue  -> a
     XFalse -> b
 {-# INLINE (|||) #-}
 }}}

 \\

 == Parser implementation

 All the above declarations were simple helpers compiled with `-XStrict`,
 because available libraries do not provide them for us. This code is a
 "real" use case and shows the weird performance results.

 The parser implementation is simple:
 {{{#!hs
 newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
 XBool a) } deriving (Functor)

 instance Monad m => Applicative (FailParser m) where
     pure  = undefined
     (<*>) = undefined

 instance Monad m => Monad (FailParser m) where
     return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
     FailParser ma >>= f = FailParser $ do
         T !b  a  <- ma
         T !b' a' <- fromFailParser $ f a
         return $! T (b ||| b') a'
     {-# INLINE (>>=) #-}
     _ >> _ = undefined ; {-# INLINE (>>) #-}

 instance MonadTrans (FailParser) where
     lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
 }}}

 We use `undefined` for non-important functions. The parser is `EitherT`
 wrapper: Left happens when we failed parsing input, while Right otherwise.
 The `XBool` denotes if we made any progress (so after consuming a letter
 it is set to `XTrue`). There are some additional util functions, like
 `returnProgress` which behaves just like return, but also sets the `XBool`
 value to `XTrue`:

 {{{#!hs
 instance Monad m => MonadProgressParser (FailParser m) where
     returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
 returnProgress #-} -- In correct code it should be XTrue as described
 below.
 }}}

 In the provided code, there is `XFalse` used instead of `XTrue` because of
 some interesting observations:

 1. The `XBool` value is used ONLY in the expression `return $! T (b |||
 b') a'` above, so it does NOT affect the way the program logically
 executes it's body.

 2. Both `b` and `b'` are strict and fully evaluated.

 3. If both `b` and `b'` are `XFalse` (as in the provided code, they always
 are `XFalse`) we get a good performance. In order to test it, the above
 code contains `XFalse` instead of `XTrue`.

 4. If we use the correct version of `returnProgress` as described just
 before point 1 above, we get 15 times slower performance (the same or very
 similar to the one when used altered `evalStateT` definition). We could
 try to explain it: maybe Haskell was able to optimize code if it
 discovered, there always were only `XFalse` values used and after the
 change there are both `XTrue` and `XFalse`, so it really has to run the
 `(|||)` operator. This way of thinking fails as fast as we check that
 changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
 performance.

 5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
 get good performance, while replacing it with `return $! T b a'` give us
 bad performance. It does not make any sense, because both `b` and `b'` are
 strict and fully evaluated. Moreover, it is the only place in code where
 they are used.

 6. However replacing `return $! T (b ||| b') a'` with `return $! T (b' |||
 b) a'` does NOT change the performance (we are getting the good one).

 == Final notes

 We've been talking with some people - both in the company I'm working in
 as well as on IRC and we do not see any reason why this code behaves in
 this way and why it is so sensitive to the changes. In fact we started to
 be worried a lot about how we can use Haskell for high-performance parts
 at all, if it is not obvious if a very simple change do not affect
 performance a lot. A good example is changing `evalStateT m = fmap fst .
 runStateT m` to `evalStateT m a = fmap fst $ runStateT m a`, which gives
 15 times slowdown. This situation makes the source code both very fragile
 to any changes as well as makes it unmaintainable. Tracking performance in
 a very small program (like the attached one) is possible, while tracking
 it in bigger one, taking in consideration the described problems, make it
 almost impossible. I'm writing this because I'm worried about where these
 problems originate from and I would really like to solve them / know why
 they appear, and be sure we can continue to use Haskell for our high-
 performance demanding applications.

--

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14035#comment:3>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler