#14035: Weird performance results.
Thu Jul 27 00:35:18 UTC 2017
#14035: Weird performance results.
Description changed by danilo2:
New description:
Hi! I was recently testing performance of a critical code in a product we
are shipping and I'm getting really weird results. The first "part" of it
I've described here: https://ghc.haskell.org/trac/ghc/ticket/14013 (and I
will be referring to this issue later), however there are some other
problems that do not seem directly connected to 14013 ticket.
**The code is compiled with `-XStrict` enabled globally. The full source
code for this ticket is attached.**
The code is a pseudo-parser implementation. It consumes any char in a loop
and fails on empty input in the end.
Everything was compiled with following options (and many variations):
`"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
-funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.
== Helpers
Let's define 2 helpers:
(.) :: (b -> c) -> (a -> b) -> a -> c
(.) f g = \x -> f (g x) ; {-# INLINE (.) #-}
dotl :: (b -> c) -> (a -> b) -> a -> c
dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
So whenever we see `.` in code it is strict in all of its arguments.
== Strict StateT performance improvement
Let's consider following code:
import qualified Control.Monad.State.Strict as S
newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
(Applicative, Functor, Monad, MonadTrans)
class MonadState s m | m -> s where
get :: m s
put :: s -> m ()
runStateT :: forall s m a. StateT s m a -> s -> m (a, s)
evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
runStateT m s = S.runStateT (fromStateT m) s ; {-# INLINE runStateT #-}
evalStateT m = fmap fst . runStateT m ; {-# INLINE evalStateT #-}
instance Monad m => MonadState s (StateT s m) where
get = StateT S.get ; {-# INLINE get #-}
put = StateT . S.put ; {-# INLINE put #-}
There are few non-obvious things to note here:
1. This wrapper performs about **15 TIMES better** than
`Control.Monad.State.Strict.StateT` (in the provided examples) and if we
create a loop in pure code imitating a parser, this `StateT` gets
completely optimized away, while the `mtl`'s version does not.
2. If we replace the following functions with lazy composition, we get
the same, high performance:
runStateT = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT #-}
evalStateT m = fmap fst `dotl` runStateT m ; {-# INLINE evalStateT #-}
3. However, if we slightly change the `evalStateT`, we've got the bad
performance, equals to the `mtl`'s `StateT` version (15 times slower):
evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
It's a very strange result, especially that `evalStateT` is used only once
in the code while running the tests.
== Strict Either & EitherT
The code contains a very minimalistic implementation of `Either` and
`EitherT` in order to make their definitions and utils strict. These
definitions are copy-pasted and simplified (removed unused code) from:
== Strict Bool and tuple
Moreover we define strict Bool `or` operation and 2-element tuple with
strict arguments:
data T a b = T !a !b deriving (Generic, Show, Functor)
data XBool = XTrue | XFalse deriving (Show, Generic)
(|||) :: XBool -> XBool -> XBool
(|||) !a !b = case a of
XTrue -> a
XFalse -> b
{-# INLINE (|||) #-}
== Parser implementation
All the above declarations were simple helpers compiled with `-XStrict`,
because available libraries do not provide them for us. This code is a
"real" use case and shows the weird performance results.
The parser implementation is simple:
newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
XBool a) } deriving (Functor)
instance Monad m => Applicative (FailParser m) where
pure = undefined
(<*>) = undefined
instance Monad m => Monad (FailParser m) where
return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
FailParser ma >>= f = FailParser $ do
T !b a <- ma
T !b' a' <- fromFailParser $ f a
return $! T (b ||| b') a'
{-# INLINE (>>=) #-}
_ >> _ = undefined ; {-# INLINE (>>) #-}
instance MonadTrans (FailParser) where
lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
We use `undefined` for non-important functions. The parser is `EitherT`
wrapper: Left happens when we failed parsing input, while Right otherwise.
The `XBool` denotes if we made any progress (so after consuming a letter
it is set to `XTrue`). There are some additional util functions, like
`returnProgress` which behaves just like return, but also sets the `XBool`
value to `XTrue`:
instance Monad m => MonadProgressParser (FailParser m) where
returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
returnProgress #-} -- In correct code it should be XTrue as described
In the provided code, there is `XFalse` used instead of `XTrue` because of
some interesting observations:
1. The `XBool` value is used ONLY in the expression `return $! T (b |||
b') a'` above, so it does NOT affect the way the program logically
executes it's body.
2. Both `b` and `b'` are strict and fully evaluated.
3. If both `b` and `b'` are `XFalse` (as in the provided code, they always
are `XFalse`) we get a good performance. In order to test it, the above
code contains `XFalse` instead of `XTrue`.
4. If we use the correct version of `returnProgress` as described just
before point 1 above, we get 15 times slower performance (the same or very
similar to the one when used altered `evalStateT` definition). We could
try to explain it: maybe Haskell was able to optimize code if it
discovered, there always were only `XFalse` values used and after the
change there are both `XTrue` and `XFalse`, so it really has to run the
`(|||)` operator. This way of thinking fails as fast as we check that
changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
get good performance, while replacing it with `return $! T b a'` give us
bad performance. It does not make any sense, because both `b` and `b'` are
strict and fully evaluated. Moreover, it is the only place in code where
they are used.
6. However replacing `return $! T (b ||| b') a'` with `return $! T (b' |||
b) a'` does NOT change the performance (we are getting the good one).
== Final notes
We've been talking with some people - both in the company I'm working in
as well as on IRC and we do not see any reason why this code behaves in
this way and why it is so sensitive to the changes. In fact we started to
be worried a lot about how we can use Haskell for high-performance parts
at all if it is not obvious if a very simple changes do not affect
performance so much, like changing `evalStateT m = fmap fst . runStateT m`
to `evalStateT m a = fmap fst $ runStateT m a`, which gives 15 times
slowdown. It makes the code both very fragile to any changes as well as
makes it unmaintainable. Tracking performance in a very small program
(like the attached one) is possible, while tracking it in bigger one,
taking in considerating the described problems, make it almost impossible.
I'm writing this because I'm worried about where these problems originate
from and I would really like to solve them / know why they appear and be
sure we can continue to use Haskell for our high-performance demanding
