[GHC] #14035: Weird performance results.
GHC
ghc-devs at haskell.org
Thu Jul 27 20:41:03 UTC 2017
#14035: Weird performance results.
-------------------------------------+-------------------------------------
Reporter: danilo2 | Owner: (none)
Type: bug | Status: new
Priority: high | Milestone:
Component: Compiler | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Description changed by danilo2:
Old description:
> Hi! I was recently testing performance of a critical code in a product we
> are shipping and I'm getting really weird results.
>
> **The code is compiled with `-XStrict` enabled globally. The full source
> code for this ticket is attached.**
> The code is a pseudo-parser implementation. It consumes any char in a
> loop and fails on empty input in the end.
>
> Everything was compiled with following options (and many variations):
> `"-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
> -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
> factor=1000 -flate-dmd-anal -fspecialise-aggressively"`.
>
> \\
>
> == Helpers
>
> Let's define 2 helpers:
> {{{#!hs
> (.) :: (b -> c) -> (a -> b) -> a -> c
> (.) f g = \x -> f (g x) ; {-# INLINE (.) #-}
>
> dotl :: (b -> c) -> (a -> b) -> a -> c
> dotl ~f ~g = \ ~x -> f (g x) ; {-# INLINE dotl #-}
> }}}
>
> So whenever we see `.` in code it is strict in all of its arguments.
>
> \\
>
> == Strict StateT performance improvement
>
> Let's consider following code:
> {{{#!hs
> import qualified Control.Monad.State.Strict as S
>
> newtype StateT s m a = StateT { fromStateT :: S.StateT s m a } deriving
> (Applicative, Functor, Monad, MonadTrans)
>
> class MonadState s m | m -> s where
> get :: m s
> put :: s -> m ()
>
> runStateT :: forall s m a. StateT s m a -> s -> m (a, s)
> evalStateT :: forall s m a. Functor m => StateT s m a -> s -> m a
> runStateT m s = S.runStateT (fromStateT m) s ; {-# INLINE runStateT
> #-}
> evalStateT m = fmap fst . runStateT m ; {-# INLINE evalStateT
> #-}
>
> instance Monad m => MonadState s (StateT s m) where
> get = StateT S.get ; {-# INLINE get #-}
> put = StateT . S.put ; {-# INLINE put #-}
> }}}
>
> There are few non-obvious things to note here:
> 1. This wrapper performs about **15 TIMES better** than
> `Control.Monad.State.Strict.StateT` (in the provided examples) and if we
> create a loop in pure code imitating a parser, this `StateT` gets
> completely optimized away, while the `mtl`'s version does not.
>
> 2. If we replace the following functions with lazy composition, we get
> the same, high performance:
> {{{#!hs
> runStateT = S.runStateT `dotl` fromStateT ; {-# INLINE runStateT #-}
> evalStateT m = fmap fst `dotl` runStateT m ; {-# INLINE evalStateT #-}
> }}}
>
> 3. However, if we slightly change the `evalStateT`, we've got the bad
> performance, equals to the `mtl`'s `StateT` version (15 times slower):
> {{{#!hs
> evalStateT m a = fmap fst $ runStateT m a ; {-# INLINE evalStateT #-}
> }}}
>
> It's a very strange result, especially that `evalStateT` is used only
> once in the code while running the tests.
>
> \\
>
> == Strict Either & EitherT
>
> The code contains a very minimalistic implementation of `Either` and
> `EitherT` in order to make their definitions and utils strict. These
> definitions are copy-pasted and simplified (removed unused code) from:
> https://hackage.haskell.org/package/base-4.10.0.0/docs/src/Data.Either.html
> https://hackage.haskell.org/package/either-4.4.1.1/docs/src/Control.Monad.Trans.Either.html
>
> \\
>
> == Strict Bool and tuple
>
> Moreover we define strict Bool `or` operation and 2-element tuple with
> strict arguments:
>
> {{{#!hs
> data T a b = T !a !b deriving (Generic, Show, Functor)
>
> data XBool = XTrue | XFalse deriving (Show, Generic)
>
> (|||) :: XBool -> XBool -> XBool
> (|||) !a !b = case a of
> XTrue -> a
> XFalse -> b
> {-# INLINE (|||) #-}
> }}}
>
> \\
>
> == Parser implementation
>
> All the above declarations were simple helpers compiled with `-XStrict`,
> because available libraries do not provide them for us. This code is a
> "real" use case and shows the weird performance results.
>
> The parser implementation is simple:
> {{{#!hs
> newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
> XBool a) } deriving (Functor)
>
> instance Monad m => Applicative (FailParser m) where
> pure = undefined
> (<*>) = undefined
>
> instance Monad m => Monad (FailParser m) where
> return a = FailParser $ pure $ (T XFalse a) ; {-# INLINE return #-}
> FailParser ma >>= f = FailParser $ do
> T !b a <- ma
> T !b' a' <- fromFailParser $ f a
> return $! T (b ||| b') a'
> {-# INLINE (>>=) #-}
> _ >> _ = undefined ; {-# INLINE (>>) #-}
>
> instance MonadTrans (FailParser) where
> lift m = FailParser $! lift $ fmap (T XFalse) m ; {-# INLINE lift #-}
> }}}
>
> We use `undefined` for non-important functions. The parser is `EitherT`
> wrapper: Left happens when we failed parsing input, while Right
> otherwise. The `XBool` denotes if we made any progress (so after
> consuming a letter it is set to `XTrue`). There are some additional util
> functions, like `returnProgress` which behaves just like return, but also
> sets the `XBool` value to `XTrue`:
>
> {{{#!hs
> instance Monad m => MonadProgressParser (FailParser m) where
> returnProgress a = FailParser $! pure (T XFalse a) ; {-# INLINE
> returnProgress #-} -- In correct code it should be XTrue as described
> below.
> }}}
>
> In the provided code, there is `XFalse` used instead of `XTrue` because
> of some interesting observations:
>
> 1. The `XBool` value is used ONLY in the expression `return $! T (b |||
> b') a'` above, so it does NOT affect the way the program logically
> executes it's body.
>
> 2. Both `b` and `b'` are strict and fully evaluated.
>
> 3. If both `b` and `b'` are `XFalse` (as in the provided code, they
> always are `XFalse`) we get a good performance. In order to test it, the
> above code contains `XFalse` instead of `XTrue`.
>
> 4. If we use the correct version of `returnProgress` as described just
> before point 1 above, we get 15 times slower performance (the same or
> very similar to the one when used altered `evalStateT` definition). We
> could try to explain it: maybe Haskell was able to optimize code if it
> discovered, there always were only `XFalse` values used and after the
> change there are both `XTrue` and `XFalse`, so it really has to run the
> `(|||)` operator. This way of thinking fails as fast as we check that
> changing `XFalse` to `XTrue` **everywhere** in the code give us again bad
> performance.
>
> 5. If we replace `return $! T (b ||| b') a'` with `return $! T b' a'` we
> get good performance, while replacing it with `return $! T b a'` give us
> bad performance. It does not make any sense, because both `b` and `b'`
> are strict and fully evaluated. Moreover, it is the only place in code
> where they are used.
>
> 6. However replacing `return $! T (b ||| b') a'` with `return $! T (b'
> ||| b) a'` does NOT change the performance (we are getting the good one).
>
> == Final notes
>
> We've been talking with some people - both in the company I'm working in
> as well as on IRC and we do not see any reason why this code behaves in
> this way and why it is so sensitive to the changes. In fact we started to
> be worried a lot about how we can use Haskell for high-performance parts
> at all, if it is not obvious if a very simple change do not affect
> performance a lot. A good example is changing `evalStateT m = fmap fst .
> runStateT m` to `evalStateT m a = fmap fst $ runStateT m a`, which gives
> 15 times slowdown. This situation makes the source code both very fragile
> to any changes as well as makes it unmaintainable. Tracking performance
> in a very small program (like the attached one) is possible, while
> tracking it in bigger one, taking in consideration the described
> problems, make it almost impossible. I'm writing this because I'm worried
> about where these problems originate from and I would really like to
> solve them / know why they appear, and be sure we can continue to use
> Haskell for our high-performance demanding applications.
New description:
Hi! I was recently testing performance of a critical code in a product we
are shipping and I'm getting really weird results.
**The code is compiled with `-XStrict` enabled globally. The full source
code for this ticket is attached, while the exposed code below uses `...`
to hide some non-important implementations.**
To get desired results, we use following GHC flags: `-O2 -funfolding-use-
threshold=10000`.
Let's consider the following program. It is just a pseudo-parser
implementation. It consumes 'a' chars in a loop and fails on empty input
in the end:
{{{#!hs
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-- | WARNING: -XStrict enabled in this file !!!
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
module Main where
imports ... (full source attached to this ticket)
------------------------
-- === Primitives === --
------------------------
-- === Strict Either === --
data Either e a = Left e | Right a deriving (Eq, Generic, Ord, Read,
Show, Functor)
newtype EitherT e m a = EitherT { runEitherT :: m (Either e a) }
instance Monad m => Functor (EitherT e m) where ...
instance Monad m => Applicative (EitherT e m) where ...
instance Monad m => Monad (EitherT e m) where ...
-- === Strict Bool === --
data XBool = XTrue | XFalse deriving (Show, Generic)
(|||) :: XBool -> XBool -> XBool
(|||) !a !b = case a of
XTrue -> a
XFalse -> b
{-# INLINE (|||) #-}
-- === Strict Tuple === --
data T a b = T !a !b deriving (Generic, Show, Functor)
------------------------
-- === FailParser === --
------------------------
-- === Definition === --
-- | It is just like EitherT, but also contains progress indicator - a
field of type XBool
-- which tells us if we've already parsed a char or not yet. In this
snippet code however,
-- it does not do anything valuable - it just stores the value.
newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T
XBool a) } deriving (Functor)
instance Monad m => Applicative (FailParser m) where
pure = undefined
(<*>) = undefined
instance Monad m => Monad (FailParser m) where
return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}
FailParser ma >>= f = FailParser $ do
T !b !a <- ma
T !b' !a' <- fromFailParser $ f a
return $ T (b ||| b') a'
{-# INLINE (>>=) #-}
_ >> _ = undefined ; {-# INLINE (>>) #-}
-- === Running === --
failParser :: m (Either () (T XBool a)) -> FailParser m a
failParser a = FailParser $ EitherT a ; {-# INLINE failParser #-}
runFailParser :: forall m a. FailParser m a -> m (Either () (T XBool a))
runFailParser f = runEitherT $ fromFailParser f ; {-# INLINE runFailParser
#-}
-- === MonadFailedParser === --
-- | Behaves just like "left" - lifts until it hits MonadFailedParser
class Monad m => MonadFailedParser m where
failed :: m a
instance {-# OVERLAPPABLE #-} (MonadFailedParser m, MonadTrans t, Monad (t
m))
=> MonadFailedParser (t m) where
failed = lift failed ; {-# INLINE failed #-}
instance Monad m => MonadFailedParser (FailParser m) where
failed = failParser $ return $ Left () ; {-# INLINE failed #-}
-----------------------
-- === Main loop === --
-----------------------
parserLoop :: StateT Text (FailParser Identity) Bool
parserLoop = parserStep >> parserLoop
parserStep :: StateT Text (FailParser Identity) Char
parserStep = get >>= \s -> case Text.uncons s of
Just (!t, !s') -> if t == 'a' then put s' >> return t else failed
Nothing -> failed
{-# INLINE parserStep #-}
-- === Criterion === --
instance NFData XBool
instance (NFData l, NFData r) => NFData (Either l r)
instance (NFData a, NFData b) => NFData (T a b)
genText :: Int -> Text
genText i = fromString $ replicate i 'a' ; {-# INLINE genText #-}
a_parsing_main :: IO ()
a_parsing_main = do
defaultMain
[ env (return $ genText $ 10^6) $ bench "a*" . nf (runIdentity .
runFailParser . evalStateT parserLoop)
]
main = a_parsing_main
}}}
The most important part is the `bind` implementation of `FailParser`:
{{{#!hs
FailParser ma >>= f = FailParser $ do
T b a <- ma
T b' a' <- fromFailParser $ f a
return $ T (b ||| b') a'
}}}
There are several performance related observations and problems:
1. **INFO:** Everything is compiled with `-XStrict` and every field in
this code is fully evaluated, in particular `b` and `b'` are fully
evaluated, strict values of type `XBool`.
2. **INFO:** Neither `b` nor `b'` are used anywhere else in the code. They
are just fields in `FailParser` which should be used to store information
if we did consume a letter or we did not.
3. **PROBLEM:** When provided with `10^6` characters this code works in
1ms. If we replace `(b ||| b')` with `(b' ||| b)` or with `(b')` the time
do NOT change. However, if we replace it with `(b)`, we've got **15
times** slowdown. Moreover, the resulting core is changed drastically in
some places.
4. **PROBLEM:** Another interesting observation is that the value of
`XBool` is created only in one place in the code, namely in:
`FailParser`'s `Monad` implementation, in `return` function: `return a =
FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}`. We never
change the XFalse, so this is the only value that could appear in this
code. If we change it to `XTrue` in this implementation however, we again
get **15 times** slowdown.
5. **INFO:** The order of `case` expressions in definition of `(|||)` or
the order of constructor defintions of any datatype does not affect the
above results.
--
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14035#comment:4>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list