[GHC] #14013: Bad monads performance

Sat Jul 22 22:02:03 UTC 2017

#14013: Bad monads performance
-------------------------------------+-------------------------------------
        Reporter:  danilo2           |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  high              |            Milestone:
       Component:  Compiler          |              Version:  8.2.1-rc3
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by danilo2:

Old description:

> Hi! We've been struggling with a very strange GHC behavior on IRC today.
> Let's consider the following code (needs mtl and criterion to be
> compiled):
>
> {{{
> module Main where
>
> import Prelude
> import Criterion.Main
> import qualified Control.Monad.State.Strict as Strict
> import qualified Control.Monad.State.Class  as State
> import Control.DeepSeq (NFData, rnf, force)
> import GHC.IO          (evaluate)
> import Data.Monoid
>

> -----------------------------
> -- === Criterion utils === --
> -----------------------------
>
> eval :: NFData a => a -> IO a
> eval = evaluate . force ; {-# INLINE eval #-}
>
> liftExp :: (Int -> a) -> (Int -> a)
> liftExp f = f . (10^) ; {-# INLINE liftExp #-}
>
> expCodeGen :: NFData a => (Int -> a) -> (Int -> IO a)
> expCodeGen f i = do
>     putStrLn $ "generating input code (10e" <> show i <> " chars)"
>     out <- eval $ liftExp f i
>     putStrLn "code generated sucessfully"
>     return out
> {-# INLINE expCodeGen #-}
>
> expCodeGenBench :: (NFData a, NFData b) => (Int -> a) -> (a -> b) -> Int
> -> Benchmark
> expCodeGenBench f p i = env (expCodeGen f i) $ bench ("10e" <> show i) .
> nf p ; {-# INLINE expCodeGenBench #-}
>

> -------------------------------
> -- === (a*) list parsing === --
> -------------------------------
>
> genList_a :: Int -> [Char]
> genList_a i = replicate i 'a' ; {-# INLINE genList_a #-}
>
> pureListParser_a :: [Char] -> Bool
> pureListParser_a = \case
>     'a':s -> pureListParser_a s
>     []    -> True
>     _     -> False
> {-# INLINE pureListParser_a #-}
>
> mtlStateListParser_a :: State.MonadState [Char] m => m Bool
> mtlStateListParser_a = State.get >>= \case
>     'a':s -> State.put s >> mtlStateListParser_a
>     []    -> return True
>     _     -> return False
> {-# INLINE mtlStateListParser_a #-}
>
> mtlStateListParser_a_typed :: Strict.State [Char] Bool
> mtlStateListParser_a_typed = State.get >>= \case
>     'a':s -> State.put s >> mtlStateListParser_a_typed
>     []    -> return True
>     _     -> return False
> {-# INLINE mtlStateListParser_a_typed #-}
>
> mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
> mtlStateListParser_a_let = go where
>     go = Strict.get >>= \case
>         'a':s -> Strict.put s >> go
>         []    -> return True
>         _     -> return False
> {-# INLINE mtlStateListParser_a_let #-}
>

> {-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}
> {-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}
>

> main = do
>     defaultMain
>         [ bgroup "a*" $
>             [ bgroup "pure"                    $ expCodeGenBench
> genList_a pureListParser_a                              <$> [6..6]
>             , bgroup "mtl.State.Strict"        $ expCodeGenBench
> genList_a (Strict.evalState mtlStateListParser_a)       <$> [6..6]
>             , bgroup "mtl.State.Strict typed"  $ expCodeGenBench
> genList_a (Strict.evalState mtlStateListParser_a_typed) <$> [6..6]
>             , bgroup "mtl.State.Strict let"    $ expCodeGenBench
> genList_a (Strict.evalState mtlStateListParser_a_let)   <$> [6..6]
>             ]
>         ]
>
> }}}
>
> The code was compiled with following options (and many other variations):
> `-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
> -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
> factor=1000 -flate-dmd-anal`
>
> Everything in this code has `INLINE` pragma. The important part we should
> focus on are these two functions:
>
> {{{
>
> pureListParser_a :: [Char] -> Bool
> pureListParser_a = \case
>     'a':s -> pureListParser_a s
>     []    -> True
>     _     -> False
> {-# INLINE pureListParser_a #-}
>
> mtlStateListParser_a :: State.MonadState [Char] m => m Bool
> mtlStateListParser_a = State.get >>= \case
>     'a':s -> State.put s >> mtlStateListParser_a
>     []    -> return True
>     _     -> return False
> {-# INLINE mtlStateListParser_a #-}
> }}}
>
> Which are just "parsers" accepting strings containing only 'a'
> characters. The former is pure one, while the later uses `State` to keep
> the remaining input. The following list contains performance related
> observations:
>
> 0. For the rest of the points, let's call the performance of
> `pureListParser_a` a "good" one and everything worse a "bad" one.
>
> 1. The performance of `mtlStateListParser_a` is bad, it runs 10 times
> slower than `pureListParser_a`. Inspecting CORE we can observe that GHC
> jumps between `(# a,b #)` and `(a,b)` representations all the time.
>
> 2. If we add a specialize pragma `{-# SPECIALIZE mtlStateListParser_a ::
> Strict.State [Char] Bool #-}`, the performance of `mtlStateListParser_a`
> is good (exactly the same as `pureListParser_a`).
>
> 3. If we do NOT use specialize pragma, but we use explicite, non-
> polymorphic type signature `mtlStateListParser_a_typed :: Strict.State
> [Char] Bool`, the performance is bad (!), identical to the polymorphic
> version without specialization.
>
> 4. If we use SPECIALIZE pragma together with explicite, non-polymorphic
> type, so we use BOTH `mtlStateListParser_a_typed :: Strict.State [Char]
> Bool` AND `{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State
> [Char] Bool #-}` we get the good performance.
>
> 5. If we transform `pureListParser_a` to
>
> {{{
> mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
> mtlStateListParser_a_let = go where
>     go = Strict.get >>= \case
>         'a':s -> Strict.put s >> go
>         []    -> return True
>         _     -> return False
> {-# INLINE mtlStateListParser_a_let #-}
> }}}
>
> we again get the good performance without the need to use `SPECIALIZE`
> pragmas.
>
> 6. The performance of all the functions that are not optimized as good as
> `pureListParser_a` is a lot worse in GHC 8.2.1-rc3 than in 8.0.2.
>
> 7. The not-yet documented flag `-fspecialise-aggressively` does NOT
> affect the results.
>

> The above points raise the following questions:
>
> 1. Why GHC does not optimize `mtlStateListParser_a` the same way as
> `pureListParser_a` and where the jumping between `(# a,b #)` and `(a,b)`
> comes from?
>
> 2. Is there any way to tell GHC to automatically insert `SPECIALIZE`
> pragmas, especially in performance critical code?
>
> 3. Why providing very-explicite type signature
> `mtlStateListParser_a_typed :: Strict.State [Char] Bool` does not solve
> the problem unless we use `SPECIALIZE` pragma that tells the same as the
> signature?
>
> 4. Why the trick to alias the body of recursive function to a local
> variable `go` affects the performance in any way, especially when it does
> NOT bring any variable to the local let scope?
>

> We've been testing this behavior in GHC 8.0.2 and 8.2.1-rc3 and several
> people reported exactly the same observations.

New description:

 Hi! We've been struggling with a very strange GHC behavior on IRC today.
 Let's consider the following code (needs mtl and criterion to be
 compiled):

 {{{
 module Main where

 import Prelude
 import Criterion.Main
 import qualified Control.Monad.State.Strict as Strict
 import qualified Control.Monad.State.Class  as State
 import Control.DeepSeq (NFData, rnf, force)
 import GHC.IO          (evaluate)
 import Data.Monoid

 -----------------------------
 -- === Criterion utils === --
 -----------------------------

 eval :: NFData a => a -> IO a
 eval = evaluate . force ; {-# INLINE eval #-}

 liftExp :: (Int -> a) -> (Int -> a)
 liftExp f = f . (10^) ; {-# INLINE liftExp #-}

 expCodeGen :: NFData a => (Int -> a) -> (Int -> IO a)
 expCodeGen f i = do
     putStrLn $ "generating input code (10e" <> show i <> " chars)"
     out <- eval $ liftExp f i
     putStrLn "code generated sucessfully"
     return out
 {-# INLINE expCodeGen #-}

 expCodeGenBench :: (NFData a, NFData b) => (Int -> a) -> (a -> b) -> Int
 -> Benchmark
 expCodeGenBench f p i = env (expCodeGen f i) $ bench ("10e" <> show i) .
 nf p ; {-# INLINE expCodeGenBench #-}

 -------------------------------
 -- === (a*) list parsing === --
 -------------------------------

 genList_a :: Int -> [Char]
 genList_a i = replicate i 'a' ; {-# INLINE genList_a #-}

 pureListParser_a :: [Char] -> Bool
 pureListParser_a = \case
     'a':s -> pureListParser_a s
     []    -> True
     _     -> False
 {-# INLINE pureListParser_a #-}

 mtlStateListParser_a :: State.MonadState [Char] m => m Bool
 mtlStateListParser_a = State.get >>= \case
     'a':s -> State.put s >> mtlStateListParser_a
     []    -> return True
     _     -> return False
 {-# INLINE mtlStateListParser_a #-}

 mtlStateListParser_a_typed :: Strict.State [Char] Bool
 mtlStateListParser_a_typed = State.get >>= \case
     'a':s -> State.put s >> mtlStateListParser_a_typed
     []    -> return True
     _     -> return False
 {-# INLINE mtlStateListParser_a_typed #-}

 mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
 mtlStateListParser_a_let = go where
     go = Strict.get >>= \case
         'a':s -> Strict.put s >> go
         []    -> return True
         _     -> return False
 {-# INLINE mtlStateListParser_a_let #-}

 {-# SPECIALIZE mtlStateListParser_a :: Strict.State [Char] Bool #-}
 {-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State [Char] Bool #-}

 main = do
     defaultMain
         [ bgroup "a*" $
             [ bgroup "pure"                    $ expCodeGenBench genList_a
 pureListParser_a                              <$> [6..6]
             , bgroup "mtl.State.Strict"        $ expCodeGenBench genList_a
 (Strict.evalState mtlStateListParser_a)       <$> [6..6]
             , bgroup "mtl.State.Strict typed"  $ expCodeGenBench genList_a
 (Strict.evalState mtlStateListParser_a_typed) <$> [6..6]
             , bgroup "mtl.State.Strict let"    $ expCodeGenBench genList_a
 (Strict.evalState mtlStateListParser_a_let)   <$> [6..6]
             ]
         ]

 }}}

 The code was compiled with following options (and many other variations):
 `-threaded -funbox-strict-fields -O2 -fconstraint-solver-iterations=100
 -funfolding-use-threshold=10000 -fexpose-all-unfoldings -fsimpl-tick-
 factor=1000 -flate-dmd-anal`

 Everything in this code has `INLINE` pragma. The important part we should
 focus on are these two functions:

 {{{

 pureListParser_a :: [Char] -> Bool
 pureListParser_a = \case
     'a':s -> pureListParser_a s
     []    -> True
     _     -> False
 {-# INLINE pureListParser_a #-}

 mtlStateListParser_a :: State.MonadState [Char] m => m Bool
 mtlStateListParser_a = State.get >>= \case
     'a':s -> State.put s >> mtlStateListParser_a
     []    -> return True
     _     -> return False
 {-# INLINE mtlStateListParser_a #-}
 }}}

 Which are just "parsers" accepting strings containing only 'a' characters.
 The former is pure one, while the later uses `State` to keep the remaining
 input. The following list contains performance related observations:

 0. For the rest of the points, let's call the performance of
 `pureListParser_a` a "good" one and everything worse a "bad" one.

 1. The performance of `mtlStateListParser_a` is bad, it runs 10 times
 slower than `pureListParser_a`. Inspecting CORE we can observe that GHC
 jumps between `(# a,b #)` and `(a,b)` representations all the time.

 2. If we add a specialize pragma `{-# SPECIALIZE mtlStateListParser_a ::
 Strict.State [Char] Bool #-}`, the performance of `mtlStateListParser_a`
 is good (exactly the same as `pureListParser_a`).

 3. If we do NOT use specialize pragma, but we use explicite, non-
 polymorphic type signature `mtlStateListParser_a_typed :: Strict.State
 [Char] Bool`, the performance is bad (!), identical to the polymorphic
 version without specialization.

 4. If we use SPECIALIZE pragma together with explicite, non-polymorphic
 type, so we use BOTH `mtlStateListParser_a_typed :: Strict.State [Char]
 Bool` AND `{-# SPECIALIZE mtlStateListParser_a_typed :: Strict.State
 [Char] Bool #-}` we get the good performance.

 5. If we transform `pureListParser_a` to

 {{{
 mtlStateListParser_a_let :: Strict.MonadState [Char] m => m Bool
 mtlStateListParser_a_let = go where
     go = Strict.get >>= \case
         'a':s -> Strict.put s >> go
         []    -> return True
         _     -> return False
 {-# INLINE mtlStateListParser_a_let #-}
 }}}

 we again get the good performance without the need to use `SPECIALIZE`
 pragmas.

 6. The performance of all the functions that are not optimized as good as
 `pureListParser_a` is a lot worse in GHC 8.2.1-rc3 than in 8.0.2.

 7. The not-yet documented flag `-fspecialise-aggressively` does NOT affect
 the results.

 The above points raise the following questions:

 1. Why GHC does not optimize `mtlStateListParser_a` the same way as
 `pureListParser_a` and where the jumping between `(# a,b #)` and `(a,b)`
 comes from?

 2. Is there any way to tell GHC to automatically insert `SPECIALIZE`
 pragmas, especially in performance critical code?

 3. Why providing very-explicite type signature `mtlStateListParser_a_typed
 :: Strict.State [Char] Bool` does not solve the problem unless we use
 `SPECIALIZE` pragma that tells the same as the signature? (GHC even warns:
 `SPECIALISE pragma for non-overloaded function
 ‘mtlStateListParser_a_typed’` but it affects performance.)

 4. Why the trick to alias the body of recursive function to a local
 variable `go` affects the performance in any way, especially when it does
 NOT bring any variable to the local let scope?

 We've been testing this behavior in GHC 8.0.2 and 8.2.1-rc3 and several
 people reported exactly the same observations.

--

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14013#comment:4>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler