[GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best

Tue Mar 27 21:57:12 UTC 2018

#14208: Performance with O0 is much better than the default or with -O2, runghc
performs the best
-------------------------------------+-------------------------------------
        Reporter:  harendra          |                Owner:  osa1
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.2.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by harendra):

 In case you need another data point, my original streaming library that
 made me file this issue still exhibits the same behavior. GHCi is 6x
 faster than my regular compiled code. I tried even compiling everything
 including all dependencies with exactly the same optimization flags to
 make sure there is no funny business due to mixing of opt flags. You can
 see the behavior in the tree available on github here:
 https://github.com/composewell/streamly/tree/199e20dd4b62ac2dafea0a40dc2ce3d97c307542

 You can clone the repo and run the experiment like this:

 {{{
 $ stack bench
 benchmarked streaming ops
 time                 34.39 ms   (32.99 ms .. 35.67 ms)
                      0.995 R²   (0.991 R² .. 0.998 R²)
 mean                 33.97 ms   (33.24 ms .. 35.43 ms)

 $ stack runghc benchmark/Main.hs
 benchmarked streaming ops
 time                 6.215 ms   (5.684 ms .. 6.860 ms)
                      0.945 R²   (0.896 R² .. 0.978 R²)
 mean                 6.610 ms   (6.333 ms .. 6.991 ms)
 }}}

 If I change the optimization flags to -O0 for benchmark stanza in cabal
 file I can get close to ghci performance. The code that I am benchmarking
 is like this:

 {{{
 {-# INLINE streamlyOp #-}
 streamlyOp :: IO Int
 streamlyOp = do
     xs <- S.toList $ S.serially $
           S.each [1..100000 :: Int]
         & fmap (+1)
         & fmap (+1)
         & fmap (+1)
         & fmap (+1)
     return (Prelude.length xs)
 }}}

 It seems the problem is with the `fmap` operation (I may be wrong), it is
 6 times slower in case of GHC, and every other fmap I add, the benchmark
 timings increase but the ratio remains the same. I tried using an INLINE
 on fmap, I also tried to SPECIALIZE it to IO and INT type but no change.

 The `fmap` op is defined in `src/Streamly/Streams.hs` file like this:

 {{{
 instance Monad m => Functor (StreamT m) where
     fmap f (StreamT (Stream m)) = StreamT $ Stream $ \_ stp yld ->
         let yield a Nothing  = yld (f a) Nothing
             yield a (Just r) = yld (f a)
                                (Just (getStreamT (fmap f (StreamT r))))
         in m Nothing stp yield
 }}}

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14208#comment:31>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler