[GHC] #14208: Performance with O0 is much better than the default or with -O2, runghc performs the best
GHC
ghc-devs at haskell.org
Tue Mar 27 07:10:32 UTC 2018
#14208: Performance with O0 is much better than the default or with -O2, runghc
performs the best
-------------------------------------+-------------------------------------
Reporter: harendra | Owner: osa1
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 8.2.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by osa1):
I can somewhat reproduce this with HEAD.
I'm currently focusing on the compiled code issues, ignoring GHCi.
My setup: I have two files
Main.hs:
{{{#!haskell
{-# LANGUAGE CPP #-}
module Main where
import Criterion.Main (defaultMain, bench, nfIO)
-- Uncomment this to have all the code in one module
-- #define SINGLE_MODULE
#ifndef SINGLE_MODULE
import List
#else
import Control.Monad (liftM)
data List a = Stop | Yield a (List a)
instance Semigroup (List a) where
x <> y =
case x of
Stop -> y
Yield a r -> Yield a (mappend r y)
instance Monoid (List a) where
-- {-# INLINE mempty #-}
mempty = Stop
-- {-# INLINE mappend #-}
mappend = (<>)
-- {-# NOINLINE toList #-}
toList :: Monad m => List a -> m [a]
toList m =
case m of
Stop -> return []
Yield a r -> liftM (a :) (toList r)
#endif
{-# NOINLINE len #-}
len :: IO Int
len = do
xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop)
[1..100000 :: Int])
return (length xs)
main :: IO ()
main = defaultMain [ bench "len" $ nfIO len ]
}}}
When I'm measuring allocations I remove criterion imports and use this
main:
{{{
main = len >>= print
}}}
Note that I have a `NOINLINE` on `len` to avoid optimising it in the
benchmark site. The original report does not have this.
List.hs:
{{{#!haskell
module List where
import Control.Monad (liftM)
data List a = Stop | Yield a (List a)
instance Semigroup (List a) where
x <> y =
case x of
Stop -> y
Yield a r -> Yield a (mappend r y)
instance Monoid (List a) where
mempty = Stop
mappend = (<>)
toList :: Monad m => List a -> m [a]
toList m =
case m of
Stop -> return []
Yield a r -> liftM (a :) (toList r)
}}}
I have three configurations:
- -O0
- -O1
- -O2
- -O0 -DSINGLE_MODULE
- -O1 -DSINGLE_MODULE
- -O2 -DSINGLE_MODULE
I first run all these with `+RTS -s` using `main = len >>= print` as the
main function.
{{{
============ -O0
===============================================================
49,723,096 bytes allocated in the heap
25,729,264 bytes copied during GC
6,576,744 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
13 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 41 colls, 0 par 0.011s 0.011s 0.0003s
0.0008s
Gen 1 5 colls, 0 par 0.010s 0.010s 0.0020s
0.0047s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.011s ( 0.012s elapsed)
GC time 0.021s ( 0.021s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.032s ( 0.033s elapsed)
%GC time 64.0% (63.8% elapsed)
Alloc rate 4,366,732,069 bytes per MUT second
Productivity 35.6% of total user, 35.9% of total elapsed
============ -O1
===============================================================
28,922,528 bytes allocated in the heap
18,195,344 bytes copied during GC
4,066,200 bytes maximum residency (5 sample(s))
562,280 bytes maximum slop
13 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 22 colls, 0 par 0.008s 0.008s 0.0004s
0.0016s
Gen 1 5 colls, 0 par 0.008s 0.008s 0.0016s
0.0029s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.009s ( 0.009s elapsed)
GC time 0.016s ( 0.016s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.025s ( 0.025s elapsed)
%GC time 63.8% (63.9% elapsed)
Alloc rate 3,262,174,222 bytes per MUT second
Productivity 35.3% of total user, 35.3% of total elapsed
============ -O2
===============================================================
28,922,528 bytes allocated in the heap
18,195,344 bytes copied during GC
4,066,200 bytes maximum residency (5 sample(s))
562,280 bytes maximum slop
13 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 22 colls, 0 par 0.008s 0.008s 0.0003s
0.0008s
Gen 1 5 colls, 0 par 0.008s 0.008s 0.0017s
0.0029s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.008s ( 0.008s elapsed)
GC time 0.016s ( 0.016s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.024s ( 0.024s elapsed)
%GC time 66.6% (66.6% elapsed)
Alloc rate 3,714,684,268 bytes per MUT second
Productivity 32.7% of total user, 32.7% of total elapsed
============ -O0 -DSINGLE_MODULE
===============================================
49,723,032 bytes allocated in the heap
25,729,184 bytes copied during GC
6,576,728 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
13 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 41 colls, 0 par 0.010s 0.010s 0.0003s
0.0008s
Gen 1 5 colls, 0 par 0.010s 0.010s 0.0019s
0.0042s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.011s ( 0.011s elapsed)
GC time 0.020s ( 0.020s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.031s elapsed)
%GC time 65.0% (65.0% elapsed)
Alloc rate 4,609,752,610 bytes per MUT second
Productivity 34.8% of total user, 34.8% of total elapsed
============ -O1 -DSINGLE_MODULE
===============================================
16,122,496 bytes allocated in the heap
7,392,664 bytes copied during GC
3,438,424 bytes maximum residency (4 sample(s))
55,464 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 10 colls, 0 par 0.004s 0.004s 0.0004s
0.0008s
Gen 1 4 colls, 0 par 0.005s 0.005s 0.0012s
0.0019s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.004s ( 0.004s elapsed)
GC time 0.009s ( 0.009s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.014s ( 0.014s elapsed)
%GC time 66.5% (66.6% elapsed)
Alloc rate 3,663,260,346 bytes per MUT second
Productivity 32.5% of total user, 32.5% of total elapsed
============ -O2 -DSINGLE_MODULE
===============================================
13,722,496 bytes allocated in the heap
6,798,640 bytes copied during GC
2,158,376 bytes maximum residency (3 sample(s))
33,248 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 9 colls, 0 par 0.007s 0.007s 0.0008s
0.0021s
Gen 1 3 colls, 0 par 0.004s 0.005s 0.0015s
0.0030s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.004s ( 0.004s elapsed)
GC time 0.012s ( 0.012s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.016s ( 0.016s elapsed)
%GC time 74.2% (74.3% elapsed)
Alloc rate 3,479,572,009 bytes per MUT second
Productivity 25.2% of total user, 25.2% of total elapsed
}}}
Summary: allocations consistently reduce as optimisation level increases.
Secondly I run criterion benchmark to measure runtime, using the same
configurations:
{{{
============ -O0
===============================================================
benchmarking len
time 13.50 ms (13.23 ms .. 13.71 ms)
0.998 R² (0.997 R² .. 0.999 R²)
mean 13.55 ms (13.35 ms .. 13.81 ms)
std dev 613.5 μs (424.7 μs .. 918.2 μs)
variance introduced by outliers: 18% (moderately inflated)
============ -O1
===============================================================
benchmarking len
time 15.83 ms (15.62 ms .. 16.02 ms)
0.999 R² (0.998 R² .. 0.999 R²)
mean 15.92 ms (15.75 ms .. 16.10 ms)
std dev 463.5 μs (340.2 μs .. 669.1 μs)
============ -O2
===============================================================
benchmarking len
time 15.70 ms (15.51 ms .. 15.90 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 15.74 ms (15.59 ms .. 15.87 ms)
std dev 355.2 μs (271.2 μs .. 470.7 μs)
============ -O0 -DSINGLE_MODULE
===============================================
benchmarking len
time 14.85 ms (13.81 ms .. 16.06 ms)
0.976 R² (0.959 R² .. 0.997 R²)
mean 13.60 ms (13.22 ms .. 14.14 ms)
std dev 1.152 ms (773.1 μs .. 1.614 ms)
variance introduced by outliers: 41% (moderately inflated)
============ -O1 -DSINGLE_MODULE
===============================================
benchmarking len
time 6.802 ms (6.702 ms .. 6.922 ms)
0.997 R² (0.994 R² .. 0.999 R²)
mean 6.845 ms (6.765 ms .. 6.945 ms)
std dev 261.8 μs (201.3 μs .. 336.8 μs)
variance introduced by outliers: 18% (moderately inflated)
============ -O2 -DSINGLE_MODULE
===============================================
benchmarking len
time 6.614 ms (6.501 ms .. 6.712 ms)
0.998 R² (0.997 R² .. 0.999 R²)
mean 6.399 ms (6.317 ms .. 6.472 ms)
std dev 239.1 μs (201.7 μs .. 292.5 μs)
variance introduced by outliers: 18% (moderately inflated)
}}}
So;
- Everything works as expected in single module case. Both runtime and
allocations get lower as optimisation level increases.
- In multi-module -O1 and -O2 produce identical outputs, runtime
difference is just noise.
- In multi-module we get better allocations with -O1 vs. -O0, but runtime
gets somewhat worse. This is what we should investigate.
To see why we allocate less in multi-module with -O1 I compared the STG
outputs (multi-module -O0 vs. multi-module -O1), the answer is fusion
kicking in with -O1. We have an intermediate function application for
`foldr mappend mempty` in -O0 output which disappears with -O1.
Why does the runtime get worse? I don't know but I suspect it's just
noise. Really the code is better (as in, it does less work) with -O1 than
with -O0.
I also compared single-module -O1 with multi-module -O1, the reason why
single module is better is becuase the `toList` function is not inlined
cross-module but it's inlined within the module.
So I think in the compiled case there are no problems. Only remaining
question is why GHCi is faster than compiled code.
I've attached a tarball with my setup + outputs. It includes Core/STG
outputs of all 6 configurations and criterion and +RTS -s outputs as well.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14208#comment:27>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list