[GHC] #14964: performance regressions from 8.0.2 to 8.4.1

Fri Mar 23 08:18:34 UTC 2018

#14964: performance regressions from 8.0.2 to 8.4.1
-------------------------------------+-------------------------------------
        Reporter:  elaforge          |                Owner:  (none)
            Type:  task              |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.4.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by simonpj:

Old description:

> === Short version:
>
> Between 8.0.2 and 8.4.1, compile time without optimization got faster.
> Compile time with optimization got slightly slower.
>
> Performance of generated (optimized) code got significantly slower, and
> GC productivity went down, despite allocation being about the same.
>
> I made this a "task", not a "bug", because there's a ways to go to figure
> out what is causing this.
>
> === Long version, copy and pasted from email to glasgow-haskell-users:
>
> I just upgraded from 8.0.2 to 8.4.1, and I took the opportunity to do a
> few
> informal compile time and run time tests.  There's been a lot of talk
> about
> compile time regressions, so maybe these will be of interest, informal as
> they are.
>
> I wound up skipping 8.2.1 due to
> https://ghc.haskell.org/trac/ghc/ticket/13604,
> but I could still test with 8.2 perfectly well.  Just haven't done it
> yet.
>
> In this context, RunTests is more code with no optimization (and -fhpc,
> if it
> matters).  debug/seq and opt/seq are the same code but with no
> optimization and
> -O respectively.  I found that -O2 hurt compile time but didn't improve
> run
> time, but it's been a long time so I should run that experiment again.
>
> compile times:
>
> OS X, macbook pro:
>
> {{{
> RunTests      549.10s user 118.45s system 343% cpu 3:14.53 total
> 8.0.2
> RunTests      548.71s user 117.10s system 347% cpu 3:11.78 total
> 8.0.2
> RunTests      450.92s user 109.63s system 343% cpu 2:43.13 total
> 8.4.1
> RunTests      445.48s user 107.99s system 341% cpu 2:42.19 total
> 8.4.1
>
> debug/seq     284.47s user 55.95s system 345% cpu 1:38.58 total
> 8.0.2
> debug/seq     283.33s user 55.27s system 343% cpu 1:38.53 total
> 8.0.2
> debug/seq     220.92s user 50.21s system 337% cpu 1:20.32 total
> 8.4.1
> debug/seq     218.39s user 49.20s system 345% cpu 1:17.47 total
> 8.4.1
>
> opt/seq       732.63s user 70.86s system 338% cpu 3:57.30 total
> 8.0.2
> opt/seq       735.21s user 71.48s system 327% cpu 4:06.31 total
> 8.0.2
> opt/seq       785.12s user 65.42s system 327% cpu 4:19.84 total
> 8.4.1
> opt/seq       765.52s user 64.01s system 321% cpu 4:18.29 total
> 8.4.1
> }}}
>
> Linux, PC:
>
> {{{
> RunTests    781.31s user 58.21s system 363% cpu 3:50.70 total
> 8.0.2
> RunTests    613.11s user 49.84s system 357% cpu 3:05.52 total
> 8.4.1
>
> debug/seq   429.44s user 31.34s system 362% cpu 2:07.03 total
> 8.0.2
> debug/seq   329.67s user 23.86s system 352% cpu 1:40.38 total
> 8.4.1
>
> opt/seq     1277.20s user 45.85s system 358% cpu 6:08.68 total
> 8.0.2
> opt/seq     1339.73s user 39.87s system 341% cpu 6:43.50 total
> 8.4.1
> }}}
>
> So it looks like non-optimized compile times have gotten significantly
> better
> since 8.0.2.  However, optimized has gotten a little worse, but not much.
>
> The performance numbers are a bit more disappointing.  At first it
> appeared
> that allocation went down in 8.4.1 while overall time is up
> significantly.
> However, the 8.4.1 used newer dependencies, so to try to control for
> those, I
> tested again after using a cabal freeze from the 8.4.1 test.  Of course I
> had
> to remove the ghc distributed packages, like container and 'ghc' itself,
> but
> the rest of the deps should be the same.  Those have the 'libs' suffix on
> Linux.
>
> From that, it looks like the improved memory in 8.4.1 was due to external
> dependencies, and in fact 8.4.1 bumped memory usage up again.  Ow.
>
> In the graphs, 'score' is just the input file.  'max mb' and 'total mb'
> and
> 'prd' come from the post-run GC report, specifically '* bytes maximum
> residency', '* bytes allocated in the heap', and productivity fields.
> 'derive', 'lily' and 'perform' are just different kinds of processes.
> They are
> CPU time bracketing the specific action, after initialization, and the
> range is
> min and max over 6 runs, so no fancy criterion-like analysis.  Each run
> is a
> separate process, so they should be independent.
>
> I was hoping for some gains due to the join points stuff, but it kind of
> looks
> like things get worse across the board.  I don't know why productivity
> goes
> down so much, and I don't know why the effect seems so much worse on OS
> X.
>
> Of course the obvious next step is to see where 8.2.1 lies, but I thought
> I'd
> see if there's interest before going to the trouble.  Of course, I should
> track
> down the regressions for my own purposes, but it's a bit of a daunting
> task.
> The step of reducing to a minimal example seems a lot harder for
> performance
> than for a bug!  Probably some old fashioned SCC annotations await me,
> but that
> can be a long and confusing process.
>
> OS X, macbook pro:
> {{{
> score           max mb  total mb  prd    derive     lily       perform
> ghc
> 6               72.26   3279.22   0.88   0.79~0.84  0.70~0.74  0.31~0.33
> 8.0.2
> 6               76.63   3419.20   0.58   1.45~1.59  1.05~1.07  0.33~0.36
> 8.4.1
>
> bloom           70.69   2456.14   0.89   1.32~1.36             0.15~0.16
> 8.0.2
> bloom           67.86   2589.97   0.62   1.94~1.99             0.20~0.22
> 8.4.1
>
> cerucuk-punyah  138.01  10080.55  0.93   6.98~7.16             1.24~1.30
> 8.0.2
> cerucuk-punyah  130.78  9617.35   0.68   8.91~9.22             1.57~1.68
> 8.4.1
>
> hex             32.86   2120.95   0.91   0.76~0.88             0.16~0.19
> 8.0.2
> hex             32.67   2194.82   0.66   1.09~1.16             0.28~0.30
> 8.4.1
>
> p1              67.01   4039.82   0.92   2.63~2.73             0.47~0.50
> 8.0.2
> p1              64.80   3899.85   0.68   3.35~3.43             0.58~0.59
> 8.4.1
>
> viola-sonata    69.32   6083.65   0.92   2.48~2.56  2.07~2.13  0.25~0.26
> 8.0.2
> viola-sonata    66.76   6120.26   0.68   3.32~3.43  2.90~2.93  0.32~0.34
> 8.4.1
> }}}
>

> Linux, PC:
>
> {{{
> score           max mb  total mb  prd   derive     lily       perform
> ghc
>
> 6               79.76   3310.69   0.89  0.88~0.89  0.73~0.75  0.27~0.27
> 8.0.2
> 6               72.21   3421.45   0.90  0.87~0.87  0.72~0.79  0.28~0.28
> 8.0.2 libs
> 6               76.56   3419.05   0.77  1.16~1.17  0.87~0.93  0.33~0.33
> 8.4.1
>
> bloom           69.82   2461.95   0.89  1.35~1.36             0.17~0.17
> 8.0.2
> bloom           63.45   2554.89   0.90  1.33~1.35             0.18~0.18
> 8.0.2 libs
> bloom           67.79   2589.85   0.79  1.64~1.65             0.20~0.20
> 8.4.1
>
> cerucuk-punyah  137.05  10113.41  0.94  7.44~7.50             1.31~1.33
> 8.0.2
> cerucuk-punyah  128.09  10278.03  0.94  7.50~7.55             1.37~1.38
> 8.0.2 libs
> cerucuk-punyah  131.20  9617.22   0.84  7.35~7.40             1.49~1.50
> 8.4.1
>
> hex             32.02   2096.87   0.92  0.73~0.74             0.18~0.18
> 8.0.2
> hex             32.05   2200.30   0.91  0.73~0.80             0.19~0.19
> 8.0.2 libs
> hex             32.46   2144.22   0.83  0.89~0.90             0.20~0.20
> 8.4.1
>
> p1              65.88   4054.66   0.93  2.84~2.87             0.49~0.50
> 8.0.2
> p1              62.60   4127.68   0.94  2.83~2.92             0.51~0.51
> 8.0.2 libs
> p1              64.72   3899.72   0.81  2.80~2.81             0.54~0.55
> 8.4.1
>
> viola-sonata    68.68   6086.49   0.93  2.55~2.56  2.10~2.12  0.27~0.27
> 8.0.2
> viola-sonata    65.05   6212.57   0.93  2.52~2.55  2.07~2.16  0.28~0.28
> 8.0.2 libs
> viola-sonata    66.85   6120.15   0.83  2.91~2.92  2.48~2.51  0.30~0.31
> 8.4.1
> }}}

New description:

 === Short version:

 Between 8.0.2 and 8.4.1, compile time without optimization got faster.
 Compile time with optimization got slightly slower.

 Performance of generated (optimized) code got significantly slower, and GC
 productivity went down, despite allocation being about the same.

 I made this a "task", not a "bug", because there's a ways to go to figure
 out what is causing this.

 == Long version, copy and pasted from email to glasgow-haskell-users:

 I just upgraded from 8.0.2 to 8.4.1, and I took the opportunity to do a
 few
 informal compile time and run time tests.  There's been a lot of talk
 about
 compile time regressions, so maybe these will be of interest, informal as
 they are.

 I wound up skipping 8.2.1 due to
 https://ghc.haskell.org/trac/ghc/ticket/13604,
 but I could still test with 8.2 perfectly well.  Just haven't done it yet.

 In this context, RunTests is more code with no optimization (and -fhpc, if
 it
 matters).  debug/seq and opt/seq are the same code but with no
 optimization and
 -O respectively.  I found that -O2 hurt compile time but didn't improve
 run
 time, but it's been a long time so I should run that experiment again.

 ------------------------------
 == Compile time performance:

 OS X, macbook pro:

 {{{
 RunTests      549.10s user 118.45s system 343% cpu 3:14.53 total
 8.0.2
 RunTests      548.71s user 117.10s system 347% cpu 3:11.78 total
 8.0.2
 RunTests      450.92s user 109.63s system 343% cpu 2:43.13 total
 8.4.1
 RunTests      445.48s user 107.99s system 341% cpu 2:42.19 total
 8.4.1

 debug/seq     284.47s user 55.95s system 345% cpu 1:38.58 total
 8.0.2
 debug/seq     283.33s user 55.27s system 343% cpu 1:38.53 total
 8.0.2
 debug/seq     220.92s user 50.21s system 337% cpu 1:20.32 total
 8.4.1
 debug/seq     218.39s user 49.20s system 345% cpu 1:17.47 total
 8.4.1

 opt/seq       732.63s user 70.86s system 338% cpu 3:57.30 total
 8.0.2
 opt/seq       735.21s user 71.48s system 327% cpu 4:06.31 total
 8.0.2
 opt/seq       785.12s user 65.42s system 327% cpu 4:19.84 total
 8.4.1
 opt/seq       765.52s user 64.01s system 321% cpu 4:18.29 total
 8.4.1
 }}}

 Linux, PC:

 {{{
 RunTests    781.31s user 58.21s system 363% cpu 3:50.70 total
 8.0.2
 RunTests    613.11s user 49.84s system 357% cpu 3:05.52 total
 8.4.1

 debug/seq   429.44s user 31.34s system 362% cpu 2:07.03 total
 8.0.2
 debug/seq   329.67s user 23.86s system 352% cpu 1:40.38 total
 8.4.1

 opt/seq     1277.20s user 45.85s system 358% cpu 6:08.68 total
 8.0.2
 opt/seq     1339.73s user 39.87s system 341% cpu 6:43.50 total
 8.4.1
 }}}

 So it looks like non-optimized compile times have gotten significantly
 better
 since 8.0.2.  However, optimized has gotten a little worse, but not much.

 -----------------------
 == Runtime performance

 The run-time performance numbers are a bit more disappointing.  At first
 it appeared
 that allocation went down in 8.4.1 while overall time is up significantly.
 However, the 8.4.1 used newer dependencies, so to try to control for
 those, I
 tested again after using a cabal freeze from the 8.4.1 test.  Of course I
 had
 to remove the ghc distributed packages, like container and 'ghc' itself,
 but
 the rest of the deps should be the same.  Those have the 'libs' suffix on
 Linux.

 From that, it looks like the improved memory in 8.4.1 was due to external
 dependencies, and in fact 8.4.1 bumped memory usage up again.  Ow.

 In the graphs, 'score' is just the input file.  'max mb' and 'total mb'
 and
 'prd' come from the post-run GC report, specifically '* bytes maximum
 residency', '* bytes allocated in the heap', and productivity fields.
 'derive', 'lily' and 'perform' are just different kinds of processes.
 They are
 CPU time bracketing the specific action, after initialization, and the
 range is
 min and max over 6 runs, so no fancy criterion-like analysis.  Each run is
 a
 separate process, so they should be independent.

 I was hoping for some gains due to the join points stuff, but it kind of
 looks
 like things get worse across the board.  I don't know why productivity
 goes
 down so much, and I don't know why the effect seems so much worse on OS X.

 Of course the obvious next step is to see where 8.2.1 lies, but I thought
 I'd
 see if there's interest before going to the trouble.  Of course, I should
 track
 down the regressions for my own purposes, but it's a bit of a daunting
 task.
 The step of reducing to a minimal example seems a lot harder for
 performance
 than for a bug!  Probably some old fashioned SCC annotations await me, but
 that
 can be a long and confusing process.

 OS X, macbook pro:
 {{{
 score           max mb  total mb  prd    derive     lily       perform
 ghc
 6               72.26   3279.22   0.88   0.79~0.84  0.70~0.74  0.31~0.33
 8.0.2
 6               76.63   3419.20   0.58   1.45~1.59  1.05~1.07  0.33~0.36
 8.4.1

 bloom           70.69   2456.14   0.89   1.32~1.36             0.15~0.16
 8.0.2
 bloom           67.86   2589.97   0.62   1.94~1.99             0.20~0.22
 8.4.1

 cerucuk-punyah  138.01  10080.55  0.93   6.98~7.16             1.24~1.30
 8.0.2
 cerucuk-punyah  130.78  9617.35   0.68   8.91~9.22             1.57~1.68
 8.4.1

 hex             32.86   2120.95   0.91   0.76~0.88             0.16~0.19
 8.0.2
 hex             32.67   2194.82   0.66   1.09~1.16             0.28~0.30
 8.4.1

 p1              67.01   4039.82   0.92   2.63~2.73             0.47~0.50
 8.0.2
 p1              64.80   3899.85   0.68   3.35~3.43             0.58~0.59
 8.4.1

 viola-sonata    69.32   6083.65   0.92   2.48~2.56  2.07~2.13  0.25~0.26
 8.0.2
 viola-sonata    66.76   6120.26   0.68   3.32~3.43  2.90~2.93  0.32~0.34
 8.4.1
 }}}

 Linux, PC:

 {{{
 score           max mb  total mb  prd   derive     lily       perform
 ghc

 6               79.76   3310.69   0.89  0.88~0.89  0.73~0.75  0.27~0.27
 8.0.2
 6               72.21   3421.45   0.90  0.87~0.87  0.72~0.79  0.28~0.28
 8.0.2 libs
 6               76.56   3419.05   0.77  1.16~1.17  0.87~0.93  0.33~0.33
 8.4.1

 bloom           69.82   2461.95   0.89  1.35~1.36             0.17~0.17
 8.0.2
 bloom           63.45   2554.89   0.90  1.33~1.35             0.18~0.18
 8.0.2 libs
 bloom           67.79   2589.85   0.79  1.64~1.65             0.20~0.20
 8.4.1

 cerucuk-punyah  137.05  10113.41  0.94  7.44~7.50             1.31~1.33
 8.0.2
 cerucuk-punyah  128.09  10278.03  0.94  7.50~7.55             1.37~1.38
 8.0.2 libs
 cerucuk-punyah  131.20  9617.22   0.84  7.35~7.40             1.49~1.50
 8.4.1

 hex             32.02   2096.87   0.92  0.73~0.74             0.18~0.18
 8.0.2
 hex             32.05   2200.30   0.91  0.73~0.80             0.19~0.19
 8.0.2 libs
 hex             32.46   2144.22   0.83  0.89~0.90             0.20~0.20
 8.4.1

 p1              65.88   4054.66   0.93  2.84~2.87             0.49~0.50
 8.0.2
 p1              62.60   4127.68   0.94  2.83~2.92             0.51~0.51
 8.0.2 libs
 p1              64.72   3899.72   0.81  2.80~2.81             0.54~0.55
 8.4.1

 viola-sonata    68.68   6086.49   0.93  2.55~2.56  2.10~2.12  0.27~0.27
 8.0.2
 viola-sonata    65.05   6212.57   0.93  2.52~2.55  2.07~2.16  0.28~0.28
 8.0.2 libs
 viola-sonata    66.85   6120.15   0.83  2.91~2.92  2.48~2.51  0.30~0.31
 8.4.1
 }}}

--

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14964#comment:2>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler