[GHC] #15936: Rethink Choice of Baseline Commit for Performance Tests

Mon Nov 26 17:54:03 UTC 2018

#15936: Rethink Choice of Baseline Commit for Performance Tests
-------------------------------------+-------------------------------------
        Reporter:  davide            |                Owner:  davide
            Type:  task              |               Status:  new
        Priority:  normal            |            Milestone:  8.6.3
       Component:  Test Suite        |              Version:  8.6.2
      Resolution:                    |             Keywords:  performance
                                     |  tests git notes
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by davide:

Old description:

> = Intro =
>
> Currently we always use the previous commit when running performance
> tests. This works well in CI where we fully test each commit in sequence
> (and hence always have test results for the previous commit). Remember,
> test results are stored in git notes and are not by default shared
> between repositories (i.e. your local repo will only have performance
> results when they were run locally on your machine). This is by design:
> we want to avoid comparing results form different machines.
>
> Unfortunately This is not so effective when testing locally. The
> programmer may have only run a subset of performance tests on the
> previous commit, and often have not run the tests at all (this is notably
> true after performing a rebase: the previous commit has changed). We need
> to rethink how we pick a baseline commit.
>
> = Goals =
>
> * In all cases, do something sensible.
> * Giving a warning if conditions are not idea. Provide clear and simple
> instructions on how to get to the ideal case.
> * Give control over the baseline commit to the programmer via command
> line options.
>   * Could make it a baseline branch where we still do git merge-base.
> That would be useful if you are branching from a different branch than
> master.
> * Give control over the baseline of local or ci to the programmer via
> command line options.
> * In general, performance tests should just work! No extra knowledge
> needed by the programmer.
> * If tests pass without warning now, then they should pass without
> warning later.
>
> = Proposed Solution =
>
> * When running performance tests, results will be compared to a baseline
> commit that is the merge base with master (most recent commit from
> master). If HEAD is already in master, then the previous commit is used
> instead.
> * If any locally generated performance results exist, they are used
> exclusively for the baseline.
> * Else if any CI generated performance results exist (and have been
> fetched), they are used exclusively for the baseline.
> * Else performance tests trivially pass, and a warning is given to the
> user.
>
> To find the baseline commit:
> {{{
> mergeBase = merge-base master HEAD
> baselineCommit = if mergeBase == HEAD
>              then HEAD^
>              else mergeBase
> }}}
>
> == Reasoning ==
>
> * We want each commit in master not to introduce a significant change in
> performance: hence we compare commits in mater to the previous commit.
> * If not on master (1 or more ahead and 0 or more commits behind master).
> We assume that the intention is to create a patch where all new commits
> will ultimately be squashed and placed on top of master as a single
> commit. On the other hand we don't want to consider changes in master
> from after we branched. Instead of using master HEAD as the baseline, we
> use the commit from which we branched from master (i.e. the merge base).
> In other words we are concerned only with the change in performance
> introduced by the newly crated commits.
>
> = Handling Expected changes =
>
> TODO this is the complicated part. What happens when the programmer is
> not planning on squashing commits and has many commits with expected
> changes :-(
>
> See
> [https://ghc.haskell.org/trac/ghc/wiki/Performance/Tests#ExpectedPerformanceChanges
> expected performance changes].
>
> If on master or an ancestor commit, the baseline is the previous commit
> and we can simply allow performance changes as specified in the current
> commit's message (this is already the behaviour of the test driver).
>
> If we have branched from master, then we may have multiple commits from
> the baseline commit to HEAD, each of which may have, possibly
> contradictory, expected performance changes. If any expected changes
> exist, aggregate them. We introduce an explicit "Metric Unchanged" option
> and aggregate per test where newer allowed changes overwrite older
> allowed changes. "Metric Unchanged" is necessary in the case that a new
> commit undoes a performance change such that a metric returns to the
> baseline value. The aggregate version should be output so that the
> programmer knows what to put in the commit message after squashing the
> commits. A warning should be given if expected changes appear in any
> commit inbetween HEAD and the baseline commit. In that warning Suggest
> e.g. "--baseline HEAD^ if not planning on squashing this commit"
>
> == Reasoning ==
>
> creating new commits with expected changes is an interactive process. The
> programmer adds a 1 or more commits, runs the tests, then adds expected
> performance changes to a commit message. It would be too inconvenient to
> force the programmer to change old commit messages, and too
> verbose/annoying to have them enter a full list of expected changes in
> each commit. Hence we must aggregate the expected changes.
>
> This is of a bit risky as it is a context sensitive change in the
> semantics of expected changes. If we e.g. intend not to squash the
> commits, then all the sudden the expected changes mean something very
> different (change to the previous commit, not some distant baseline
> commit). Perhaps we just show a warning in this case.
>
> We must figure out what commit messages will be used in GitLab on merge.
> Does the programmer have to remember to sort out expected changes before
> merge some how?
>
> = Use cases =
>
> * We do not distinguish between full/partial performance results being
> available for the baseline commit: that would require checking out the
> baseline commit and extracting the full list of tests.
> *
>
> || BaselinelocalResults || BaselineCIResults || Infos || Warnings ||
>
> ||||||= Case =||||||= Behaviour =||
> ||= Platform =||= Baseline local Results? =||= Baseline CI Results? =||=
> Baseline local/CI =||= Infos =||= Warnings =||
> || Local || Yes/Partial || - || Local || If HEAD tests is not subset or
> eq to Baseline tests: "If relevant tests exist on baseline, checkout
> baseline and running those tests OR fetch notes and use --baseline-ci ||
> ||
> || Local || No || Yes || CI || "Using CI numbers, suggest running locally
> for more accurate results" ||  ||
> || Local || No || No || - ||  || "No baseline results, tests trivially
> pass" + suggest fetch notes or locally run tests on baseline ||
> || CI |||| Yes || CI || || ||
> || CI |||| No || - || || "No results. CI is not yet finished, or CI has
> failed for the baseline commit or CI hasn’t fetched" ||
>
> = When to automatically fetch CI results? =
>
> Fetch if:
> * Testing locally (not a ci run) AND
> * Baseline commit doesn't have local nor CI results (before fetch) AND
> * Baseline commit is an ancestor of master.
> If fetching, suggest a command line option: --no-fetch.
> This is most convenient for local testing avoids fetch on ci, but may
> result in unwanted/wasted fetches

New description:

 = Intro =

 Currently we always use the previous commit when running performance
 tests. This works well in CI where we fully test each commit in sequence
 (and hence always have test results for the previous commit). Remember,
 test results are stored in git notes and are not by default shared between
 repositories (i.e. your local repo will only have performance results when
 they were run locally on your machine). This is by design: we want to
 avoid comparing results form different machines.

 Unfortunately This is not so effective when testing locally. The
 programmer may have only run a subset of performance tests on the previous
 commit, and often have not run the tests at all (this is notably true
 after performing a rebase: the previous commit has changed). We need to
 rethink how we pick a baseline commit.

 = Goals =

 * In all cases, do something sensible.
 * Giving a warning if conditions are not idea. Provide clear and simple
 instructions on how to get to the ideal case.
 * Give control over the baseline commit to the programmer via command line
 options.
   * Could make it a baseline branch where we still do git merge-base. That
 would be useful if you are branching from a different branch than master.
 * Give control over the baseline of local or ci to the programmer via
 command line options.
 * In general, performance tests should just work! No extra knowledge
 needed by the programmer.
 * If tests pass without warning now, then they should pass without warning
 later.

 = Proposed Solution 1 =

 * Per test use baseline of closest ancestor commit's metric (favour local
 as opposed to CI metrics). aggregate expected changes, but skip tests with
 both expected increase and decrease.
 * If anything other than local metrics from previous commit were used,
 show a warning suggesting running tests on previous commit.
 * Stop search if a CI commit is found (these are assumed to be complete)
 or if searched an arbitrary number of commits e.g. 100 (this is important
 as we'll end up searching whole history if a new test was added).

 = Proposed Solution 2 =

 * When running performance tests, results will be compared to a baseline
 commit that is the merge base with master (most recent commit from
 master). If HEAD is already in master, then the previous commit is used
 instead.
 * If any locally generated performance results exist, they are used
 exclusively for the baseline.
 * Else if any CI generated performance results exist (and have been
 fetched), they are used exclusively for the baseline.
 * Else performance tests trivially pass, and a warning is given to the
 user.

 To find the baseline commit:
 {{{
 mergeBase = merge-base master HEAD
 baselineCommit = if mergeBase == HEAD
              then HEAD^
              else mergeBase
 }}}

 == Reasoning ==

 * We want each commit in master not to introduce a significant change in
 performance: hence we compare commits in mater to the previous commit.
 * If not on master (1 or more ahead and 0 or more commits behind master).
 We assume that the intention is to create a patch where all new commits
 will ultimately be squashed and placed on top of master as a single
 commit. On the other hand we don't want to consider changes in master from
 after we branched. Instead of using master HEAD as the baseline, we use
 the commit from which we branched from master (i.e. the merge base). In
 other words we are concerned only with the change in performance
 introduced by the newly crated commits.

 = Handling Expected changes =

 TODO this is the complicated part. What happens when the programmer is not
 planning on squashing commits and has many commits with expected changes
 :-(

 See
 [https://ghc.haskell.org/trac/ghc/wiki/Performance/Tests#ExpectedPerformanceChanges
 expected performance changes].

 If on master or an ancestor commit, the baseline is the previous commit
 and we can simply allow performance changes as specified in the current
 commit's message (this is already the behaviour of the test driver).

 If we have branched from master, then we may have multiple commits from
 the baseline commit to HEAD, each of which may have, possibly
 contradictory, expected performance changes. If any expected changes
 exist, aggregate them. We introduce an explicit "Metric Unchanged" option
 and aggregate per test where newer allowed changes overwrite older allowed
 changes. "Metric Unchanged" is necessary in the case that a new commit
 undoes a performance change such that a metric returns to the baseline
 value. The aggregate version should be output so that the programmer knows
 what to put in the commit message after squashing the commits. A warning
 should be given if expected changes appear in any commit inbetween HEAD
 and the baseline commit. In that warning Suggest e.g. "--baseline HEAD^ if
 not planning on squashing this commit"

 == Reasoning ==

 creating new commits with expected changes is an interactive process. The
 programmer adds a 1 or more commits, runs the tests, then adds expected
 performance changes to a commit message. It would be too inconvenient to
 force the programmer to change old commit messages, and too
 verbose/annoying to have them enter a full list of expected changes in
 each commit. Hence we must aggregate the expected changes.

 This is of a bit risky as it is a context sensitive change in the
 semantics of expected changes. If we e.g. intend not to squash the
 commits, then all the sudden the expected changes mean something very
 different (change to the previous commit, not some distant baseline
 commit). Perhaps we just show a warning in this case.

 We must figure out what commit messages will be used in GitLab on merge.
 Does the programmer have to remember to sort out expected changes before
 merge some how?

 = Use cases =

 * We do not distinguish between full/partial performance results being
 available for the baseline commit: that would require checking out the
 baseline commit and extracting the full list of tests.
 *

 || BaselinelocalResults || BaselineCIResults || Infos || Warnings ||

 ||||||= Case =||||||= Behaviour =||
 ||= Platform =||= Baseline local Results? =||= Baseline CI Results? =||=
 Baseline local/CI =||= Infos =||= Warnings =||
 || Local || Yes/Partial || - || Local || If HEAD tests is not subset or eq
 to Baseline tests: "If relevant tests exist on baseline, checkout baseline
 and running those tests OR fetch notes and use --baseline-ci ||  ||
 || Local || No || Yes || CI || "Using CI numbers, suggest running locally
 for more accurate results" ||  ||
 || Local || No || No || - ||  || "No baseline results, tests trivially
 pass" + suggest fetch notes or locally run tests on baseline ||
 || CI |||| Yes || CI || || ||
 || CI |||| No || - || || "No results. CI is not yet finished, or CI has
 failed for the baseline commit or CI hasn’t fetched" ||

 = When to automatically fetch CI results? =

 Fetch if:
 * Testing locally (not a ci run) AND
 * Baseline commit doesn't have local nor CI results (before fetch) AND
 * Baseline commit is an ancestor of master.
 If fetching, suggest a command line option: --no-fetch.
 This is most convenient for local testing avoids fetch on ci, but may
 result in unwanted/wasted fetches

--

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/15936#comment:5>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler