nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Andy Georges itkovian at
Wed Feb 6 23:26:08 CET 2013

Hi Johan,

On 06 Feb 2013, at 17:04, Johan Tibell <johan.tibell at> wrote:

> On Wed, Feb 6, 2013 at 2:09 AM, Simon Marlow <marlowsd at> wrote:
> This is slightly off topic, but I wanted to plant this thought in people's brains: we shouldn't place much significance in the average of a bunch of benchmarks (even the geometric mean), because it assumes that the benchmarks have a sensible distribution, and we have no reason to expect that to be the case.  For example, in the results above, we wouldn't expect a 14.7% reduction in runtime to be seen in a typical program.
> Using the median might be slightly more useful, which here would be something around 0% for runtime, though still technically dodgy.  When I get around to it I'll modify nofib-analyse to report medians instead of GMs.


> Using the geometric mean as a way to summarize the results isn't that bad. See "How not to lie with statistics: the correct way to summarize benchmark results" (

I would argue the exact opposite. The geometric mean has absolutely no meaning whatsoever. 

See e.g., 

Computer Architecture Performance Evaluation Methods. L. Eeckhout Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, June 2010.
Quantifying performance changes with effect size confidence intervals - Tomas Kalibera and Richard Jones, 2012 (tech report)
Measuring Computer Performance: A Practitioner's Guide - Lilja, DJ, 2005
The Art of Computer Systems Performance Analysis: techniques for experimental design, measurement, simulation, and modelling - Jain, R.  1991

> That being said, I think the most useful thing to do is to look at the big losers, as they're often regressions. Making some class of programs much worse is but improving the geometric mean overall is often worse than changing nothing at all.


-- Andy Georges

PS. I wrote this a while back for the Evaluate collaborator

# The Mean Is Not A Simple Average

## Application domain

Aggregating measurements. The anti-pattern discusses a single example, though for all uses of an average it is important to consider the right mean to use. Examples of applicable means are: (weighed) arithmetic, (weighed) harmonic, each with respect to the proper weighing factors.

## Premise

You have a set of benchmarks and you wish to quantify your Shiny New Idea (SNI). To make it easy to grok the results, you decide to aggregate the impact of your Shiny New Idea with a single performance number: the mean of the various
measurements. This means that readers of your paper can easily compare single numbers: those for a baseline system, those for other enhancements you compare with and of course, the number for your SNI.

## Description

You have implemented your SNI and you wish to conduct a comparison study to show that your SNI outperforms existing work and improves execution time (or other metrics such as energy consumption, ...) with X% compared to a baseline
system. You design an experiment with a set of benchmarks from an applicable benchmark suite and you assemble performance numbers for each benchmark and for each scenario (baseline, your SNI, other work, ...). For example, you assemble executions times (the metric of choice for single-program workloads) and you wish to assess the speedup.

Since people prefer single numbers they can compare to see which one is bigger, you must aggregate you data into an average value. While this contains less information that the original data set, it is an easy way to see if your SNI improves things or not and to prove it to your readers or users.

You should choose a mean that allows you to: (i) directly compare the alternatives to each other by canceling out the baseline, (ii) make sure (relevant) outliers do not influence your average too much. Clearly, the geometric mean is perfectly suited for this purpose. Without further ado, determine the per-benchmark speedup for each scenario and you compute the various geometric means.

The resulting average values immediately allow you to see if your SNI improves the other scenarios derived from existing work. It also allows you to see how much you improve over these scenarios by dividing them by the geometric mean of
your SNI. Do not worry, the formula for the geometric mean makes sure that the baseline values are canceled out and you effectively get the average speedup of your SNI compared to existing work.

Now go ahead and publish these numbers that support your SNI.

## Why this is a bad idea

There may be specific circumstances where the use of a geometric mean is warranted, yet producing the average over some benchmark suite for a performance metric of your choice is not one of them. Typically, the geometric mean can be used when the final aggregate performance number results from multiplying individual numbers. For example, when making several enhancements to a system, the average improvement per enhancement can be expressed as the geometric mean of the speedups resulting from the individual improvements. However, for any benchmark suite (regardless of the relative importance one attaches to each benchmark in the suite), the aggregate results from adding the individual results, as is the case for, e.g., overall speedup. In practically all cases using either the (weighed) arithmetic mean of the (weighed) harmonic mean is the correct way to compute and report an average.

While it is true that the geometric mean sustains a smaller impact from outliers in the measurements compared to the other means, one should always investigate outliers and disregard them if there is an indication that the data is wrong. Otherwise, they can provide valuable insight. Moreover, by adding appropriate weights, one can easily reduce the impact of outliers.

## Example

Suppose you have 5 benchmarks, B1 ... B5. The baseline system has the following measurements: 10, 15, 7, 12, and 16, which yields a total execution time of 60. Hence, the aggregate score is the sum of the individual scores. Suppose now
you wish to compare two difference enhancements. The first enhancement yields the measurements 8, 10, 6, 11, 12 -- adding up to 47; the second enhancement yields the measurements 7, 12, 5, 10, 14 -- adding up to 48.

If we take a look at the global improvement achieved, then that is 60/47 = 1.2766 and 48/60 = 1.25 for enhancement 1 and enhancement 2, respectively. Therefore we conclude that by a small margin, enhancement 1 outperforms enhancement 2, for this particular set of benchmarks.

However, the geometric means are 1.2604 and 1.2794. From these numbers we would conclude the opposite, namely that enhancement 2 outperforms enhancement 1.

Which mean then yields the correct result? The answer is dependent on which system to weigh against. If we weigh against the enhanced system, giving the benchmarks the weights that correspond to their relative execution time
compared to the execution time of the complete suite (on the same configuration), then we need to use a weighed arithmetic mean. If we weigh against the baseline system, the correct answer is that we need to use the weighted harmonic mean.

Of course, the use of weights is often disregarded. If we assume all benchmarks are of equal importance, then we likely will not weigh them. In that case, all three means yield the same conclusion, but none of them accurately reflect the
true speedup that is achieved over the entire suite.

## Why is this pattern relevant

The geometric mean is still widely used and accepted by researchers. It can be found in papers published at top venues, such as OOPSLA, PLDI, CGO, etc. It is commonly used by e.g., VMMark, SPEC CPU, ... On multiple occasions the argument
regarding the impact of outliers is brought forth, even though there are other ways to deal with outliers.

	• [[1]] J.E., Smith. Characterizing computer performance with a single number. CACM 31(10), 1988.
	• [[2]] D.A., Patterson; J.L., Hennessy. Computer Organization and Design: The Hardware/Software Approach, Morgan Kaufman.

More information about the ghc-devs mailing list