nofib comparisons between 7.0.4, 7.4.2, 7.6.1, and 7.6.2

Wed Feb 6 21:50:16 CET 2013

On 06/02/13 16:04, Johan Tibell wrote:
> On Wed, Feb 6, 2013 at 2:09 AM, Simon Marlow <marlowsd at gmail.com
> <mailto:marlowsd at gmail.com>> wrote:
>
>     This is slightly off topic, but I wanted to plant this thought in
>     people's brains: we shouldn't place much significance in the average
>     of a bunch of benchmarks (even the geometric mean), because it
>     assumes that the benchmarks have a sensible distribution, and we
>     have no reason to expect that to be the case.  For example, in the
>     results above, we wouldn't expect a 14.7% reduction in runtime to be
>     seen in a typical program.
>
>     Using the median might be slightly more useful, which here would be
>     something around 0% for runtime, though still technically dodgy.
>       When I get around to it I'll modify nofib-analyse to report
>     medians instead of GMs.
>
>
> Using the geometric mean as a way to summarize the results isn't that
> bad. See "How not to lie with statistics: the correct way to summarize
> benchmark results"
> (http://ece.uprm.edu/~nayda/Courses/Icom6115F06/Papers/paper4.pdf).

Yes - our current usage of GM is because we read that paper :)  I've 
reported GMs of nofib programs in several papers.  I'm not saying the 
paper is wrong - the GM is definitely more correct than the AM for 
averaging normalised results.

The problem is that we're attributing equal weight to all of our 
benchmarks, without any reason to expect that they are representative. 
We collect as many benchmarks as we can and hope they are 
representative, but in fact it's rarely the case: often a particular 
optimisation or regression will hit just one or two benchmarks.  So all 
I'm saying is that we shouldn't expect the GM to be representative. 
Often there's no sensible mean at all - saying "some programs get a lot 
better but most don't change" is far more informative than "on average 
programs got faster by 1.2%".

> That being said, I think the most useful thing to do is to look at the
> big losers, as they're often regressions. Making some class of programs
> much worse is but improving the geometric mean overall is often worse
> than changing nothing at all.

Absolutely.

Cheers,
	Simon