potential for GHC benchmarks w.r.t. optimisations being incorrect

I write this out of curiosity, as well as concern, over how this may affect GHC.

our performance measurements are pretty non-scientific. For many
decades, developers just ran our benchmark suite (nofib) before and
after their change, hopefully on a cleanly built working copy, and
pasted the most interesting numbers in the commit logs. Maybe some went
for coffee to have an otherwise relatively quiet machine (or have some
remote setup), maybe not.

In the end, the run-time performance numbers are often ignored and we
we focus on comparing the effects of *dynamic heap allocations*, which
are much more stable across different environments, and which we
believe are a good proxy for actual performance, at least for the kind
of high-level optimizations that we work on in the core-to-core
pipeline. But this assumption is folklore, and not scientifically

Since two years or so we started collecting performance numbers for
every commit to the GHC repository, and I wrote a tool to print
comparisons: https://perf.haskell.org/ghc/

This runs on a dedicated physical machine, and still the run-time
numbers were varying too widely and gave us many false warnings (and
probably reported many false improvements which we of course were happy
to believe). I have since switched to measuring only dynamic
instruction counts with valgrind. This means that we cannot detect
improvement or regressions due to certain low-level stuff, but we gain
the ability to reliably measure *something* that we expect to change
when we improve (or accidentally worsen) the high-level

I wish there were a better way of getting a reliable, stable number
that reflects the actual performance.


