potential for GHC benchmarks w.r.t. optimisations being incorrect
klebinger.andreas at gmx.at
Sun May 6 14:41:06 UTC 2018
Joachim Breitner schrieb:
> This runs on a dedicated physical machine, and still the run-time
> numbers were varying too widely and gave us many false warnings (and
> probably reported many false improvements which we of course were happy
> to believe). I have since switched to measuring only dynamic
> instruction counts with valgrind. This means that we cannot detect
> improvement or regressions due to certain low-level stuff, but we gain
> the ability to reliably measure *something* that we expect to change
> when we improve (or accidentally worsen) the high-level
While this matches my experience with the default settings, I had good
results by tuning the number of measurements nofib does.
With a high number of NoFibRuns (30+) , disabling frequency scaling,
stopping background tasks and walking away from the computer
till it was done I got noise down to differences of about +/-0.2% for
This doesn't eliminate alignment bias and the like but at least it gives
fairly reproducible results.
Sven Panne schrieb:
> 4% is far from being "big", look e.g. at
> where changing just the alignment of the code lead to a 10%
> difference. :-/ The code itself or its layout wasn't changed at all.
> The "Producing Wrong Data Without Doing Anything Obviously Wrong!"
> paper gives more funny examples.
> I'm not saying that code layout has no impact, quite the opposite. The
> main point is: Do we really have a benchmarking machinery in place
> which can tell you if you've improved the real run time or made it
> worse? I doubt that, at least at the scale of a few percent. To reach
> just that simple yes/no conclusion, you would need quite a heavy
> machinery involving randomized linking order, varying environments (in
> the sense of "number and contents of environment variables"), various
> CPU models etc. If you do not do that, modern HW will leave you with a
> lot of "WTF?!" moments and wrong conclusions.
You raise good points. While the example in the blog seems a bit
constructed with the whole loop fitting in a cache line the principle is
a real concern though.
I've hit alignment issues and WTF moments plenty of times in the past
when looking at micro benchmarks.
However on the scale of nofib so far I haven't really seen this happen.
It's good to be aware of the chance for a whole suite to give
wrong results though.
I wonder if this effect is limited by GHC's tendency to use 8 byte
alignment for all code (at least with tables next to code)?
If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines)
relevant this reduces the possibilities by a lot after all.
In the particular example I've hit however it's pretty obvious that
alignment is not the issue. (And I still verified that).
In the end how big the impact of a better layout would be in general is
hard to quantify. Hence the question if anyone has
pointers to good literature which looks into this.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ghc-devs