potential for GHC benchmarks w.r.t. optimisations being incorrect

Sun May 6 14:41:06 UTC 2018

Joachim Breitner schrieb:
> This runs on a dedicated physical machine, and still the run-time
> numbers were varying too widely and gave us many false warnings (and
> probably reported many false improvements which we of course were happy
> to believe). I have since switched to measuring only dynamic
> instruction counts with valgrind. This means that we cannot detect
> improvement or regressions due to certain low-level stuff, but we gain
> the ability to reliably measure *something* that we expect to change
> when we improve (or accidentally worsen) the high-level
> transformations.
While this matches my experience with the default settings, I had good 
results by tuning the number of measurements nofib does.
With a high number of NoFibRuns (30+) , disabling frequency scaling, 
stopping background tasks and walking away from the computer
till it was done I got noise down to differences of about +/-0.2% for 
subsequent runs.

This doesn't eliminate alignment bias and the like but at least it gives 
fairly reproducible results.

Sven Panne schrieb:
> 4% is far from being "big", look e.g. at 
> https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues 
> <https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues> 
> where changing just the alignment of the code lead to a 10% 
> difference. :-/ The code itself or its layout wasn't changed at all. 
> The "Producing Wrong Data Without Doing Anything Obviously Wrong!" 
> paper gives more funny examples.
>
> I'm not saying that code layout has no impact, quite the opposite. The 
> main point is: Do we really have a benchmarking machinery in place 
> which can tell you if you've improved the real run time or made it 
> worse? I doubt that, at least at the scale of a few percent. To reach 
> just that simple yes/no conclusion, you would need quite a heavy 
> machinery involving randomized linking order, varying environments (in 
> the sense of "number and contents of environment variables"), various 
> CPU models etc. If you do not do that, modern HW will leave you with a 
> lot of "WTF?!" moments and wrong conclusions.
You raise good points. While the example in the blog seems a bit 
constructed with the whole loop fitting in a cache line the principle is 
a real concern though.
I've hit alignment issues and WTF moments plenty of times in the past 
when looking at micro benchmarks.

However on the scale of nofib so far I haven't really seen this happen. 
It's good to be aware of the chance for a whole suite to give
wrong results though.
I wonder if this effect is limited by GHC's tendency to use 8 byte 
alignment for all code (at least with tables next to code)?
If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines) 
relevant this reduces the possibilities by a lot after all.

In the particular example I've hit however it's pretty obvious that 
alignment is not the issue. (And I still verified that).
In the end how big the impact of a better layout would be in general is 
hard to quantify. Hence the question if anyone has
pointers to good literature which looks into this.

Cheers
Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20180506/cfc792e6/attachment.html>