potential for GHC benchmarks w.r.t. optimisations being incorrect
Joachim Breitner
mail at joachim-breitner.de
Sun May 6 14:59:22 UTC 2018
Hi,
Am Sonntag, den 06.05.2018, 16:41 +0200 schrieb Andreas Klebinger:
> With a high number of NoFibRuns (30+) , disabling frequency scaling,
> stopping background tasks and walking away from the computer
> till it was done I got noise down to differences of about +/-0.2% for
> subsequent runs.
>
> This doesn't eliminate alignment bias and the like but at least it
> gives fairly reproducible results.
That’s true, but it leaves alignment bias. This bit my in my work on
Call Arity, as I write in my thesis:
Initially, I attempted to use the actual run time measurements, but it
turned out to be a mostly pointless endeavour. For example the knights
benchmark would become 9% slower when enabling Call Arity (i.e. when
comparing (A) to (B)), a completely unexpected result, given that the
changes to the GHC Core code were reasonable. Further investigation
using performance data obtained from the CPU indicated that with the
changed code, the CPU’s instruction decoder was idling for more cycles,
hinting at cache effects and/or bad program layout.
Indeed: When I compiled the code with the compiler flag -g, which
includes debugging information in the resulting binary, but should otherwise
not affect the relative performance characteristics much, the unexpected
difference vanished. I conclude that non-local changes to the
Haskell or Core code will change the layout of the generated program
code in unpredictable ways and render such run time measurements
mostly meaningless.
This conclusion has been drawn before [MDHS09], and recently, tools
to mitigate this effect, e.g. by randomising the code layout [CB13], were
created. Unfortunately, these currently target specific C compilers, so I
could not use them here.
In the following measurements, I avoid this problem by not measuring
program execution time, but simply by counting the number of instructions performed.
This way, the variability in execution time due to code
layout does not affect the results. To obtain the instruction counts I employ
valgrind [NS07], which runs the benchmarks on a virtual CPU and
thus produces more reliable and reproducible measurements.
Unpleasant experience.
Cheers,
Joachim
--
Joachim Breitner
mail at joachim-breitner.de
http://www.joachim-breitner.de/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20180506/de4bea6f/attachment.sig>
More information about the ghc-devs
mailing list