Measuring performance of GHC

Tue Dec 6 04:07:23 UTC 2016

Hi,

I see the following challenges here, which have partially be touched
by the discussion in the mentioned proposal.

- The tests we are looking at, might be quite time intensive (lots of
  modules that take substantial time to compile).  Is this practical to
  run when people locally execute nofib to get *some* idea of the
  performance implications?  Where is the threshold for the total
  execution time on running nofib?

- One of the core issues I see in day to day programming (even though
  not necessarily with haskell right now) is that the spare time I have
  to file bug reports, boil down performance regressions etc. and file
  them with open source projects is not paid for and hence minimal.
  Hence whenever the tools I use make it really easy for me to file a
  bug, performance regression or fix something that takes the least time
  the chances of me being able to help out increase greatly.  This was one
  of the ideas behind using just pull requests.
  E.g. This code seems to be really slow, or has subjectively regressed in
  compilation time. I also feel confident I can legally share this code
  snipped. So I just create a quick pull request with a short description,
  and then carry on with what ever pressing task I’m trying to solve right
  now.

- Making sure that measurements are reliable. (E.g. running on a dedicated
  machine with no other applications interfering.) I assume Joachim has
  quite some experience here.

Thanks.

Cheers,
 Moritz

> On Dec 6, 2016, at 9:44 AM, Ben Gamari <ben at smart-cactus.org> wrote:
> 
> Michal Terepeta <michal.terepeta at gmail.com> writes:
> 
>> Interesting! I must have missed this proposal.  It seems that it didn't meet
>> with much enthusiasm though (but it also proposes to have a completely
>> separate
>> repo on github).
>> 
>> Personally, I'd be happy with something more modest:
>> - A collection of modules/programs that are more representative of real
>>  Haskell programs and stress various aspects of the compiler.
>>  (this seems to be a weakness of nofib, where >90% of modules compile
>>  in less than 0.4s)
> 
> This would be great.
> 
>> - A way to compile all of those and do "before and after" comparisons
>>  easily. To measure the time, we should probably try to compile each
>>  module at least a few times. (it seems that this is not currently
>>  possible with `tests/perf/compiler` and
>>  nofib only compiles the programs once AFAICS)
>> 
>> Looking at the comments on the proposal from Moritz, most people would
>> prefer to
>> extend/improve nofib or `tests/perf/compiler` tests. So I guess the main
>> question is - what would be better:
>> - Extending nofib with modules that are compile only (i.e., not
>>  runnable) and focus on stressing the compiler?
>> - Extending `tests/perf/compiler` with ability to run all the tests and do
>>  easy "before and after" comparisons?
>> 
> I don't have a strong opinion on which of these would be better.
> However, I would point out that currently the tests/perf/compiler tests
> are extremely labor-intensive to maintain while doing relatively little
> to catch performance regressions. There are a few issues here:
> 
> * some tests aren't very reproducible between runs, meaning that
>   contributors sometimes don't catch regressions in their local
>   validations
> * many tests aren't very reproducible between platforms and all tests
>   are inconsistent between differing word sizes. This means that we end
>   up having many sets of expected performance numbers in the testsuite.
>   In practice nearly all of these except 64-bit Linux are out-of-date.
> * our window-based acceptance criterion for performance metrics doesn't
>   catch most regressions, which typically bump allocations by a couple
>   percent or less (whereas the acceptance thresholds range from 5% to
>   20%). This means that the testsuite fails to catch many deltas, only
>   failing when some unlucky person finally pushes the number over the
>   threshold.
> 
> Joachim and I discussed this issue a few months ago at Hac Phi; he had
> an interesting approach to tracking expected performance numbers which
> may both alleviate these issues and reduce the maintenance burden that
> the tests pose. I wrote down some terse notes in #12758.
> 
> Cheers,
> 
> - Ben