[GHC] #15999: Stabilise nofib runtime measurements

Thu Dec 13 09:47:05 UTC 2018

#15999: Stabilise nofib runtime measurements
-------------------------------------+-------------------------------------
        Reporter:  sgraf             |                Owner:  (none)
            Type:  task              |               Status:  new
        Priority:  normal            |            Milestone:  ⊥
       Component:  NoFib benchmark   |              Version:  8.6.2
  suite                              |
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #5793 #9476       |  Differential Rev(s):  Phab:D5438
  #15333 #15357                      |
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by sgraf):

 Replying to [comment:2 osa1]:
 > - I wonder if it'd be better to run the process multiple times, instead
 of
 >   running the `main` function multiple times in the program. Why? That
 way we
 >   know GHC won't fuse or somehow optimize the `replicateM_ 100` call in
 the
 >   program, and we properly reset all the resources/global state (both
 the
 >   program's and the runtime system's, e.g. weaks pointers, threads,
 stable
 >   names). It just seems more reliable.

 The whole point of this patch is that we iterate ''within'' the process,
 so that GC paramerisation doesn't affect performance (counted
 instructions, even) on the same program in 'unforeseen' ways, like the
 discontinuous, non-monotonic paraffins example. There we would expect that
 increasing nursery always leads to higher productivity, because the GC has
 to run less often. That clearly isn't the case, due to effects outlined in
 https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38.

 My hypothesis here is that fixing the productivity curve above has the
 following effect on benchmarking a patch that changes allocations:

 Basically, we fix the GC paramerisation and vary allocations instead of
 fixing allocations and varying GC parameters. For a fixed GC
 parameterisation (which is always the case when running NoFib), we would
 expect an optimisation that reduces allocations by 10% to lead to less GC
 time, thus higher productivity. Yet, in
 https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55, there is a
 regression in total allocations by 18.6%, while counted instructions
 improve by 11.7% (similar results when measuring actual runtime). As the
 thread reveals, I had to do quite some digging to find out why
 (https://ghc.haskell.org/trac/ghc/ticket/9476#comment:71): The program
 produces more garbage, leading to more favorable points at which GC is
 done. We don't want that to dilute our comparison!

 This is also the reason why iterating the same program by restarting the
 whole process is impossible: GC paramerisation is deterministic (rightly
 so, IMO) and restarting the process will only measure the same dilusion
 over and over.

 On the other hand, iterating the logic $n times from within the program
 leads to different GC states at which the actual benchmark logic starts,
 thus leading to a more uniform distribution at the points in the program
 when GC happens.

 For every benchmark in the patch, I made sure that `replicateM_ 100` is
 actually a sensible thing to do. Compare that to the current version of
 [https://github.com/ghc/nofib/blob/f87d446b4e361cc82f219cf78917db9681af69b3/spectral/awards/Main.hs#L68
 awards] where that is not the case: GHC will float out the actual
 computation and all that is measured is `IO`. In paraffins I used
 `relicateM_ 100` (or `forM_ [1..100] $ const`, rather, TODO), because the
 inner action has a call to `getArgs`, the result of which is used to build
 up the input data. GHC can't know that the result of `getArgs` doesn't
 change, so it can't memoize the benchmark result and this measures what we
 want. In other cases I actually had to use `forM_` and make use of the
 counter somewhere in generating the input data.

 > - You say "GC wibbles", but I'm not sure if these are actually GC
 wibbles. I
 >   just checked paraffins: it doesn't do any IO (other than printing the
 >   results), and it's not even threaded (does not use threaded runtime,
 does not
 >   do `forkIO`). So I think it should be quite deterministic, and I think
 any
 >   wibbles are due to OS side of things. In other words, if we could have
 an OS
 >   that only runs `paraffins` and nothing else I think the results would
 be quite
 >   deterministic.
 >
 >   Of course this doesn't change the fact that we're getting non-
 deterministic
 >   results and we should do something about it, I'm just trying to
 understand the
 >   root cause here.

 The numbers are deterministic, but they are off in the sense above. By GC
 wibble, I mean that varying tiny parameters of the program or the GC has
 huge, non-monotonic, 'discontinuous' impact on the perceived performance,
 which makes the benchmark suite unsuited to evaluate any optimisation that
 changes allocations.

 Replying to [comment:3 osa1]:
 > I'm also wondering if having a fixed iteration number is a good idea.
 What if,
 > in 5 years, 100 iterations for paraffins is not enough to get reliable
 numbers?
 > I think `criterion` also has a solution for this (it does different
 number of
 > repetitions depending on results).

 The point of iterating 100 times is not to adjust runtime so that we
 measure something other than System overhead (type 1 in
 https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38). It's solely to
 get smoother results in the above sense. There's still the regular
 fast/norm/slow mechanism which are there to tune for specific runtimes,
 which are quite likely to vary in the future.

 Of course, we could just as well increment 100 to 243 for even smoother
 results instead of modifying the actual input problem. But having it fixed
 means that the curve for fast is as smooth as slow, which is a good thing,
 I think. It means we can measure counted instructions on fast without
 having to worry too much about said dilusions.

 Replying to [comment:4 osa1]:
 > Here's a library that can be useful for this purpose:
 http://hackage.haskell.org/package/bench

 Ah, yes. That could indeed be worthwhile as a replacement for the current
 runner, but only if it would support measuring more than just time.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/15999#comment:5>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler