Feedback request regarding HSOC project (bringing sanity to the GHC performance test-suite)

Mon Sep 11 16:43:38 UTC 2017

Hi Phyx,

Sorry for the late reply. I already implemented a wait and retry
approach since it seemed the most sensible.

Let me see if I can clear this up a bit.

Suppose you want to check out a new branch and work on a feature.
You check out that branch, run the testsuite once (using
ONLY_PERFORMANCE_TESTS=YES if you just want to populate the
performance test comparison data).  Every commit you would ideally run
the testsuite again, although you don't have to. Let's say you finish
up your new feature in 10 commits and you ran the testsuite in each
commit; you can use the comparison tool across all 10 commits to track
performance changes over the development of the feature. But if you
only ran the performance testsuite once before you started coding the
feature, you could run the testsuite again after you're finished and
then use the comparison tool using the last and first commit as
data-points. If you forgot to run the performance testsuite at all you
can always check out the first commit and run it on there and then
check the branch back out and compare from the HEAD.

While the notes will not be pushed, if you create a branch from your
branch, the notes will copy over. They will also not be deleted ever
(although they are tied to their respective commit hashes so a rebase
will "remove them" as it changes the commit hashes; however you can
manually copy over the notes from the "old hash" to the "new hash" if
you want to--the data is not lost ever). As such, long time GHC
developers will eventually build up a very long and respectable
data-set that will be almost as comprehensive as the data-set that the
CI builders will develop.

Your local history of performance tests will be as perfect as you want
it to be since it doesn't really matter for the CI story. Does that
clarify things regarding the local story?

--------------

Now, on the CI side of things...  You are right that performance is
very platform specific. In fact, due to the very platform specific
nature of performance testing, it's just really ridiculously hard to
come up with a "perfect" way to normalize scores and have some way to
push numbers somewhere to be referenced. So, as far as I can see, the
only really sensible approach is to only consider numbers gathered by
"that" computer relative to each other (as performance is something
that is inherently relative to some perspective). This means that,
yes, if there is not a CI running tests for that OS then automated
performance regression testing will suffer a bit; however, if there is
not a CI running tests for that OS then that OS is not being tested
*at all* and the lack of accurate performance regression testing is
the least of GHC's concerns.

The way I see it is that performance can only truthfully be tracked on
platforms it's actually measured on. Further, *correctness* of the
compiler can only truthfully be tracked on platforms that it's being
tested on; if no CI or builder exists for OSX or Windows at all, then
the entire concept of measuring performance is already moot; how can
one evaluate a comparison that never happened?

This change should make things much easier as regressions can be very
tightly tracked on every platform that the testsuite is regularly ran
on. Right now, Linux is the only platform that the tests are regularly
run on; it is a lie to continue to pretend that the testsuite is fully
functional on Windows (it is not) and on OSX (it is not)--in an
environment where "clean test run" means "same amount of broken tests
as there used to be", any performance metric would just be a lie. I'd
even go so far as to say that if any builder is broken at all, the
performance has ceased to be measured in any meaningful way and that
all of the numbers that exist inside the current testsuite are
basically meaningless for OSX and Windows and only *barely* useful for
Linux.

On the other hand, information gathered from Linux can now be used to
tightly control performance regressions in GHC for Linux; since Linux
is the only platform being tested regularly right now, it's an overall
vast improvement since any tight regression control is better than
none. Once a regular CI builder is constructed for OSX and one is
constructed for Windows, the situation will improve by leaps and
bounds--entirely for free.

I look forward to your thoughts,
Jared

On Wed, Sep 6, 2017 at 11:44 PM, Phyx <lonetiger at gmail.com> wrote:
> Hi Jared,
>
>
> On Tue, Sep 5, 2017, 19:39 Jared Weakly <jweakly at pdx.edu> wrote:
>>
>> Hi Tamar,
>>
>> That framework failure is due to a somewhat embarrassing error that I
>> thought I had caught earlier; line 298 shouldn't have existed (it was
>> a small mistake from converting the all.T file from using the old
>> function to using the new collect_stats function. I have fixed this
>> and it will be pushed by the time you read this email. That being
>> said, the individual tests or units are very isolated and a framework
>> failure simply means that unit didn't get run; in this case it means
>> that entire all.T file didn't get run since the error was in reading
>> the file, but the rest of the files should've had their performance
>> tests recorded properly.
>>
>> The .git/refs/notes/perf is an implementation detail. Git notes have
>> the concept of namespaces; so, in order to avoid cluttering up a
>> global namespace in git notes with stuff only the performance tests
>> will use, all the performance metrics are stored in the namespace
>> 'perf'.
>>
>> This format of the git notes is mentioned in the code for the
>> testsuite but I will make this more visible in the README and other
>> documentation. The format of the git notes is:
>> $test-env $test_name $test_way $metric_measured $value_collected
>> (separated by tabs)
>>
>> The maximum deviation the test allows is inside the respective
>> all.T's; additionally, if you set the verbosity level of the
>> test-suite to a value >= 4, you will see the expected value, allowed
>> deviation, lower bound, upper bound, the actual measured value, and
>> (if the test fell outside the bounds) how much the actual value
>> deviated from the expected value. This information will also print if
>> the test falls outside of the allowed bounds.
>>
>> perf_notes.py exists as both an internal library and a measurement
>> tool (hopefully to be useful to developers). You can give the tool
>> several commits and it will give you a comparison of the union of all
>> the tests in those commits, with an output very similar in style to
>> noFib. I imagine this will be useful mostly to people who want to
>> improve the performance of the compiler so they can see which tests
>> have regressed the most over time (or which have improved the most
>> over time); but as it works over commits, it can also be useful for a
>> developer wanting to know if they've made a measurable difference with
>> their patch.
>>
>> The notes are updated every time the testsuite is ran. However they
>> are updated only at the very end of the execution of the testsuite in
>> a single command (the information is collected in a python
>> datastructure which is turned into a string and given to git notes).
>> This behavior means that if the testsuite is ran more than once
>> in-between commits that 'duplicate values' will exist in the git
>> notes. I'm not quite sure how to deal with this yet; I am considering
>> just grabbing the latest value if multiples exist. This also means you
>> can test just one test and then run other tests and have those values
>> added into the git notes without losing your older values which is why
>> the behavior is kept this way. (I will make sure this is more
>> prominent in the docs somewhere).
>>
>> The note update is done using python's subprocess library. I have no
>> idea how resilient that is to git failure; I'd imagine that if it was
>> busy it would just silently fail to update. Fortunately, the update
>> process is as close to atomic as one can get. I'll see if I can figure
>> out a way to force a repo lock to test this out. I'm open to
>> suggestions as to how to deal with this better and I'll also google
>> around and see if anyone has a good solution.
>
>
> I think you can just wait and retry and do so a specified max number of r
> times. That should be good enough for most cases. Unless the process on the
> other end has died but then the user needs to clean up the lock firsts.
>
>>
>> Platform discrepancies are completely sidestepped because of the way
>> git notes work. The performance metrics are entirely local and stay on
>> your computer; they won't be pushed or shared with any other users.
>> That means that the performance numbers are completely tailored to
>> your platform so there is effectively an 0% margin of "OS-related"
>> error that needs to be accounted for. The collect_stats function is
>> very much designed to be declarative and "set it and forget it". As
>> such, the need to even record values at all is obsoleted (one of the
>> main motivators of this project in the first place).
>
>
> This I don't quite understand. If I get this right. It means I will now
> always have to run the full performance benchmark suite for each change I
> have twice? Before and after locally?
>
> I had thought the notes would be pushed as I would find it useful to have
> the perfect history locally if I wanted to. Performance by its very nature
> is very platform specific, I feel that this change makes it harder for the
> platforms we don't have a CI for to run benchmarks. So basically only Linux.
>
> This would be unfortunate as it would mean we would effectively stop
> tracking performance on e.g. Windows and Mac OS since the current
> implementation doesn't allow for the data to live together in the same repo.
>
>>
>>
>> Hopefully this answers some questions; I'll make sure this sort of
>> information is available somewhere so that later users can find these
>> answers again. Thanks for your thoughts! They were very helpful.
>>
>> Regards,
>> Jared
>>
>> On Mon, Sep 4, 2017 at 10:02 PM, Phyx <lonetiger at gmail.com> wrote:
>> > Hi Jared,
>> >
>> > First off, thanks for all the hard work on this. I checked out your
>> > branch
>> > and made a run, I noticed at the end it had
>> >
>> > Framework failures:
>> > . ./perf/compiler/all.T [] (unexpected indent (<string>, line 298))
>> >
>> > so I assume none of the perf tests were run?
>> >
>> > Though I do see a .git/refs/notes/perf, so I assume your ref is is perf?
>> >
>> > Doing a git notes --ref perf show I see somethings were collected at
>> > some
>> > point
>> >
>> > local T3924 normal bytes allocated 47064
>> > local haddock.base normal bytes allocated 18427047160
>> > local haddock.Cabal normal bytes allocated 15863910848
>> > local haddock.compiler normal bytes allocated 50656428952
>> >
>> > which brings me up to my first question, I'm guessing the number here is
>> > the
>> > number of bytes allocated for the test? Is there a way for me to see
>> > what the maximum deviation the test allows is and how far I am from it?
>> > Do I
>> > just get the information like before only when a test fails? How does
>> > that
>> > look like? Same as before?
>> >
>> > It's also not entirely clear to be what perf_notes.py can be used, is it
>> > just an infrastructure tool? or is it something you foresee as useful
>> > for a
>> > developer?
>> >
>> > lastly, how often do you update notes? It's probably too late for this
>> > now,
>> > but git, especially msys git can be especially slow, so I would have
>> > liked
>> > the notes to be updated in batches to not slow down the testsuite run on
>> > Windows.
>> >
>> > Which brings me to my next question, how resilient are you to failures
>> > updating git? some IDE/environments like vscode automatically issue git
>> > operations in the background. so git may be busy when you try to update
>> > and
>> > the operation would fail saying the repo is locked. Does your new system
>> > recover from such failures?
>> >
>> > Also how do you deal with platform discrepancies? We've had in the past
>> > tests that behave radically different on different platforms, so we've
>> > also
>> > historically had the ability to record a platform specific value.
>> >
>> > Thanks,
>> > Tamar
>> >
>> > On Fri, Sep 1, 2017, 05:01 Jared Weakly <jweakly at pdx.edu> wrote:
>> >>
>> >> Hey y'all,
>> >>
>> >> A quick ToC before I dive right in:
>> >>
>> >> * What my HSOC project is on
>> >> * My progress so far
>> >> * Feedback welcome
>> >> * What I have left to do
>> >> * Theoretical potential improvements
>> >>
>> >> -----------
>> >>
>> >> My HSOC project was on bringing sanity to the GHC performance
>> >> test-suite.
>> >> My blog post on this is here:
>> >> https://jaredweakly.com/blog/haskell-summer-of-code/
>> >> The Trac ticket that corresponds to this is here:
>> >> https://ghc.haskell.org/trac/ghc/ticket/12758
>> >> The Phabricator ticket for this patch:
>> >> https://phabricator.haskell.org/D3758
>> >>
>> >> The tl;dr of my HSOC project is that GHC's performance tests currently
>> >> require the programmer to add in expected numbers manually, updated
>> >> them, handhold the testsuite, etc. This is a bit absurd and my
>> >> project's overall aim is to reduce the effort required of the
>> >> programmer to as close to zero as possible while simultaneously
>> >> increasing the potential ability of the testsuite to catch regressions
>> >> as much as possible.
>> >>
>> >> ------------
>> >>
>> >> My progress so far:
>> >>  - I have a few comparison tools in perf_notes.py. These allow people
>> >> to compare performance numbers of tests across commits
>> >>  - I have all the performance numbers generated by running the tests
>> >> automatically stored in git notes and referenced by both the
>> >> comparison tool and the testsuite
>> >>  - I have refactored the testsuite to use my new code that pulls
>> >> expected numbers automatically from git notes (trivially passing if
>> >> the note does not yet exist for that test), then it compares that
>> >> expected number with the number that was gotten from running the
>> >> testsuite on the latest commit. The comparison passes if it's within a
>> >> certain deviation (20% by default, but can be customized by the
>> >> programmer).
>> >>  - I have refactored all of the all.T files to use the new comparison
>> >> functions for the performance tests and ensured that this doesn't
>> >> break any existing tests.
>> >>
>> >> ------------
>> >>
>> >>
>> >> Anyone who wants to checkout the wip/perf-testsuite and try this out
>> >> is more than welcome. Feedback on anything is welcome; comments are
>> >> appreciated; discussion is welcome, etc.
>> >>
>> >> -------------
>> >>
>> >>
>> >> What I have left to do is:
>> >>
>> >> 1. Finish writing up the documentation
>> >> 2. Update the wiki in all the relevant places concerning
>> >> additions/modifications to the testsuite and test driver
>> >> 3. Make sure everyone is happy with the change (and make small changes
>> >> as necessary)
>> >>
>> >> --------------
>> >>
>> >> Possible features and improvements I am thinking about adding in:
>> >> * As a stopgap to full integration with performance tracking tools
>> >> (such as Gipedia), optionally emitting a test warning with the test
>> >> summary if there is any regression detected whatsoever (even if the
>> >> number falls within the allowed deviation)
>> >> * Some tests, such as T7702, have a somewhat nonsensical regression
>> >> percentage. Ideally the testsuite could handle those better. I could
>> >> potentially build in multiple ways to determine a regression
>> >> (percentage, 'above a certain value', 'taking longer than X amount of
>> >> time', as potential examples)
>> >> * Currently some tests require installing some Haskell packages; they
>> >> are skipped if the packages are not installed. I could try to build in
>> >> a way to automatically attempt to install all necessary Haskell
>> >> packages if someone attempts to run a test that requires them.
>> >> (Perhaps using a command such as 'make test exhaustive')
>> >> * The performance metric 'peak_megabytes' is sometimes not accurate
>> >> enough; I could see if adding something like `RTS -h -i0.01`
>> >> automatically to tests that use 'peak_megabytes' would resolve that.
>> >> Currently it is a manual debugging step.
>> >>
>> >> Any thoughts? Comments? Questions?
>> >>
>> >> Regards,
>> >> Jared Weakly
>> >> _______________________________________________
>> >> ghc-devs mailing list
>> >> ghc-devs at haskell.org
>> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs