WebUI for GHC/Haskell tooling (eventlog)

Mon Sep 14 00:38:15 UTC 2020

Hi, I read your post back when you posted it and meant to respond, but got
distracted!  Anyway, I think the profiling tools in ghc could definitely use
some attention, so I'm glad you're looking into this!

The below is going to seem like a rant, and maybe it is in some parts, but
I mean it to be a constructive attempt to chart the gaps in documentation or
tools.  It has been observed many times that the haskell performance story is
scattered about, and many people have suggested some kind of consolidation,
which of course is always The Problem, especially for open source.  So here
I am observing that again, but there does seem to be promising movement
as people get more interested in performance, and your efforts are
encouraging.

### documentation

It would be really nice to get more complete and detailed documentation of
what the options are, and gather it into one place.  This is a disorganized
list of my own experiences:

The time units in all the profiles seem mysterious.  There's a "total time"
in the .prof file.  There's a time axis on the heap profile.  There are times
in the GC summary (INIT, MUT, ..., Total).  None of these times seem to
correspond with each other.  What do they mean?  Similarly, the "total bytes"
in the prof file doesn't seem to correspond to anything in the GC summary.
Long ago (maybe around 10 years ago) I think I intuited that the heap profile
time is CPU time, which is what foiled my attempts to separate program phases
with sleeps so I could see them.  I resorted to live profiling with ekg, and
more recently I have tried to use the eventlog and custom events for that
(eventlog2html does draw the event positions, but the feature to show the
event text didn't work for me).  Anyway, there are many tools and techniques,
but I haven't seen documentation tieing them together, along with advice and
experience reports and all that good stuff.

So I improvise.  Here is my latest attempt, for ad-hoc profile exploration:
https://github.com/elaforge/karya/blob/work/tools/run_profile.py
It fiddles with all the flags I can never remember, collects and archives the
results in a dated directory, runs all the various tools I can never remember
(ghc-prof-flamegraph has been ostensibly the most useful, but see below about
SCCs), and tries to extract a summary of the somewhat more stable numbers
(GC stats and top profile cost centers) so I can diff them.

Then there is a completely different attempt to get historical performance by
running on known inputs, with the optimized non-profiling binary, extract the
actual runtimes of various phases, and put them in a database to query later:
https://github.com/elaforge/karya/tree/work/tools/timing  This is because
I don't trust profile-built binaries to be ground truth, even if it's just
-prof and the eventlog runtime, no SCCs.

I did some work to convert event logs to the chrome tracing format:
https://github.com/elaforge/karya/blob/work/App/ConvertEventLog.hs In the end,
I didn't use the graphical tracing, but just did ad-hoc analysis of the
timestamps to see who was most expensive.  The event format is another place
where documentation would be nice, as you can see from the file, I just copy
pasted the definition out of ghc and guessed what the types mean from their
names.  This was in the ghc 8.0 era I think, and I recall that the eventlog
acquired heap data after that.  I did get it working as a replacement for
ThreadScope, and I think in general reusing a general framework that other
people maintain will work better than a custom GTK app when the maintainer
count is in the 0 to 1 range, though I recall chrome consumes JSON and trying
to get that much data through JSON hit a wall eventually.  I guess JSON
should be theoretically capable of arbitrary sizes, so presumably it was that
chrome is not optimized for large data... which might undercut the idea that
it's better to use someone else's tool.

Despite all of this, over the last 10 or so years, I have never managed to get
predictable or consistent numbers, e.g. after a ghc version change they get
dramatically worse in theory, but seem to be about the same wall clock time.
Or they steadily creep down or up over long periods where no changes should
have affected them, or there is no apparent improvement after a change that
eliminated a top SCC entry, etc. etc.  And this is without the confounding
factor of changing hardware, since I do have hardware that's unchanged from 10
years back (I'm lazy about upgrading, ok?)... though of course hardware is
confounding in general, and I haven't seen any techniques for how to control
for that.  Even on the same hardware, CPUs and OSes are quite
non-deterministic, but the best approach I've seen there is criterion-style
analysis for short benchmarks, and for long ones I just run them multiple
times and hope they are below the noise floor.

Anyway, I know all this stuff goes beyond just haskell and ghc, and is part of
the general theme that profiling and benchmarking is hard and no one
really seems to know how to do it satisfactorily.  For example, here is a fun
blog post on how even the mainstream VM world has apparently failed to get
useful benchmarks on JIT:
https://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_our_vms_part_1.html
which reminds me of how it seems only recently did people realize, in the
context of mtl vs. various free monads, that the key to mtl performance is
monomorphic inlining.  But despite all that, surely we can do better than just
banging around blindly alone, as I've done.  I think other people have had
better success than I have, and I would love to learn from their examples.

There is also a whole battery of language-agnostic low level tools from Intel
and whatnot that the HPC or video games people use, and while ghc haskell
can be a bit far from that, it doesn't mean they're useless... I've seen
references to them used even for python.  After just a little bit of time
lurking on a rust-oriented chat, it seems like they think about performance
(both throughput and latency) in a more rigorous and systematic way, and more
connected to the broader performance-oriented community.  Maybe similar to the
way haskell has traditionally been more rigorous and systematic about
abstractions and correctness, and more connected to the broader math-oriented
community.

The whole thing about SCCs could also use some documentation and advice.  Due
to some of the experiences above, I don't trust -fprof-auto-* flags,
and I have seen some blog posts supporting that.  The basic problem as
I understand it is that SCCs prevent inlining, and inlining is the way
important optimizations happen.  But there they are, very tempting, and there
is even a new one, -fprof-auto-exported, that seems to want to solve the
inlining problem, but it can't really, because what you really want is SCCs on
non-inlined functions, and I gather that's awkward given the order of the ghc
pipeline.  But cabal compiles all external libraries with "exported-functions"
by default (which you have to look up to figure out that it's
-fprof-auto-exported... can we pick consistent names?), with the result
that (I think!) the SCCs stymie the inlining and specialization of (>>=),
which, as has been documented (by which I mean the usual blog and reddit
posts) to completely alter the performance of mtl style monadic code.  So the
first step is to set 'profiling-detail: none' and recompile the whole world,
which used to be a lot more hassle, but I think cabal V2 has improved matters.
But, all that said, I also understand why the auto-scc stuff is so tempting,
just to give an overview before you try to zoom in manually with SCCs, because
there are zillions of functions to annotate.  What approach to use when?  Has
anyone come up with satisfying guidance?

Then there are fascinating experiments like
https://github.com/Petrosz007/haskell-profile-highlight , which is something
I dreamed about from the beginning, except that it relies on pervasive SCCs
so... is it ok to build on that foundation?

Oh and speaking of SCCs, there's a bug (?) where the entries is 0 sometimes.
Every once in a while someone posts somewhere asking about that and no one
seems to know.

Every time I do biographical profiling I have to remind myself what exactly
are LAG, DRAG, INHERENT_USE, VOID.  So I search my gmail box, because
the only documentation I have is a very helpful response Simon
Marlow sent to me asking those very questions 10 years back, and the original
1996 biographical profiling paper
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.1219) that he
mentions.  To this day, if I search the mailing list archives for INHERENT_USE
that is the only message that comes up!  The paper is still
relevant, because it seems the implementation hasn't changed much since 1996
either, but dummies like me need lots of examples and case studies for things
to stick.

Also INHERENT_USE seems to date from pre-bytestring days when no one had
significant data in ByteArrays so it was ok to just handwave it away.  That
isn't the case anymore, and it means that often most data is not tracked.
There is a ghc ticket to improve the situation:
https://gitlab.haskell.org/ghc/ghc/-/issues/7275 There has been consistent
interest over its 7 years, looks like it just lacks a volunteer!

Then there is folk knowledge about what ARR_WORDS is.  I recently stumbled
across a very helpful post by Ben Gamari:
https://bgamari.github.io/posts/2016-03-30-what-is-this-array.html  There are
a bunch of other internal closure types though, which as far as I know require
knowing ghc internals to understand.

And then there is a whole zoo of ad-hoc techniques scattered across blog posts
over the last 10 years or so: Neil Mitchell's stack-limiting leak-finder,
Simon Marlow's weak pointer leak finder, and an absolutely heroic post about
using gdb to directly inspect ghc data structures and find leaks:
https://lukelau.me/haskell/posts/leak/
Here's a recent one about memory fragmentation, that might also be the answer
to my bytes discrepancy questions above:
https://www.well-typed.com/blog/2020/08/memory-fragmentation/

And of course various ghc pragmas which are actually pretty well documented,
but the advice on how to use them is still scattered around in blog posts:
INLINE vs. INLINABLE vs. SPECIALIZE, rewrite rules, etc.

And then there's folk knowledge about libraries and data structures, e.g.
Writer being inefficient so use strict StateT instead, but someone also put up
writer-cps-mtl, but hey it says it was merged into 'transformers' so maybe
that's all obsolete now?  And Either/ExceptT is also inefficient but in
theory CPS transform fixes that too... but still no except-cps-mtl?  I wrote
my own by hand, which seemed to be what everyone was doing at the time, but as
usual I couldn't demonstrate an actual performance improvement from it.  By
the way, I assume that is the answer to the attoparsec question on
https://www.reddit.com/r/haskell/comments/ir3hmr/compiling_systems_haskell_resourcesexamples/

And the existence of short-text and short-bytestring, and of course the
famousest folk knowledge, which is difference lists, but actually sometimes
they hurt more than help, and no one seems to mention that.  Or the AppendList
(called OrdList in ghc source) which never seemed to gain significant
popularity... including with me, since I couldn't get it to demonstrate
a performance improvement over [] and (++)... but ghc does use it and maybe
you just have to use it right?

There's also some folk wisdom about LLVM-for-your-loops and vectorization...
e.g. I noticed that a foreign call to a C function that does a nested
loop to sum
buffers of floats is an order of magnitude faster than a loop in ST with
unsafeWrite, which is another order of magnitude faster than the high-level
Unboxed.Vector.zipWith stuff, and I assume auto vectorization might be to
blame.  Of course it seems no one knows how to do it without one of flaky
automatic optimization, or grungy explicit intrinsics calls, or an entirely
new DSL or language, though I suppose haskell does have entries such as
'accelerate' and 'repa'.  But anyway, that's getting into low level
performance and numerics, which is a whole specialized field on its
own, and it seems hard to port its solutions into general purpose code.