Introducing GHC whole program compiler (GHC-WPC)

Sun Jun 21 13:40:47 UTC 2020

Hello,

All the questions you have asked are important, but before I answer them
I'd like to emphasize that as a research platform it is worth supporting
WPC/WPA even with the worst case memory and runtime properties. It would
allow researchers to observe Haskell programs with arbitrary precision. Not
giving access to the whole program IR simply rules out certain types of
research directions. IMO this is an important issue, because some
experiment might reveal something that is applicable in practice or might
be implemented in the incremental compilation pipeline as well. I find it
cool to allow developers to use GHC in an unexpected or unintended way that
may lead to something new and valuable.
It is common sense that whole (or large chunks) program analysis and
compilation is infeasible. I believe that common sense needs to be
reevaluated time to time, because the conditions may change. E.g.. Would
the existence of huge amounts of RAM with multicore CPUs or GPU or quantum
computers change the situation? Not to mention the new research results of
the static analysis field.

Questions that come to mind (to be understood in the context of the above
> enthusiasm):
>
>    - If you compile a program that depends on (say) lens, you get a lot
>    of code.  Dead-code elim will drop lots, perhaps, but you start with
>    everything.  So what do memory footprints and compile times look like when
>    you do WPC?   Remembering that people often complain about GHC’s footprint
>    when compiling a *single* module.
>    Also, WPC means that instead of just linking to precompiled libraries,
>    you have to recompile (parts of) them.  What does that do to compile times?
>
> I deliberately split the WPC pipeline into IR export and import/codegen
pieces. GHC-WPC does only the STG IR export, and the External STG compiler
could be seen as a separate project that uses GHC-WPC as frontend and GHC
as backend. It is important to note that compilation pipeline design is not
fixed in this setting. It could implement the regular per module
incremental compilation, or could batch compile multiple modules or even
the whole program. The IR export part is the same for each case, only the
backend driver differs.
While it is true that the GHC-WPC project implements the whole program
compiler scenario currently, I also plan to extend it to allow incremental
compilation with arbitrary sized program clusters. So it would not use the
source code structure based module or package boundaries, instead the
compilation unit size granularity could be changed at will. Possibly driven
by some static or runtime analysis. It would be a free parameter that can
be adjusted to get an acceptable compilation time, and as the software and
hardware advances the practical compilation unit size would increase.
It is also possible to implement incremental whole program compilation.
This technique is already used in the Visual C++ compiler: Incremental
Whole Program Optimization and Compilation
<https://dl.acm.org/doi/pdf/10.5555/3049832.3049857>

The current whole program compilation pipeline first compiles the project
with GHC-WPC. GHC-WPC compiles the target project through the standard GHC
backend and generates executable, in addition it exports the Ext-STG IR.
Then gen-exe is executed and the whole stg program compilation is done
using the following steps:

   1. extracting liveness datalog facts from each project module
   2. running the whole program liveness analysis
   3. per module link time codegen for the whole project cutting out the
   dead top level functions using the liveness analysis result

I designed the pipeline to have a light memory footprint. I intentionally
implemented the static analyses in Datalog using the Souffle Datalog
compiler. Souffle generates small and efficient parallel (OpenMP) C++ code
that stores the in-memory database in specialised data types (Trie/B-tree).
Another important design choice was to collect the datalog facts
incrementally for static analysis, instead of loading the whole program IR
to the memory. The insight is that it is not the complete IR that is needed
for whole program analysis, but just a projection of the IR. It is
perfectly OK to calculate that projection on a per module basis, then run
the static analysis on the collected data. Furthermore it is even possible
to store the collected facts in files separately for each module for later
reuse.

I measured the compilation of pandoc and a simple (hello world like)
project. I put the collected data grouped by compilation stages into bullet
points with notes:

   - ext-stg export
   Currently the serializer uses binary via the generic instance, which is
   not the fastest method. The store package via TH would be the fastest way
   of exporting the IR, but it is not in the foundation libraries. Also at the
   current stage of the project this is not an issue (binary is good enough).
   - liveness datalog fact generator
   simple: 2.107545216s
   pandoc: 18.431660236s
   - liveness datalog facts size:
   simple: 11 MB
   pandoc: 186 MB
   - liveness analysis (CSV) result size:
   simple: 92 KB
   pandoc: 14 MB
   - liveness analysis memory usage:
   simple: Maximum resident set size: 15 MB
   pandoc: Maximum resident set size: 109 MB
   - liveness analysis run time:
   simple: 0.118736736s
   pandoc: 2.322970868s
   - link time codegen via GHC:
   simple: 15.683492995s
   pandoc: 790.070061268s
   memory usage: around 100 MB (typical GHC module compilation footprint)
   - gen-exe is the whole stg program compiler driver:
   simple:
   Maximum resident set size: 186 MB
   Elapsed (wall clock) time: 18.63 sec
   pandoc:
   Maximum resident set size: 401 MB
   Elapsed (wall clock) time: 13 min 35 sec

GHC-WPC Ext-STG IR serializer

   - without ext-stg export -O0 from scratch with deps (full world)
   simple:  5.455 sec
   pandoc:  8 min 54.363 sec
   - with ext-stg export -O0 from scratch with deps (full world)
   simple:  5.576 sec
   pandoc:  12 min 50.101 sec

>    - I love the 25% reduction in binary size. In fact I’m surprised it
>    isn’t bigger.
>
>  I used a simple and fast method for the link time dead code elimination.
It calculates the references of the module top level functions
transitively, then only the reachable ones are passed to the codegen. So it
does not eliminate the local closures, nor the dead case alternatives.
Check the analysis source code how simple the implementation is:
ext-stg-liveness.dl
<https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/master/external-stg-compiler/datalog/ext-stg-liveness.dl>

I also implemented a much more sophisticated control flow analysis, which
could eliminate much more code. i.e.

   - constructor data fields
   - case alternatives
   - function parameters
   - local closures

It is future work to integrate this precise CFA into the external STG
compiler pipeline. My initial goal was to create the simplest and fastest
working whole compiler pipeline for GHC. So I prioritised my plans and
tasks accordingly.
I'd like to mention that the current pipeline design is a result of
iterative development. The first version was written entirely in Haskell.
It comprised a points-to analysis and the simple liveness analysis. The
Haskell version of these analyses worked fine for small programs, but the
memory and runtime properties were bad. Of course I tried to add some
strictness annotations, that helped a bit, but it was not good enough. Then
I rewrote the points-to analysis in C++, which resulted in a much better
memory footprint, but my fixed point evaluator was still naive so it was
not fast enough. Also the C++ implementation increased the development and
maintenance complexity that I did not like. Luckily I found the Souffle
Datalog compiler that just fits my needs. It is super easy to write static
analyses in Datalog and it does not compromise in the performance either.
I'd like to write all these parts in Haskell but IMO it is not possible
yet. Hopefully this will change in the future, maybe LoCal can make a big
difference. (http://recurial.com/pldi19main.pdf)

>    - Why are you using STG?  It’s mostly untyped – or at least much less
>    strongly typed than Core.  It has lots of restrictions (like ANF) that Core
>    does not.   Indeed I think of STG as a little lily-pad to alight on in the
>    hop from Core to Cmm.   Maybe your entire setup would work equally well
>    with Core, provided you can serialise and deserialise it.
>
> To me every IR layer is an API for the compiler. The long term existence
of the different IRs must have a reason. As I understand, the GHC Core is
about polymorphic reasoning and STG IR is about operational semantics with
explicit notion of evaluation, allocation of closures/data constructors and
pattern matching. Thirdly, Cmm is the language of the runtime system that
expresses the lowest level memory operations.

I chose STG because I am interested in the analysis of the operational
semantics of Haskell programs. I see STG IR as an entry/exit point for the
GHC pipeline. The actual IR data type does not matter because most
researchers or developers will have their own needs regarding the IR
conventions. I.e. I'd like to write analyses in Souffle/Datalog but both
GHC Core and GHC STG are insufficient for direct datalog translation,
because not every expression has a name, neither a unique one. But this is
totally fine. It is impossible to design the perfect IR, instead everyone
should do the customised conversion for their project.

IMO STG is implicitly typed, its type system is the TYPE's RuntimeRep kind
parameter that describes the value’s runtime representation. And this is
exactly what I'm interested in, because the Core type system is not
expressive enough for the optimizations that I'd like to implement. In fact
GHC has the unarise pass that flattens unboxed tuples, and it is
implemented in STG because it would not type check in Core. Also there are
Cmm and LLVM transformations that are semantically valid but not typeable
in GHC Core.
Ideally it would be great to have a single IR that could express both Core
and Cmm properties in a typed way, but it would require a much stronger
(probably dependent) type system. I find this idea a promising research
direction.

Indeed the whole GHC-WPC could support Core also, it would just be an
additional export operation at the same place in the GHC and Cabal source
code where STG gets exported. The same would be true for the Cmm IR.

>    - Moreover, we **already** have a fast serialiser and deserialiser for
>    Core – the stuff we use for interface files.  So maybe you could re-use
>    that … no need for pretty-print and parse.
>
> Ideally there would be a binary import/export facility and pretty printer
for every IR layer. The speed might matter for the main use cases but for
the experimental cases it is not an issue at all. Serializers would allow
us to use GHC internal components with finer granularity and from arbitrary
programming languages. Having a binary IR (de)serializer without a
specification or without stability guarantees is still much better than
having nothing.

>    - You say “That would mean a significant conceptual shift in the GHC
>    compiler pipeline, because heavy optimizations would be introduced at the
>    low level IRs beside GHC Core.”  Fair enough, but what I’m missing is the
>    *rationale* for doing heavy opts on STG rather than Core.
>
>  The Core type system is not expressive enough to reason about memory
layout. Of course STG's RuntimeRep type system cannot describe the shape of
the heap either, but at least it can be used as a starting point in an
experiment.
Haskell and pure functional programming formed my views on how to automate
programming tasks, and how to utilise the type system for specific domains.
Haskell as a pure FP also shaped how I see compilers. The traditional view
is that at the top of the compilation pipeline there is the high level
language with a rich and rigorous type system, and during the compilation
process (lowering) every lower level IR has a weaker type system that
encodes slightly less semantic information. And at the level of assembly
language or machine code there is nothing left from the original types and
semantic information. So instead of this traditional approach I can imagine
a pipeline that preserves high level semantic information during the
lowering process. But for this the low-level IRs would require increasingly
more expressive type systems. I'd like to explore this idea further and GHC
would be a good platform for such an experiment. I also believe that this
direction is the future of compilers and programming languages.

>
>    - Apart from (a) dead code, and (b) GRIN, do you have ideas in mind
>    for what we could do with WPC?
>
> I have lots of ideas regarding the usage of the whole program IR.

   - GHC-WPC's exported STG IR could be useful for any backend or codegen
   project.
   The existing cross compilation projects like Asterius and GHCJS could
   use the External-STG as input IR. The benefit of this approach is that the
   alternative Haskell compiler backends do not need to be supported by Cabal
   directly instead they would entirely rely on the External STG IR and the
   exported linker metadata file.

   - External STG could allow using GHC as frontend and backend, with
   additional support of custom IR transformations.
   E.g. maybe the Gibbon <http://iu-parfunc.github.io/gibbon/> language
   project could compile real-world Haskell programs via Ext-STG. Or maybe
   Purescript or Idris would add GHC with RTS as a new target. STG's weaker
   type system is actually beneficial for programming languages that differ
   from Haskell in evaluation model or type system.
   Maybe I'll add external Core or external Cmm support also.

   - The Intel Haskell Research Compiler optimizer part is called
   Functional Language Research Compiler (FLRC
   <https://github.com/IntelLabs/flrc#the-functional-language-research-compiler->)
   and its IR is called MIL. Its source is on github and it still compiles.
   I'd like to plug the FLRC vectorising and memory layout optimizer into GHC.
   FLRC's RTS was not optimized for laziness, however it would be interesting
   to see how it would perform with GHC's fine tuned GC and RTS.

   - I also plan to implement an STG interpreter with FFI support that
   could run any Haskell program.
   With such an interpreter I could implement runtime control flow tracing,
   or a stop the world debugger that could observe and visualize the runtime
   heap values. I'd like to track the source origin and lifetime of run-time
   values. I'd use that information to build a better intuition of the Haskell
   program’s runtime behaviour. I hope that it will lead to important insights
   to improve the performance. I'm optimistic with this idea because AFAIK it
   has not been done yet. Haskell's GC removes the unreachable values from the
   memory that might be dead for a long time. According to the As Static As
   Possible Memory Management thesis, reachability is too conservative
   approximation of the actual liveness.

   - I’d like to do partial defunctionalization on the STG IR.
   Regarding the engineering complexity of an optimizing compiler I believe
   it is the easiest to start with a whole program analysis and compilation.
   It is easier to construct a static analysis as a data flow analysis, then
   later with the insights the same analysis could be formulated as a type
   system. Which might enable compositional analysis, and therefore
   incremental compilation.

I see GHC-WPC as a framework that would allow us to explore these ideas.

I often hear from researchers and Haskell experts that whole program
compilation does not work. They seem so confident with this claim, and I'd
like to understand why. I read several papers on the topic and also checked
quite a few compilers’ source code. LLVM, GCC, Visual C++ is doing LTO.
Intel Haskell Research Compiler research results, MLton, Boquist's GRIN
supports the idea of whole program analysis and compilation. At least it
seems reasonable to continue the research in this direction. I wonder if
those who concluded against WPA read the same papers and projects.

Do you think that I am on the wrong track?
Am I chasing unimportant or impossible things?

Important papers:

   - The Intel Labs Haskell Research Compiler
   http://www.leafpetersen.com/leaf/publications/hs2013/hrc-paper.pdf
   - Measuring the Haskell Gap
   http://www.leafpetersen.com/leaf/publications/ifl2013/haskell-gap.pdf
   - LoCal: A Language for Programs Operating on Serialized Data
   http://recurial.com/pldi19main.pdf
   - On fast large-scale program analysis in Datalog
   https://souffle-lang.github.io/pdf/cc.pdf
   - ASAP: As Static As Possible memory management
   https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-908.pdf
   - Pushdown Control-Flow Analysis for Free
   https://arxiv.org/abs/1507.03137

Regards,
Csaba Hruska

On Mon, Jun 15, 2020 at 11:34 AM Simon Peyton Jones <simonpj at microsoft.com>
wrote:

> I’ve always thought that whole-program compilation has the possibility of
> doing optimisations that are simply inaccessible without the whole program,
> but been daunted by the engineering challenges of making WPC actually
> work.  So it’s fantastic that you’ve made progress on this.  Well done!
>
>
>
> Questions that come to mind (to be understood in the context of the above
> enthusiasm):
>
>    - If you compile a program that depends on (say) lens, you get a lot
>    of code.  Dead-code elim will drop lots, perhaps, but you start with
>    everything.  So what do memory footprints and compile times look like when
>    you do WPC?   Remembering that people often complain about GHC’s footprint
>    when compiling a *single* module.
>
>
>
> Also, WPC means that instead of just linking to precompiled libraries, you
> have to recompile (parts of) them.  What does that do to compile times?
>
>
>
>    - I love the 25% reduction in binary size. In fact I’m surprised it
>    isn’t bigger.
>
>
>
>    - Why are you using STG?  It’s mostly untyped – or at least much less
>    strongly typed than Core.  It has lots of restrictions (like ANF) that Core
>    does not.   Indeed I think of STG as a little lily-pad to alight on in the
>    hop from Core to Cmm.   Maybe your entire setup would work equally well
>    with Core, provided you can serialise and deserialise it.
>
>
>
>    - Moreover, we **already** have a fast serialiser and deserialiser for
>    Core – the stuff we use for interface files.  So maybe you could re-use
>    that … no need for pretty-print and parse.
>
>
>
>    - You say “That would mean a significant conceptual shift in the GHC
>    compiler pipeline, because heavy optimizations would be introduced at the
>    low level IRs beside GHC Core.”  Fair enough, but what I’m missing is the
>    *rationale* for doing heavy opts on STG rather than Core.
>
>
>
>    - Apart from (a) dead code, and (b) GRIN, do you have ideas in mind
>    for what we could do with WPC?
>
>
>
> Thanks
>
>
>
> Simon
>
>
>
>
>
>
>
>
>
> *From:* ghc-devs <ghc-devs-bounces at haskell.org> *On Behalf Of *Csaba
> Hruska
> *Sent:* 14 June 2020 13:46
> *To:* Alexis King <lexi.lambda at gmail.com>
> *Cc:* GHC developers <ghc-devs at haskell.org>
> *Subject:* Re: Introducing GHC whole program compiler (GHC-WPC)
>
>
>
> Hi,
>
>
>
> I thought about the GHC-LTO project name before, but it would not be an
> accurate description though. The GHC-WPC in its current state is about
> exporting STG + linker info for later processing, either feed it back to
> GHC backend or to a third party pipeline. It depends what the
> user/researcher wants, the point is that GHC-WPC solves the IR export part
> of the issue. It is the external stg compiler that implements a (simple)
> whole program dead function elimination pass that I implemented as a proof
> of concept to show the new possibilities GHC-WPC opens up. But I plan to do
> much more optimization with sophisticated dataflow analyses. I.e. I have a
> fast and working implementation of control flow analysis in souffle/datalog
> that I plan to use to do more accurate dead code elimination and partial
> program defunctionalization on the whole program STG IR. In theory I could
> implement all GRIN optimizations on STG. That would mean a significant
> conceptual shift in the GHC compiler pipeline, because heavy optimizations
> would be introduced at the low level IRs beside GHC Core. I'd like to go
> even further with experimentation. I can imagine a dependently typed Cmm
> with a similar type system that ATS (
> http://www.ats-lang.org/MYDATA/VsTsVTs-2018-10-28.pdf
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ats-lang.org%2FMYDATA%2FVsTsVTs-2018-10-28.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7Cd49efc7bddcb4e35d70808d81060fbbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637277356582121415&sdata=lUgtWXs9etLZwQYbGDMWTt4Ea17YhsQLJpxRbN5iJSE%3D&reserved=0>)
> has. I definitely would like to make an experiment in the future, to come
> up with an Idirs2 EDSL for GHC RTS heap operations where the type system
> would ensure the correctness of pointer arithmetic and heap object
> manipulation. The purpose of GHC-WPC in this story is to deliver the IR for
> these stuff.
>
>
>
> Beside exporting STG IR, the external STG compiler can compile STG via
> GHC's standard code generator. This makes GHC codegen/RTS available as a
> backend for programming language developers. I.e. Idris, Agda, Purescript
> could use GHC/STG/RTS as a backend with all of its cool features.
>
>
>
> So these are the key parts of my vision about the purpose and development
> of GHC-WPC. It is meant to be more than a link time optimizer.
>
>
>
> Cheers,
>
> Csaba
>
>
>
> On Sat, Jun 13, 2020 at 10:26 PM Alexis King <lexi.lambda at gmail.com>
> wrote:
>
> Hi Csaba,
>
>
>
> I originally posted this comment on /r/haskell
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Fhaskell%2Fcomments%2Fh7t8wr%2Fintroducing_ghc_whole_program_compiler_ghcwpc%2Ffuqdnye%2F&data=02%7C01%7Csimonpj%40microsoft.com%7Cd49efc7bddcb4e35d70808d81060fbbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637277356582121415&sdata=2NcIa7wPP%2Bcwu%2B8%2FoctSe8pj%2Fsl9O5BGbNIbsxTfj5U%3D&reserved=0> before
> I saw you also sent this to ghc-devs. I’ve decided to reproduce my comment
> here as well, since this list probably has a more relevant audience:
>
>
>
> I want to start by saying that I think this sounds totally awesome, and I
> think it’s a fantastic idea. I’m really interested in seeing how this
> progresses!
>
>
> I do wonder if people might find the name a little misleading. “Whole
> program compilation” usually implies “whole program optimization,” but most
> of GHC’s key optimizations happen at the Core level, before STG is even
> generated. (Of course, I’m sure you’re well aware of that, I’m just stating
> it for the sake of others who might be reading who aren’t aware.)
>
>
> This seems much closer in spirit to “link-time optimization” (LTO) as
> performed by Clang and GCC than whole program compilation. For example,
> Clang’s LTO works by “linking” LLVM bitcode files instead of fully-compiled
> native objects. STG is not quite analogous to LLVM IR—GHC’s analog would be
> Cmm, not STG—but I think that difference is not that significant here: the
> STG-to-Cmm pass is quite mechanical, and STG is mostly just easier to
> manipulate.
>
>
> tl;dr: Have you considered naming this project GHC-LTO instead of GHC-WPC?
>
>
>
> Alexis
>
>
>
> On Jun 12, 2020, at 16:16, Csaba Hruska <csaba.hruska at gmail.com> wrote:
>
>
>
> Hello,
>
>
>
> I've created a whole program compilation pipeline for GHC via STG.
>
> Please read my blog post for the details:
>
> Introducing GHC whole program compiler (GHC-WPC)
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.patreon.com%2Fposts%2Fintroducing-ghc-38173710&data=02%7C01%7Csimonpj%40microsoft.com%7Cd49efc7bddcb4e35d70808d81060fbbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637277356582131411&sdata=%2BF1KN0%2BuZbeW%2F3wTFOCvNVN9UWxY8wPhcahnN7Tx7DQ%3D&reserved=0>
>
>
>
> Regards,
>
> Csaba Hruska
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20200621/fde8d867/attachment.html>