inside the GHC code generator

Thu Feb 23 06:58:43 EST 2006

Hello

last days i studied GHC code generator, reading -ddumps of all sorts,
GHC sources, various papers and John's appeals :)

what i carried from this investigation:

GHC has great high-level optimization. moreover, GHC now allows to
program STG almost directly. when i look at STG code i don't see
anything that can be done better - at least for these simple loops i
tried to compile. i see just unboxed variables, primitive operations
and simple loops represented as tail recursion. fine.

then STG compiled to the simplified C (called abstract C earlier and
quasi C-- now), what is next can be translated:

* to real C and then compiled by gcc
* to assembler by build-in simple C-- compiler
* to assembler by some external C-- compiler (at least it is
theoretically possible)

that is the "simplified C"? it's really generalized abstract assembler
of modern cpus. its most important feature for us is that all
parameters of STG functions are passed here via the stack. so, for
example, the following STG code:

factorial n r  =  if n=1  then r  else (factorial (n - 1) (n*r))

translated to something like this

    factorial:
        _s1BD = *(Sp + 0);
        if (_s1BD != 1) goto c1C4;     // n!=1 ?
        R1 = *(Sp + 4);
        Sp = Sp + 8;
        return (*(Sp + 0))();  // return from factorial
    c1C4:
        _s1BI = _s1BD * (*(Sp + 4));   // n*r
        _s1BF = _s1BD - 1;             // n-1
        *(Sp + 4) = _s1BI;
        *(Sp + 0) = _s1BF;
        goto factorial;

i rewrote for clarity STG code to its Haskell equivalent, the C-- to
the C, plus made a few comments.

is this code good? it depends :)  if we see this as the assembler
code, it is really bad - for just 2 instructions that perform real
work, there are 4 memory accesses, two jumps and comparison.
unfortunately, this translation is what GHC really does when it is
called in "-fasm" mode. now you should realize why even strict Haskell
code compiled to such slow executables.

well, there is also gcc path ("-fvia-C", auto-enabled by "-O"). but
it's only several percents better (as said in GHC docs). why this
shining C compiler can't optimize such trivial code?

1) because it just not asked. you can enable gcc optimization by
adding "-optc-O6" to the cmdline, but this leads to compilation errors
on part of modules. it seems that gcc optimization is not compatible
with "evil mangler" that is the ghc's own optimization trick.

2) moreover, enabling gcc optimization seems to make noticeable
improvements only on tight loops like this factorial calculation. and
even in this case asm code produced is far worse than for the normal C
program that uses explicit loop (see enclosed fac.c and fac-c.s, which
contains factorial functions manually written in C using tail
recursion and explicit loop):

factorial:
        movl    (%ebp), %edx
        cmpl    $1, %edx
        je      L6
        movl    4(%ebp), %eax
        imull   %edx, %eax
        decl    %edx
        movl    %edx, (%ebp)
        movl    %eax, 4(%ebp)
        jmp factorial

as you can see here, gcc isn't unrolled the loop and retained all the
stack<->register moves. why? the short answer is what gcc just don't
know that data pointed by the Sp pointer is non-volatile - they can't
be changed in middle of the program by some async computation or
exception and don't required to be updated immediately. may be there
is some way to tell gcc this? this would immediately SUBSTANTIALLY
improve all ghc code generation

the long answer is: are you ever heard promises that gcc is best
cpu-independent assembler? no? and you know why? because gcc is not
cpu-independent assembler. gcc was strongly optimized to make
efficient asm from the code usually written by the C programmers. but
code generated by ghc has nothing common with it. so we are stay with
all these register-memory moves, non-unrolled loops and all other
results of naive compilation. gcc is just don't know how to
efficiently manage such code!

so now we have rather strange situation: ghc generates not-so-bad
"abstract asm" code, but there is just no tool to compile this code to
efficient "concrete asm"! although that is perfectly possible and i
can imagine the translation that uses registers to store these two
internal variables, unrolls the loop and schedule instructions to
maximally load the modern superscalar cpu.

at this moment i recalled existence of C-- and now, i think,
understand the whole history:

- long ago in the 1992 Simon PJ has developed this abstract asm and
rules for translation of STG to this abstract asm (you can find
details in spineless-tagless-gmachine.ps.gz). this "portable asm" was
represented at practice by low-level C code.

- then he stand waiting for efficient compilation of ghc-generated C
code by gcc or some other C compiler

- in 1999 he realized that this mountain will not walk to him and
started C-- project. its goal was exact to implement portable
assembler and in particular make at last efficient compilation of
ghc-generated code

- but this project fails. at least enough time is gone and i don't
hear that C-- can now be used as better alternative to gcc.

so now we have what i stated above: ghc produces good portable
assembler code. it is so good that there is no way to compile it
efficiently :)))

now what can be done in this direction:

we can either improve "STG -> portable asm" or the "portable asm ->
asm" translation. second alternative can be implemented either as

* improving GHC's "-fasm" code generator. It is the easiest way, but
we soon will turn to limited number of registers and requirement to do
global code analysis. moreover, this way don't build fastest
executables at this moment, so we will need some efforts just to
overtake gcc. and going this way we will reimplement such standard
thing as optimizing compiler just for GHC, that is useless waste of
time. it may be better to make more widely used optimizing compiler
and this decision turns us to second alternative:

* adding appropriate optimizations to the existing C-- compiler,
namely Quick C--. it should be not much more complicated than previous
way, as far as we will try to optimize only typical ghc-generated
code. that way we may also help other Quick C-- -based language
implementations, such as Lua. but on other side this way rises two
questions: first, C-- and especially "portable asm" generated by the
GHC is rather low-level language. as you should know, lower level of
language makes it hard or even impossible to optimize it as good as
higher level one. This makes a serious barrier on ultimate
optimization level possible on this way, but at this moment it is more
theoretical question. The more practical question is how long we will
wait for implementing in QC-- all the optimizations that are already
present in gcc, such as optimization of registers usage, loop
unrolling, instruction scheduling, global inlining, sse support,
run-time selection between Centrino-optimized and Pentium4-optimized
code and more? Using C-- as portable assembler looking good on the
paper, but can we really compete with gcc?? Moreover, even gcc itself
can't compete with Intel C compiler! so, going this way we will never
reach the speed of C programs. and this conclusion turns us to the 3rd
alternative:

* use the FULL C/C++ language as "portable asm". it means - map STG to
the natural C++ language idioms instead of the current translation.
just try to compare code generated by gcc for the above-mentioned C
code, fWXAXDfMainXDfac function in the included fac.c, and fac
function in the same file. first is the current ghc translation,
second is jhc's translation and third is my manual translation of the
same STG function to idiomatic C. as we can see there is a big
difference - the more close our intermediate code to idiomatic C the
better resulting code generated by the gcc.

so i propose the following, rather straightforward STG->C++
translation for the "unboxed" functions:

* C++ calling stack should be used as much as possible
* parameters are passed in C++ way: "factorial(int n, int r)"
* multiple results can be returned via C++ pair<a,b> template, if this is
efficiently implemented in gcc
* recursive definitions translated into the for/while loops if possible
* groups of mutually recursive definitions should be also "naturally"
translated as much as possible
* functions marked INLINE in Haskell code should be marked the same in
C++ code

next source of inefficiency is the polymorphic and higher order
functions. it will be great if such functions will be
inlined/specialized more aggressively than other functions to reach
"concrete" code as often as possible. this way Haskell's high-level
features will cost us nothing in most cases, like templates in C++,
which are free in terms of execution time. It should also help us to
drop most of the INLINE pragmas in libs: small functions will be
inlined anyway, large functions will be specialized to concrete code,
and if user wants to maximally optimize some concrete function in his
program, it is better to request optimization of this concrete
function (20% of our code performs 80% of work, but only application
programmer knows what invocation of "sort" need to be inlined and what
used just to order 2 values)

next source of inefficiency is the lazy evaluation. i propose:

* if some function has strict parameters, then its worker part should
accept this parameters in unboxed form (it is already implemented) and
wrappers for this worker should be inlined at call places. it is the
same requirement as for polymorphic functions

* maximally speed up using of already evaluated values. instead of
unconditional jumping to the closure-computation function just test
this field and continue computation if it contains NULL (indicating
that value is already in WHNF). This change will make time penalty of
boxing data structures significantly, at least 10 times, smaller

* evaluate boxed values that are not in WHNF using C (hardware) stack
if possible. this will allow us to free the ebp register. if this is
impossible, then "two stacks" scheme can be used - hardware stack and
"natural" C recursion used for unboxed values and ebp-based explicitly
managed stack used for boxed values

* at this moment using of I/O exceptions in ghc leads to generation of
boxed datastructures. it will be great to use C/C++ exception handling
abilities (setjmp or try) to make I/O exceptions "free" in terms of
speed

* in addition to ability to define strict fields, allow to define
strict data structures (say, lists) and strict types/structures of
functions parameters and results. that is a long-standing goal, but
the "![!Int] -> !Int" function can be used inside higher-order code a
lot faster than "[Int] -> Int" one. the ultimate goal is to allow
Haskell to be as fast as ML dialects in every situation and this means
that laziness should be optional in every place. this will also
allow to make efficient emulation of C++ virtual methods

Seymore Cray once said that in order to develop fast vector processor
you should start with making the fast scalar one. the same applies to
Haskell - although it's lazy language, to make its fast implementation
we need first to optimize strict, "unboxed" code. This will allow us:

* to compete with C - every algorithm that can be expressed in C, can
be expressed in Haskell with the same complexity and run with the same
speed. this includes all the shootout-type competition and any other
strict algorithms

* to rewrite speed-critical parts of any Haskell program in the strict
manner and get maximal possible speed without using FFI and
marshalling data structures between C and Haskell

* to make solid backend for the ghc's "strictifying" optimizations

I personally can work on the first part of my proposals - making
efficient translation from STG to "natural C/C++". this should give
significant improvements for small strict loops. i also think that the
following things can be implemented rather easily and should give
significant speedup: 
* fix problems with using "-optc-O6"
* mark Sp variable in some way to avoid excessive moves between stack
and registers
* make evaluation of boxed values that are already in WHNF as fast as
possible

Full implementation of these proposals will lead to the following:

* for the simple imperative algorithms the C-quality code will be
generated. this means that such things as matrix libraries or game
physics engines written in pure Haskell will have just the same speed
as their C++ equivalents. is not this that we wait for many years? :)

* "real-world programs" like the darcs or pugs that mostly use strict
computations, will get significant raise in speed, and become close to
speed of the similar C++ projects that use templates and classes to
express their complex functionality

* "high-school algorithms" that involve large number of lazy
computations, should also get significant speedup. even in such
algorithms most computations, imho, use data in WHNF and therefore
will go faster

-- 
Best regards,
 Bulat                          mailto:Bulat.Ziganshin at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fac-c.s
Type: application/octet-stream
Size: 1387 bytes
Desc: not available
Url : http://www.haskell.org//pipermail/glasgow-haskell-users/attachments/20060223/441bee34/fac-c.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fac.c
Type: application/octet-stream
Size: 590 bytes
Desc: not available
Url : http://www.haskell.org//pipermail/glasgow-haskell-users/attachments/20060223/441bee34/fac.obj