jhc vs ghc and the surprising result involving ghc generated
assembly.
Jan-Willem Maessen
jmaessen at alum.mit.edu
Wed Oct 26 12:24:14 EDT 2005
Nice analysis. I indeed found with phc that shadow stack references
absolutely killed performance, and I aggressively cached stack
locations in locals, spilling to stack only when GC information
needed to be accurate. [There was a giant infrastructure to save
only live data to stack, but we won't go into that now as it was the
source of almost all the codegen bugs...]
On Oct 26, 2005, at 5:43 AM, John Meacham wrote:
> here is the C code that jhc generates. (As an aside, I am very
> proud of how
> readable and how much structure the jhc generated C code preserves
> of the
> original haskell. it's a small thing, perhaps only implementors
> appreciate it,
> but I am glad I spent the time needed to do so.)
>
This makes a big difference. The phc compiler even put comments in
the code so that I could figure out what came from where.
> v99 = fWXAXDfMainXDfac(v97, v98);
> return v99;
> ...
> notice that besides being a bit verbose and using a tailcall,
>
I'm impressed that gcc found this. It's definitely living a bit
dangerously, and your suggestions below for self tail call handling
are the ones I found most effective. (They also allowed me to bypass
some prologue garbage, since phc used a one-C-function-per-Haskell-
function model with internal resumption points.) Non-self tail calls
I was careful to compile to:
return f(...);
I expect from the above that gcc does better at spotting tail calls now.
> furthermore gotos and labels are very problematic for gcc to
> optimize around.
> for various tiresome reasons gcc cannot perform (most) code motion
> optimizations across explicit labels and gotos, especially when
> they deal with
> the global register variables and memory stores and loads. ...
>
> there are a couple of things we can do to mitigate these problems:
>
> get rid of indirect jumps whenever possible.
>
> use C control constructs rather than gotos.
>
"for" loop introduction would be especially nice, but a bit tricky in
practice I fear (requiring a game of "spot the induction variable").
> A couple simple rules seem to help greatly.
>
> * turn anything of the form JMP_((W_)&self) where self is oneself
> into a goto
> that gotos a label at the beginning of the function.
>
Or better yet, wrap the whole function in
do {
} while (1);
and replace "JMP_" by "continue". This avoids the troubles with goto
which John mentioned above. It made a difference for phc, at least.
Of course, if you can introduce loops elsewhere you might get
yourself into trouble with this solution.
> * do simple pattern matthing on the basic blocks to recognize where
> C control
> constructs can be placed.
>
> for instance turn
>
> if (x) { goto y; }
> blah..
> baz..
> JMP_(foo)
>
> into
>
> if (x) { goto y; } else {
> blah..
> baz..
> JMP_(foo)
> }
>
> extending the else to after the next jump or goto.
>
I'm surprised this actually helps, I must admit.
> * getting stack dereferences out of your loops.
>
I recommend caching stack references in C locals where possible, but
it's tricky to get this right if loop bodies include embedded
function calls. Last I checked this wasn't an issue for GHC, since
function calls were CPS-converted and only tight call-free loops
ended up in a single function anyway.
> in order to get rid of the unessesary memory accesess, we need to
> either
>
> 1. fully convert it to use C control constructs, so gcc will do it
> for us.
> (code motion and loop invarient being inhibited again by gotos)
>
As I recall, the "right" solution here is to compute dominator trees,
and coalesce functions which are only tail called from their
dominator into a single function. Alas, I've forgotten where I saw
this written up, but there are indeed papers on it. Because it takes
a bunch of effort on the part of the implementor, it'd be nice to see
if its benefits are quantified.
> These should be straightforward to implement in the C code
> generator. it also
> suggests we might want to try to use the native C calling
> convention on leaf
> nodes that deal with unboxed values (so we get register passing and
> return
> values for free) or taking groups of mutually recursive functions
> and turning
> them all into one function with explicit jumps between them.
>
Making sure things are marked "static" and occur in an appropriate
dependency order helps a bit here. It might even be worth marking
some stuff "inline" in the code generator, though that's shaky ground.
I actually considered making everything static and putting outwardly-
visible functionality in an extern wrapper---effectively carrying
worker-wrapper down to the C level.
> some random notes:
>
> the 3x-7x factor was tested on an i386, on x86_64 the margin is
> much much
> greater for reasons that are still unclear.
>
Does x86-64 use a register-based calling convention by default? If
you compiled the i386 code using __regparm(2), would you see the same
speed difference?
-Jan-Willem Maessen
More information about the Glasgow-haskell-users
mailing list