jhc vs ghc and the surprising result involving ghc generated assembly.

Jan-Willem Maessen jmaessen at alum.mit.edu
Wed Oct 26 12:24:14 EDT 2005


Nice analysis.  I indeed found with phc that shadow stack references  
absolutely killed performance, and I aggressively cached stack  
locations in locals, spilling to stack only when GC information  
needed to be accurate.  [There was a giant infrastructure to save  
only live data to stack, but we won't go into that now as it was the  
source of almost all the codegen bugs...]

On Oct 26, 2005, at 5:43 AM, John Meacham wrote:

> here is the C code that jhc generates. (As an aside, I am very  
> proud of how
> readable and how much structure the jhc generated C code preserves  
> of the
> original haskell. it's a small thing, perhaps only implementors  
> appreciate it,
> but I am glad I spent the time needed to do so.)
>

This makes a big difference.  The phc compiler even put comments in  
the code so that I could figure out what came from where.


>             v99 = fWXAXDfMainXDfac(v97, v98);
>             return v99;
> ...
> notice that besides being a bit verbose and using a tailcall,
>

I'm impressed that gcc found this.  It's definitely living a bit  
dangerously, and your suggestions below for self tail call handling  
are the ones I found most effective.  (They also allowed me to bypass  
some prologue garbage, since phc used a one-C-function-per-Haskell- 
function model with internal resumption points.)  Non-self tail calls  
I was careful to compile to:
   return f(...);
I expect from the above that gcc does better at spotting tail calls now.


> furthermore gotos and labels are very problematic for gcc to  
> optimize around.
> for various tiresome reasons gcc cannot perform (most) code motion
> optimizations across explicit labels and gotos, especially when  
> they deal with
> the global register variables and memory stores and loads. ...
>
> there are a couple of things we can do to mitigate these problems:
>
> get rid of indirect jumps whenever possible.
>
> use C control constructs rather than gotos.
>

"for" loop introduction would be especially nice, but a bit tricky in  
practice I fear (requiring a game of "spot the induction variable").


> A couple simple rules seem to help greatly.
>
> * turn anything of the form JMP_((W_)&self) where self is oneself  
> into a goto
> that gotos a label at the beginning of the function.
>

Or better yet, wrap the whole function in

do {
} while (1);

and replace "JMP_" by "continue".  This avoids the troubles with goto  
which John mentioned above.  It made a difference for phc, at least.   
Of course, if you can introduce loops elsewhere you might get  
yourself into trouble with this solution.


> * do simple pattern matthing on the basic blocks to recognize where  
> C control
> constructs can be placed.
>
> for instance turn
>
> if (x) { goto  y; }
> blah..
> baz..
> JMP_(foo)
>
> into
>
> if (x) { goto  y; } else {
> blah..
> baz..
> JMP_(foo)
> }
>
> extending the else to after the next jump or goto.
>

I'm surprised this actually helps, I must admit.


> * getting stack dereferences out of your loops.
>

I recommend caching stack references in C locals where possible, but  
it's tricky to get this right if loop bodies include embedded  
function calls.  Last I checked this wasn't an issue for GHC, since  
function calls were CPS-converted and only tight call-free loops  
ended up in a single function anyway.


> in order to get rid of the unessesary memory accesess, we need to  
> either
>
> 1. fully convert it to use C control constructs, so gcc will do it  
> for us.
> (code motion and loop invarient being inhibited again by gotos)
>

As I recall, the "right" solution here is to compute dominator trees,  
and coalesce functions which are only tail called from their  
dominator into a single function.  Alas, I've forgotten where I saw  
this written up, but there are indeed papers on it.  Because it takes  
a bunch of effort on the part of the implementor, it'd be nice to see  
if its benefits are quantified.


> These should be straightforward to implement in the C code  
> generator. it also
> suggests we might want to try to use the native C calling  
> convention on leaf
> nodes that deal with unboxed values (so we get register passing and  
> return
> values for free) or taking groups of mutually recursive functions  
> and turning
> them all into one function with explicit jumps between them.
>

Making sure things are marked "static" and occur in an appropriate  
dependency order helps a bit here.  It might even be worth marking  
some stuff "inline" in the code generator, though that's shaky ground.

I actually considered making everything static and putting outwardly- 
visible functionality in an extern wrapper---effectively carrying  
worker-wrapper down to the C level.


> some random notes:
>
> the 3x-7x factor was tested on an i386, on x86_64 the margin is  
> much much
> greater for reasons that are still unclear.
>

Does x86-64 use a register-based calling convention by default?  If  
you compiled the i386 code using __regparm(2), would you see the same  
speed difference?

-Jan-Willem Maessen




More information about the Glasgow-haskell-users mailing list