jhc vs ghc and the surprising result involving ghc generated assembly.

Tue Nov 1 16:16:14 EST 2005

On Tue, 2005-11-01 at 19:03 +0100, Florian Weimer wrote:

> > Felix generates C with gotos. The result is FASTER
> > than native C using gcc 4.0 on x86_64.
> 
> Coincidence. 8-)

Possibly :)

> > Felix generated C(++) code -- compiled with same options:
> >
> > int FLX_REGPARM _i1860_f1301_ack( 
> > int _i1864_v1303_x, int _i1865_v1304_y)
> > {
> >   _us2 _i1867_v1799_ack_mv_74;
> >   _us2 _i1868_v1821_ack_mv_84;
> 
> _us2 is unsigned, correct?

No, actually it means 'unit sum', in this case the
sum of two units (hence us2). Felix has a special representation
for unit sums .. it uses an int. And of course, 

	1 + 1 = 2

is just another name for 'bool'. These two variables are
'mv's or 'match variables' -- the arguments of the two matches
which are generated by the 'if then elif else endif' sugar.
Match expressions are stuck into variables so they can be
destructed by pattern matching (even though in this case
there is none :)

> BTW, you shouldn't generate identifiers with leading underscores
> because they are reserved for the implementation.

I AM the implementation :)

Generated Identifiers start with underscores,
so they don't clash with arbitrary C code.

> > I have no real idea why the Felix generated C is faster.
> > Two guesses:
> >
> > (a) the two 'mv' variables declared at the top are optimised
> > away, so the Felix version is only using 3 words of stack.
> >
> > (b) the "parallel assigment in tail calls optimisation"
> > is saving one word on the stack (evaluating y before x
> > saves a temporary across the non-tail recursion).
> 
> Both variants do not use the stack.

This code evaluates y first, then x:

  _i1865_v1304_y = _i1860_f1301_ack(_i1864_v1303_x, _i1865_v1304_y-1 );
  _i1864_v1303_x = _i1864_v1303_x-1 ;

If you change the order you get this:

  {
     int tmp = _i1864_v1303_x;
    _i1864_v1303_x = _i1864_v1303_x-1 ;
    _i1865_v1304_y = _i1860_f1301_ack(tmp, _i1865_v1304_y-1 );
  }

which as written uses one extra word on the stack for 'tmp'
up until the closing } .. which means an extra word during
the recursive call. 

> It seems that the goto-based version leads to different static branch
> prediction results, which happen to be favorable. 

It has nothing to do with branch prediction. I know 
it is determined ENTIRELY by stack use. I rewrote the
assembler version many ways to try to find out how
to improve my compiler.. nothing made the slightest
difference except stack usage.

The performance relativities of various generated codes
depend only on the difference in stack pointers at the same place 
in the code, as a function of recursion depth.

This is obvious when you think about it: the CPU operations
(add, subtract, branch, etc) are orders of magnitude faster
than memory accesses -- including cache accesses -- and the
calculations in ackermann are trivial (add 1, subtract 1.. :)

Accessing the cache costs. Accessing RAM costs even more.
And I even drove this function so hard on my old i386
I filled all of RAM and pushed it into disk paging.

hehe .. just before publishing the latest Felix I had a look
at the performance graph and .. shock horror .. gcc and ocaml
were creaming it. I had accidentally disabled the optimisation
which replaced the creation of an C++ class object with
a C function. The latter not only has no 'this' pointer,
it also takes arguments 'one by one' instead of a 
reference to a tuple which is unpacked in the apply
method of the class .. thats a lot of extra words
on the stack. The graph you see, of course, has that
little problem fixed :)

Ackermann eats memory .. and doesn't do any other serious
work. That's the point of this function as a microbenchmark:
everything depends on efficient stack usage (and nothing
else).

I actually know how to optimise this function MUCH more:
all the way down to ONE word of stack. The first optimisation
eliminates x by unrolling -- x starts at 3 and is never incremented
(in the Shootout test, initial x is fixed at 3).

That eliminates one word. The second optimisation is secret <g>
but also eliminates one word at the cost of one register.

I guess the bottom line is: CPU performance is hard to predict,
and what optimisations gcc will do are also hard to predict.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net