penalty for proper tail recursion

Jan-Willem Maessen
Mon, 5 Nov 2001 11:43:34 -0500

Richard Uhtenwoldt <> asks:
> What is the penalty that compilers pay these days for allowing 
> "proper" tail-recursion (that does not grow the stack)?  

These days it's often little or none.  Here's what I know from working
very intensively on compiling (Eager) Haskell to GCC.

1) GCC has always done a fairly good job with self-tail-recursion.
   Oddly, I don't exploit this as I use a few tricks to do a tiny bit

2) GCC's latest version does a pretty good job with sibling call
   optimization (==tail-recursive calls to other C procedures).  I've
   worked exclusively with gcc 3.0 prereleases for the past 9 months
   and it has simplified implementation tremendously.

3) Sibling call optimization is not infallible (yet?), a fact I
   discovered mainly by poring over the assembly code for portions of
   my run-time system.  It seems to be good enough in practice, but
   there's a fallback mechanism in my implementation just in case.  I
   use the same mechanism to handle various other more common
   conditions, though, so I didn't consider this a great burden in
   practice.  It doesn't appear to actually kick in due to missed
   sibling call opportunities.

4) Writing your own GC means maintaining a "shadow stack": An explicit
   data structure which allows the pointers on the stack to be found.
   There are a number of ways of doing this, but all of them pretty
   much require that the *Haskell* compiler do the work of saving and
   restoring pointer variables around calls.  This is a drawback of
   implementing precise garbage collection on top of C.  Haskell
   compilers do enough allocation that conservative garbage collection
   isn't very attractive.

   I mention shadow stacks because they are the other (I would say
   dominant) source of overhead when compiling to C.  It subverts
   register and frame slot allocation, two things the C compiler
   should do well.

5) In GCC, shadow stacks can be efficient to pull off---compile without a
   frame pointer, and declare a register global to hold the shadow
   stack pointer.  Net change in number of available registers: none.
   But the loss in register/frame allocation is still annoying.

I've stuck with an implementation that could be turned into
bone-standard C99 code with a few days' work.  The performance
advantages of gcc (register globals, sibling call optimization, and
per-procedure control over x86 calling conventions being the big ones)
are such that I have no plans to put in the effort.  I would get a
working, but positively lousy, implementation.

My understanding is that ghc -fvia-C also depends on GCC for properly
tail-recursive output (otherwise they can use a trampoline as
described in your message).  They post-process the assembly code
output in order to replace tail-recursive calls with jumps.  This
approach is less portable but reliably eliminates tail calls.
The code is in GHC and wasn't hard to read when last I looked.  Note
that they *also* post-process the assembly code output to move info
tables into the code segment.  That code *is* a bit complicated, and
messages over the years suggest that people looking for one trick
often find themselves lost in the code for the other, as one program
does them both.

-Jan-Willem Maessen
Eager Haskell project