inside the GHC code generator

Tue Feb 28 04:56:00 EST 2006

Bulat Ziganshin wrote:

>>>* C++ calling stack should be used as much as possible
>>>* parameters are passed in C++ way: "factorial(int n, int r)"
> 
> 
> SM> This is unworkable, I'm afraid.  Go and read Simon's paper on the STG:
> 
> SM>    http://citeseer.ist.psu.edu/peytonjones92implementing.html
> 
> SM> there are two main reasons we don't use the C stack: garbage collection
> SM> and tail calls.  What do you plan to do about GC?
> 
> minimalistic solution i see - just use two stacks. unboxed values are
> passed using the C conventions and C compiler will be able to
> optimize using of this stack. boxed values are passed using the
> current Sp[] machinery and therefore GC will not be changed

How do you propose to do thread switching?

Incedentally, GHC had two stacks a long time ago (before 4.00) I rewrote 
the runtime and code generator to replace it with one stack.  It was 
about 6 months work, if I remember correctly.  I think you misused the 
words "just" and "minimalistic" :-)

I know gcc does (some) tail-call optimisation these days.  You don't get 
any *guarantee* that a call will be implemented as a tail-call, though, 
and the conditions are quite complex.  The gcc folks treat it as an 
optimisation, rather than a fundamental part of the operational 
semantics, which is what you need for Haskell.

> SM> This is called semi-tagging, it was tried a long time ago.  Certainly 
> SM> worth trying again, but I very much doubt the gains would be anything 
> SM> like you claim, and it increases code size.
> 
> we will replace 2 unpredictable jumps with one that will not be even
> executed if value is in WHNF. at least, i'm very displeased by slowness
> of lazy values evaluation in GHC. for example, my binary serialization
> package writes 60 mb/s working with unboxed arrays and only 20 mb/s
> working with the lists

You need to *try* these things and measure the difference.  Quite often 
I'm surprised with the actual results I get from an optimisation that 
looks on paper to be worthwhile.  Today's processors with 3-level caches 
are complicated beasts.

Semi tagging increases code size, so unless you can make some 
significant gains to make up for it, you've lost already.

Let me give you another example: GHC puts info tables next to the entry 
code for a closure.  On paper this looks like it might be a bad idea: it 
pollutes the instruction cache with data that is rarely accessed during 
execution.  It's hard to arrange (this is one thing the mangler does), 
so we'd like to avoid it if possible.  However, doing this eliminates an 
indirection when entering a closure, reduces the size of info tables by 
one word, and reduces code size.  These effects together are more 
beneficial than the code-cache-pollution effect - last time I measured 
it the difference was ~10%.  (turn off TABLES_NEXT_TO_CODE and try it 
yourself).

> SM> There are other schemes, including this one that we particularly like:
> SM> use a spare bit in a pointer to indicate "points to an evaluated value".
> SM>   Since there are two spare bits, you can go further and use the 4 
> SM> values to mean:
> 
> SM>    0: possibly unevaluated
> SM>    1: evaluated, tag 0
> SM>    2: evaluated, tag 1
> SM>    3: evaluated, tag 2
> 
> SM> I believe Robert Ennals tried this as part of his spec eval 
> SM> implementation, but IIRC he didn't get separate measurements for it (or
> SM> it didn't improve things much).  Note that this increases code size too.
> 
> to be exact, we had only 2^30-1 valid pointers, all other values are
> ready to use! :)

True, but I'm assuming you often want to store a pointer in addition to 
the tag (eg. for a cons cell).  For enumerations, you're quite right 
that the whole word can be used for the tag as long as it is 
distinguished from a pointer.

Cheers,
	Simon