High-level Cmm code and stack allocation

Fri Jan 10 17:00:29 UTC 2014

So stack areas are still a great abstraction, the only change is that 
they now overlap.  It's not just about stack getting too big, I've 
copied the notes I made about it below (which I will paste into the code 
in due course).  The nice property that we can generate well-defined Cmm 
without knowing explicit stack offsets is intact.

What is different is that there used to be an intermediate state where 
live variables were saved to abstract stack areas across calls, but Sp 
was still not manifest.  This intermediate state doesn't exist any more, 
the stack layout algorithm does it all in one pass.  To me this was far 
simpler, and I think it ended up being fewer lines of code than the old 
multi-phase stack layout algorithm (it's also much faster).

Of course you can always change this.  My goal was to get code that was 
at least as good as the old code generator and in a reasonable amount of 
time, and this was the shortest path I could find to that goal.

Cheers,
Simon

e.g. if we had

     x = Sp[old + 8]
     y = Sp[old + 16]

     Sp[young(L) + 8]  = L
     Sp[young(L) + 16] = y
     Sp[young(L) + 24] = x
     call f() returns to L

if areas semantically do not overlap, then we might optimise this to

     Sp[young(L) + 8]  = L
     Sp[young(L) + 16] = Sp[old + 8]
     Sp[young(L) + 24] = Sp[old + 16]
     call f() returns to L

and now young(L) cannot be allocated at the same place as old, and we
are doomed to use more stack.

   - old+8  conflicts with young(L)+8
   - old+16 conflicts with young(L)+16 and young(L)+8

so young(L)+8 == old+24 and we get

     Sp[-8]  = L
     Sp[-16] = Sp[8]
     Sp[-24] = Sp[0]
     Sp -= 24
     call f() returns to L

However, if areas are defined to be "possibly overlapping" in the
semantics, then we cannot commute any loads/stores of old with
young(L), and we will be able to re-use both old+8 and old+16 for
young(L).

     x = Sp[8]
     y = Sp[0]

     Sp[8] = L
     Sp[0] = y
     Sp[-8] = x
     Sp = Sp - 8
     call f() returns to L

Now, the assignments of y go away,

     x = Sp[8]
     Sp[8] = L
     Sp[-8] = x
     Sp = Sp - 8
     call f() returns to L

Conclusion:

   - T[old+N] aliases with U[young(L)+M] for all T, U, L, N and M
   - T[old+N] aliases with U[old+M] only if the areas actually overlap

this ensures that we will not commute any accesses to old with
young(L) or young(L) with young(L'), and the stack allocator will get
the maximum opportunity to overlap these areas, keeping the stack use
to a minimum and possibly avoiding some assignments.

On 10/01/2014 16:35, Simon Peyton Jones wrote:
> Oh, ok.  Alas, a good chunk of my model of Cmm has just gone out of the window.  I thought that areas were such a lovely, well-behaved abstraction.  I was thrilled when we came up with them, and I'm very sorry to see them go.
>
> There are no many things that I no longer understand.  I now have no idea how we save live variables over a call, or how multiple returned values from one call (returned on the stack) stay right where they are if they are live across the next call.
>
> What was the actual problem?  That functions used too much stack, so the stack was getting too big?  But a one slot area corresponds exactly to a live variable, so I don't see how the area abstraction could possibly increase stack size.  And is stack size a crucial issue anyway?
>
> Apart from anything else, areas would have given a lovely solution to the problem this thread started with!
>
> I guess we can talk about this when you next visit?  But some documentation would be welcome.
>
> Simon
>
> | -----Original Message-----
> | From: Simon Marlow [mailto:marlowsd at gmail.com]
> | Sent: 10 January 2014 16:24
> | To: Simon Peyton Jones; Herbert Valerio Riedel
> | Cc: ghc-devs at haskell.org
> | Subject: Re: High-level Cmm code and stack allocation
> |
> | There are no one-slot areas any more, I ditched those when I rewrote the
> | stack allocator.  There is only ever one live area: either the old area
> | or the young area for a call we are about to make or have just made.
> | (see the data type: I removed the one-slot areas)
> |
> | I struggled for a long time with this.  The problem is that with the
> | semantics of non-overlapping areas, code motion optimisations would tend
> | to increase the stack requirements of the function by overlapping the
> | live ranges of the areas.  I concluded that actually what we wanted was
> | areas that really do overlap, and optimisations that respect that, so
> | that we get more efficient stack usage.
> |
> | Cheers,
> | 	Simon
> |
> | On 10/01/2014 15:22, Simon Peyton Jones wrote:
> | > That documentation would be good, yes!  I don't know what it means to
> | say "we don't really have a general concept of areas any more".  We did
> | before, and I didn't know that it had gone away.  Urk!  We can have lots
> | of live areas, notably the old area (for the current call/return
> | parameters, the call area for a call we are preparing, and the one-slot
> | areas for variables we are saving on the stack.
> | >
> | > Here's he current story
> | > https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/StackAreas
> | >
> | > I agree that we have no concrete syntax for talking about areas, but
> | that is something we could fix.  But I'm worried that they may not mean
> | what they used to mean.
> | >
> | > Simon
> | >
> | > | -----Original Message-----
> | > | From: Simon Marlow [mailto:marlowsd at gmail.com]
> | > | Sent: 09 January 2014 08:39
> | > | To: Simon Peyton Jones; Herbert Valerio Riedel
> | > | Cc: ghc-devs at haskell.org
> | > | Subject: Re: High-level Cmm code and stack allocation
> | > |
> | > | On 08/01/2014 10:07, Simon Peyton Jones wrote:
> | > | > | > Can't we just allocate a Cmm "area"? The address of an area is
> | > | > | > a
> | > | > | perfectly well-defined Cmm value.
> | > | >
> | > | > What about this idea?
> | > |
> | > | We don't really have a general concept of areas (any more), and
> | > | areas aren't exposed in the concrete Cmm syntax at all.  The current
> | > | semantics is that areas may overlap with each other, so there should
> | > | only be one active area at any point.  I found that this was
> | > | important to ensure that we could generate good code from the stack
> | > | layout algorithm, otherwise it had to make pessimistic assumptions
> | and use too much stack.
> | > |
> | > | You're going to ask me where this is documented, and I think I have
> | > | to admit to slacking off, sorry :-)  We did discuss it at the time,
> | > | and I made copious notes, but I didn't transfer those to the code.
> | > | I'll add a Note.
> | > |
> | > | Cheers,
> | > | Simon
> | > |
> | > |
> | > | > Simon
> | > | >
> | > | > | -----Original Message-----
> | > | > | From: Simon Marlow [mailto:marlowsd at gmail.com]
> | > | > | Sent: 08 January 2014 09:26
> | > | > | To: Simon Peyton Jones; Herbert Valerio Riedel
> | > | > | Cc: ghc-devs at haskell.org
> | > | > | Subject: Re: High-level Cmm code and stack allocation
> | > | > |
> | > | > | On 07/01/14 22:53, Simon Peyton Jones wrote:
> | > | > | > | Yes, this is technically wrong but luckily works.  I'd very
> | > | > | > | much like to have a better solution, preferably one that
> | > | > | > | doesn't add any extra overhead.
> | > | > | >
> | > | > | > | __decodeFloat_Int is a C function, so it will not touch the
> | > | > | > | Haskell stack.
> | > | > | >
> | > | > | > This all seems terribly fragile to me.  At least it ought to
> | > | > | > be
> | > | > | surrounded with massive comments pointing out how terribly
> | > | > | fragile it is, breaking all the rules that we carefully document
> | elsewhere.
> | > | > | >
> | > | > | > Can't we just allocate a Cmm "area"? The address of an area is
> | > | > | > a
> | > | > | perfectly well-defined Cmm value.
> | > | > |
> | > | > | It is fragile, yes.  We can't use static memory because it needs
> | > | > | to be thread-local.  This particular hack has gone through
> | > | > | several iterations over the years: first we had static memory,
> | > | > | which broke when we did the parallel runtime, then we had
> | > | > | special storage in the Capability, which we gave up when GMP was
> | > | > | split out into a separate library, because it didn't seem right
> | > | > | to have magic fields in the Capability for one library.
> | > | > |
> | > | > | I'm looking into whether we can do temporary allocation on the
> | > | > | heap for this instead.
> | > | > |
> | > | > | Cheers,
> | > | > | Simon
> | > | > |
> | > | > |
> | > | > | > Simon
> | > | > | >
> | > | > | > | -----Original Message-----
> | > | > | > | From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On
> | > | > | > | Behalf Of Simon Marlow
> | > | > | > | Sent: 07 January 2014 16:05
> | > | > | > | To: Herbert Valerio Riedel; ghc-devs at haskell.org
> | > | > | > | Subject: Re: High-level Cmm code and stack allocation
> | > | > | > |
> | > | > | > | On 04/01/2014 23:26, Herbert Valerio Riedel wrote:
> | > | > | > | > Hello,
> | > | > | > | >
> | > | > | > | > According to Note [Syntax of .cmm files],
> | > | > | > | >
> | > | > | > | > | There are two ways to write .cmm code:
> | > | > | > | > |
> | > | > | > | > |  (1) High-level Cmm code delegates the stack handling to
> | > | > | > | > | GHC,
> | > | > | and
> | > | > | > | > |      never explicitly mentions Sp or registers.
> | > | > | > | > |
> | > | > | > | > |  (2) Low-level Cmm manages the stack itself, and must
> | > | > | > | > | know
> | > | about
> | > | > | > | > |      calling conventions.
> | > | > | > | > |
> | > | > | > | > | Whether you want high-level or low-level Cmm is
> | > | > | > | > | indicated by the presence of an argument list on a
> | procedure.
> | > | > | > | >
> | > | > | > | > However, while working on integer-gmp I've been noticing
> | > | > | > | > in integer-gmp/cbits/gmp-wrappers.cmm that even though all
> | > | > | > | > Cmm
> | > | > | > | procedures
> | > | > | > | > have been converted to high-level Cmm, they still
> | > | > | > | > reference the
> | > | > | 'Sp'
> | > | > | > | > register, e.g.
> | > | > | > | >
> | > | > | > | >
> | > | > | > | >      #define GMP_TAKE1_RET1(name,mp_fun)       \
> | > | > | > | >      name (W_ ws1, P_ d1)                      \
> | > | > | > | >      {                                         \
> | > | > | > | >        W_ mp_tmp1;                             \
> | > | > | > | >        W_ mp_result1;                          \
> | > | > | > | >                                                \
> | > | > | > | >      again:                                    \
> | > | > | > | >        STK_CHK_GEN_N (2 * SIZEOF_MP_INT);      \
> | > | > | > | >        MAYBE_GC(again);                        \
> | > | > | > | >                                                \
> | > | > | > | >        mp_tmp1    = Sp - 1 * SIZEOF_MP_INT;    \
> | > | > | > | >        mp_result1 = Sp - 2 * SIZEOF_MP_INT;    \
> | > | > | > | >        ...                                     \
> | > | > | > | >
> | > | > | > | >
> | > | > | > | > So is this valid high-level Cmm code? What's the proper
> | > | > | > | > way to
> | > | > | > | allocate
> | > | > | > | > Stack (and/or Heap) memory from high-level Cmm code?
> | > | > | > |
> | > | > | > | Yes, this is technically wrong but luckily works.  I'd very
> | > | > | > | much like to have a better solution, preferably one that
> | > | > | > | doesn't add any extra overhead.
> | > | > | > |
> | > | > | > | The problem here is that we need to allocate a couple of
> | > | > | > | temporary words and take their address; that's an unusual
> | > | > | > | thing to do in Cmm, so it only occurs in a few places
> | > | > | > | (mainly
> | > | interacting with gmp).
> | > | > | > | Usually if you want some temporary storage you can use local
> | > | > | > | variables or some heap-allocated memory.
> | > | > | > |
> | > | > | > | Cheers,
> | > | > | > | Simon
> | > | > | > | _______________________________________________
> | > | > | > | ghc-devs mailing list
> | > | > | > | ghc-devs at haskell.org
> | > | > | > | http://www.haskell.org/mailman/listinfo/ghc-devs
> | > | > | >
> | > | >
> | >
>