High-level Cmm code and stack allocation

Mon Jan 13 17:13:58 UTC 2014

Using more stack generally (but not always) implies extra memory 
traffic.  I noticed it happening a lot, but I didn't make measurements - 
we never had a way to generate code with just this one thing changed, 
because the new code generator had lots of issues with bad code, and 
this was just one.

We can think of stack allocation as a black box: it takes Cmm in which 
(a) variables live across calls and (b) stack references are to [Old+n] 
or [Young+n], and returns Cmm in which (a) variables do not live across 
calls, and (b) all stack references are explicit offsets from Sp.

The internals of this box are what has changed. Most users of Cmm don't 
need to care, because you can write optimisations on both the 
pre-stack-allocated Cmm and the post-stack-allocated Cmm without knowing 
anything about how stack allocation works. Indeed CmmSink (now) works on 
both forms.

The stack area idea exposed some of the internals of this box; I don't 
think that's necessarily a good thing.  There was *another* form of Cmm, 
in which (a) variables do not live across calls, and (b) stack 
references are to [Old+n], [Young+n] or [Sp(var)].  There was a 
(beautifully simple) spill pass using Hoopl that inserted spills at the 
definition site; unfortunately to generate good code you often have to 
move the spills somewhere else.  And that's really hard, because code 
motion interacts in complex ways with stack layout: making a bad code 
motion decision can increase your stack requirements.  This is a pretty 
good summary of what I was finding difficult here.  It was not possible 
to generate good code without doing some optimisation on this 
intermediate stage, yet by doing stack allocation in a different way it 
was much easier to get good code.  So the new stack allocator just walks 
through the code spilling, reloading, and allocating stack frames as it 
goes and making intelligent decisions about not spilling things if 
they're already on the stack.  This does a really good job, and it was 
easy to add a couple of important special cases for common things.

There's plenty of room to do something better.  However, what we have 
now generates good code from the kind of things that the code generator 
generates (since that's what I tuned it for, by peering at lots of Cmm 
and tweaking things), so any improvements won't see much benefit for 
typical Haskell code.

I have some more docs for the stack layout code that I'll push shortly.

Cheers,
Simon

On 13/01/2014 15:20, Simon Peyton Jones wrote:
> Thanks.  Reading what you write below, I can see two possible motivations.
>
> 1.  Reduce stack sizes.
> 2.  Eliminate memory moves
>
> For (1) do we have any data to show that the non-overlap of areas was giving rise to unacceptably big stacks?
>
> For (2) that is indeed clever, but it's pretty serendipitous: it relies on the overlap being just so, so that coincidentally y gets stored in the same place as it was loaded from.  I imagine that you don't plan the stack layout to cause that to happen; it's just a coincidence.  Do we have any data to show that the coincidence happens with any frequency?
>
> Also, as you note, we lose the opportunity for certain sorts of code motion, perhaps increasing register pressure a lot.  So there is a downside too.
>
> You seldom do things without a very good reason, so I feel I must be missing something.
>
> Simon
>
> | -----Original Message-----
> | From: Simon Marlow [mailto:marlowsd at gmail.com]
> | Sent: 10 January 2014 17:00
> | To: Simon Peyton Jones; Herbert Valerio Riedel
> | Cc: ghc-devs at haskell.org
> | Subject: Re: High-level Cmm code and stack allocation
> |
> | So stack areas are still a great abstraction, the only change is that
> | they now overlap.  It's not just about stack getting too big, I've
> | copied the notes I made about it below (which I will paste into the code
> | in due course).  The nice property that we can generate well-defined Cmm
> | without knowing explicit stack offsets is intact.
> |
> | What is different is that there used to be an intermediate state where
> | live variables were saved to abstract stack areas across calls, but Sp
> | was still not manifest.  This intermediate state doesn't exist any more,
> | the stack layout algorithm does it all in one pass.  To me this was far
> | simpler, and I think it ended up being fewer lines of code than the old
> | multi-phase stack layout algorithm (it's also much faster).
> |
> | Of course you can always change this.  My goal was to get code that was
> | at least as good as the old code generator and in a reasonable amount of
> | time, and this was the shortest path I could find to that goal.
> |
> | Cheers,
> | Simon
> |
> | e.g. if we had
> |
> |      x = Sp[old + 8]
> |      y = Sp[old + 16]
> |
> |      Sp[young(L) + 8]  = L
> |      Sp[young(L) + 16] = y
> |      Sp[young(L) + 24] = x
> |      call f() returns to L
> |
> | if areas semantically do not overlap, then we might optimise this to
> |
> |      Sp[young(L) + 8]  = L
> |      Sp[young(L) + 16] = Sp[old + 8]
> |      Sp[young(L) + 24] = Sp[old + 16]
> |      call f() returns to L
> |
> | and now young(L) cannot be allocated at the same place as old, and we
> | are doomed to use more stack.
> |
> |    - old+8  conflicts with young(L)+8
> |    - old+16 conflicts with young(L)+16 and young(L)+8
> |
> | so young(L)+8 == old+24 and we get
> |
> |      Sp[-8]  = L
> |      Sp[-16] = Sp[8]
> |      Sp[-24] = Sp[0]
> |      Sp -= 24
> |      call f() returns to L
> |
> | However, if areas are defined to be "possibly overlapping" in the
> | semantics, then we cannot commute any loads/stores of old with young(L),
> | and we will be able to re-use both old+8 and old+16 for young(L).
> |
> |      x = Sp[8]
> |      y = Sp[0]
> |
> |      Sp[8] = L
> |      Sp[0] = y
> |      Sp[-8] = x
> |      Sp = Sp - 8
> |      call f() returns to L
> |
> | Now, the assignments of y go away,
> |
> |      x = Sp[8]
> |      Sp[8] = L
> |      Sp[-8] = x
> |      Sp = Sp - 8
> |      call f() returns to L
> |
> |
> | Conclusion:
> |
> |    - T[old+N] aliases with U[young(L)+M] for all T, U, L, N and M
> |    - T[old+N] aliases with U[old+M] only if the areas actually overlap
> |
> | this ensures that we will not commute any accesses to old with
> | young(L) or young(L) with young(L'), and the stack allocator will get
> | the maximum opportunity to overlap these areas, keeping the stack use to
> | a minimum and possibly avoiding some assignments.
> |
> |
> |
> | On 10/01/2014 16:35, Simon Peyton Jones wrote:
> | > Oh, ok.  Alas, a good chunk of my model of Cmm has just gone out of
> | the window.  I thought that areas were such a lovely, well-behaved
> | abstraction.  I was thrilled when we came up with them, and I'm very
> | sorry to see them go.
> | >
> | > There are no many things that I no longer understand.  I now have no
> | idea how we save live variables over a call, or how multiple returned
> | values from one call (returned on the stack) stay right where they are
> | if they are live across the next call.
> | >
> | > What was the actual problem?  That functions used too much stack, so
> | the stack was getting too big?  But a one slot area corresponds exactly
> | to a live variable, so I don't see how the area abstraction could
> | possibly increase stack size.  And is stack size a crucial issue anyway?
> | >
> | > Apart from anything else, areas would have given a lovely solution to
> | the problem this thread started with!
> | >
> | > I guess we can talk about this when you next visit?  But some
> | documentation would be welcome.
> | >
> | > Simon
> | >
> | > | -----Original Message-----
> | > | From: Simon Marlow [mailto:marlowsd at gmail.com]
> | > | Sent: 10 January 2014 16:24
> | > | To: Simon Peyton Jones; Herbert Valerio Riedel
> | > | Cc: ghc-devs at haskell.org
> | > | Subject: Re: High-level Cmm code and stack allocation
> | > |
> | > | There are no one-slot areas any more, I ditched those when I rewrote
> | > | the stack allocator.  There is only ever one live area: either the
> | > | old area or the young area for a call we are about to make or have
> | just made.
> | > | (see the data type: I removed the one-slot areas)
> | > |
> | > | I struggled for a long time with this.  The problem is that with the
> | > | semantics of non-overlapping areas, code motion optimisations would
> | > | tend to increase the stack requirements of the function by
> | > | overlapping the live ranges of the areas.  I concluded that actually
> | > | what we wanted was areas that really do overlap, and optimisations
> | > | that respect that, so that we get more efficient stack usage.
> | > |
> | > | Cheers,
> | > | 	Simon
> | > |
> | > | On 10/01/2014 15:22, Simon Peyton Jones wrote:
> | > | > That documentation would be good, yes!  I don't know what it means
> | > | > to
> | > | say "we don't really have a general concept of areas any more".  We
> | > | did before, and I didn't know that it had gone away.  Urk!  We can
> | > | have lots of live areas, notably the old area (for the current
> | > | call/return parameters, the call area for a call we are preparing,
> | > | and the one-slot areas for variables we are saving on the stack.
> | > | >
> | > | > Here's he current story
> | > | > https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/StackAre
> | > | > as
> | > | >
> | > | > I agree that we have no concrete syntax for talking about areas,
> | > | > but
> | > | that is something we could fix.  But I'm worried that they may not
> | > | mean what they used to mean.
> | > | >
> | > | > Simon
> | > | >
> | > | > | -----Original Message-----
> | > | > | From: Simon Marlow [mailto:marlowsd at gmail.com]
> | > | > | Sent: 09 January 2014 08:39
> | > | > | To: Simon Peyton Jones; Herbert Valerio Riedel
> | > | > | Cc: ghc-devs at haskell.org
> | > | > | Subject: Re: High-level Cmm code and stack allocation
> | > | > |
> | > | > | On 08/01/2014 10:07, Simon Peyton Jones wrote:
> | > | > | > | > Can't we just allocate a Cmm "area"? The address of an
> | > | > | > | > area is a
> | > | > | > | perfectly well-defined Cmm value.
> | > | > | >
> | > | > | > What about this idea?
> | > | > |
> | > | > | We don't really have a general concept of areas (any more), and
> | > | > | areas aren't exposed in the concrete Cmm syntax at all.  The
> | > | > | current semantics is that areas may overlap with each other, so
> | > | > | there should only be one active area at any point.  I found that
> | > | > | this was important to ensure that we could generate good code
> | > | > | from the stack layout algorithm, otherwise it had to make
> | > | > | pessimistic assumptions
> | > | and use too much stack.
> | > | > |
> | > | > | You're going to ask me where this is documented, and I think I
> | > | > | have to admit to slacking off, sorry :-)  We did discuss it at
> | > | > | the time, and I made copious notes, but I didn't transfer those
> | to the code.
> | > | > | I'll add a Note.
> | > | > |
> | > | > | Cheers,
> | > | > | Simon
> | > | > |
> | > | > |
> | > | > | > Simon
> | > | > | >
> | > | > | > | -----Original Message-----
> | > | > | > | From: Simon Marlow [mailto:marlowsd at gmail.com]
> | > | > | > | Sent: 08 January 2014 09:26
> | > | > | > | To: Simon Peyton Jones; Herbert Valerio Riedel
> | > | > | > | Cc: ghc-devs at haskell.org
> | > | > | > | Subject: Re: High-level Cmm code and stack allocation
> | > | > | > |
> | > | > | > | On 07/01/14 22:53, Simon Peyton Jones wrote:
> | > | > | > | > | Yes, this is technically wrong but luckily works.  I'd
> | > | > | > | > | very much like to have a better solution, preferably one
> | > | > | > | > | that doesn't add any extra overhead.
> | > | > | > | >
> | > | > | > | > | __decodeFloat_Int is a C function, so it will not touch
> | > | > | > | > | the Haskell stack.
> | > | > | > | >
> | > | > | > | > This all seems terribly fragile to me.  At least it ought
> | > | > | > | > to be
> | > | > | > | surrounded with massive comments pointing out how terribly
> | > | > | > | fragile it is, breaking all the rules that we carefully
> | > | > | > | document
> | > | elsewhere.
> | > | > | > | >
> | > | > | > | > Can't we just allocate a Cmm "area"? The address of an
> | > | > | > | > area is a
> | > | > | > | perfectly well-defined Cmm value.
> | > | > | > |
> | > | > | > | It is fragile, yes.  We can't use static memory because it
> | > | > | > | needs to be thread-local.  This particular hack has gone
> | > | > | > | through several iterations over the years: first we had
> | > | > | > | static memory, which broke when we did the parallel runtime,
> | > | > | > | then we had special storage in the Capability, which we gave
> | > | > | > | up when GMP was split out into a separate library, because
> | > | > | > | it didn't seem right to have magic fields in the Capability
> | for one library.
> | > | > | > |
> | > | > | > | I'm looking into whether we can do temporary allocation on
> | > | > | > | the heap for this instead.
> | > | > | > |
> | > | > | > | Cheers,
> | > | > | > | Simon
> | > | > | > |
> | > | > | > |
> | > | > | > | > Simon
> | > | > | > | >
> | > | > | > | > | -----Original Message-----
> | > | > | > | > | From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On
> | > | > | > | > | Behalf Of Simon Marlow
> | > | > | > | > | Sent: 07 January 2014 16:05
> | > | > | > | > | To: Herbert Valerio Riedel; ghc-devs at haskell.org
> | > | > | > | > | Subject: Re: High-level Cmm code and stack allocation
> | > | > | > | > |
> | > | > | > | > | On 04/01/2014 23:26, Herbert Valerio Riedel wrote:
> | > | > | > | > | > Hello,
> | > | > | > | > | >
> | > | > | > | > | > According to Note [Syntax of .cmm files],
> | > | > | > | > | >
> | > | > | > | > | > | There are two ways to write .cmm code:
> | > | > | > | > | > |
> | > | > | > | > | > |  (1) High-level Cmm code delegates the stack
> | > | > | > | > | > | handling to GHC,
> | > | > | > | and
> | > | > | > | > | > |      never explicitly mentions Sp or registers.
> | > | > | > | > | > |
> | > | > | > | > | > |  (2) Low-level Cmm manages the stack itself, and
> | > | > | > | > | > | must know
> | > | > | about
> | > | > | > | > | > |      calling conventions.
> | > | > | > | > | > |
> | > | > | > | > | > | Whether you want high-level or low-level Cmm is
> | > | > | > | > | > | indicated by the presence of an argument list on a
> | > | procedure.
> | > | > | > | > | >
> | > | > | > | > | > However, while working on integer-gmp I've been
> | > | > | > | > | > noticing in integer-gmp/cbits/gmp-wrappers.cmm that
> | > | > | > | > | > even though all Cmm
> | > | > | > | > | procedures
> | > | > | > | > | > have been converted to high-level Cmm, they still
> | > | > | > | > | > reference the
> | > | > | > | 'Sp'
> | > | > | > | > | > register, e.g.
> | > | > | > | > | >
> | > | > | > | > | >
> | > | > | > | > | >      #define GMP_TAKE1_RET1(name,mp_fun)       \
> | > | > | > | > | >      name (W_ ws1, P_ d1)                      \
> | > | > | > | > | >      {                                         \
> | > | > | > | > | >        W_ mp_tmp1;                             \
> | > | > | > | > | >        W_ mp_result1;                          \
> | > | > | > | > | >                                                \
> | > | > | > | > | >      again:                                    \
> | > | > | > | > | >        STK_CHK_GEN_N (2 * SIZEOF_MP_INT);      \
> | > | > | > | > | >        MAYBE_GC(again);                        \
> | > | > | > | > | >                                                \
> | > | > | > | > | >        mp_tmp1    = Sp - 1 * SIZEOF_MP_INT;    \
> | > | > | > | > | >        mp_result1 = Sp - 2 * SIZEOF_MP_INT;    \
> | > | > | > | > | >        ...                                     \
> | > | > | > | > | >
> | > | > | > | > | >
> | > | > | > | > | > So is this valid high-level Cmm code? What's the
> | > | > | > | > | > proper way to
> | > | > | > | > | allocate
> | > | > | > | > | > Stack (and/or Heap) memory from high-level Cmm code?
> | > | > | > | > |
> | > | > | > | > | Yes, this is technically wrong but luckily works.  I'd
> | > | > | > | > | very much like to have a better solution, preferably one
> | > | > | > | > | that doesn't add any extra overhead.
> | > | > | > | > |
> | > | > | > | > | The problem here is that we need to allocate a couple of
> | > | > | > | > | temporary words and take their address; that's an
> | > | > | > | > | unusual thing to do in Cmm, so it only occurs in a few
> | > | > | > | > | places (mainly
> | > | > | interacting with gmp).
> | > | > | > | > | Usually if you want some temporary storage you can use
> | > | > | > | > | local variables or some heap-allocated memory.
> | > | > | > | > |
> | > | > | > | > | Cheers,
> | > | > | > | > | Simon
> | > | > | > | > | _______________________________________________
> | > | > | > | > | ghc-devs mailing list
> | > | > | > | > | ghc-devs at haskell.org
> | > | > | > | > | http://www.haskell.org/mailman/listinfo/ghc-devs
> | > | > | > | >
> | > | > | >
> | > | >
> | >
>