[commit: ghc] master: Documentation on the stack layout algorithm (78a506a)

Thu Jan 16 11:11:46 UTC 2014

Repository : ssh://git@git.haskell.org/ghc

On branch  : master
Link       : http://ghc.haskell.org/trac/ghc/changeset/78a506a914f18399c8d5032fe72547b6c2abfc7a/ghc

>---------------------------------------------------------------

commit 78a506a914f18399c8d5032fe72547b6c2abfc7a
Author: Simon Marlow <marlowsd at gmail.com>
Date:   Mon Jan 13 20:36:48 2014 +0000

    Documentation on the stack layout algorithm


>---------------------------------------------------------------

78a506a914f18399c8d5032fe72547b6c2abfc7a
 compiler/cmm/CmmLayoutStack.hs |   99 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 94 insertions(+), 5 deletions(-)

diff --git a/compiler/cmm/CmmLayoutStack.hs b/compiler/cmm/CmmLayoutStack.hs
index 5b881d8..0c49033 100644
--- a/compiler/cmm/CmmLayoutStack.hs
+++ b/compiler/cmm/CmmLayoutStack.hs
@@ -35,13 +35,95 @@ import Control.Monad (liftM)
 
 #include "HsVersions.h"
 
+{- Note [Stack Layout]
 
-data StackSlot = Occupied | Empty
-     -- Occupied: a return address or part of an update frame
+The job of this pass is to
+
+ - replace references to abstract stack Areas with fixed offsets from Sp.
+
+ - replace the CmmHighStackMark constant used in the stack check with
+   the maximum stack usage of the proc.
+
+ - save any variables that are live across a calll, and reload them as
+   necessary.
+
+Before stack allocation, local variables remain live across native
+calls (CmmCall{ cmm_cont = Just _ }), and after stack allocation local
+variables are clobbered by native calls.
+
+We want to do stack allocation so that as far as possible
+ - stack use is minimized, and
+ - unnecessary stack saves and loads are avoided.
+
+The algorithm we use is a variant of linear-scan register allocation,
+where the stack is our register file.
+
+ - First, we do a liveness analysis, which annotates every block with
+   the variables live on entry to the block.
+
+ - We traverse blocks in reverse postorder DFS; that is, we visit at
+   least one predecessor of a block before the block itself.  The
+   stack layout flowing from the predecessor of the block will
+   determine the stack layout on entry to the block.
+
+ - We maintain a data structure
+
+     Map Label StackMap
+
+   which describes the contents of the stack and the stack pointer on
+   entry to each block that is a successor of a block that we have
+   visited.
+
+ - For each block we visit:
+
+    - Look up the StackMap for this block.
+
+    - If this block is a proc point (or a call continuation, if we
+      aren't splitting proc points), emit instructions to reload all
+      the live variables from the stack, according to the StackMap.
+
+    - Walk forwards through the instructions:
+      - At an assignment  x = Sp[loc]
+        - Record the fact that Sp[loc] contains x, so that we won't
+          need to save x if it ever needs to be spilled.
+      - At an assignment  x = E
+        - If x was previously on the stack, it isn't any more
+      - At the last node, if it is a call or a jump to a proc point
+        - Lay out the stack frame for the call (see setupStackFrame)
+        - emit instructions to save all the live variables
+        - Remember the StackMaps for all the successors
+        - emit an instruction to adjust Sp
+      - If the last node is a branch, then the current StackMap is the
+        StackMap for the successors.
+
+    - Manifest Sp: replace references to stack areas in this block
+      with real Sp offsets. We cannot do this until we have laid out
+      the stack area for the successors above.
+
+      In this phase we also eliminate redundant stores to the stack;
+      see elimStackStores.
+
+  - There is one important gotcha: sometimes we'll encounter a control
+    transfer to a block that we've already processed (a join point),
+    and in that case we might need to rearrange the stack to match
+    what the block is expecting. (exactly the same as in linear-scan
+    register allocation, except here we have the luxury of an infinite
+    supply of temporary variables).
+
+  - Finally, we update the magic CmmHighStackMark constant with the
+    stack usage of the function, and eliminate the whole stack check
+    if there was no stack use. (in fact this is done as part of the
+    main traversal, by feeding the high-water-mark output back in as
+    an input. I hate cyclic programming, but it's just too convenient
+    sometimes.)
+
+There are plenty of tricky details: update frames, proc points, return
+addresses, foreign calls, and some ad-hoc optimisations that are
+convenient to do here and effective in common cases.  Comments in the
+code below explain these.
+
+-}
 
-instance Outputable StackSlot where
-  ppr Occupied = ptext (sLit "XXX")
-  ppr Empty    = ptext (sLit "---")
 
 -- All stack locations are expressed as positive byte offsets from the
 -- "base", which is defined to be the address above the return address
@@ -996,6 +1078,13 @@ callResumeThread new_base id =
 plusW :: DynFlags -> ByteOff -> WordOff -> ByteOff
 plusW dflags b w = b + w * wORD_SIZE dflags
 
+data StackSlot = Occupied | Empty
+     -- Occupied: a return address or part of an update frame
+
+instance Outputable StackSlot where
+  ppr Occupied = ptext (sLit "XXX")
+  ppr Empty    = ptext (sLit "---")
+
 dropEmpty :: WordOff -> [StackSlot] -> Maybe [StackSlot]
 dropEmpty 0 ss           = Just ss
 dropEmpty n (Empty : ss) = dropEmpty (n-1) ss