[GHC] #8326: Place heap checks common in case alternatives before the case

Mon Jan 7 08:45:24 UTC 2019

#8326: Place heap checks common in case alternatives before the case
-------------------------------------+-------------------------------------
        Reporter:  jstolarek         |                Owner:  (none)
            Type:  task              |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.7
      Resolution:                    |             Keywords:  CodeGen
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:  8317
 Related Tickets:  #1498             |  Differential Rev(s):  Phab:D343
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by simonpj:

Old description:

> We would like to have functions that check whether an `Int#` is a valid
> tag to represent `Bool` (see Note [Optimizing isTrue#] in ghc-prim):
>
> {{{
> isTrue# :: Int# -> Bool
> isTrue# 1# = True
> isTrue# _  = False
>
> isFalse# :: Int# -> Bool
> isFalse# 0# = True
> isFalse# _  = False
> }}}
>
> We could use them with comparison primops like this:
>
> {{{
> f :: Int# -> Int
> f x | isTrue# (x ># 0#) = I# x
>     | otherwise         = -(I# x)
> }}}
>
> `isTrue#` is optimized away at the Core level:
>
> {{{
> A.f =
>   \ (x_aqM :: GHC.Prim.Int#) ->
>     case GHC.Prim.># x_aqM 0 of _ {
>       __DEFAULT -> GHC.Types.I# (GHC.Prim.negateInt# x_aqM);
>       1 -> GHC.Types.I# x_aqM
>     }
> }}}
>
> but the code genrator produces very bad Cmm code, because it pushes heap
> checks into case alternatives:
>
> {{{
>      {offset
>        cFd: // stack check
>            if ((Sp + -16) < SpLim) goto cFr; else goto cFs;
>        cFr: // not enough place on the stack, call GC
>            R2 = R2;
>            R1 = A.f_closure;
>            call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
>        cFs: // scrutinize (x ># 0#)
>            _sEU::I64 = R2;
>            _sEV::I64 = %MO_S_Gt_W64(R2, 0);
>            if (_sEV::I64 != 1) goto cFg; else goto cFo;
>        cFg: // False branch
>            Hp = Hp + 16;
>            if (Hp > HpLim) goto cFy; else goto cFx;
>        cFy: // not enough heap, call GC
>            HpAlloc = 16;
>            I64[Sp - 16] = cFf;
>            R1 = _sEV::I64;
>            I64[Sp - 8] = _sEU::I64;
>            Sp = Sp - 16;
>            call stg_gc_unbx_r1(R1) returns to cFf, args: 8, res: 8, upd:
> 8;
>        cFf: // re-do the False branch
>            _sEU::I64 = I64[Sp + 8];
>            Sp = Sp + 16;
>            _sEV::I64 = R1;
>            goto cFg;
>        cFx: // RHS of False branch
>            I64[Hp - 8] = GHC.Types.I#_con_info;
>            I64[Hp] = -_sEU::I64;
>            R1 = Hp - 7;
>            call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>        cFo: // True branch
>            Hp = Hp + 16;
>            if (Hp > HpLim) goto cFv; else goto cFu;
>        cFv: // not enough heap, call GC
>            HpAlloc = 16;
>            I64[Sp - 16] = cFn;
>            R1 = _sEV::I64;
>            I64[Sp - 8] = _sEU::I64;
>            Sp = Sp - 16;
>            call stg_gc_unbx_r1(R1) returns to cFn, args: 8, res: 8, upd:
> 8;
>        cFn: // re-do the True branch
>            _sEU::I64 = I64[Sp + 8];
>            Sp = Sp + 16;
>            _sEV::I64 = R1;
>            goto cFo;
>        cFu: // RHS of True branch
>            I64[Hp - 8] = GHC.Types.I#_con_info;
>            I64[Hp] = _sEU::I64;
>            R1 = Hp - 7;
>            call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>      }
> }}}
>
> This results in average 2.5% increase in binary size. By contrast, if we
> use `tagToEnum#` instead of `isTrue#` heap check will be placed before
> `case` expression and the code will be significantly shorter (this is
> done by a special case-on-bool optimization in the code generator - see
> #8317). What we would like to do here is:
>
>   1. compile case alternatives without placing heap checks inside them
>   2. each compiled alternative should return amount of heap it needs to
> allocate
>   3. code generator inspects amounts of heap needed by each alternative
> and either adds heap checks in alternatives or puts a single check before
> the case expression.
>
> Getting this right might be a bit tricky.
>   1. if all branches allocate some heap then we can just put a common
> heap check before the case. Note that we must allocate the higgest amount
> required by any of the alternatives and then alternatives that use less
> heap must retract the heap pointer accordingly.
>   2. if we have two alternatives, one of which allocates heap and the
> other does not, we should place the heap check only in the alternative
> that allocates the stack. This will solve #1498.
>   3. it is not clear to me what to do if we have combination of the above
> (more than one branch that allocates heap and at least one branch that
> does not). If we place heap check before the `case` expression we lose
> optimization of recursive functions and face the problem described in
> #1498. If we push heap checks into branches that allocate heap then we
> get code duplication, i.e. the problem that we're addressing in this
> ticket. I guess the only way to make correct decission here is to try
> different aproaches and measure their performance.
>
> This ticket is mentioned
> [http://ghc.haskell.org/trac/ghc/wiki/PrimBool#Implementationdetails on
> this wiki page] and in the source code in Note [Optimizing isTrue#] in
> ghc-prim. Once this ticket is resolved we need to update these places
> accordingly.

New description:

 We would like to have functions that check whether an `Int#` is a valid
 tag to represent `Bool` (see Note [Optimizing isTrue#] in ghc-prim):

 {{{
 isTrue# :: Int# -> Bool
 isTrue# 1# = True
 isTrue# _  = False

 isFalse# :: Int# -> Bool
 isFalse# 0# = True
 isFalse# _  = False
 }}}

 We could use them with comparison primops like this:

 {{{
 f :: Int# -> Int
 f x | isTrue# (x ># 0#) = I# x
     | otherwise         = -(I# x)
 }}}

 `isTrue#` is optimized away at the Core level:

 {{{
 A.f =
   \ (x_aqM :: GHC.Prim.Int#) ->
     case GHC.Prim.># x_aqM 0 of _ {
       __DEFAULT -> GHC.Types.I# (GHC.Prim.negateInt# x_aqM);
       1 -> GHC.Types.I# x_aqM
     }
 }}}

 but the code genrator produces very bad Cmm code, because it pushes heap
 checks into case alternatives:

 {{{
      {offset
        cFd: // stack check
            if ((Sp + -16) < SpLim) goto cFr; else goto cFs;
        cFr: // not enough place on the stack, call GC
            R2 = R2;
            R1 = A.f_closure;
            call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
        cFs: // scrutinize (x ># 0#)
            _sEU::I64 = R2;
            _sEV::I64 = %MO_S_Gt_W64(R2, 0);
            if (_sEV::I64 != 1) goto cFg; else goto cFo;
        cFg: // False branch
            Hp = Hp + 16;
            if (Hp > HpLim) goto cFy; else goto cFx;
        cFy: // not enough heap, call GC
            HpAlloc = 16;
            I64[Sp - 16] = cFf;
            R1 = _sEV::I64;
            I64[Sp - 8] = _sEU::I64;
            Sp = Sp - 16;
            call stg_gc_unbx_r1(R1) returns to cFf, args: 8, res: 8, upd:
 8;
        cFf: // re-do the False branch
            _sEU::I64 = I64[Sp + 8];
            Sp = Sp + 16;
            _sEV::I64 = R1;
            goto cFg;
        cFx: // RHS of False branch
            I64[Hp - 8] = GHC.Types.I#_con_info;
            I64[Hp] = -_sEU::I64;
            R1 = Hp - 7;
            call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
        cFo: // True branch
            Hp = Hp + 16;
            if (Hp > HpLim) goto cFv; else goto cFu;
        cFv: // not enough heap, call GC
            HpAlloc = 16;
            I64[Sp - 16] = cFn;
            R1 = _sEV::I64;
            I64[Sp - 8] = _sEU::I64;
            Sp = Sp - 16;
            call stg_gc_unbx_r1(R1) returns to cFn, args: 8, res: 8, upd:
 8;
        cFn: // re-do the True branch
            _sEU::I64 = I64[Sp + 8];
            Sp = Sp + 16;
            _sEV::I64 = R1;
            goto cFo;
        cFu: // RHS of True branch
            I64[Hp - 8] = GHC.Types.I#_con_info;
            I64[Hp] = _sEU::I64;
            R1 = Hp - 7;
            call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
      }
 }}}

 This results in average 2.5% increase in binary size. By contrast, if we
 use `tagToEnum#` instead of `isTrue#` heap check will be placed before
 `case` expression and the code will be significantly shorter (this is done
 by a special case-on-bool optimization in the code generator - see #8317).
 What we would like to do here is:

   1. compile case alternatives without placing heap checks inside them
   2. each compiled alternative should return amount of heap it needs to
 allocate
   3. code generator inspects amounts of heap needed by each alternative
 and either adds heap checks in alternatives or puts a single check before
 the case expression.

 Getting this right might be a bit tricky.
   1. if all branches allocate some heap then we can just put a common heap
 check before the case. Note that we must allocate the higgest amount
 required by any of the alternatives and then alternatives that use less
 heap must retract the heap pointer accordingly.
   2. if we have two alternatives, one of which allocates heap and the
 other does not, we should place the heap check only in the alternative
 that allocates the stack. This will solve #1498.
   3. it is not clear to me what to do if we have combination of the above
 (more than one branch that allocates heap and at least one branch that
 does not). If we place heap check before the `case` expression we lose
 optimization of recursive functions and face the problem described in
 #1498. If we push heap checks into branches that allocate heap then we get
 code duplication, i.e. the problem that we're addressing in this ticket. I
 guess the only way to make correct decission here is to try different
 aproaches and measure their performance.

 This ticket is mentioned
 * [http://ghc.haskell.org/trac/ghc/wiki/PrimBool#Implementationdetails on
 this wiki page]
 * in the source code in Note [Optimizing isTrue#] in ghc-prim.
 * In `Simplify.hs`, `Note [Optimising tagToEnum#]`
 Once this ticket is resolved we need to update these places accordingly.

--

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8326#comment:46>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler