[GHC] #12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop...

Fri Dec 9 11:23:57 UTC 2016

#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code
Flow not lifted outside the loop...
-------------------------------------+-------------------------------------
        Reporter:  GordonBGood       |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by GordonBGood):

 It seems that the loop invariant code flow not being lifted out of the
 loops in not limited to primitive operations (also including Addr#), but
 is a general case for any code that is not purely strict, thus anything
 involving boxed thunks does not seem to be optimized properly.

 The following code of a simple naive Sieve of Eratosthenes implementation
 with the composite number culling operations run a number of times in a
 loop for better timing purposes demonstrates the problem:

 {{{
 {-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
 {-# OPTIONS_GHC -O3 -rtsopts -keep-s-files -ddump-stg -ddump-cmm -ddump-
 opt-cmm -ddump-to-file -dumpdir . #-} -- or -O2 -fllvm -v -dcore-lint
 -ddump-asm

 import Data.Word
 import Data.Bits
 import Data.Array.ST (runSTUArray)
 import Data.Array.Base
 import GHC.ST ( ST(..) )

 twos = listArray (0, 31) [ 1 `shiftL` i | i <- [0 .. 31]] :: UArray Int
 Word32

 eos :: Int -> [Int]
 eos top = [fromIntegral i | (i, False) <- assocs cmpsts] where
   cmpsts = runSTUArray $ do
     cmpstsb <- newArray (0, top) False :: ST s (STUArray s Int Bool)
     cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s
 Int Word32)) cmpstsb
     unsafeWrite cmpstsw 0 3 -- precull 0 and 1
     let loop i =
           if i <= 0 then return cmpstsb else
           let nxtp p =
                 let s = p * p in
                 if s > top then loop (i - 1) else do
                 v <- unsafeRead cmpstsw (p `shiftR` 5)
                 if v .&. unsafeAt twos (p .&. 31) /= 0 then nxtp (p + 1)
 else
                   let nxtc c =
                         if c > top then return () else do
                         let w = c `shiftR` 5
                         v <- unsafeRead cmpstsw w
                         unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&.
 31))
                         nxtc (c + p) in do { nxtc s; nxtp (p + 1) } in
 twos `seq` nxtp 2
     loop (10000 :: Int)

 main = print $ length $ eos(131071)
 }}}
 When run with the -fllvm (LLVM back end) compiler flag, it produces the
 following STG code for the inner loop (located by searching for "nxtc",
 massively indented for display):

 {{{
 let {
   $wnxtc_s7Ru [InlPrag=[0],
                Occ=LoopBreaker]
     :: GHC.Prim.Int#
        -> GHC.Prim.State#
             GHC.Prim.RealWorld
        -> (# GHC.Prim.State#
                GHC.Prim.RealWorld,
              () #)
   [LclId,
    Arity=2,
    Str=DmdType <S,U><S,U>,
    Unf=OtherCon []] =
       sat-only \r srt:SRT:[] [ww1_s7Rv
                               w1_s7Rw]
           case
               ># [ww1_s7Rv
                   131071#]
           of
           sat_s7Rx
           { __DEFAULT ->
                 case
                     tagToEnum# [sat_s7Rx]
                 of
                 _ [Occ=Dead]
                 { GHC.Types.False ->
                       case
                           uncheckedIShiftRA# [ww1_s7Rv
                                               5#]
                       of
                       i#_s7Rz [Dmd=<S,U>]
                       { __DEFAULT ->
                             case
                                 readWord32Array# [ipv1_s7R1
                                                   i#_s7Rz
                                                   w1_s7Rw]
                             of
                             _ [Occ=Dead]
                             { (#,#) ipv8_s7RB [Occ=Once]
                                     ipv9_s7RC [Occ=Once] ->
                                   case
                                       andI# [ww1_s7Rv
                                              31#]
                                   of
                                   sat_s7RD
                                   { __DEFAULT ->
                                         case
                                             indexWord32Array# [ipv5_s7Rf
                                                                sat_s7RD]
                                         of
                                         wild5_s7RE
                                         { __DEFAULT ->
                                               case
                                                   or# [ipv9_s7RC
                                                        wild5_s7RE]
                                               of
                                               sat_s7RF
                                               { __DEFAULT ->
                                                     case
                                                         writeWord32Array#
 [ipv1_s7R1
 i#_s7Rz
 sat_s7RF
 ipv8_s7RB]
                                                     of
                                                     s2#1_s7RG [OS=OneShot]
                                                     { __DEFAULT ->
                                                           case
                                                               +# [ww1_s7Rv
                                                                   ww_s7Rh]
                                                           of
                                                           sat_s7RH
                                                           { __DEFAULT ->
 $wnxtc_s7Ru
 sat_s7RH
 s2#1_s7RG;
                                                           };
                                                     };
                                               };
                                         };
                                   };
                             };
                       };
                   GHC.Types.True ->
                       (#,#) [w1_s7Rw
                              GHC.Tuple.()];
                 };
           };
 } in
 }}}
 This, in turn produces the following CMM code:

 {{{
        c8oB:
            _s7R1::P64 = P64[_s7Ru::P64 + 6];
            _s7Rf::P64 = P64[_s7Ru::P64 + 14];
            _s7Rh::I64 = I64[_s7Ru::P64 + 22];
            _c8oE::I64 = %MO_S_Gt_W64(_s7Rv::I64, 131071);
            _s7Rx::I64 = _c8oE::I64;
            switch [0 .. 1] _s7Rx::I64 {
                case 0 : goto c8oM;
                case 1 : goto c8oN;
            }
        c8oN:
            R1 = GHC.Tuple.()_closure+1;
            call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
        c8oM:
            _c8oP::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
            _s7Rz::I64 = _c8oP::I64;
            _s7RC::I64 = %MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) +
 (_s7Rz::I64 << 2)]);
            _s7RC::I64 = _s7RC::I64;
            _c8oS::I64 = _s7Rv::I64 & 31;
            _s7RD::I64 = _c8oS::I64;
            _c8oV::I64 = %MO_UU_Conv_W32_W64(I32[(_s7Rf::P64 + 16) +
 (_s7RD::I64 << 2)]);
            _s7RE::I64 = _c8oV::I64;
            _c8oY::I64 = _s7RC::I64 | _s7RE::I64;
            _s7RF::I64 = _c8oY::I64;
            I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] =
 %MO_UU_Conv_W64_W32(_s7RF::I64);
            _c8p3::I64 = _s7Rv::I64 + _s7Rh::I64;
            _s7RH::I64 = _c8p3::I64;
            _s7Rv::I64 = _s7RH::I64;
            goto c8oB;
 }}}
 which is reduced to the following CMM code after many optimization passes:

 {{{
       c8oB:
           switch [0 .. 1] (%MO_S_Gt_W64(_s7Rv::I64, 131071)) {
               case 0 : goto c8oM;
               case 1 : goto c8oN;
           }
       c8oN:
           R1 = GHC.Tuple.()_closure+1;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c8oM:
           _s7R1::P64 = P64[_s7Ru::P64 + 6];
           _s7Rh::I64 = I64[_s7Ru::P64 + 22];
           _s7Rz::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
           I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] =
 %MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) +
 (_s7Rz::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[P64[_s7Ru::P64 + 14] +
 ((_s7Rv::I64 & 31 << 2) + 16)]));
           _s7Rv::I64 = _s7Rv::I64 + _s7Rh::I64;
           goto c8oB;
 }}}
 and finally the following assembly code:

 {{{
         .align  16, 0x90
 .LBB29_1:                               # %c8oM
                                         # =>This Inner Loop Header:
 Depth=1
         movq    %r14, %rax
         sarq    $5, %rax
         movq    6(%rbx), %rcx
         movq    14(%rbx), %rdx
         movl    %r14d, %esi
         andl    $31, %esi
         movl    16(%rdx,%rsi,4), %edx
         addq    22(%rbx), %r14
         orl     %edx, 16(%rcx,%rax,4)
         cmpq    $131072, %r14           # imm = 0x20000
         jl      .LBB29_1
 }}}
 where one can clearly see the multiple register loads inside the inner
 loop.  This code runs at almost four CPU clock cycles per loop on Intel
 Skylake.

 It is easy to see that this code is partially non-strict by running the
 `+RTS -s` command line option on the run to observed that heap use is much
 higher than it should be, although not so high that it causes a
 significant amount of GC or cost in execution time.  The extra execution
 time is almost entirely due to the register reloads seen above inside the
 inner loop.

 '''The Work Around'''

 By merely changing the inner loop as follows, the non-strictness goes away
 (as seen in the amount of heap used, which drops to a few 10's of
 Kilobytes from 10's of Megabytes:

 {{{
                   let nxtc c =
                         if c > top then nxtp (p + 1) else do
                         let w = c `shiftR` 5
                         v <- unsafeRead cmpstsw w
                         unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&.
 31))
                         nxtc (c + p) in nxtc s in twos `seq` nxtp 2
 }}}
 With the modified code producing the following STG (massively indented for
 display here):

 {{{
 lvl21_s7Rl [Dmd=<S,U>]
 { __DEFAULT ->
       let-no-escape {
         $wnxtc_s7Rm [InlPrag=[0],
                      Occ=LoopBreaker]
           :: GHC.Prim.Int#
              -> GHC.Prim.State#
                   GHC.Prim.RealWorld
              -> (# GHC.Prim.State#
                      GHC.Prim.RealWorld,
                    Data.Array.Base.STUArray
                      GHC.Prim.RealWorld
                      GHC.Types.Int
                      GHC.Types.Bool #)
         [LclId,
          Arity=2,
          Str=DmdType <S,U><S,U>,
          Unf=OtherCon []] =
             sat-only \r srt:SRT:[] [ww3_s7Rn
                                     w3_s7Ro]
                 case
                     ># [ww3_s7Rn
                         131071#]
                 of
                 sat_s7Rp
                 { __DEFAULT ->
                       case
                           tagToEnum# [sat_s7Rp]
                       of
                       _ [Occ=Dead]
                       { GHC.Types.False ->
                             case
                                 uncheckedIShiftRA# [ww3_s7Rn
                                                     5#]
                             of
                             i#_s7Rr [Dmd=<S,U>]
                             { __DEFAULT ->
                                   case
                                       readWord32Array# [ipv1_s7Qj
                                                         i#_s7Rr
                                                         w3_s7Ro]
                                   of
                                   _ [Occ=Dead]
                                   { (#,#) ipv8_s7Rt [Occ=Once]
                                           ipv9_s7Ru [Occ=Once] ->
                                         case
                                             andI# [ww3_s7Rn
                                                    31#]
                                         of
                                         sat_s7Rv
                                         { __DEFAULT ->
                                               case
                                                   indexWord32Array#
 [ipv5_s7Qx
 sat_s7Rv]
                                               of
                                               wild7_s7Rw
                                               { __DEFAULT ->
                                                     case
                                                         or# [ipv9_s7Ru
                                                              wild7_s7Rw]
                                                     of
                                                     sat_s7Rx
                                                     { __DEFAULT ->
                                                           case
 writeWord32Array# [ipv1_s7Qj
 i#_s7Rr
 sat_s7Rx
 ipv8_s7Rt]
                                                           of
                                                           s2#1_s7Ry
 [OS=OneShot]
                                                           { __DEFAULT ->
                                                                 case
                                                                     +#
 [ww3_s7Rn
 ww2_s7R8]
                                                                 of
                                                                 sat_s7Rz
                                                                 {
 __DEFAULT ->
 $wnxtc_s7Rm
 sat_s7Rz
 s2#1_s7Ry;
                                                                 };
                                                           };
                                                     };
                                               };
                                         };
                                   };
                             };
                         GHC.Types.True ->
                             $wnxtp1_s7R7
                                 lvl21_s7Rl
                                 w3_s7Ro;
                       };
                 };
       } in
         $wnxtc_s7Rm
             x1_s7Ra
             ipv6_s7Rf;
 };
 };
 }}}
 converted to the following initial CMM code:

 {{{
       c8o0:
           switch [0 .. 1] (%MO_S_Gt_W64(_s7QO::I64, 131071)) {
               case 0 : goto c8o8;
               case 1 : goto c8o9;
           }
       c8o9:
           _s7Qz::I64 = _s7QM::I64;
           goto c8ni;
       c8o8:
           _s7QS::I64 = %MO_S_Shr_W64(_s7QO::I64, 5);
           I32[(_s7Qj::P64 + 16) + (_s7QS::I64 << 2)] =
 %MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) +
 (_s7QS::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) +
 (_s7QO::I64 & 31 << 2)]));
           _s7QO::I64 = _s7QO::I64 + _s7Qz::I64;
           goto c8o0;
 }}}
 and the following optimized CMM code:

 {{{
       c8p9:
           switch [0 .. 1] (%MO_S_Gt_W64(_s7Rn::I64, 131071)) {
               case 0 : goto c8ph;
               case 1 : goto c8pi;
           }
       c8pi:
           _s7R8::I64 = _s7Rl::I64;
           goto c8ou;
       c8ph:
           _s7Rr::I64 = %MO_S_Shr_W64(_s7Rn::I64, 5);
           I32[(_s7Qj::P64 + 16) + (_s7Rr::I64 << 2)] =
 %MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) +
 (_s7Rr::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) +
 (_s7Rn::I64 & 31 << 2)]));
           _s7Rn::I64 = _s7Rn::I64 + _s7R8::I64;
           goto c8p9;
 }}}
 to produce the following almost ideal assembly code (this particular code
 doesn't seem to manifest the symptoms of ticket #12798):

 {{{
 .LBB29_10:                              # %c8ph
                                         #   Parent Loop BB29_7 Depth=1
                                         #     Parent Loop BB29_8 Depth=2
                                         # =>    This Inner Loop Header:
 Depth=3
         movq    %rsi, %rdx
         sarq    $5, %rdx
         movl    %esi, %edi
         andl    $31, %edi
         movl    16(%rcx,%rdi,4), %edi
         orl     %edi, 16(%r10,%rdx,4)
         addq    %rax, %rsi
         cmpq    $131071, %rsi           # imm = 0x1FFFF
         jle     .LBB29_10
 }}}
 which one can see has no register loads and is almost ideal as to speed
 for the purpose - it runs at about 3.09 CPU clock cycles per loop whereas
 I have seen some code slightly re-ordered as produced by Clang/Rust/LLVM
 that runs at about 3.00 clock cycles.

 In order to fix the previous code using primitive Addr# operations for
 which the ticket was opened, one just has to convince the compiler that it
 is to be evaluated strictly; although this is not so easy or one runs into
 the mixed lifted and un-lifted types error message.

 However, there is likely a whole wide range of programs where executing
 entirely strictly is either not possible or not desired.  I don't see why
 non-strict boxed code (for Haskell, likely the majority of code) can not
 be just as effectively optimized.

 '''In conclusion:'''  this is a very serious performance bug that can
 cause up to about a half again cost in execution time (50% increase),
 occurs in many use cases with a typical performance cost of about 30% (for
 instance for highly recursive code using list basted tail calls), and I
 believe has a great deal to do with the general perception that (GHC)
 Haskell is very much slower than Cee languages (C/C++/Rust, etc.).

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12808#comment:12>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler