[GHC] #12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop...
GHC
ghc-devs at haskell.org
Fri Dec 9 11:23:57 UTC 2016
#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code
Flow not lifted outside the loop...
-------------------------------------+-------------------------------------
Reporter: GordonBGood | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by GordonBGood):
It seems that the loop invariant code flow not being lifted out of the
loops in not limited to primitive operations (also including Addr#), but
is a general case for any code that is not purely strict, thus anything
involving boxed thunks does not seem to be optimized properly.
The following code of a simple naive Sieve of Eratosthenes implementation
with the composite number culling operations run a number of times in a
loop for better timing purposes demonstrates the problem:
{{{
{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O3 -rtsopts -keep-s-files -ddump-stg -ddump-cmm -ddump-
opt-cmm -ddump-to-file -dumpdir . #-} -- or -O2 -fllvm -v -dcore-lint
-ddump-asm
import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )
twos = listArray (0, 31) [ 1 `shiftL` i | i <- [0 .. 31]] :: UArray Int
Word32
eos :: Int -> [Int]
eos top = [fromIntegral i | (i, False) <- assocs cmpsts] where
cmpsts = runSTUArray $ do
cmpstsb <- newArray (0, top) False :: ST s (STUArray s Int Bool)
cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s
Int Word32)) cmpstsb
unsafeWrite cmpstsw 0 3 -- precull 0 and 1
let loop i =
if i <= 0 then return cmpstsb else
let nxtp p =
let s = p * p in
if s > top then loop (i - 1) else do
v <- unsafeRead cmpstsw (p `shiftR` 5)
if v .&. unsafeAt twos (p .&. 31) /= 0 then nxtp (p + 1)
else
let nxtc c =
if c > top then return () else do
let w = c `shiftR` 5
v <- unsafeRead cmpstsw w
unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&.
31))
nxtc (c + p) in do { nxtc s; nxtp (p + 1) } in
twos `seq` nxtp 2
loop (10000 :: Int)
main = print $ length $ eos(131071)
}}}
When run with the -fllvm (LLVM back end) compiler flag, it produces the
following STG code for the inner loop (located by searching for "nxtc",
massively indented for display):
{{{
let {
$wnxtc_s7Ru [InlPrag=[0],
Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.State#
GHC.Prim.RealWorld
-> (# GHC.Prim.State#
GHC.Prim.RealWorld,
() #)
[LclId,
Arity=2,
Str=DmdType <S,U><S,U>,
Unf=OtherCon []] =
sat-only \r srt:SRT:[] [ww1_s7Rv
w1_s7Rw]
case
># [ww1_s7Rv
131071#]
of
sat_s7Rx
{ __DEFAULT ->
case
tagToEnum# [sat_s7Rx]
of
_ [Occ=Dead]
{ GHC.Types.False ->
case
uncheckedIShiftRA# [ww1_s7Rv
5#]
of
i#_s7Rz [Dmd=<S,U>]
{ __DEFAULT ->
case
readWord32Array# [ipv1_s7R1
i#_s7Rz
w1_s7Rw]
of
_ [Occ=Dead]
{ (#,#) ipv8_s7RB [Occ=Once]
ipv9_s7RC [Occ=Once] ->
case
andI# [ww1_s7Rv
31#]
of
sat_s7RD
{ __DEFAULT ->
case
indexWord32Array# [ipv5_s7Rf
sat_s7RD]
of
wild5_s7RE
{ __DEFAULT ->
case
or# [ipv9_s7RC
wild5_s7RE]
of
sat_s7RF
{ __DEFAULT ->
case
writeWord32Array#
[ipv1_s7R1
i#_s7Rz
sat_s7RF
ipv8_s7RB]
of
s2#1_s7RG [OS=OneShot]
{ __DEFAULT ->
case
+# [ww1_s7Rv
ww_s7Rh]
of
sat_s7RH
{ __DEFAULT ->
$wnxtc_s7Ru
sat_s7RH
s2#1_s7RG;
};
};
};
};
};
};
};
GHC.Types.True ->
(#,#) [w1_s7Rw
GHC.Tuple.()];
};
};
} in
}}}
This, in turn produces the following CMM code:
{{{
c8oB:
_s7R1::P64 = P64[_s7Ru::P64 + 6];
_s7Rf::P64 = P64[_s7Ru::P64 + 14];
_s7Rh::I64 = I64[_s7Ru::P64 + 22];
_c8oE::I64 = %MO_S_Gt_W64(_s7Rv::I64, 131071);
_s7Rx::I64 = _c8oE::I64;
switch [0 .. 1] _s7Rx::I64 {
case 0 : goto c8oM;
case 1 : goto c8oN;
}
c8oN:
R1 = GHC.Tuple.()_closure+1;
call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
c8oM:
_c8oP::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
_s7Rz::I64 = _c8oP::I64;
_s7RC::I64 = %MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) +
(_s7Rz::I64 << 2)]);
_s7RC::I64 = _s7RC::I64;
_c8oS::I64 = _s7Rv::I64 & 31;
_s7RD::I64 = _c8oS::I64;
_c8oV::I64 = %MO_UU_Conv_W32_W64(I32[(_s7Rf::P64 + 16) +
(_s7RD::I64 << 2)]);
_s7RE::I64 = _c8oV::I64;
_c8oY::I64 = _s7RC::I64 | _s7RE::I64;
_s7RF::I64 = _c8oY::I64;
I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] =
%MO_UU_Conv_W64_W32(_s7RF::I64);
_c8p3::I64 = _s7Rv::I64 + _s7Rh::I64;
_s7RH::I64 = _c8p3::I64;
_s7Rv::I64 = _s7RH::I64;
goto c8oB;
}}}
which is reduced to the following CMM code after many optimization passes:
{{{
c8oB:
switch [0 .. 1] (%MO_S_Gt_W64(_s7Rv::I64, 131071)) {
case 0 : goto c8oM;
case 1 : goto c8oN;
}
c8oN:
R1 = GHC.Tuple.()_closure+1;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
c8oM:
_s7R1::P64 = P64[_s7Ru::P64 + 6];
_s7Rh::I64 = I64[_s7Ru::P64 + 22];
_s7Rz::I64 = %MO_S_Shr_W64(_s7Rv::I64, 5);
I32[(_s7R1::P64 + 16) + (_s7Rz::I64 << 2)] =
%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7R1::P64 + 16) +
(_s7Rz::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[P64[_s7Ru::P64 + 14] +
((_s7Rv::I64 & 31 << 2) + 16)]));
_s7Rv::I64 = _s7Rv::I64 + _s7Rh::I64;
goto c8oB;
}}}
and finally the following assembly code:
{{{
.align 16, 0x90
.LBB29_1: # %c8oM
# =>This Inner Loop Header:
Depth=1
movq %r14, %rax
sarq $5, %rax
movq 6(%rbx), %rcx
movq 14(%rbx), %rdx
movl %r14d, %esi
andl $31, %esi
movl 16(%rdx,%rsi,4), %edx
addq 22(%rbx), %r14
orl %edx, 16(%rcx,%rax,4)
cmpq $131072, %r14 # imm = 0x20000
jl .LBB29_1
}}}
where one can clearly see the multiple register loads inside the inner
loop. This code runs at almost four CPU clock cycles per loop on Intel
Skylake.
It is easy to see that this code is partially non-strict by running the
`+RTS -s` command line option on the run to observed that heap use is much
higher than it should be, although not so high that it causes a
significant amount of GC or cost in execution time. The extra execution
time is almost entirely due to the register reloads seen above inside the
inner loop.
'''The Work Around'''
By merely changing the inner loop as follows, the non-strictness goes away
(as seen in the amount of heap used, which drops to a few 10's of
Kilobytes from 10's of Megabytes:
{{{
let nxtc c =
if c > top then nxtp (p + 1) else do
let w = c `shiftR` 5
v <- unsafeRead cmpstsw w
unsafeWrite cmpstsw w (v .|. unsafeAt twos (c .&.
31))
nxtc (c + p) in nxtc s in twos `seq` nxtp 2
}}}
With the modified code producing the following STG (massively indented for
display here):
{{{
lvl21_s7Rl [Dmd=<S,U>]
{ __DEFAULT ->
let-no-escape {
$wnxtc_s7Rm [InlPrag=[0],
Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.State#
GHC.Prim.RealWorld
-> (# GHC.Prim.State#
GHC.Prim.RealWorld,
Data.Array.Base.STUArray
GHC.Prim.RealWorld
GHC.Types.Int
GHC.Types.Bool #)
[LclId,
Arity=2,
Str=DmdType <S,U><S,U>,
Unf=OtherCon []] =
sat-only \r srt:SRT:[] [ww3_s7Rn
w3_s7Ro]
case
># [ww3_s7Rn
131071#]
of
sat_s7Rp
{ __DEFAULT ->
case
tagToEnum# [sat_s7Rp]
of
_ [Occ=Dead]
{ GHC.Types.False ->
case
uncheckedIShiftRA# [ww3_s7Rn
5#]
of
i#_s7Rr [Dmd=<S,U>]
{ __DEFAULT ->
case
readWord32Array# [ipv1_s7Qj
i#_s7Rr
w3_s7Ro]
of
_ [Occ=Dead]
{ (#,#) ipv8_s7Rt [Occ=Once]
ipv9_s7Ru [Occ=Once] ->
case
andI# [ww3_s7Rn
31#]
of
sat_s7Rv
{ __DEFAULT ->
case
indexWord32Array#
[ipv5_s7Qx
sat_s7Rv]
of
wild7_s7Rw
{ __DEFAULT ->
case
or# [ipv9_s7Ru
wild7_s7Rw]
of
sat_s7Rx
{ __DEFAULT ->
case
writeWord32Array# [ipv1_s7Qj
i#_s7Rr
sat_s7Rx
ipv8_s7Rt]
of
s2#1_s7Ry
[OS=OneShot]
{ __DEFAULT ->
case
+#
[ww3_s7Rn
ww2_s7R8]
of
sat_s7Rz
{
__DEFAULT ->
$wnxtc_s7Rm
sat_s7Rz
s2#1_s7Ry;
};
};
};
};
};
};
};
GHC.Types.True ->
$wnxtp1_s7R7
lvl21_s7Rl
w3_s7Ro;
};
};
} in
$wnxtc_s7Rm
x1_s7Ra
ipv6_s7Rf;
};
};
}}}
converted to the following initial CMM code:
{{{
c8o0:
switch [0 .. 1] (%MO_S_Gt_W64(_s7QO::I64, 131071)) {
case 0 : goto c8o8;
case 1 : goto c8o9;
}
c8o9:
_s7Qz::I64 = _s7QM::I64;
goto c8ni;
c8o8:
_s7QS::I64 = %MO_S_Shr_W64(_s7QO::I64, 5);
I32[(_s7Qj::P64 + 16) + (_s7QS::I64 << 2)] =
%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) +
(_s7QS::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) +
(_s7QO::I64 & 31 << 2)]));
_s7QO::I64 = _s7QO::I64 + _s7Qz::I64;
goto c8o0;
}}}
and the following optimized CMM code:
{{{
c8p9:
switch [0 .. 1] (%MO_S_Gt_W64(_s7Rn::I64, 131071)) {
case 0 : goto c8ph;
case 1 : goto c8pi;
}
c8pi:
_s7R8::I64 = _s7Rl::I64;
goto c8ou;
c8ph:
_s7Rr::I64 = %MO_S_Shr_W64(_s7Rn::I64, 5);
I32[(_s7Qj::P64 + 16) + (_s7Rr::I64 << 2)] =
%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(I32[(_s7Qj::P64 + 16) +
(_s7Rr::I64 << 2)]) | %MO_UU_Conv_W32_W64(I32[(_s7Qx::P64 + 16) +
(_s7Rn::I64 & 31 << 2)]));
_s7Rn::I64 = _s7Rn::I64 + _s7R8::I64;
goto c8p9;
}}}
to produce the following almost ideal assembly code (this particular code
doesn't seem to manifest the symptoms of ticket #12798):
{{{
.LBB29_10: # %c8ph
# Parent Loop BB29_7 Depth=1
# Parent Loop BB29_8 Depth=2
# => This Inner Loop Header:
Depth=3
movq %rsi, %rdx
sarq $5, %rdx
movl %esi, %edi
andl $31, %edi
movl 16(%rcx,%rdi,4), %edi
orl %edi, 16(%r10,%rdx,4)
addq %rax, %rsi
cmpq $131071, %rsi # imm = 0x1FFFF
jle .LBB29_10
}}}
which one can see has no register loads and is almost ideal as to speed
for the purpose - it runs at about 3.09 CPU clock cycles per loop whereas
I have seen some code slightly re-ordered as produced by Clang/Rust/LLVM
that runs at about 3.00 clock cycles.
In order to fix the previous code using primitive Addr# operations for
which the ticket was opened, one just has to convince the compiler that it
is to be evaluated strictly; although this is not so easy or one runs into
the mixed lifted and un-lifted types error message.
However, there is likely a whole wide range of programs where executing
entirely strictly is either not possible or not desired. I don't see why
non-strict boxed code (for Haskell, likely the majority of code) can not
be just as effectively optimized.
'''In conclusion:''' this is a very serious performance bug that can
cause up to about a half again cost in execution time (50% increase),
occurs in many use cases with a typical performance cost of about 30% (for
instance for highly recursive code using list basted tail calls), and I
believe has a great deal to do with the general perception that (GHC)
Haskell is very much slower than Cee languages (C/C++/Rust, etc.).
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12808#comment:12>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list