[GHC] #12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code Flow not lifted outside the loop...
GHC
ghc-devs at haskell.org
Fri Dec 9 11:30:42 UTC 2016
#12808: For non-strict code including primitive (Addr#) code, Loop Invariant Code
Flow not lifted outside the loop...
-------------------------------------+-------------------------------------
Reporter: GordonBGood | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Description changed by GordonBGood:
@@ -12,11 +12,15 @@
- '''Description of test code:''' Essentially, this method involves
- extreme loop unrolling with very imperative code although coded
- functionally; in the case of the following code it means that, recognizing
- that for all odd primes (which they all are other than two), and that all
- word sizes are of an even number of bits, there will be a repeating
- pattern of composite number culls that repeats every word size number of
- bits. Thus for a word size of one eight-bit byte, we can unroll to eight
- composite culls in the body of one loop, with loops cases for the primes
- modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions (b0..b7)
- meaning there are four times eight is 32 loop cases. When there are no
- longer a full eight culls left, the culling reverts to conventional
+ '''Shortest possible test code that clearly shows non-strict code not
+ being optimized, but optimized when made strict:''' Please refer directly
+ to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12,
+
+ '''A version of test code that triggered this ticket:''' Essentially,
+ this method involves extreme loop unrolling with very imperative code
+ although coded functionally; in the case of the following code it means
+ that, recognizing that for all odd primes (which they all are other than
+ two), and that all word sizes are of an even number of bits, there will be
+ a repeating pattern of composite number culls that repeats every word size
+ number of bits. Thus for a word size of one eight-bit byte, we can unroll
+ to eight composite culls in the body of one loop, with loops cases for the
+ primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions
+ (b0..b7) meaning there are four times eight is 32 loop cases. When there
+ are no longer a full eight culls left, the culling reverts to conventional
New description:
'''Background:''' I've been intrigued investigating whether GHC can
produce code "as fast as Cee (C/C++/Rust/etc.)" by-any-means-possible, and
have been using the very tight inner composite culling loops (purely
integer operations) of a basic Sieve of Eratosthenes implementation as a
test vehicle.
'''Synopsis:''' This is a follow-on of the work leading to finding the
efficiency problem described in ticket #12798, but involves pushing the
speed even further as per the method described for "primesieve" as per
[http://primesieve.org/] in the "Highly optimized inner loop" section.
'''Shortest possible test code that clearly shows non-strict code not
being optimized, but optimized when made strict:''' Please refer directly
to comment 12https://ghc.haskell.org/trac/ghc/ticket/12808#comment:12,
'''A version of test code that triggered this ticket:''' Essentially,
this method involves extreme loop unrolling with very imperative code
although coded functionally; in the case of the following code it means
that, recognizing that for all odd primes (which they all are other than
two), and that all word sizes are of an even number of bits, there will be
a repeating pattern of composite number culls that repeats every word size
number of bits. Thus for a word size of one eight-bit byte, we can unroll
to eight composite culls in the body of one loop, with loops cases for the
primes modulo 8 of 1, 3, 5, and 7, and for the eight bit start positions
(b0..b7) meaning there are four times eight is 32 loop cases. When there
are no longer a full eight culls left, the culling reverts to conventional
single-cull-per-loop as per the test program of ticket #12798.
To do this using GHC we need pointer arithmetic, and the only way to
implement pointer arithmetic in GHC is to use the Addr# primitive.
GHC/Haskell has one other slight overhead over Cee languages in that we
need to move the culling array to a pinned array to avoid having it moved
while the culling is going on and then move it back when finished but this
takes a negligible amount of time (one percent or so) as compared to the
culling. As usual for test programs, the culling operations are repeated
in a loop for a number of times to give more accurate timing not
influenced by execution not related to the culling. All of this is
included in the following code (truncated as to loop coses for inclusion
here):
{{{#!hs
-- EfficiencyBug.hs
-- showing that there is a register loop invariant bug in generation of
assembler code...
-- LLVM shows the bug clearer since the code is generally a little
faster...
{-# LANGUAGE FlexibleContexts, BangPatterns, MagicHash, UnboxedTuples #-}
{-# OPTIONS_GHC -O2 -rtsopts -keep-s-files #-} -- or -O2 -fllvm
import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
import GHC.ST ( ST(..) )
import GHC.Exts
numLOOPS = 10000 :: Int
-- Uses a very simple Sieve of Eratosthenes for fixed 2 ^ 18 range (so one
L1 cache size) to prove it.
twos :: UArray Int Word32
twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]
soep1 :: () -> [Word32]
soep1() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
bufb = runSTUArray $ do
let bfBts = (256 * 1024) `div` 2 -- to 2^18 + 2 is 128 KBits = 16 KBytes
bf <- newArray (0, bfBts - 1) False :: ST s (STUArray s Int Bool)
cullb bf
cullb bf@(STUArray l u n marr#) = ST $ \s0# ->
case getSizeofMutableByteArray# marr# s0# of { (# s1#, n# #) ->
let loop t mr# s0# = -- cull a number of times to test timing
if t <= 0 then (# s0#, STUArray l u n mr# #) else
case getSizeofMutableByteArray# mr# s0# of { (# s1#, n# #) ->
case newPinnedByteArray# n# s1# of { (# s2#, marr'# #) ->
case copyMutableByteArray# mr# 0# marr'# 0# n# s2# of { s3# ->
case unsafeFreezeByteArray# marr'# s3# of { (# s4#, arr# #) -> --
must do this
case byteArrayContents# arr# of { adr# -> -- to obtain
the addr# here
let cullp i@(I# i#) sp# =
let !p@(I# p#) = i + i + 3 in
let !s@(I# s#) = (p * p - 3) `div` 2 in
if s >= n then
case copyMutableByteArray# marr'# 0# mr# 0# n# sp# of
so# -> (# so#, mr# #) else
let !(UArray _ _ _ tarr#) = twos in
case readWord64Array# marr# (i# `uncheckedIShiftRL#` 6#)
sp# of { (# sp0#, v0# #) ->
case (v0# `and#` ((int2Word# 1#) `uncheckedShiftL#` (i#
`andI#` 63#))) `eqWord#` (int2Word# 0#) of
0# -> cullp (i + 1) sp0# -- not prime
_ -> -- is prime
-- most program execution time spent in the following
tight loops.
-- the following code implments extream loop
unrolling...
let !pi@(I# pi#) = fromIntegral p in
let !sw@(I# sw#) = s `shiftR` 3 in let !sb@(I# sb#) =
s .&. 7 in
let p1 = sb + pi in let !(I# r1#) = p1 `shiftR` 3 in
let p2 = p1 + pi in let !(I# r2#) = p2 `shiftR` 3 in
let p3 = p2 + pi in let !(I# r3#) = p3 `shiftR` 3 in
let p4 = p3 + pi in let !(I# r4#) = p4 `shiftR` 3 in
let p5 = p4 + pi in let !(I# r5#) = p5 `shiftR` 3 in
let p6 = p5 + pi in let !(I# r6#) = p6 `shiftR` 3 in
let p7 = p6 + pi in let !(I# r7#) = p7 `shiftR` 3 in
let !lmt@(I# lmt#) = (fromIntegral n `shiftR` 3) - p7
in
let !lmt1# = plusAddr# adr# lmt# in
let !strt# = plusAddr# adr# sw# in
let !(I# n#) = n in
let (# !so#, !sco# #) = case ((((p - 1) `div` 2) .&.
3) `shiftL` 3) + sb of {
0 ->
let cull c# sp# =
case c# `ltAddr#` lmt1# of
0# -> (# c#, sp# #)
_ ->
case readWord8OffAddr# c# 0# sp# of { (#
sp0#, v0# #) ->
case writeWord8OffAddr# c# 0# (v0# `or#`
(int2Word# 1#)) sp0# of { sp1# ->
case readWord8OffAddr# c# r1# sp1# of {
(# sp2#, v1# #) ->
case writeWord8OffAddr# c# r1# (v1#
`or#` (int2Word# 2#)) sp2# of { sp3# ->
case readWord8OffAddr# c# r2# sp3# of {
(# sp4#, v2# #) ->
case writeWord8OffAddr# c# r2# (v2#
`or#` (int2Word# 4#)) sp4# of { sp5# ->
case readWord8OffAddr# c# r3# sp5# of {
(# sp6#, v3# #) ->
case writeWord8OffAddr# c# r3# (v3#
`or#` (int2Word# 8#)) sp6# of { sp7# ->
case readWord8OffAddr# c# r4# sp7# of {
(# sp8#, v4# #) ->
case writeWord8OffAddr# c# r4# (v4#
`or#` (int2Word# 16#)) sp8# of { sp9# ->
case readWord8OffAddr# c# r5# sp9# of {
(# sp10#, v5# #) ->
case writeWord8OffAddr# c# r5# (v5#
`or#` (int2Word# 32#)) sp10# of { sp11# ->
case readWord8OffAddr# c# r6# sp11# of {
(# sp12#, v6# #) ->
case writeWord8OffAddr# c# r6# (v6#
`or#` (int2Word# 64#)) sp12# of { sp13# ->
case readWord8OffAddr# c# r7# sp13# of {
(# sp14#, v7# #) ->
case writeWord8OffAddr# c# r7# (v7#
`or#` (int2Word# 128#)) sp14# of { sp15# ->
cull (plusAddr# c# pi#) sp15#
}}}}}}}}}}}}}}}} in
cull strt# sp0#;
1 ->
let cull c# sp# =
case c# `ltAddr#` lmt1# of
0# -> (# c#, sp# #)
_ ->
case readWord8OffAddr# c# 0# sp# of { (#
sp0#, v0# #) ->
case writeWord8OffAddr# c# 0# (v0# `or#`
(int2Word# 2#)) sp0# of { sp1# ->
case readWord8OffAddr# c# r1# sp1# of {
(# sp2#, v1# #) ->
case writeWord8OffAddr# c# r1# (v1#
`or#` (int2Word# 4#)) sp2# of { sp3# ->
case readWord8OffAddr# c# r2# sp3# of {
(# sp4#, v2# #) ->
case writeWord8OffAddr# c# r2# (v2#
`or#` (int2Word# 8#)) sp4# of { sp5# ->
case readWord8OffAddr# c# r3# sp5# of {
(# sp6#, v3# #) ->
case writeWord8OffAddr# c# r3# (v3#
`or#` (int2Word# 16#)) sp6# of { sp7# ->
case readWord8OffAddr# c# r4# sp7# of {
(# sp8#, v4# #) ->
case writeWord8OffAddr# c# r4# (v4#
`or#` (int2Word# 32#)) sp8# of { sp9# ->
case readWord8OffAddr# c# r5# sp9# of {
(# sp10#, v5# #) ->
case writeWord8OffAddr# c# r5# (v5#
`or#` (int2Word# 64#)) sp10# of { sp11# ->
case readWord8OffAddr# c# r6# sp11# of {
(# sp12#, v6# #) ->
case writeWord8OffAddr# c# r6# (v6#
`or#` (int2Word# 128#)) sp12# of { sp13# ->
case readWord8OffAddr# c# r7# sp13# of {
(# sp14#, v7# #) ->
case writeWord8OffAddr# c# r7# (v7#
`or#` (int2Word# 1#)) sp14# of { sp15# ->
cull (plusAddr# c# pi#) sp15#
}}}}}}}}}}}}}}}} in
cull strt# sp0#;
-- and so on for 30 more cases...
_ -> (# strt#, sp0# #) {- normally never taken case,
all cases covered -} } in
let !ns# = ((minusAddr# so# adr#) `uncheckedIShiftL#`
3#) +# sb# in
-- extreme loop unrolling ends here; remaining primes
culled conventionally...
let cull j# sc# = -- very tight inner loop
case j# <# n# of
0# -> cullp (i + 1) sc#
_ -> let i# = j# `andI#` 31# in
let !sh# = indexWord32Array# tarr# i# in
-- (1 `shiftL` (j .&. 31)))
let w# = j# `uncheckedIShiftRL#` 5# in
case readWord32Array# marr'# w# sc# of {
(# sc0#, ov# #) ->
case writeWord32Array# marr'# w# (ov#
`or#` sh#) sc0# of { sc1# ->
cull (j# +# pi#) sc1# }} in
cull ns# sp0# } in
case cullp 0 s4# of (# sp#, mrp# #) -> loop (t - 1) mrp# sp# }}}}}
in loop numLOOPS marr# s1# }
main = print $ length $ soep1()
}}}
'''The problem:''' The problem is in the innermost loop of the cases, for
which case "0" the following assembly code (using -fllvm) is produced:
{{{
seGU_info$def:
# BB#0: # %cgRL
cmpq %r14, 70(%rbx)
jbe .LBB35_1
.align 16, 0x90
.LBB35_2: # %cgRJ
# =>This Inner Loop Header:
Depth=1
movq 14(%rbx), %rcx
movq 22(%rbx), %rdx
movq 30(%rbx), %rsi
movq 38(%rbx), %rdi
movq 46(%rbx), %r8
movq 54(%rbx), %r9
movq 62(%rbx), %r10
movq 6(%rbx), %rax
addq %r14, %rax
orb $1, (%r14)
orb $2, (%rcx,%r14)
orb $4, (%rdx,%r14)
orb $8, (%rsi,%r14)
orb $16, (%rdi,%r14)
orb $32, (%r8,%r14)
orb $64, (%r9,%r14)
orb $-128, (%r10,%r14)
cmpq 70(%rbx), %rax
movq %rax, %r14
jb .LBB35_2
jmp .LBB35_3
.LBB35_1:
movq %r14, %rax
.LBB35_3: # %cgRK
movq (%rbp), %rcx
movq %rax, %rbx
rex64 jmpq *%rcx # TAILCALL
}}}
One can readily see that the compiler is not lifting the Loop Invariant
Code Flow as in initializing the registers to outside the inner loop,
meaning there are many register loads from memory which are not necessary.
'''Desired results:''' The desired assembly code is something like the
following, which is similar to what is produced by Cee (C/C++/Rust/etc.):
{{{
seGU_info$def:
# BB#0: # %cgRL
movq 14(%rbx), %rcx
movq 22(%rbx), %rdx
movq 30(%rbx), %rsi
movq 38(%rbx), %rdi
movq 46(%rbx), %r8
movq 54(%rbx), %r9
movq 62(%rbx), %r10
movq 6(%rbx), %rax
movq 70(%rbx), %rbx
cmpq %r14, %rbx # rbx clobbered here, but old
value
jbe .LBB35_1 # never used again until replaced
after loop
.align 16, 0x90
.LBB35_2: # %cgRJ
# =>This Inner Loop Header:
Depth=1
orb $1, (%r14)
orb $2, (%rcx,%r14)
orb $4, (%rdx,%r14)
orb $8, (%rsi,%r14)
orb $16, (%rdi,%r14)
orb $32, (%r8,%r14)
orb $64, (%r9,%r14)
orb $-128, (%r10,%r14)
addq %rax, %r14
cmpq %rbx, %r14
jb .LBB35_2
jmp .LBB35_3
.LBB35_1:
movq %r14, %rax
.LBB35_3: # %cgRK
movq (%rbp), %rcx
movq %rax, %rbx # rbx clobbered here anyway
rex64 jmpq *%rcx # TAILCALL
}}}
'''Full testing:''' The actual unrolled loop code including all the case
loops is too long to post here, but to verify the result is correct
(23000) and the performance, the full actual file is attached here. Due
to the magic of modern CPU instruction fusion and Out Of Order (OOE)
execution, the code is not as slow as it would indicate by the number of
increased instructions, but while it is about twice as fast as when culled
conventionally (Intel Skylake), it is about half again as slow as Cee. On
an Intel Sky Lake i5-6500 (running at 3.5 GHz for single threading), this
takes about one second, about two seconds culled conventionally, but only
about 0.6 seconds for Rust/LLVM (with the assembly code output essentially
identical to the "desired" code).
'''Other back ends and targets:''' Although the code generated by the
native NCG has other problems (not moving the loop test to the end of the
loop to avoid one jump, and not combining the read and modify and store
instructions into the single available instruction), it exhibits the same
problem as to not lifting the Loop Invariant Code Flow register
initialization.
Although this code is x86_64, the problem also applies to x86 code even
though the x86 architecture doesn't have enough registers to do this in
one loop and needs to be split into two loops culling only four composites
per loop, but there still is a significant gain in speed. Although not
tested, it probably also applies to other targets such as ARM (which has
many general purpose registers).
'''Conclusion:''' The use of Addr# primitives is probably not a frequent
use case, but as shown here that when one needs their use, they should be
efficient.
I considered that GHC may intentionally limit the performance of these
unsafe primitives to limit their use unless absolutely necessary as in
marshalling, something as C# does for the use of unsafe pointers, but
surely GHC would not do that as the target programmers are different.
'''If this and ticket #12798 were fixed, for this use case the GHC code
would be within a percent or two of the performance of Cee.'''
--
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12808#comment:13>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list