[GHC] #12798: LLVM seeming to over optimize, producing inefficient assembly code...
GHC
ghc-devs at haskell.org
Wed Nov 2 04:22:21 UTC 2016
#12798: LLVM seeming to over optimize, producing inefficient assembly code...
-------------------------------------+-------------------------------------
Reporter: GordonBGood | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler | Version: 8.0.1
(LLVM) |
Keywords: | Operating System: Unknown/Multiple
Architecture: | Type of failure: Runtime
Unknown/Multiple | performance bug
Test Case: | Blocked By:
Blocking: | Related Tickets:
Differential Rev(s): | Wiki Page:
-------------------------------------+-------------------------------------
Since in many cases, the use of the LLVM backend is the only way to avoid
the NCG's poor register allocation (ticket #8971), this is a concern that
using "-fllvm" is producing overly complex code through a (seeming) failed
attempt to optimize.
The following code uses a very simple "odds-only" implementation of the
Sieve of Eratosthenes with a very tight inner culling loop limited to
using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache
size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for
better speed than using a variable shift left operation for setting the
composite bits in the buffer array as it (should) take the same number of
registers and the array look-up instruction is easier for the CPU to fuse
than a variable shift left:
{{{#!hs
-- GHC_EfficiencyBug.hs
{-# LANGUAGE FlexibleContexts #-}
{-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2
import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base
numLOOPS = 10000 :: Int
twos :: UArray Int Word32
twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]
soe :: () -> [Word32]
soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
bufb = runSTUArray $ do
let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 =
16 KBytes
cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool)
cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int
Word32)) cmpstsb
let loop n = -- cull a number of times to test timing
if n <= 0 then return cmpstsb else
let cullp i =
let p = i + i + 3 in
let s = (p * p - 3) `div` 2 in
if s > bfLmt then loop (n - 1) else do
isCmpst <- unsafeRead cmpstsb i
if isCmpst then cullp (i + 1) else -- is Prime
let cull j = -- very tight inner loop where all the time
is spent
if j > bfLmt then cullp (i + 1) else do
let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL`
(j .&. 31)))
let w = j `shiftR` 5
ov <- unsafeRead cmpstsw w
unsafeWrite cmpstsw w (ov .|. sh)
cull (j + p) in cull s in cullp 0
loop numLOOPS
main = print $ length $ soe()
}}}
The main culling is repeated "numLOOPS" times to get a reasonable
execution time for accurate timing and to make the time required to use
the List comprehension to determine the number of found primes (the
answer) a negligible part of the execution time. Timing results can be
produced by running "./GHC_EfficiencyBug +RTS -s".
The desired assembly code result for the tight inner loop is as for the
Rust/LLVM compiler, in this case for x86_64 64-bit code:
{{{
.p2align 4, 0x90
.LBB10_27:
movq %rcx, %rdx
shrq $5, %rdx
movl %ecx, %esi
andl $31, %esi
movl (%rbp,%rsi,4), %esi
orl %esi, (%r14,%rdx,4)
addq %rax, %rcx
.LBB10_26:
cmpq %r13, %rcx
jb .LBB10_27
}}}
The above code is extremely efficient on a CPU that is not cache bottle
necked (such as the AMD Bulldozer series are) and takes just about three
clock cycles per inner composite culling loop on Intel Sky Lake; it is
just as efficient for x86 code since there are only seven registers used
in this inner loop.
Due to this attempt at "over-optimization", the GHC/LLVM backend produces
the following x86_64 64-bit code:
{{{
.align 16, 0x90
.LBB34_2: # %c8T2
# =>This Inner Loop Header:
Depth=1
movq %rcx, %rsi
sarq $5, %rsi
movl %r8d, %edi
andl $124, %edi
movl 16(%rax,%rdi), %edi
orl %edi, 16(%r11,%rsi,4)
addq %r14, %rcx
addq %rdx, %r8
cmpq %r10, %rcx
jle .LBB34_2
}}}
As can be seen, instead of just masking the "twos" index register by 31
(0x1F), the code is using two extra separate registers to contain "(j *
4)" increment and the accumulated index, which increment is added to the
"twos" index register per loop and masked by 124 (0x7C or 0x1F times 4),
requiring an extra two registers and an extra instruction for the extra
addition. This isn't a problem as to the number of registers for x86_64
code which has more than enough, but it adds the extra instruction
execution time of one third of a CPU clock cycle (I know, only one ninth
extra time).
However, for 32-bit x86 code with barely enough registers previously, the
use of the extra registers triggers a chain of three register reloads as
can be seen in the following assembly code:
{{{
.align 16, 0x90
LBB33_2: # %c8Wb
# =>This Inner Loop Header:
Depth=1
movl %ebx, %ebp
sarl $5, %ebp
movl %edi, %ecx
andl $124, %ecx
movl %esi, %edx
movl %eax, %esi
movl 36(%esp), %eax # 4-byte Reload
movl 8(%eax,%ecx), %ecx
movl %esi, %eax
movl %edx, %esi
orl %ecx, 8(%esi,%ebp,4)
addl 32(%esp), %ebx # 4-byte Folded Reload
addl 28(%esp), %edi # 4-byte Folded Reload
cmpl %eax, %ebx
jle LBB33_2
}}}
'''The above code runs about 25% slower than it should on Intel Sky Lake
for this 32-bit code.'''
This was tested for GHC version 8.0.1 under both Windows and Linux for
both 32-bit and 64-bit code with identical results for each native code
width.
The code was also tested for 32 and 64 bit code produced by the NCG; for
this specific problem, NCG takes the simple approach and does not waste
the extra register. However, due to the inefficient allocation of
registers as per ticket #8971, not moving the loop completion check to the
end of the loop and thus requiring an extra jump instruction, and not
combining the read/modify/write into a single instruction, it is still
slower (much slower for 32-bit code) than the LLVM produced code. As its
problems are known, I have not documented the NCG code.
Conclusion: This may seem like a nit picky type of bug as in some use
cases the execution time cost is very small, but it may be an indication
of problems in other use cases that cause more serious effects on
execution speed. It is my feeling that for such low level somewhat
imperative types of code, GHC should really produce code that is as fast
as C/C++/Rust.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12798>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list