[GHC] #12798: LLVM seeming to over optimize, producing inefficient assembly code...

Wed Feb 22 01:03:52 UTC 2017

#12798: LLVM seeming to over optimize, producing inefficient assembly code...
-------------------------------------+-------------------------------------
        Reporter:  GordonBGood       |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler (LLVM)   |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #12808            |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by bgamari:

Old description:

> Since in many cases, the use of the LLVM backend is the only way to avoid
> the NCG's poor register allocation (ticket #8971), this is a concern that
> using "-fllvm" is producing overly complex code through a (seeming)
> failed attempt to optimize.
>
> The following code uses a very simple "odds-only" implementation of the
> Sieve of Eratosthenes with a very tight inner culling loop limited to
> using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache
> size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for
> better speed than using a variable shift left operation for setting the
> composite bits in the buffer array as it (should) take the same number of
> registers and the array look-up instruction is easier for the CPU to fuse
> than a variable shift left:
>
> {{{#!hs
> -- GHC_EfficiencyBug.hs
> {-# LANGUAGE FlexibleContexts #-}
> {-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2
>
> import Control.Monad.ST
> import Data.Word
> import Data.Bits
> import Data.Array.ST (runSTUArray)
> import Data.Array.Base
>
> numLOOPS = 10000 :: Int
>
> twos :: UArray Int Word32
> twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]
>
> soe :: () -> [Word32]
> soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
>  bufb = runSTUArray $ do
>   let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 =
> 16 KBytes
>   cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool)
>   cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int
> Word32)) cmpstsb
>   let loop n = -- cull a number of times to test timing
>         if n <= 0 then return cmpstsb else
>         let cullp i =
>               let p = i + i + 3 in
>               let s = (p * p - 3) `div` 2 in
>               if s > bfLmt then loop (n - 1) else do
>                 isCmpst <- unsafeRead cmpstsb i
>                 if isCmpst then cullp (i + 1) else -- is Prime
>                   let cull j = -- very tight inner loop where all the
> time is spent
>                         if j > bfLmt then cullp (i + 1) else do
>                           let sh = unsafeAt twos (j .&. 31) -- (1
> `shiftL` (j .&. 31)))
>                           let w = j `shiftR` 5
>                           ov <- unsafeRead cmpstsw w
>                           unsafeWrite cmpstsw w (ov .|. sh)
>                           cull (j + p) in cull s in cullp 0
>   loop numLOOPS
>
> main = print $ length $ soe()
> }}}
> The main culling is repeated "numLOOPS" times to get a reasonable
> execution time for accurate timing and to make the time required to use
> the List comprehension to determine the number of found primes (the
> answer) a negligible part of the execution time.  Timing results can be
> produced by running "./GHC_EfficiencyBug +RTS -s".
>
> The desired assembly code result for the tight inner loop is as for the
> Rust/LLVM compiler, in this case for x86_64 64-bit code:
> {{{
>         .p2align        4, 0x90
> .LBB10_27:
>         movq    %rcx, %rdx
>         shrq    $5, %rdx
>         movl    %ecx, %esi
>         andl    $31, %esi
>         movl    (%rbp,%rsi,4), %esi
>         orl     %esi, (%r14,%rdx,4)
>         addq    %rax, %rcx
> .LBB10_26:
>         cmpq    %r13, %rcx
>         jb      .LBB10_27
> }}}
> The above code is extremely efficient on a CPU that is not cache bottle
> necked (such as the AMD Bulldozer series are) and takes just about three
> clock cycles per inner composite culling loop on Intel Sky Lake; it is
> just as efficient for x86 code since there are only seven registers used
> in this inner loop.
>
> Due to this attempt at "over-optimization", the GHC/LLVM backend produces
> the following x86_64 64-bit code:
> {{{
>         .align  16, 0x90
> .LBB34_2:                               # %c8T2
>                                         # =>This Inner Loop Header:
> Depth=1
>         movq    %rcx, %rsi
>         sarq    $5, %rsi
>         movl    %r8d, %edi
>         andl    $124, %edi
>         movl    16(%rax,%rdi), %edi
>         orl     %edi, 16(%r11,%rsi,4)
>         addq    %r14, %rcx
>         addq    %rdx, %r8
>         cmpq    %r10, %rcx
>         jle     .LBB34_2
> }}}
> As can be seen, instead of just masking the "twos" index register by 31
> (0x1F), the code is using two extra separate registers to contain "(j *
> 4)" increment and the accumulated index, which increment is added to the
> "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4),
> requiring an extra two registers and an extra instruction for the extra
> addition.  This isn't a problem as to the number of registers for x86_64
> code which has more than enough, but it adds the extra instruction
> execution time of one third of a CPU clock cycle (I know, only one ninth
> extra time).
>
> However, for 32-bit x86 code with barely enough registers previously, the
> use of the extra registers triggers a chain of three register reloads as
> can be seen in the following assembly code:
> {{{
>         .align  16, 0x90
> LBB33_2:                                # %c8Wb
>                                         # =>This Inner Loop Header:
> Depth=1
>         movl    %ebx, %ebp
>         sarl    $5, %ebp
>         movl    %edi, %ecx
>         andl    $124, %ecx
>         movl    %esi, %edx
>         movl    %eax, %esi
>         movl    36(%esp), %eax          # 4-byte Reload
>         movl    8(%eax,%ecx), %ecx
>         movl    %esi, %eax
>         movl    %edx, %esi
>         orl     %ecx, 8(%esi,%ebp,4)
>         addl    32(%esp), %ebx          # 4-byte Folded Reload
>         addl    28(%esp), %edi          # 4-byte Folded Reload
>         cmpl    %eax, %ebx
>         jle     LBB33_2
> }}}
> '''The above code runs about 25% slower than it should on Intel Sky Lake
> for this 32-bit code.'''
>
> This was tested for GHC version 8.0.1 under both Windows and Linux for
> both 32-bit and 64-bit code with identical results for each native code
> width.
>
> The code was also tested for 32 and 64 bit code produced by the NCG; for
> this specific problem, NCG takes the simple approach and does not waste
> the extra register.  However, due to the inefficient allocation of
> registers as per ticket #8971, not moving the loop completion check to
> the end of the loop and thus requiring an extra jump instruction, and not
> combining the read/modify/write into a single instruction, it is still
> slower (much slower for 32-bit code) than the LLVM produced code.  As its
> problems are known, I have not documented the NCG code.
>
> Conclusion:  This may seem like a nit picky type of bug as in some use
> cases the execution time cost is very small, but it may be an indication
> of problems in other use cases that cause more serious effects on
> execution speed.  It is my feeling that for such low level somewhat
> imperative types of code, GHC should really produce code that is as fast
> as C/C++/Rust.

New description:

 Since in many cases, the use of the LLVM backend is the only way to avoid
 the NCG's poor register allocation (ticket #8971), this is a concern that
 using "-fllvm" is producing overly complex code through a (seeming) failed
 attempt to optimize.

 The following code uses a very simple "odds-only" implementation of the
 Sieve of Eratosthenes with a very tight inner culling loop limited to
 using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache
 size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for
 better speed than using a variable shift left operation for setting the
 composite bits in the buffer array as it (should) take the same number of
 registers and the array look-up instruction is easier for the CPU to fuse
 than a variable shift left:

 {{{#!hs
 -- GHC_EfficiencyBug.hs
 {-# LANGUAGE FlexibleContexts #-}
 {-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2

 import Control.Monad.ST
 import Data.Word
 import Data.Bits
 import Data.Array.ST (runSTUArray)
 import Data.Array.Base

 numLOOPS = 10000 :: Int

 twos :: UArray Int Word32
 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]

 soe :: () -> [Word32]
 soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
  bufb = runSTUArray $ do
   let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 =
 16 KBytes
   cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool)
   cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int
 Word32)) cmpstsb
   let loop n = -- cull a number of times to test timing
         if n <= 0 then return cmpstsb else
         let cullp i =
               let p = i + i + 3 in
               let s = (p * p - 3) `div` 2 in
               if s > bfLmt then loop (n - 1) else do
                 isCmpst <- unsafeRead cmpstsb i
                 if isCmpst then cullp (i + 1) else -- is Prime
                   let cull j = -- very tight inner loop where all the time
 is spent
                         if j > bfLmt then cullp (i + 1) else do
                           let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL`
 (j .&. 31)))
                           let w = j `shiftR` 5
                           ov <- unsafeRead cmpstsw w
                           unsafeWrite cmpstsw w (ov .|. sh)
                           cull (j + p)
                   in cull s
         in cullp 0
   loop numLOOPS

 main = print $ length $ soe()
 }}}
 The main culling is repeated "numLOOPS" times to get a reasonable
 execution time for accurate timing and to make the time required to use
 the List comprehension to determine the number of found primes (the
 answer) a negligible part of the execution time.  Timing results can be
 produced by running "./GHC_EfficiencyBug +RTS -s".

 The desired assembly code result for the tight inner loop is as for the
 Rust/LLVM compiler, in this case for x86_64 64-bit code:
 {{{
         .p2align        4, 0x90
 .LBB10_27:
         movq    %rcx, %rdx
         shrq    $5, %rdx
         movl    %ecx, %esi
         andl    $31, %esi
         movl    (%rbp,%rsi,4), %esi
         orl     %esi, (%r14,%rdx,4)
         addq    %rax, %rcx
 .LBB10_26:
         cmpq    %r13, %rcx
         jb      .LBB10_27
 }}}
 The above code is extremely efficient on a CPU that is not cache bottle
 necked (such as the AMD Bulldozer series are) and takes just about three
 clock cycles per inner composite culling loop on Intel Sky Lake; it is
 just as efficient for x86 code since there are only seven registers used
 in this inner loop.

 Due to this attempt at "over-optimization", the GHC/LLVM backend produces
 the following x86_64 64-bit code:
 {{{
         .align  16, 0x90
 .LBB34_2:                               # %c8T2
                                         # =>This Inner Loop Header:
 Depth=1
         movq    %rcx, %rsi
         sarq    $5, %rsi
         movl    %r8d, %edi
         andl    $124, %edi
         movl    16(%rax,%rdi), %edi
         orl     %edi, 16(%r11,%rsi,4)
         addq    %r14, %rcx
         addq    %rdx, %r8
         cmpq    %r10, %rcx
         jle     .LBB34_2
 }}}
 As can be seen, instead of just masking the "twos" index register by 31
 (0x1F), the code is using two extra separate registers to contain "(j *
 4)" increment and the accumulated index, which increment is added to the
 "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4),
 requiring an extra two registers and an extra instruction for the extra
 addition.  This isn't a problem as to the number of registers for x86_64
 code which has more than enough, but it adds the extra instruction
 execution time of one third of a CPU clock cycle (I know, only one ninth
 extra time).

 However, for 32-bit x86 code with barely enough registers previously, the
 use of the extra registers triggers a chain of three register reloads as
 can be seen in the following assembly code:
 {{{
         .align  16, 0x90
 LBB33_2:                                # %c8Wb
                                         # =>This Inner Loop Header:
 Depth=1
         movl    %ebx, %ebp
         sarl    $5, %ebp
         movl    %edi, %ecx
         andl    $124, %ecx
         movl    %esi, %edx
         movl    %eax, %esi
         movl    36(%esp), %eax          # 4-byte Reload
         movl    8(%eax,%ecx), %ecx
         movl    %esi, %eax
         movl    %edx, %esi
         orl     %ecx, 8(%esi,%ebp,4)
         addl    32(%esp), %ebx          # 4-byte Folded Reload
         addl    28(%esp), %edi          # 4-byte Folded Reload
         cmpl    %eax, %ebx
         jle     LBB33_2
 }}}
 '''The above code runs about 25% slower than it should on Intel Sky Lake
 for this 32-bit code.'''

 This was tested for GHC version 8.0.1 under both Windows and Linux for
 both 32-bit and 64-bit code with identical results for each native code
 width.

 The code was also tested for 32 and 64 bit code produced by the NCG; for
 this specific problem, NCG takes the simple approach and does not waste
 the extra register.  However, due to the inefficient allocation of
 registers as per ticket #8971, not moving the loop completion check to the
 end of the loop and thus requiring an extra jump instruction, and not
 combining the read/modify/write into a single instruction, it is still
 slower (much slower for 32-bit code) than the LLVM produced code.  As its
 problems are known, I have not documented the NCG code.

 Conclusion:  This may seem like a nit picky type of bug as in some use
 cases the execution time cost is very small, but it may be an indication
 of problems in other use cases that cause more serious effects on
 execution speed.  It is my feeling that for such low level somewhat
 imperative types of code, GHC should really produce code that is as fast
 as C/C++/Rust.

--

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12798#comment:6>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler