[GHC] #12798: LLVM seeming to over optimize, producing inefficient assembly code...

Sun Nov 6 16:53:28 UTC 2016

#12798: LLVM seeming to over optimize, producing inefficient assembly code...
-------------------------------------+-------------------------------------
        Reporter:  GordonBGood       |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler (LLVM)   |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by AlexET):

 On current HEAD with llvm 3.9.0, the follwing code

 {{{
 import Data.Word
 import Data.Bits
 import Data.Array.ST (runSTUArray)
 import Data.Array.Base

 import Control.Monad.ST

 numLOOPS = 10000 :: Int

 twos :: UArray Int Word32
 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]

 soe :: () -> [Word32]
 soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
  bufb = runSTUArray $ do
   let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 =
 16 KBytes
   cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool)
   cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int
 Word32)) cmpstsb
   return $! twos -- force evaluation of twos outside the loop.
   let loop n = -- cull a number of times to test timing
         if n <= 0 then return cmpstsb else
         let cullp i =
               let p = i + i + 3 in
               let s = (p * p - 3) `div` 2 in
               if s > bfLmt then loop (n - 1) else do
                 isCmpst <- unsafeRead cmpstsb i
                 if isCmpst then cullp (i + 1) else -- is Prime
                   let cull j = -- very tight inner loop where all the time
 is spent
                         if j > bfLmt then cullp (i + 1) else do
                           let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL`
 (j .&. 31)))
                           let w = j `shiftR` 5
                           ov <- unsafeRead cmpstsw w
                           unsafeWrite cmpstsw w (ov .|. sh)
                           cull (j + p) in cull s in cullp 0
   loop numLOOPS

 main = print $ length $ soe()

 }}}

 Gives the inner loop, which is almost the same as rust.
 {{{
 .LBB28_7:
         movq    %rcx, %rdx
         sarq    $5, %rdx
         movl    %ecx, %edi
         andl    $31, %edi
         movl    16(%r9,%rdi,4), %edi
         orl     %edi, 16(%rax,%rdx,4)
         addq    %rsi, %rcx
 .LBB28_5:
         cmpq    $131071, %rcx
         jle     .LBB28_7
 }}}
 The CMM and initial llvm code is the same as your code for 8.0.1, so it
 seems the difference is due to the fact that rust ships with its own more
 recent llvm than ghc 8.0.1 supports.

 The difference in the code between my version and your original version is
 that we force `twos` early which is needed to prevent the evaluation of
 that within the loop, an optimisation which seems to have been missed by
 HEAD.

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12798#comment:1>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler