[GHC] #13763: Performance regression (~34%) in 8.2.1, poor register allocation(?) in an inner loop over an array

Sun May 28 17:59:34 UTC 2017

#13763: Performance regression (~34%) in 8.2.1, poor register allocation(?) in an
inner loop over an array
-------------------------------------+-------------------------------------
        Reporter:  jberryman         |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler (NCG)    |              Version:  8.2.1-rc2
      Resolution:                    |             Keywords:  JoinPoints
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by jberryman):

 Yeah, I quickly gave up trying to core or downstream output between the
 two. And I'm aware that we should expect register allocation to be fiddly:
 the algorithm is simple and suboptimal, so it's likely that from very
 different core we'd get randomly better or worse generated code,
 especially if we're pushing our use of registers (I'm not sure if this is
 really the case here).

 Still I thought I should report because:

 - this represents a real regression in my code, and perhaps something like
 this should be included in ghc benchmarks to motivate improvements to
 register allocation, and

 - it might be that something else is going on and we're getting register
 spilling when we really shouldn't

 In any case I hope reporting this is not unproductive.

 Here's a bit more information:

 I found that the following change (which is more sensible source in any
 case) results in significantly faster code in both cases and starts to
 narrow but does not eliminate the regression (we are still using stack in
 the 8.2 version). A test case with this version is probably better to work
 with:

 {{{#!hs
 clean4xWord16ChunkLE :: Word64 -> Word64
 {-# INLINE clean4xWord16ChunkLE #-}
 clean4xWord16ChunkLE w64Dirty =
     -- This improves things significantly (and is an improvement in
 8.0.1), but
     -- still regresses:
     --   For "ByteArray 1000000":
     --     8.0.1:   2.663 ms - 2.261 ms
     --     8.2 rc:  3.371 ms - 2.728 ms
     let !w64 = byteSwap64 w64Dirty
      in
       ((w64 `unsafeShiftR` 8) .&. 0x00FF00FF00FF00FF)
       .|.
       ((w64 `unsafeShiftL` 8) .&. 0xFF00FF00FF00FF00)
 }}}

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/13763#comment:4>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler