[GHC] #8279: bad alignment in code gen yields substantial perf issue

Thu Sep 12 22:20:21 CEST 2013

#8279: bad alignment in code gen  yields substantial perf issue
------------------------------+--------------------------------------------
       Reporter:  carter      |             Owner:
           Type:  bug         |            Status:  new
       Priority:  highest     |         Milestone:
      Component:  Compiler    |           Version:  7.7
       Keywords:              |  Operating System:  Unknown/Multiple
   Architecture:              |   Type of failure:  Runtime performance bug
  Unknown/Multiple            |         Test Case:
     Difficulty:  Unknown     |          Blocking:
     Blocked By:              |
Related Tickets:              |
------------------------------+--------------------------------------------
 independently, a number of folks have noticed that in various ways, GHC
 currently has quite a few different memory alignment related performance
 problems that can have >= 10% perf impact!

 Nicolas Frisby notes

 {{{
 On my laptop, a program showed a consistent slowdown with -fdicts-strict

 I didn't find any obvious causes in the Core differences, so I turned to
 Intel's Performance Counter Monitor for measurements. After trying a few
 counters, I eventuall saw that there are about an order of magnitude more
 misaligned memory loads with -fdicts-strict than without, so I think that
 may be a significant part of the slowdown. I'm not sure if these are code
 or data reads.

 Can anyone suggest how to validate this hypothesis about misaligned reads?

 A subsequent commit has changed the behavior I was seeing, so I'm not
 interested in alternatives means to determine if -fdicts-strict is somehow
 at fault — I'm just asking specifically about data/code memory alignment
 in GHC and how to diagnose/experiment with it.

 }}}

 Reid Barton has independently noted
 {{{

 so I did a nofib run with llvm libraries, ghc quickbuild

 so there's this really simple benchmark tak,
 https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
 it doesn't use any libraries at all in the main loop because the Ints all
 get unboxed
 but it's still 8% slower with quick-llvm (vs -fasm)
 weird right?

 [14:36:30] <carter>      could you post the asm it generates for that
 function?
 [14:36:49] <rwbarton>    well it's identical between the two versions
 <rwbarton>       but they get linked at different offsets because some
 llvm sections are different sizes
 <rwbarton>       if I add a 128-byte symbol to the .text section to move
 it to the same address... then the llvm libs version is just as fast
 <rwbarton>       well, apparently 404000 is good and 403f70 is bad
  <rwbarton>      I guess I can test other alignments easily enough
 <rwbarton>       I imagine it wants to start on a cache line
  <rwbarton>      but I don't know if it's just a coincidence that it
 worked with the ncg libraries
  <rwbarton>      that it got a good location

 <rwbarton>       for this program every 32-byte aligned address is 10+%
 faster than any merely 16-byte aligned address

  <rwbarton>      and by alignment I mean alignment of the start of the
 Haskell code section
  <carter>        haswell, sandybridge, ivy bridge, other?
  <rwbarton>      dunno
  <rwbarton>      I have similar results on Intel(R) Core(TM)2 Duo CPU
 T7300  @ 2.00GHz
  <rwbarton>      and on Quad-Core AMD Opteron(tm) Processor 2374 HE
  <carter>        ok
  <rwbarton>      trying a patch now that aligns all *_entry symbols to 32
 bytes

 }}}

 the key point in there is that on the tak benchmark, better alignment for
 the code made a 10% perf differnce on TAk on Core2 and opteron cpus!

 benjamin scarlet and Luite are speculating that this may be further
 induced by Tables next to code (TNC) accidentally creating bad alignment
 so theres cache line pollution / conflicts between the L1 Instruction-
 cache and data-caches.
 So one experiment would be to have the TNC transform pad after the table
 so the function entry point starts on the next cacheline?

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8279>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler