[GHC] #8279: bad alignment in code gen yields substantial perf issue

Mon Jul 14 06:47:01 UTC 2014

#8279: bad alignment in code gen  yields substantial perf issue
--------------------------------------------+------------------------------
        Reporter:  carter                   |            Owner:
            Type:  bug                      |           Status:  new
        Priority:  high                     |        Milestone:  7.10.1
       Component:  Compiler                 |          Version:  7.7
      Resolution:                           |         Keywords:
Operating System:  Unknown/Multiple         |     Architecture:
 Type of failure:  Runtime performance bug  |  Unknown/Multiple
       Test Case:                           |       Difficulty:  Unknown
        Blocking:                           |       Blocked By:
                                            |  Related Tickets:  #8082
--------------------------------------------+------------------------------
Description changed by jstolarek:

Old description:

> independently, a number of folks have noticed that in various ways, GHC
> currently has quite a few different memory alignment related performance
> problems that can have >= 10% perf impact!
>
> Nicolas Frisby notes
>
> {{{
> On my laptop, a program showed a consistent slowdown with -fdicts-strict
>
> I didn't find any obvious causes in the Core differences, so I turned to
> Intel's Performance Counter Monitor for measurements. After trying a few
> counters, I eventuall saw that there are about an order of magnitude more
> misaligned memory loads with -fdicts-strict than without, so I think that
> may be a significant part of the slowdown. I'm not sure if these are code
> or data reads.
>
> Can anyone suggest how to validate this hypothesis about misaligned
> reads?
>
> A subsequent commit has changed the behavior I was seeing, so I'm not
> interested in alternatives means to determine if -fdicts-strict is
> somehow at fault — I'm just asking specifically about data/code memory
> alignment in GHC and how to diagnose/experiment with it.
>
> }}}
>

> Reid Barton has independently noted
> {{{
>
> so I did a nofib run with llvm libraries, ghc quickbuild
>
> so there's this really simple benchmark tak,
> https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
> it doesn't use any libraries at all in the main loop because the Ints all
> get unboxed
> but it's still 8% slower with quick-llvm (vs -fasm)
> weird right?
>
> [14:36:30] <carter>      could you post the asm it generates for that
> function?
> [14:36:49] <rwbarton>    well it's identical between the two versions
> <rwbarton>       but they get linked at different offsets because some
> llvm sections are different sizes
> <rwbarton>       if I add a 128-byte symbol to the .text section to move
> it to the same address... then the llvm libs version is just as fast
> <rwbarton>       well, apparently 404000 is good and 403f70 is bad
>  <rwbarton>      I guess I can test other alignments easily enough
> <rwbarton>       I imagine it wants to start on a cache line
>  <rwbarton>      but I don't know if it's just a coincidence that it
> worked with the ncg libraries
>  <rwbarton>      that it got a good location
>
> <rwbarton>       for this program every 32-byte aligned address is 10+%
> faster than any merely 16-byte aligned address
>
>  <rwbarton>      and by alignment I mean alignment of the start of the
> Haskell code section
>  <carter>        haswell, sandybridge, ivy bridge, other?
>  <rwbarton>      dunno
>  <rwbarton>      I have similar results on Intel(R) Core(TM)2 Duo CPU
> T7300  @ 2.00GHz
>  <rwbarton>      and on Quad-Core AMD Opteron(tm) Processor 2374 HE
>  <carter>        ok
>  <rwbarton>      trying a patch now that aligns all *_entry symbols to 32
> bytes
>
> }}}
>
> the key point in there is that on the tak benchmark, better alignment for
> the code made a 10% perf differnce on TAk on Core2 and opteron cpus!
>

> benjamin scarlet and Luite are speculating that this may be further
> induced by Tables next to code (TNC) accidentally creating bad alignment
> so theres cache line pollution / conflicts between the L1 Instruction-
> cache and data-caches.
> So one experiment would be to have the TNC transform pad after the table
> so the function entry point starts on the next cacheline?

New description:

 independently, a number of folks have noticed that in various ways, GHC
 currently has quite a few different memory alignment related performance
 problems that can have >= 10% perf impact!

 Nicolas Frisby notes

 {{{
 On my laptop, a program showed a consistent slowdown with -fdicts-strict

 I didn't find any obvious causes in the Core differences, so I turned to
 Intel's
 Performance Counter Monitor for measurements. After trying a few counters,
 I eventually
 saw that there are about an order of magnitude more misaligned memory
 loads with
 -fdicts-strict than without, so I think that may be a significant part of
 the slowdown.
 I'm not sure if these are code or data reads.

 Can anyone suggest how to validate this hypothesis about misaligned reads?

 A subsequent commit has changed the behavior I was seeing, so I'm not
 interested
 in alternatives means to determine if -fdicts-strict is somehow at fault —
 I'm just
 asking specifically about data/code memory alignment in GHC and how to
 diagnose/experiment with it.

 }}}

 Reid Barton has independently noted
 {{{

 so I did a nofib run with llvm libraries, ghc quickbuild

 so there's this really simple benchmark tak,
 https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
 it doesn't use any libraries at all in the main loop because the Ints all
 get unboxed
 but it's still 8% slower with quick-llvm (vs -fasm)
 weird right?

 [14:36:30] <carter>      could you post the asm it generates for that
 function?
 [14:36:49] <rwbarton>    well it's identical between the two versions
 <rwbarton>       but they get linked at different offsets because some
 llvm sections are different sizes
 <rwbarton>       if I add a 128-byte symbol to the .text section to move
 it to the same address... then the llvm libs version is just as fast
 <rwbarton>       well, apparently 404000 is good and 403f70 is bad
  <rwbarton>      I guess I can test other alignments easily enough
 <rwbarton>       I imagine it wants to start on a cache line
  <rwbarton>      but I don't know if it's just a coincidence that it
 worked with the ncg libraries
  <rwbarton>      that it got a good location

 <rwbarton>       for this program every 32-byte aligned address is 10+%
 faster than any merely 16-byte aligned address

  <rwbarton>      and by alignment I mean alignment of the start of the
 Haskell code section
  <carter>        haswell, sandybridge, ivy bridge, other?
  <rwbarton>      dunno
  <rwbarton>      I have similar results on Intel(R) Core(TM)2 Duo CPU
 T7300  @ 2.00GHz
  <rwbarton>      and on Quad-Core AMD Opteron(tm) Processor 2374 HE
  <carter>        ok
  <rwbarton>      trying a patch now that aligns all *_entry symbols to 32
 bytes

 }}}

 the key point in there is that on the tak benchmark, better alignment for
 the code made a 10% perf differnce on TAk on Core2 and opteron cpus!

 benjamin scarlet and Luite are speculating that this may be further
 induced by Tables next to code (TNC) accidentally creating bad alignment
 so theres cache line pollution / conflicts between the L1 Instruction-
 cache and data-caches.
 So one experiment would be to have the TNC transform pad after the table
 so the function entry point starts on the next cacheline?

--

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8279#comment:16>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler