[GHC] #8279: bad alignment in code gen yields substantial perf issue
GHC
ghc-devs at haskell.org
Thu Sep 12 22:20:21 CEST 2013
#8279: bad alignment in code gen yields substantial perf issue
------------------------------+--------------------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: highest | Milestone:
Component: Compiler | Version: 7.7
Keywords: | Operating System: Unknown/Multiple
Architecture: | Type of failure: Runtime performance bug
Unknown/Multiple | Test Case:
Difficulty: Unknown | Blocking:
Blocked By: |
Related Tickets: |
------------------------------+--------------------------------------------
independently, a number of folks have noticed that in various ways, GHC
currently has quite a few different memory alignment related performance
problems that can have >= 10% perf impact!
Nicolas Frisby notes
{{{
On my laptop, a program showed a consistent slowdown with -fdicts-strict
I didn't find any obvious causes in the Core differences, so I turned to
Intel's Performance Counter Monitor for measurements. After trying a few
counters, I eventuall saw that there are about an order of magnitude more
misaligned memory loads with -fdicts-strict than without, so I think that
may be a significant part of the slowdown. I'm not sure if these are code
or data reads.
Can anyone suggest how to validate this hypothesis about misaligned reads?
A subsequent commit has changed the behavior I was seeing, so I'm not
interested in alternatives means to determine if -fdicts-strict is somehow
at fault — I'm just asking specifically about data/code memory alignment
in GHC and how to diagnose/experiment with it.
}}}
Reid Barton has independently noted
{{{
so I did a nofib run with llvm libraries, ghc quickbuild
so there's this really simple benchmark tak,
https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
it doesn't use any libraries at all in the main loop because the Ints all
get unboxed
but it's still 8% slower with quick-llvm (vs -fasm)
weird right?
[14:36:30] <carter> could you post the asm it generates for that
function?
[14:36:49] <rwbarton> well it's identical between the two versions
<rwbarton> but they get linked at different offsets because some
llvm sections are different sizes
<rwbarton> if I add a 128-byte symbol to the .text section to move
it to the same address... then the llvm libs version is just as fast
<rwbarton> well, apparently 404000 is good and 403f70 is bad
<rwbarton> I guess I can test other alignments easily enough
<rwbarton> I imagine it wants to start on a cache line
<rwbarton> but I don't know if it's just a coincidence that it
worked with the ncg libraries
<rwbarton> that it got a good location
<rwbarton> for this program every 32-byte aligned address is 10+%
faster than any merely 16-byte aligned address
<rwbarton> and by alignment I mean alignment of the start of the
Haskell code section
<carter> haswell, sandybridge, ivy bridge, other?
<rwbarton> dunno
<rwbarton> I have similar results on Intel(R) Core(TM)2 Duo CPU
T7300 @ 2.00GHz
<rwbarton> and on Quad-Core AMD Opteron(tm) Processor 2374 HE
<carter> ok
<rwbarton> trying a patch now that aligns all *_entry symbols to 32
bytes
}}}
the key point in there is that on the tak benchmark, better alignment for
the code made a 10% perf differnce on TAk on Core2 and opteron cpus!
benjamin scarlet and Luite are speculating that this may be further
induced by Tables next to code (TNC) accidentally creating bad alignment
so theres cache line pollution / conflicts between the L1 Instruction-
cache and data-caches.
So one experiment would be to have the TNC transform pad after the table
so the function entry point starts on the next cacheline?
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8279>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list