[GHC] #8279: bad alignment in code gen yields substantial perf issue
GHC
ghc-devs at haskell.org
Mon Jul 14 06:47:01 UTC 2014
#8279: bad alignment in code gen yields substantial perf issue
--------------------------------------------+------------------------------
Reporter: carter | Owner:
Type: bug | Status: new
Priority: high | Milestone: 7.10.1
Component: Compiler | Version: 7.7
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime performance bug | Unknown/Multiple
Test Case: | Difficulty: Unknown
Blocking: | Blocked By:
| Related Tickets: #8082
--------------------------------------------+------------------------------
Description changed by jstolarek:
Old description:
> independently, a number of folks have noticed that in various ways, GHC
> currently has quite a few different memory alignment related performance
> problems that can have >= 10% perf impact!
>
> Nicolas Frisby notes
>
> {{{
> On my laptop, a program showed a consistent slowdown with -fdicts-strict
>
> I didn't find any obvious causes in the Core differences, so I turned to
> Intel's Performance Counter Monitor for measurements. After trying a few
> counters, I eventuall saw that there are about an order of magnitude more
> misaligned memory loads with -fdicts-strict than without, so I think that
> may be a significant part of the slowdown. I'm not sure if these are code
> or data reads.
>
> Can anyone suggest how to validate this hypothesis about misaligned
> reads?
>
> A subsequent commit has changed the behavior I was seeing, so I'm not
> interested in alternatives means to determine if -fdicts-strict is
> somehow at fault — I'm just asking specifically about data/code memory
> alignment in GHC and how to diagnose/experiment with it.
>
> }}}
>
> Reid Barton has independently noted
> {{{
>
> so I did a nofib run with llvm libraries, ghc quickbuild
>
> so there's this really simple benchmark tak,
> https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
> it doesn't use any libraries at all in the main loop because the Ints all
> get unboxed
> but it's still 8% slower with quick-llvm (vs -fasm)
> weird right?
>
> [14:36:30] <carter> could you post the asm it generates for that
> function?
> [14:36:49] <rwbarton> well it's identical between the two versions
> <rwbarton> but they get linked at different offsets because some
> llvm sections are different sizes
> <rwbarton> if I add a 128-byte symbol to the .text section to move
> it to the same address... then the llvm libs version is just as fast
> <rwbarton> well, apparently 404000 is good and 403f70 is bad
> <rwbarton> I guess I can test other alignments easily enough
> <rwbarton> I imagine it wants to start on a cache line
> <rwbarton> but I don't know if it's just a coincidence that it
> worked with the ncg libraries
> <rwbarton> that it got a good location
>
> <rwbarton> for this program every 32-byte aligned address is 10+%
> faster than any merely 16-byte aligned address
>
> <rwbarton> and by alignment I mean alignment of the start of the
> Haskell code section
> <carter> haswell, sandybridge, ivy bridge, other?
> <rwbarton> dunno
> <rwbarton> I have similar results on Intel(R) Core(TM)2 Duo CPU
> T7300 @ 2.00GHz
> <rwbarton> and on Quad-Core AMD Opteron(tm) Processor 2374 HE
> <carter> ok
> <rwbarton> trying a patch now that aligns all *_entry symbols to 32
> bytes
>
> }}}
>
> the key point in there is that on the tak benchmark, better alignment for
> the code made a 10% perf differnce on TAk on Core2 and opteron cpus!
>
> benjamin scarlet and Luite are speculating that this may be further
> induced by Tables next to code (TNC) accidentally creating bad alignment
> so theres cache line pollution / conflicts between the L1 Instruction-
> cache and data-caches.
> So one experiment would be to have the TNC transform pad after the table
> so the function entry point starts on the next cacheline?
New description:
independently, a number of folks have noticed that in various ways, GHC
currently has quite a few different memory alignment related performance
problems that can have >= 10% perf impact!
Nicolas Frisby notes
{{{
On my laptop, a program showed a consistent slowdown with -fdicts-strict
I didn't find any obvious causes in the Core differences, so I turned to
Intel's
Performance Counter Monitor for measurements. After trying a few counters,
I eventually
saw that there are about an order of magnitude more misaligned memory
loads with
-fdicts-strict than without, so I think that may be a significant part of
the slowdown.
I'm not sure if these are code or data reads.
Can anyone suggest how to validate this hypothesis about misaligned reads?
A subsequent commit has changed the behavior I was seeing, so I'm not
interested
in alternatives means to determine if -fdicts-strict is somehow at fault —
I'm just
asking specifically about data/code memory alignment in GHC and how to
diagnose/experiment with it.
}}}
Reid Barton has independently noted
{{{
so I did a nofib run with llvm libraries, ghc quickbuild
so there's this really simple benchmark tak,
https://github.com/ghc/nofib/blob/master/imaginary/tak/Main.hs
it doesn't use any libraries at all in the main loop because the Ints all
get unboxed
but it's still 8% slower with quick-llvm (vs -fasm)
weird right?
[14:36:30] <carter> could you post the asm it generates for that
function?
[14:36:49] <rwbarton> well it's identical between the two versions
<rwbarton> but they get linked at different offsets because some
llvm sections are different sizes
<rwbarton> if I add a 128-byte symbol to the .text section to move
it to the same address... then the llvm libs version is just as fast
<rwbarton> well, apparently 404000 is good and 403f70 is bad
<rwbarton> I guess I can test other alignments easily enough
<rwbarton> I imagine it wants to start on a cache line
<rwbarton> but I don't know if it's just a coincidence that it
worked with the ncg libraries
<rwbarton> that it got a good location
<rwbarton> for this program every 32-byte aligned address is 10+%
faster than any merely 16-byte aligned address
<rwbarton> and by alignment I mean alignment of the start of the
Haskell code section
<carter> haswell, sandybridge, ivy bridge, other?
<rwbarton> dunno
<rwbarton> I have similar results on Intel(R) Core(TM)2 Duo CPU
T7300 @ 2.00GHz
<rwbarton> and on Quad-Core AMD Opteron(tm) Processor 2374 HE
<carter> ok
<rwbarton> trying a patch now that aligns all *_entry symbols to 32
bytes
}}}
the key point in there is that on the tak benchmark, better alignment for
the code made a 10% perf differnce on TAk on Core2 and opteron cpus!
benjamin scarlet and Luite are speculating that this may be further
induced by Tables next to code (TNC) accidentally creating bad alignment
so theres cache line pollution / conflicts between the L1 Instruction-
cache and data-caches.
So one experiment would be to have the TNC transform pad after the table
so the function entry point starts on the next cacheline?
--
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8279#comment:16>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list