[GHC] #8971: Native Code Generator for 8.0.1 is not as optimized as 7.6.3... (was: Native Code Generator for 7.8 is not as optimized as 7.6.3...)
GHC
ghc-devs at haskell.org
Wed Nov 2 04:28:49 UTC 2016
#8971: Native Code Generator for 8.0.1 is not as optimized as 7.6.3...
-------------------------------------+-------------------------------------
Reporter: GordonBGood | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler (NCG) | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Changes (by GordonBGood):
* milestone: => 8.2.1
* version: 7.8.1-rc2 => 8.0.1
@@ -2,1 +2,1 @@
- version 7.8.1 RC2 compiler as the Windows 7.6.3 compiler (32-bit) when the
+ version 8.0.1 compiler as the Windows 7.6.3 compiler (32-bit) when the
New description:
The output assembly code is not as optimized for the Windows 32-bit
version 8.0.1 compiler as the Windows 7.6.3 compiler (32-bit) when the
option switches are exactly the same although it may not be limited to
only the Windows platform; this has a negative impact on execution time
for tight loops of about a factor of two times slower.
The following code will reproduce the problem:
{{{#!haskell
-- GHC_NCG_OptimizationBug.hs
-- it seems the Haskell GHC 7.8.1 NCG Native Code Generator (NCG) doesn't
-- optimize as well for (at least) the x86 target as version 7.6.3
{-# OPTIONS_GHC -O3 -rtsopts -v -dcore-lint -ddump-asm -ddump-to-file
-dumpdir . #-} -- or O2
import Data.Bits
import Control.Monad.ST (runST,ST(..))
import Data.Array.Base
-- Uses a very simple Sieve of Eratosthenes to 2 ^ 18 to prove it.
accNumPrimes :: Int -> Int
accNumPrimes acc = acc `seq` runST $ do
let bfSz = (256 * 1024 - 3) `div` 2
bfLmtWrds = (bfSz + 1) `div` 32
bufw <- newArray (0, bfLmtWrds) (-1) :: ST s (STUArray s Int Int)
-- to clear the last "uneven" bit(s)
unsafeWrite bufw bfLmtWrds (complement ((-2) `shiftL` (bfSz .&. 31)))
bufb <- (castSTUArray :: STUArray s Int Int -> ST s (STUArray s Int
Bool)) bufw
let cullp i =
let p = i + i + 3 in
let s = (p * p - 3) `div` 2 in
if s > bfSz then
let count i sm = do
sm `seq` if i > bfLmtWrds then return (acc + sm) else do
wd <- unsafeRead bufw i
count (i + 1) (sm + (popCount wd)) in
count 0 1 -- use '1' for the '2' prime not in the array
else do
v <- unsafeRead bufb i
if v then
let cull j = do -- very tight inner loop
if j > bfSz then cullp (i + 1) else do
unsafeWrite bufb j False
cull (j + p) in
cull s
else cullp (i + 1)
cullp 0
main =
-- run the program a number of times to get a reasonable time...
let numloops = 2000 in
let loop n acc =
acc `seq` if n <= 0 then acc else
loop (n - 1) (accNumPrimes acc) in
print $ loop numloops 0
}}}
The above code takes almost twice as long to run when compiled under 7.8.1
RC2 for Windows (32-bit) as it does for the version 7.6.3 compiler (both
32-bit compilers).
The -ddump-simpl Core dump is almost identical between the two, which is
also evidenced by that using the -fllvm LLVM compiler back end switch for
each results in code that runs at about the same speed for each compiler
run (which would use the same Core output as used for NCG, right?).
Under Windows, the compilation and run for 7.8.1 RC2 goes like this:
{{{
*Main> :! E:\ghc-7.8.0.20140228_32\bin\ghc --make -pgmlo
"E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc"
"GHC_NCG_OptimizationBug.hs"
compile: input file WindowsVsLinuxNCG.hs
Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc15460_0
*** Checking old interface for main:Main:
*** Parser:
*** Renamer/typechecker:
[1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs,
GHC_NCG_OptimizationBug.o )
*** Desugar:
Result size of Desugar (after optimization)
= {terms: 260, types: 212, coercions: 0}
*** Core Linted result of Desugar (after optimization):
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 213, types: 136, coercions: 52}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 215, types: 148, coercions: 67}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
= {terms: 209, types: 135, coercions: 51}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 209, types: 135, coercions: 42}
*** Core Linted result of Simplifier:
*** Specialise:
Result size of Specialise = {terms: 209, types: 135, coercions: 42}
*** Core Linted result of Specialise:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
Consts = True,
PAPs = False})
= {terms: 286, types: 185, coercions: 42}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
= False}):
*** Float inwards:
Result size of Float inwards
= {terms: 286, types: 185, coercions: 42}
*** Core Linted result of Float inwards:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 502, types: 393, coercions: 103}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 428, types: 326, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
= {terms: 420, types: 321, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 420, types: 321, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 418, types: 318, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 418, types: 318, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 475, types: 383, coercions: 32}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Simplifier:
*** Demand analysis:
Result size of Demand analysis
= {terms: 444, types: 336, coercions: 9}
*** Core Linted result of Demand analysis:
*** Worker Wrapper binds:
Result size of Worker Wrapper binds
= {terms: 579, types: 457, coercions: 9}
*** Core Linted result of Worker Wrapper binds:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 510, types: 415, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 420, types: 322, coercions: 9}
*** Core Linted result of Simplifier:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
Consts = True,
PAPs = True})
= {terms: 426, types: 326, coercions: 9}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
= True}):
*** Common sub-expression:
Result size of Common sub-expression
= {terms: 424, types: 326, coercions: 9}
*** Core Linted result of Common sub-expression:
*** Float inwards:
Result size of Float inwards
= {terms: 424, types: 326, coercions: 9}
*** Core Linted result of Float inwards:
*** Liberate case:
Result size of Liberate case
= {terms: 1,824, types: 1,259, coercions: 9}
*** Core Linted result of Liberate case:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 608, types: 422, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
= {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 604, types: 413, coercions: 9}
*** Core Linted result of Simplifier:
*** SpecConstr:
Result size of SpecConstr = {terms: 708, types: 505, coercions: 9}
*** Core Linted result of SpecConstr:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 702, types: 499, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 608, types: 405, coercions: 9}
*** Core Linted result of Simplifier:
*** Tidy Core:
Result size of Tidy Core = {terms: 608, types: 405, coercions: 9}
*** Core Linted result of Tidy Core:
*** CorePrep:
Result size of CorePrep = {terms: 825, types: 489, coercions: 9}
*** Core Linted result of CorePrep:
*** Stg2Stg:
*** CodeOutput:
*** New CodeGen:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** CPSZ:
*** Assembler:
"E:\ghc-7.8.0.20140228_32\lib/../mingw/bin/gcc.exe" "-U__i686" "-fno-
stack-protector" "-DTABLES_NEXT_TO_CODE" "-I." "-x" "assembler-with-cpp"
"-c" "C:\Users\Gordon\AppData\Local\Temp\ghc15460_0\ghc15460_2.s" "-o"
"GHC_NCG_OptimizationBug.o"
Linking GHC_NCG_OptimizationBug.exe ...
*Main> :! GHC_NCG_OptimizationBug +RTS -s
46000000
32,965,096 bytes allocated in the heap
7,032 bytes copied during GC
41,756 bytes maximum residency (2 sample(s))
19,684 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s
0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s
0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.73s ( 1.73s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.73s ( 1.73s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 19,006,902 bytes per MUT second
Productivity 100.0% of total user, 100.2% of total elapsed
}}}
whereas under version 7.6.3 goes like this:
{{{
*Main> :! E:\ghc-7.6.3_32\bin\ghc --make -pgmlo
"E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc"
"GHC_NCG_OptimizationBug.hs"
compile: input file GHC_NCG_OptimizationBug.hs
Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc28200_0
*** Checking old interface for main:Main:
*** Parser:
*** Renamer/typechecker:
[1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs,
GHC_NCG_OptimizationBug.o )
*** Desugar:
Result size of Desugar (after optimization)
= {terms: 247, types: 212, coercions: 0}
*** Core Linted result of Desugar (after optimization):
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 198, types: 132, coercions: 35}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 200, types: 144, coercions: 43}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
= {terms: 194, types: 131, coercions: 57}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 194, types: 131, coercions: 39}
*** Core Linted result of Simplifier:
*** Specialise:
Result size of Specialise = {terms: 194, types: 131, coercions: 39}
*** Core Linted result of Specialise:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
Consts = True,
PAPs = False})
= {terms: 277, types: 191, coercions: 39}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
= False}):
*** Float inwards:
Result size of Float inwards
= {terms: 277, types: 191, coercions: 39}
*** Core Linted result of Float inwards:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 514, types: 403, coercions: 103}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 420, types: 317, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
= {terms: 412, types: 312, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 412, types: 312, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 410, types: 309, coercions: 29}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 410, types: 309, coercions: 29}
*** Core Linted result of Simplifier:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 455, types: 364, coercions: 32}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Simplifier:
*** Demand analysis:
Result size of Demand analysis
= {terms: 422, types: 317, coercions: 9}
*** Core Linted result of Demand analysis:
*** Worker Wrapper binds:
Result size of Worker Wrapper binds
= {terms: 536, types: 427, coercions: 9}
*** Core Linted result of Worker Wrapper binds:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 480, types: 391, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 400, types: 306, coercions: 9}
*** Core Linted result of Simplifier:
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
Consts = True,
PAPs = True})
= {terms: 408, types: 311, coercions: 9}
*** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
= True}):
*** Common sub-expression:
Result size of Common sub-expression
= {terms: 406, types: 311, coercions: 9}
*** Core Linted result of Common sub-expression:
*** Float inwards:
Result size of Float inwards
= {terms: 406, types: 311, coercions: 9}
*** Core Linted result of Float inwards:
*** Liberate case:
Result size of Liberate case
= {terms: 1,186, types: 824, coercions: 9}
*** Core Linted result of Liberate case:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 585, types: 411, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=3
= {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 569, types: 392, coercions: 9}
*** Core Linted result of Simplifier:
*** SpecConstr:
Result size of SpecConstr = {terms: 746, types: 566, coercions: 9}
*** Core Linted result of SpecConstr:
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 739, types: 560, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier iteration=2
= {terms: 762, types: 546, coercions: 9}
*** Core Linted result of Simplifier:
Result size of Simplifier = {terms: 642, types: 402, coercions: 9}
*** Core Linted result of Simplifier:
*** Tidy Core:
Result size of Tidy Core = {terms: 642, types: 402, coercions: 9}
*** Core Linted result of Tidy Core:
writeBinIface: 10 Names
writeBinIface: 34 dict entries
*** CorePrep:
Result size of CorePrep = {terms: 779, types: 483, coercions: 9}
*** Core Linted result of CorePrep:
*** Stg2Stg:
*** CodeOutput:
*** CodeGen:
*** Assembler:
"E:\ghc-7.6.3_32\lib/../mingw/bin/gcc.exe" "-fno-stack-protector" "-Wl
,--hash-size=31" "-Wl,--reduce-memory-overheads" "-I." "-c"
"C:\Users\Gordon\AppData\Local\Temp\ghc28200_0\ghc28200_0.s" "-o"
"GHC_NCG_OptimizationBug.o"
Linking GHC_NCG_OptimizationBug.exe ...
*Main> :! GHC_NCG_OptimizationBug +RTS -s
46000000
32,989,396 bytes allocated in the heap
4,976 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s
0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s
0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.64s ( 0.64s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.66s ( 0.64s elapsed)
%GC time 0.0% (0.1% elapsed)
Alloc rate 51,495,642 bytes per MUT second
Productivity 100.0% of total user, 102.3% of total elapsed
}}}
Looking at the ASM dump for the innermost tight culling loop reveals the
problem, with 7.8.1 RC2 outputting as follow:
{{{
_n3nx:
movl 76(%esp),%ecx
_c3gf:
cmpl %ecx,%eax
jg _c3jB
_c3jC:
movl %eax,%edx
sarl $5,%edx
movl %ecx,76(%esp)
movl $1,%ecx
movl %ecx,280(%esp)
movl %eax,%ecx
andl $31,%ecx
movl %eax,292(%esp)
movl 280(%esp),%eax
shll %cl,%eax
xorl $-1,%eax
movl 64(%esp),%ecx
addl $8,%ecx
movl (%ecx,%edx,4),%ecx
andl %eax,%ecx
movl 64(%esp),%eax
addl $8,%eax
movl %ecx,(%eax,%edx,4)
movl 292(%esp),%eax
addl $3,%eax
jmp _n3nx
}}}
and 7.6.3 outputting as follows:
{{{
.text
.align 4,0x90
.long 1894
.long 32
s1GZ_info:
_c1YB:
cmpl 16(%ebp),%esi
jg _c1YE
movl %esi,%edx
sarl $5,%edx
movl $1,%eax
movl %esi,%ecx
andl $31,%ecx
shll %cl,%eax
xorl $-1,%eax
movl 12(%ebp),%ecx
movl 8(%ecx,%edx,4),%ecx
andl %eax,%ecx
movl 12(%ebp),%eax
movl %ecx,8(%eax,%edx,4)
addl 4(%ebp),%esi
jmp s1GZ_info
_c1YE:
movl 8(%ebp),%esi
addl $8,%ebp
jmp s1GB_info
}}}
The second code is clearly much more efficient, with the only memory
access reading/writing the sieve buffer array and one register reload of
the prime value to add to the current position index, whereas the first
(7.8.1 RC2) code has three register spills and five register re-loads,
almost as if debugging were still turned on.
This bug was tested under Windows, but likely applies to other platforms,
at least for 32-bit versions but also possibly to others.
--
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8971#comment:29>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list