[GHC] #8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3...
GHC
ghc-devs at haskell.org
Fri Apr 25 04:44:29 UTC 2014
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3...
--------------------------------------------+------------------------------
Reporter: GordonBGood | Owner:
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler (NCG) | Version: 7.8.1-rc2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime performance bug | Unknown/Multiple
Test Case: | Difficulty: Unknown
Blocking: | Blocked By:
| Related Tickets:
--------------------------------------------+------------------------------
Comment (by GordonBGood):
Replying to [comment:9 tibbe]:
> Replying to [comment:8 GordonBGood]:
> > I only referred to LLVM as proof that the problem seems to be limited
to NCG as both NCG and LLVM will share the same C-- output (or at least I
think so???) yet NCG shows this step backwards whereas LLVM does not.
>
> That's a good indication that the NGC is to blame, but it's also
possible that the Cmm codegen has regressed but some LLVM optimizations
make up for the regression.
Alright, as requested by "ezyang" earlier, I added the -ddump-cmm and
-ddump-opt-cmm switches to the compilation, with the following results:
For version 7.6.3, the CMM code for the loops looks like this:
{{{
s1Hq_ret()
{ label: s1Hq_info
rep:StackRep [True, False, True, True, True]
}
c1XX:
_c1Tr::I32 = %MO_S_Gt_W32(R1, I32[Sp + 4]);
;
if (_c1Tr::I32 >= 1) goto c1XZ;
_s1H9::I32 = %MO_S_Shr_W32(R1, 5);
_s1He::I32 = I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)];
_s1KJ::I32 = R1;
_s1Hh::I32 = _s1KJ::I32 & 31;
_s1Hj::I32 = _s1Hh::I32;
_s1KI::I32 = 1 << _s1Hj::I32;
_s1Hm::I32 = _s1KI::I32 ^ 18446744073709551615;
_s1KH::I32 = _s1He::I32 & _s1Hm::I32;
I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)] = _s1KH::I32;
_s1KG::I32 = R1 + 3;
R1 = _s1KG::I32;
jump s1Hq_info; // [R1]
c1XZ:
R1 = 1;
jump s1HY_info; // [R1]
},
}}}
and the opt-cmm code for about the same area looks like this:
{{{
s1Hq_ret()
{ Just s1Hq_info:
const 933;
const 32;
}
c1XX:
;
if (%MO_S_Gt_W32(R1, I32[Sp + 4])) goto c1XZ;
_s1H9::I32 = %MO_S_Shr_W32(R1, 5);
I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] = I32[I32[Sp + 8] +
((_s1H9::I32 << 2) + 8)] & (1 << R1 & 31) ^ 18446744073709551615;
R1 = R1 + 3;
jump s1Hq_info; // [R1]
c1XZ:
R1 = 1;
jump s1HY_info; // [R1]
}
}}}
For version 7.8.1 RC2 the cmm code is about eight times larger with about
eight times the number of lines and looks like this
{{{
c3jO:
_c3jQ::I32 = %MO_S_Gt_W32(_s33c::I32, _s32t::I32);
_s33e::I32 = _c3jQ::I32;
if (_s33e::I32 >= 1) goto c3jY; else goto c3jZ;
c3jY:
_s32S::I32 = _s335::I32;
goto c3gp;
c3jZ:
_c3kn::I32 = %MO_S_Shr_W32(_s33c::I32, 5);
_s33g::I32 = _c3kn::I32;
_s33j::I32 = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)];
_s33j::I32 = _s33j::I32;
_c3kq::I32 = _s33c::I32;
_s33k::I32 = _c3kq::I32;
_c3kt::I32 = _s33k::I32 & 31;
_s33l::I32 = _c3kt::I32;
_c3kw::I32 = _s33l::I32;
_s33m::I32 = _c3kw::I32;
_c3kz::I32 = 1 << _s33m::I32;
_s33n::I32 = _c3kz::I32;
_c3kC::I32 = _s33n::I32 ^ 4294967295;
_s33o::I32 = _c3kC::I32;
_c3kF::I32 = _s33j::I32 & _s33o::I32;
_s33p::I32 = _c3kF::I32;
I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = _s33p::I32;
_c3kK::I32 = _s33c::I32 + _s32U::I32;
_s33r::I32 = _c3kK::I32;
_s33c::I32 = _s33r::I32;
goto c3jO;
}}}
the the optimized opt-cmm code looks like this:
{{{
c3jO:
if (%MO_S_Gt_W32(_s33c::I32,
_s32t::I32)) goto c3jY; else goto c3jZ;
c3jY:
Sp = Sp + 8;
_s32S::I32 = _s335::I32;
goto c3gp;
c3jZ:
_s33g::I32 = %MO_S_Shr_W32(_s33c::I32, 5);
I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = I32[(_s327::P32 + 8)
+ (_s33g::I32 << 2)] & (1 << _s33c::I32 & 31) ^ 4294967295;
_s33c::I32 = _s33c::I32 + _s32U::I32;
goto c3jO;
}}}
It appears to me that the opt-cmm code is about the same but the straight
cmm dump has regressed to have lost even the basic optimizations that were
there with the older version.
It may be that the NCG is using the non-optimized version of CMM as a
source where the LLVM generator is using the optimized CMM version, which
would explain why LLVM backend generated code is still efficient where NCG
generated code is not.
Anticipating your next request to look at the STG output, I used the
-ddump-stg compiler option to examine that code. The code is too
difficult to post here (wordy and lines very long), but a quick
examination shows it to be about the same for both versions except that
internal constants are recorded as 32-bit numbers in 7.8.1 whereas they
were recorded as 64-bit numbers even when referring to 32-bit registers
for the older 7.6.3 code; this corresponds to the way they are recorded in
both the optimized and non-optimized CMM files by version as listed above.
Thus, the bug/regression appears to be go further back than just the new
NCG (which is likely using the non-optimized CMM code as input) but also
to the CMM code generator in that it is producing much less efficient CMM
code.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8971#comment:10>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list