[GHC] #10397: Compiler performance regression 7.6 -> 7.8 in elimCommonBlocks
GHC
ghc-devs at haskell.org
Sat May 9 12:23:45 UTC 2015
#10397: Compiler performance regression 7.6 -> 7.8 in elimCommonBlocks
-------------------------------------+-------------------------------------
Reporter: | Owner:
TobyGoodwin | Status: new
Type: bug | Milestone:
Priority: normal | Version: 7.8.4
Component: Compiler | Operating System: Unknown/Multiple
Keywords: | Type of failure: None/Unknown
performance | Blocked By:
Architecture: | Related Tickets:
Unknown/Multiple |
Test Case: see ticket |
Blocking: |
Differential Revisions: |
-------------------------------------+-------------------------------------
I've been studying a compiler performance problem in ghc-7.8.4 which is
not seen in ghc-7.6.3. Profiling shows the compiler spends 74.3% of its
time in `elimCommonBlocks` (in `compiler/cmm/CmmCommonBlockElim.hs`).
I came across the problem in real code, and rwbarton on #ghc was able to
turn it into a sensible test case, which can be found at
http://static.tobold.org/ghc-prof/code/
It's not really a regression, since `elimCommonBlocks` is not enabled by
default in 7.6. It can be turned on with `-fnew-codegen`, and that
demonstrates a similar slowdown. However 7.8 is worse than 7.6, and 7.10
slightly worse still, for reasons that I haven't explored. Compile times
for `RegBig.hs` are:
{{{
7.6.3 -O2 17s
7.6.3 -O2 -fnew-codegen 177s
7.8.4 -O2 230s
7.10.1 -O2 241s
}}}
The problem obviously stems from the applicative expression with lots of
branches `Foo <$> a <*> b <*> c <*> ...` It seems that some level of
inlining occurs which means that each extra term in this expression
doubles the size of the code. Incidentally, this blowup only occurs when
both `RegBig.hs` ''and'' `Form.hs` are compiled with `-O2`. I haven't
looked to see why the blowup occurs; although I do see that a similar
blowup happens with 7.6.
{{{
...
*** Simplifier:
Result size of Simplifier iteration=1
= {terms: 16,525, types: 31,503, coercions: 73}
Result size of Simplifier iteration=2
= {terms: 29,070, types: 45,079, coercions: 73}
Result size of Simplifier iteration=3
= {terms: 50,139, types: 62,057, coercions: 73}
Result size of Simplifier iteration=4
= {terms: 84,172, types: 83,805, coercions: 73}
Result size of Simplifier
= {terms: 84,172, types: 83,805, coercions: 73}
...
}}}
I've looked in some detail at what happens next, tinkering with
`elimCommonBlocks` to try to understand why it runs so slowly in this
case.
There are two major problems. First, `elimCommonBlocks` is worse than
O(n^2^), and the inlining blowup mentioned above results in a very large
input: 51695 blocks in this example. (Across the project of 50 files or so
where this problem originated, I see the input to `elimCommonBlocks` is
normally a few hundred blocks at most.)
Secondly, there is an effort made in `elimCommonBlocks` to avoid comparing
whole blocks all the time, by computing a hash value of some of the block.
This fails to produce sufficient unique hash values for this case.
The hash must obviously omit the unique block label, and in practice all
labels, and various other things, are omitted. In this case, presumably
because the code is blown up from the tiny input, many many blocks are
similar, and the differences are elided by the hash function. For example,
8177 blocks share the hash value 1094.
These 8177 similar blocks all end up in a normal list, which is searched
with Data.List.find and the full-block comparison function! Eventually
this process concludes that the vast majority are in fact different: for
all its hard work, `elimCommonBlocks` only finds about a dozen common
blocks in total.
Two blocks which both hash to 1094 are these:
{{{
cCHc:
_se0q::P64 = R1;
_c1grM::P64 = _se0q::P64 & 7;
if (_c1grM::P64 >= 2) goto c1grR; else goto c1grS;
cWK7:
_sig3::P64 = R1;
_c1iXk::P64 = _sig3::P64 & 7;
if (_c1iXk::P64 >= 2) goto c1iXp; else goto c1iXq;
}}}
I suggest the following steps need to be taken.
== 1. Mitigation ==
Since this is a real problem in the 7.8 and 7.10 series, it needs a simple
but effective mitigation. I suggest a test that the input to
`elimCommonBlocks` is not "too big", where 1000 seems a reasonable cutoff.
Another simple mitigation would be to disable common block elimination
altogether (after all, `-O2` is usually understood to mean "make my code
go faster even if it gets bigger").
I've tried both these possibilities with 7.8.4. The trivial patches are
http://static.tobold.org/ghc-prof/0001-cbe-cutoff.patch and
http://static.tobold.org/ghc-prof/0001-no-cbe.patch
Compilation times for the test case `RegBig.hs` are as follows, and there
is no difference in size in the `.o` file.
{{{
7.8.4 -O2 230s
cbe only if input < 1000 blocks 88s
no common block elimination 86s
}}}
== 2. Investigation ==
I focused on `elimCommonBlocks` thinking that the inline blowup seen
earlier in the compilation is reasonable. On reflection, this is probably
not so, and I think someone better acquainted with ghc internals should
investigate this.
== 3. Optimization ==
`-- TODO: Use optimization fuel` it says in `CmmCommonBlockElim.hs`
Since `elimCommonBlocks` is essentially doing a `nub`, it will never
perform better than O(n^2^). In fact, it's currently much worse than that
as the comparison function changes as duplicate blocks are found, which in
turn may create new duplicates, so we have to iterate the nub function
till no more dups are found.
I did experiment with improving the hash function. It's possible to
include target labels, at the expense of needing to recompute block hash
values on each iteration. (And currently some ''highly'' ugly code to
convert labels to integers.) This considerably reduces the length of some
of the hash bucket lists, and so offers a significant speedup:
{{{
ghc-7.8.4 -O2 230s
"hack" 170s
}}}
Patch for this is http://static.tobold.org/ghc-prof/0001-cbe-hacks.patch
This looks like a promising approach. Including target labels doesn't
eliminate all the large hash buckets, so there is scope for further
improvements along these lines. I don't see any reason why the hash
function should not be as discerning as the equality function, with just
occasional accidental clashes.
I've also pondered whether more radical rewriting of `elimCommonBlocks`
might pay dividends. It's a commonplace that, if you don't need to
preserve order, `map head . group . sort` is a faster `nub`. Is the order
of blocks significant? I'm not sure - they all seem to end with either
`goto` or `call`. I'd ''assume'' that `call` then falls through to the
next block but I'm not sure. Anyway, presumably we could `zip [0..]`
first, and sort again afterwards, and that should still be asymptotically
faster than `nub`.
Of course, this presupposes that blocks could be (sensibly? efficiently?)
made an instance of `Ord`. (If they were, hash buckets could be stored in
an O(1) data structure and binary search employed.)
I may be able to find time to pursue some of these avenues, but as a very
novice ghc hacker, I'd need some strategic direction here.
Any thoughts?
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/10397>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list