[GHC] #12231: Eliminate redundant heap allocations/deallocations
GHC
ghc-devs at haskell.org
Sat Jun 25 21:01:26 UTC 2016
#12231: Eliminate redundant heap allocations/deallocations
-------------------------------------+-------------------------------------
Reporter: harendra | Owner:
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 8.0.1
(CodeGen) |
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Changes (by harendra):
* component: Compiler => Compiler (CodeGen)
@@ -51,2 +51,2 @@
- GHC_PERF_ISSUE directory. - https://github.com/harendra-kumar/unicode-
- transforms/tree/GHC_PERF_ISSUE
+ GHC_PERF_ISSUE directory: https://github.com/harendra-kumar/unicode-
+ transforms/tree/ghc-trac-12231
New description:
The code generated by GHC sometimes places the Heap allocation and
deallocation in a common path which has multiple sub-branches inside it.
Some of those branches may not need heap at all and it just checks
allocates and deallocates it without using. This adds unnecessary overhead
and becomes significant when the loop is small and runs very frequently.
Instead of putting the heap allocation code at a higher level branch we
can move it to more specific branches where it is really needed so that it
gets out of the paths which do not need it.
I am attaching a specific example of such a case. The code used here is in
a public github repository [2]. It should be quite easy to play with it,
there are detailed instructions along with the code to reproduce and
investigate the issue.
The files referenced in the following text are attached with this ticket.
They are available in the github repo as well. Look at the CMM trace in
`decompose-loop-execution-trace.cmm` and `decompose-loop-full.cmm`, start
at label `c4ic` where we allocate space on heap (+48). There are multiple
possible paths from this point on, some of those use the heap and some
don't. Some interesting paths are:
{{{
1) c4ic (allocate) -> c4mw -> {c4pv} -> ...
2) c4ic (allocate) -> c4mw -> c4pw -> ((c4pr -> ({c4pe} -> ... | c4ph ->
...)) | cp4ps -> ...)
}}}
I have put those which use the heap inside curly braces, the rest do not
use it at all. The paths which actually need the heap are rarely executed
in this case. But because of the placement it is always allocated and
deallocated in a the fast path which is executed almost all the time.
If we can place this allocation at both `c4pv` and `c4pe` instead of the
common parent then we can save the fast path from this check. The same
thing applies to the allocation at label `c4jd` as well.
The whole loop is composed on 51 instructions (ghc-8.0.1) and 8 of them
are heap adjustment related, which comes out to almost 16%. If I can get
this back then the performance of this loop will get quite close to the
gcc compiled code which is currently around 30% faster.
This is a CMM level optimization which will benefit all architectures. I
also noticed that llvm removed the heap adjustment though not the checks.
== References ==
* [1] mail thread - https://mail.haskell.org/pipermail/ghc-
devs/2016-June/012275.html
* [2] github repo for code, traces, reproduction details. See the
GHC_PERF_ISSUE directory: https://github.com/harendra-kumar/unicode-
transforms/tree/ghc-trac-12231
--
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12231#comment:2>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list