[GHC] #12231: Eliminate redundant heap allocations/deallocations

Sat Jun 25 21:01:26 UTC 2016

#12231: Eliminate redundant heap allocations/deallocations
-------------------------------------+-------------------------------------
        Reporter:  harendra          |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.0.1
  (CodeGen)                          |
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Changes (by harendra):

 * component:  Compiler => Compiler (CodeGen)


@@ -51,2 +51,2 @@
- GHC_PERF_ISSUE directory. - https://github.com/harendra-kumar/unicode-
- transforms/tree/GHC_PERF_ISSUE
+ GHC_PERF_ISSUE directory: https://github.com/harendra-kumar/unicode-
+ transforms/tree/ghc-trac-12231

New description:

 The code generated by GHC sometimes places the Heap allocation and
 deallocation in a common path which has multiple sub-branches inside it.
 Some of those branches may not need heap at all and it just checks
 allocates and deallocates it without using. This adds unnecessary overhead
 and becomes significant when the loop is small and runs very frequently.

 Instead of putting the heap allocation code at a higher level branch we
 can move it to more specific branches where it is really needed so that it
 gets out of the paths which do not need it.

 I am attaching a specific example of such a case. The code used here is in
 a public github repository [2]. It should be quite easy to play with it,
 there are detailed instructions along with the code to reproduce and
 investigate the issue.

 The files referenced in the following text are attached with this ticket.
 They are available in the github repo as well. Look at the CMM trace in
 `decompose-loop-execution-trace.cmm` and `decompose-loop-full.cmm`, start
 at label `c4ic` where we allocate space on heap (+48). There are multiple
 possible paths from this point on, some of those use the heap and some
 don't. Some interesting paths are:

 {{{
 1) c4ic (allocate) -> c4mw -> {c4pv} -> ...
 2) c4ic (allocate) -> c4mw -> c4pw -> ((c4pr -> ({c4pe} -> ... | c4ph ->
 ...)) | cp4ps -> ...)
 }}}

 I have put those which use the heap inside curly braces, the rest do not
 use it at all. The paths which actually need the heap are rarely executed
 in this case. But because of the placement it is always allocated and
 deallocated in a the fast path which is executed almost all the time.

 If we can place this allocation at both `c4pv` and `c4pe` instead of the
 common parent then we can save the fast path from this check.  The same
 thing applies to the allocation at label `c4jd` as well.

 The whole loop is composed on 51 instructions (ghc-8.0.1) and 8 of them
 are heap adjustment related, which comes out to almost 16%. If I can get
 this back then the performance of this loop will get quite close to the
 gcc compiled code which is currently around 30% faster.

 This is a CMM level optimization which will benefit all architectures. I
 also noticed that llvm removed the heap adjustment though not the checks.

 == References ==

 * [1] mail thread - https://mail.haskell.org/pipermail/ghc-
 devs/2016-June/012275.html
 * [2] github repo for code, traces, reproduction details. See the
 GHC_PERF_ISSUE directory: https://github.com/harendra-kumar/unicode-
 transforms/tree/ghc-trac-12231

--

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12231#comment:2>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler