[GHC] #10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure

GHC ghc-devs at haskell.org
Thu Jan 22 03:48:47 UTC 2015


#10012: Cheap-to-compute values aren't pushed into case branches inducing
unnecessary register pressure
-------------------------------------+-------------------------------------
        Reporter:  bgamari           |                   Owner:
            Type:  bug               |                  Status:  new
        Priority:  normal            |               Milestone:
       Component:  Compiler          |                 Version:  7.8.4
      Resolution:                    |                Keywords:
Operating System:  Unknown/Multiple  |            Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |               Test Case:
      Blocked By:                    |                Blocking:
 Related Tickets:                    |  Differential Revisions:
-------------------------------------+-------------------------------------
Description changed by bgamari:

Old description:

> When characterizing bytestring's `Builder` interface[1] I noticed that
> some benchmarks involving repeated appends (e.g. `Host endian/1MB of
> Word8 in chunks of 16`) perform inexplicably much worse than others. A
> glance at the assembly revealed that a large number of `Word8`s are being
> evaluated and saved to registers, only to be later stored. E.g.,
>
> ```
> # First we evaluate a number of Word8s, saving them in registers
> movzbl %bl,%ebx
> leaq 1(%r14),%rcx
> movzbl %cl,%ecx
> leaq 2(%r14),%rdx
> movzbl %dl,%edx
> leaq 3(%r14),%rsi
> movzbl %sil,%esi
> leaq 4(%r14),%r9
> movzbl %r9b,%r9d
> leaq 5(%r14),%r10
> movzbl %r10b,%r10d
> ...
> # Eventually we run out of registers and start spilling to the stack
> movq %rax,64(%rsp)
> leaq 7(%r14),%rax
> movzbl %al,%eax
> movq %rbx,72(%rsp)
> ...
>
> # Only after evaluating all of the needed words do we actually consume
> # them
> movq %rax,-152(%r12)
> movq 72(%rsp),%rax
> movq %rax,-144(%r12)
> movq 80(%rsp),%rax
> movq %rax,-136(%r12)
> movq 88(%rsp),%rax
> movq %rax,-128(%r12)
> ...
> movq %rsi,-56(%r12)
> movq %r9,-48(%r12)
> movq %r10,-40(%r12)
> movq %r11,-32(%r12)
> movq %r14,-24(%r12)
>
> ```
>
> This is due to the fact that the `Word`s are bound outside of a case
> analysis and GHC is reluctant to push them inside of the branches. The
> float-in pass will only float a binding into a case if the value is
> "small" and at least one branch doesn't use the binding. Unfortunately
> the case expression in question has only two branches.
>
> This is demonstrated in the attached testcase.
>
> [1] https://github.com/kolmodin/binary/pull/65
> [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html

New description:

 When characterizing bytestring's `Builder` interface[1] I noticed that
 some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8
 in chunks of 16`) perform inexplicably much worse than others. A glance at
 the assembly revealed that a large number of `Word8`s are being evaluated
 and saved to registers, only to be later stored. E.g.,

 {{{#asm
 # First we evaluate a number of Word8s, saving them in registers
 movzbl %bl,%ebx
 leaq 1(%r14),%rcx
 movzbl %cl,%ecx
 leaq 2(%r14),%rdx
 movzbl %dl,%edx
 leaq 3(%r14),%rsi
 movzbl %sil,%esi
 leaq 4(%r14),%r9
 movzbl %r9b,%r9d
 leaq 5(%r14),%r10
 movzbl %r10b,%r10d
 ...
 # Eventually we run out of registers and start spilling to the stack
 movq %rax,64(%rsp)
 leaq 7(%r14),%rax
 movzbl %al,%eax
 movq %rbx,72(%rsp)
 ...

 # Only after evaluating all of the needed words do we actually consume
 # them
 movq %rax,-152(%r12)
 movq 72(%rsp),%rax
 movq %rax,-144(%r12)
 movq 80(%rsp),%rax
 movq %rax,-136(%r12)
 movq 88(%rsp),%rax
 movq %rax,-128(%r12)
 ...
 movq %rsi,-56(%r12)
 movq %r9,-48(%r12)
 movq %r10,-40(%r12)
 movq %r11,-32(%r12)
 movq %r14,-24(%r12)
 }}}

 This is due to the fact that the `Word`s are bound outside of a case
 analysis and GHC is reluctant to push them inside of the branches. The
 float-in pass will only float a binding into a case if the value is
 "small" and at least one branch doesn't use the binding. Unfortunately the
 case expression in question has only two branches.

 This is demonstrated in the attached testcase.

 [1] https://github.com/kolmodin/binary/pull/65
 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html

--

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/10012#comment:2>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler


More information about the ghc-tickets mailing list