[GHC] #12798: LLVM seeming to over optimize, producing inefficient assembly code...
GHC
ghc-devs at haskell.org
Tue Nov 22 08:33:50 UTC 2016
#12798: LLVM seeming to over optimize, producing inefficient assembly code...
-------------------------------------+-------------------------------------
Reporter: GordonBGood | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 8.2.1
Component: Compiler (LLVM) | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by GordonBGood):
@AlexET, Replying to [comment:1 AlexET]:
> On current HEAD with llvm 3.9.0, the following code:
>
> ... code skipped for brevity...
>
> The CMM and initial llvm code is the same as your code for 8.0.1, so it
seems the difference is due to the fact that rust ships with its own more
recent llvm than ghc 8.0.1 supports.
>
> The difference in the code between my version and your original version
is that we force `twos` early which is needed to prevent the evaluation of
that within the loop, an optimisation which seems to have been missed by
HEAD.
It's even worse with GHC version 8.0.1 than I thought:, with the very
minor change of the inner loop to as follows:
{{{
let cull j = -- very tight inner loop where all the time
is spent
if j > bfLmt then return () else do
let sh = unsafeAt twos (j .&. 31)
let w = j `shiftR` 5
ov <- unsafeRead cmpstsw w
unsafeWrite cmpstsw w (ov .|. sh) -- (1 `shiftL`
(j .&. 31)))
cull (j + p) in do { cull s; cullp (i + 1) } in
cullp 0
}}}
only changed so the loop back for the next prime value is outside the
`cull` loop, the assembly code is as follows:
{{{
.align 16, 0x90
.LBB33_3: # %caLP.i.caLP.i_crit_edge
# in Loop: Header=BB33_2 Depth=1
movq (%r12), %rdx
.LBB33_2: # %caLP.i
# =>This Inner Loop Header:
Depth=1
movq %rsi, %rcx
movl %ecx, %edi
movq %rdx, %rsi
addq %rcx, %rsi
sarq $5, %rcx
movq -24(%r12), %rdx
movq -16(%r12), %rbx
andl $31, %edi
movl 16(%rbx,%rdi,4), %edi
orl %edi, 16(%rdx,%rcx,4)
cmpq -8(%r12), %rsi
jle .LBB33_3
}}}
with three register reloads in the loop and another version in the code
also reloading the `p` increment value for a total of four reloads. I
can't see that the code generated should be any different then for the
original ticket code.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12798#comment:3>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list