[GHC] #9041: NCG generates slow loop code
GHC
ghc-devs at haskell.org
Sun Apr 27 15:45:19 UTC 2014
#9041: NCG generates slow loop code
-------------------------+-------------------------------------------------
Reporter: nh2 | Owner:
Type: bug | Status: new
Priority: | Milestone:
normal | Version: 7.6.3
Component: | Operating System: Unknown/Multiple
Compiler (NCG) | Type of failure: Compile-time performance bug
Keywords: | Test Case:
Architecture: | Blocking:
Unknown/Multiple |
Difficulty: |
Unknown |
Blocked By: |
Related Tickets: |
-------------------------+-------------------------------------------------
In http://stackoverflow.com/a/23322255/263061 we compile the code
{{{
main :: IO ()
main = do
loop (maxBound :: Word32) $ \i -> do
when (i `rem` 100000000 == 0) $
putStrLn "foo"
}}}
and compare the assembly generated by the NCG and `-fllvm`.
The LLVM code is 10 times faster.
While the LLVM code looks great, NCG one generated looks quite unfortunate
and messy, with a lot of unnecessary `mov`ing around.
I am by no means an expert in this, but some of it looks like low hanging
potential improvements.
This is the '''LLVM code''':
{{{
Main_zdwa_info:
.LBB1_1:
movl $4294967295, %esi /* load the loop bound */
movabsq $-6067343680855748867, %rdi /* load a magic number for the
modulus */
jmp .LBB1_2
.LBB1_4:
incl %ecx /* loop core start */
.LBB1_2:
cmpq %rsi, %rcx
je .LBB1_6 /* check loop bound */
/* do the modulus with two multiplications, a shift and a magic number
*/
/* note : gcc does the same reduction */
movq %rcx, %rax
mulq %rdi
shrq $26, %rdx
imulq $100000000, %rdx, %rax
cmpq %rax, %rcx
jne .LBB1_4 /* loop core end */
/* Code omitted: print, then return to loop beginning */
.LBB1_6:
}}}
So the core of the loop is a nice and short:
{{{
incl
cmpq
je
movq [division replaced by magic multiplication]
mulq
shrq
imulq
cmpq
jne
}}}
Now what '''NCG''' produces:
{{{
Main_zdwa_info: /* loop core start */
.Lc3JD:
leaq -16(%rbp),%rax /* stack check */
cmpq %r15,%rax
jb .Lc3JO /* jump not taken in the normal case */
.Lc3JP:
movl $4294967295,%eax /* issue: loading the bound on every iteration
*/
cmpq %rax,%r14
jne .Lc3JB
.Lc3JC:
/* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
movl $100000000,%eax /* issue: unnecessary moves */
movq %rax,%rbx /* and loading the modulo arg on every iteration
*/
movq %r14,%rax
xorq %rdx,%rdx /* shorter alternative to mov $0,%rdx */
divq %rbx /* issue: doing the division (llvm and gcc avoid
this) */
testq %rdx,%rdx
jne .Lc3JU /* This jump is usually taken. */
.Lc3JV:
/* do the printing. Code omitted. Not relevant */
.Lc3JN:
/* increment index and (I guess) restore registers messed up by the
printing */
movq 8(%rbp),%rax /* This code is not very relevant because we
don't almost never */
incq %rax
movl %eax,%r14d
addq $16,%rbp
jmp Main_zdwa_info
.Lc3JU:
leaq 1(%r14),%rax /*issue: why not just increment r14? */
movl %eax,%r14d
jmp Main_zdwa_info
}}}
So the core of this loop is
{{{
movl [stack check]
cmpq
jne
movl
movq
movq
xorq
divq
testq
jne
leaq
movl
jmp
}}}
Some issues here:
* the stack check is inside the loop
* the loop limit is loaded every time inside the loop
* the modulo arg is loaded every time inside the loop
* we are shuffling around registers
* no strength reduction (div is not replaced by cheaper magic number imul)
* weird increment using leaq
It would be great if somebody with codegen insight could comment on
whether these are easy targets for improvement!
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9041>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list