[GHC] #9041: NCG generates slow loop code

Sun Apr 27 15:45:19 UTC 2014

#9041: NCG generates slow loop code
-------------------------+-------------------------------------------------
       Reporter:  nh2    |             Owner:
           Type:  bug    |            Status:  new
       Priority:         |         Milestone:
  normal                 |           Version:  7.6.3
      Component:         |  Operating System:  Unknown/Multiple
  Compiler (NCG)         |   Type of failure:  Compile-time performance bug
       Keywords:         |         Test Case:
   Architecture:         |          Blocking:
  Unknown/Multiple       |
     Difficulty:         |
  Unknown                |
     Blocked By:         |
Related Tickets:         |
-------------------------+-------------------------------------------------
 In http://stackoverflow.com/a/23322255/263061 we compile the code

 {{{
 main :: IO ()
 main = do
   loop (maxBound :: Word32) $ \i -> do
     when (i `rem` 100000000 == 0) $
       putStrLn "foo"
 }}}

 and compare the assembly generated by the NCG and `-fllvm`.

 The LLVM code is 10 times faster.

 While the LLVM code looks great, NCG one generated looks quite unfortunate
 and messy, with a lot of unnecessary `mov`ing around.

 I am by no means an expert in this, but some of it looks like low hanging
 potential improvements.

 This is the '''LLVM code''':

 {{{
  Main_zdwa_info:
 .LBB1_1:
     movl    $4294967295, %esi           /* load the loop bound */
     movabsq $-6067343680855748867, %rdi /* load a magic number for the
 modulus */
     jmp .LBB1_2
 .LBB1_4:
     incl    %ecx                        /* loop core start */
 .LBB1_2:
     cmpq    %rsi, %rcx
     je  .LBB1_6                         /* check loop bound */

     /* do the modulus with two multiplications, a shift and a magic number
 */
     /* note : gcc does the same reduction */
     movq    %rcx, %rax
     mulq    %rdi
     shrq    $26, %rdx
     imulq   $100000000, %rdx, %rax
     cmpq    %rax, %rcx
     jne .LBB1_4                         /* loop core end */
     /* Code omitted: print, then return to loop beginning */
 .LBB1_6:
 }}}

 So the core of the loop is a nice and short:

 {{{
 incl
 cmpq
 je

 movq [division replaced by magic multiplication]
 mulq
 shrq
 imulq

 cmpq
 jne
 }}}

 Now what '''NCG''' produces:

 {{{
 Main_zdwa_info:           /* loop core start */
 .Lc3JD:
     leaq -16(%rbp),%rax   /* stack check */
     cmpq %r15,%rax
     jb .Lc3JO             /* jump not taken in the normal case */
 .Lc3JP:
     movl $4294967295,%eax /* issue: loading the bound on every iteration
 */
     cmpq %rax,%r14
     jne .Lc3JB
 .Lc3JC:
    /* Return from main. Code omitted */
 .Lc3JB: /* test the index for modulus */
     movl $100000000,%eax  /* issue: unnecessary moves */
     movq %rax,%rbx        /* and loading the modulo arg on every iteration
 */
     movq %r14,%rax
     xorq %rdx,%rdx        /* shorter alternative to mov $0,%rdx */
     divq %rbx             /* issue: doing the division (llvm and gcc avoid
 this) */
     testq %rdx,%rdx
     jne .Lc3JU            /* This jump is usually taken. */
 .Lc3JV:
    /* do the printing. Code omitted. Not relevant */
 .Lc3JN:
    /* increment index and (I guess) restore registers messed up by the
 printing */
     movq 8(%rbp),%rax     /* This code is not very relevant because we
 don't almost never */
     incq %rax
     movl %eax,%r14d
     addq $16,%rbp
     jmp Main_zdwa_info
 .Lc3JU:
     leaq 1(%r14),%rax   /*issue: why not just increment r14? */
     movl %eax,%r14d
     jmp Main_zdwa_info
 }}}

 So the core of this loop is

 {{{
 movl [stack check]
 cmpq
 jne

 movl
 movq
 movq
 xorq
 divq
 testq
 jne

 leaq
 movl
 jmp
 }}}

 Some issues here:

 * the stack check is inside the loop
 * the loop limit is loaded every time inside the loop
 * the modulo arg is loaded every time inside the loop
 * we are shuffling around registers
 * no strength reduction (div is not replaced by cheaper magic number imul)
 * weird increment using leaq

 It would be great if somebody with codegen insight could comment on
 whether these are easy targets for improvement!

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9041>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler