<div dir="ltr">I have a loop which runs millions of times. For some reason I have to run it in the IO monad. I noticed that when I convert the code from pure to IO monad the generated assembly code in essence is almost identical except one difference where it puts a piece of code in a separate block which is making a huge difference in performance (4-6x slower).<br><br>I want to understand what makes GHC to generate code in this way and if there is anything that can be done at source level (or ghc option)  to control that.<br><br>The pure code looks like this:<br><br>        decomposeChars :: [Char] -> [Char]<br><br>        decomposeChars [] = []<br>        decomposeChars [x] =<br>            case NFD.isDecomposable x of<br>                True -> decomposeChars (NFD.decomposeChar x)<br>                False -> [x]<br>        decomposeChars (x : xs) = decomposeChars [x] ++ decomposeChars xs<br><br>The equivalent IO code is this:<br><br>        decomposeStrIO :: [Char] -> IO [Char]<br><br>        decomposeStrPtr !p = decomposeStrIO<br>            where<br>                decomposeStrIO [] = return []<br>                decomposeStrIO [x] = do<br>                    res <- NFD.isDecomposable p x<br>                    case res of<br>                        True -> decomposeStrIO (NFD.decomposeChar x)<br>                        False -> return [x]<br>                decomposeStrIO (x : xs) = do<br>                    s1 <- decomposeStrIO [x]<br>                    s2 <- decomposeStrIO xs<br>                    return (s1 ++ s2)<br><br>The difference is in how the code corresponding to the call to the (++) operation is generated. In the pure case the (++) operation is inline in the main loop:<br><br>_cn5N:<br>movq $sat_sn2P_info,-48(%r12)<br>movq %rax,-32(%r12)<br>movq %rcx,-24(%r12)<br>movq $:_con_info,-16(%r12)<br>movq 16(%rbp),%rax<br>movq %rax,-8(%r12)<br>movq $GHC.Types.[]_closure+1,(%r12)<br>leaq -48(%r12),%rsi<br>leaq -14(%r12),%r14<br>addq $40,%rbp<br>jmp GHC.Base.++_info<br><br>In the IO monad version this code is placed in a separate block and a call is placed in the main loop:<br><br>the main loop call site:<br><br>_cn6A:<br>movq $sat_sn3w_info,-24(%r12)<br>movq 8(%rbp),%rax<br>movq %rax,-8(%r12)<br>movq %rbx,(%r12)<br>leaq -24(%r12),%rbx<br>addq $40,%rbp<br>jmp *(%rbp)<br><br>out of the line block - the code that was in the main loop in the previous case is now moved to this block (see label _cn5s below):<br><br>sat_sn3w_info:<br>_cn5p:<br>leaq -16(%rbp),%rax<br>cmpq %r15,%rax<br>jb _cn5q<br>_cn5r:<br>addq $24,%r12<br>cmpq 856(%r13),%r12<br>ja _cn5t<br>_cn5s:<br>movq $stg_upd_frame_info,-16(%rbp)<br>movq %rbx,-8(%rbp)<br>movq 16(%rbx),%rax<br>movq 24(%rbx),%rbx<br>movq $:_con_info,-16(%r12)<br>movq %rax,-8(%r12)<br>movq $GHC.Types.[]_closure+1,(%r12)<br>movq %rbx,%rsi<br>leaq -14(%r12),%r14<br>addq $-16,%rbp<br>jmp GHC.Base.++_info<br>_cn5t:<br>movq $24,904(%r13)<br>_cn5q:<br>jmp *-16(%r13)<br><br>Except this difference the rest of the assembly looks pretty similar in both the cases. The corresponding dump-simpl output for the pure case:<br><br>          False -><br>            ++<br>              @ Char<br>              (GHC.Types.: @ Char ww_amuh (GHC.Types.[] @ Char))<br>              (Data.Unicode.Internal.Normalization.decompose_$sdecompose<br>                 ipv_smuv ipv1_smuD);<br><br>And for the IO monad version:<br><br>                False -><br>                  case $sa1_sn0g ipv_smUT ipv1_smV6 ipv2_imWU<br>                  of _ [Occ=Dead] { (# ipv4_XmXv, ipv5_XmXx #) -><br>                  (# ipv4_XmXv,<br>                     ++<br>                       @ Char<br>                       (GHC.Types.: @ Char sc_sn0b (GHC.Types.[] @ Char))<br>                       ipv5_XmXx #)<br>                  };<br><br>The dump-simpl output is essentially the same except the difference due to the realworld token in the IO case. Why is the generated code different? I will appreciate if someone can throw some light on the reason or can point to the relevant ghc source to look at where this happens.<br><br>I am using ghc-7.10.3 in native code generation mode (no llvm).<br><br>Thanks,<br>Harendra<br></div>