suboptimal ghc code generation in IO vs equivalent pure code case

Mon May 9 15:10:01 UTC 2016

I have a loop which runs millions of times. For some reason I have to run
it in the IO monad. I noticed that when I convert the code from pure to IO
monad the generated assembly code in essence is almost identical except one
difference where it puts a piece of code in a separate block which is
making a huge difference in performance (4-6x slower).

I want to understand what makes GHC to generate code in this way and if
there is anything that can be done at source level (or ghc option)  to
control that.

The pure code looks like this:

        decomposeChars :: [Char] -> [Char]

        decomposeChars [] = []
        decomposeChars [x] =
            case NFD.isDecomposable x of
                True -> decomposeChars (NFD.decomposeChar x)
                False -> [x]
        decomposeChars (x : xs) = decomposeChars [x] ++ decomposeChars xs

The equivalent IO code is this:

        decomposeStrIO :: [Char] -> IO [Char]

        decomposeStrPtr !p = decomposeStrIO
            where
                decomposeStrIO [] = return []
                decomposeStrIO [x] = do
                    res <- NFD.isDecomposable p x
                    case res of
                        True -> decomposeStrIO (NFD.decomposeChar x)
                        False -> return [x]
                decomposeStrIO (x : xs) = do
                    s1 <- decomposeStrIO [x]
                    s2 <- decomposeStrIO xs
                    return (s1 ++ s2)

The difference is in how the code corresponding to the call to the (++)
operation is generated. In the pure case the (++) operation is inline in
the main loop:

_cn5N:
movq $sat_sn2P_info,-48(%r12)
movq %rax,-32(%r12)
movq %rcx,-24(%r12)
movq $:_con_info,-16(%r12)
movq 16(%rbp),%rax
movq %rax,-8(%r12)
movq $GHC.Types.[]_closure+1,(%r12)
leaq -48(%r12),%rsi
leaq -14(%r12),%r14
addq $40,%rbp
jmp GHC.Base.++_info

In the IO monad version this code is placed in a separate block and a call
is placed in the main loop:

the main loop call site:

_cn6A:
movq $sat_sn3w_info,-24(%r12)
movq 8(%rbp),%rax
movq %rax,-8(%r12)
movq %rbx,(%r12)
leaq -24(%r12),%rbx
addq $40,%rbp
jmp *(%rbp)

out of the line block - the code that was in the main loop in the previous
case is now moved to this block (see label _cn5s below):

sat_sn3w_info:
_cn5p:
leaq -16(%rbp),%rax
cmpq %r15,%rax
jb _cn5q
_cn5r:
addq $24,%r12
cmpq 856(%r13),%r12
ja _cn5t
_cn5s:
movq $stg_upd_frame_info,-16(%rbp)
movq %rbx,-8(%rbp)
movq 16(%rbx),%rax
movq 24(%rbx),%rbx
movq $:_con_info,-16(%r12)
movq %rax,-8(%r12)
movq $GHC.Types.[]_closure+1,(%r12)
movq %rbx,%rsi
leaq -14(%r12),%r14
addq $-16,%rbp
jmp GHC.Base.++_info
_cn5t:
movq $24,904(%r13)
_cn5q:
jmp *-16(%r13)

Except this difference the rest of the assembly looks pretty similar in
both the cases. The corresponding dump-simpl output for the pure case:

          False ->
            ++
              @ Char
              (GHC.Types.: @ Char ww_amuh (GHC.Types.[] @ Char))
              (Data.Unicode.Internal.Normalization.decompose_$sdecompose
                 ipv_smuv ipv1_smuD);

And for the IO monad version:

                False ->
                  case $sa1_sn0g ipv_smUT ipv1_smV6 ipv2_imWU
                  of _ [Occ=Dead] { (# ipv4_XmXv, ipv5_XmXx #) ->
                  (# ipv4_XmXv,
                     ++
                       @ Char
                       (GHC.Types.: @ Char sc_sn0b (GHC.Types.[] @ Char))
                       ipv5_XmXx #)
                  };

The dump-simpl output is essentially the same except the difference due to
the realworld token in the IO case. Why is the generated code different? I
will appreciate if someone can throw some light on the reason or can point
to the relevant ghc source to look at where this happens.

I am using ghc-7.10.3 in native code generation mode (no llvm).

Thanks,
Harendra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20160509/cb6fb018/attachment.html>