[Haskell-cafe] Re: speed: ghc vs gcc

Fri Feb 20 23:55:26 EST 2009

Don Stewart wrote:
> If we take what I usually see as the best loops GHC can do for this kind
> of thing:
> 
>     import Data.Array.Vector
> 
>     main = print (sumU (enumFromToU 1 (10^9 :: Int)))
> 
> And compile it:
> 
>     $ ghc-core A.hs -O2 -fvia-C -optc-O3
> 
> We get ideal core, all data structures fused away, and no heap allocation:
> 
>     $wfold_s15t :: Int# -> Int# -> Int#
>     $wfold_s15t =
>       \ (ww1_s150 :: Int#) (ww2_s154 :: Int#) ->
>         case ># ww2_s154 ww_s14U of wild_aWm {
>           False ->
>             $wfold_s15t
>               (+# ww1_s150 ww2_s154) (+# ww2_s154 1);
>           True -> ww1_s150
>         }; } in
>     case $wfold_s15t 0 1
> 
> Which produces nice assembly:
> 
>     s16e_info:
>       cmpq        6(%rbx), %rdi
>       jg  .L2
>       addq        %rdi, %rsi
>       leaq        1(%rdi), %rdi
>       jmp s16e_info

Note that this does the addition to the accumulator first, and then
increments the counter.

[snip]
> I wondered if we just got worse code on backwards counting loops. So
> translating into the "obvious" translation, counting up:
> 
>     main = print (sum0 0 1)
> 
>     sum0 :: Int -> Int -> Int
>     sum0 acc n | n > 10^9  = acc
>                | otherwise = sum0 (acc + n) (n + 1)
> 
> We start to notice something interesting:
> 
> 
>     $wsum0 :: Int# -> Int# -> Int#
>     $wsum0 =
>       \ (ww_sOH :: Int#) (ww1_sOL :: Int#) ->
>         case lvl2 of wild1_aHn { I# y_aHp ->
>         case ># ww1_sOL y_aHp of wild_B1 {
>           False ->
>             letrec {
> 
>               $wsum01_XPd :: Int# -> Int# -> Int#
>               $wsum01_XPd =
>                 \ (ww2_XP4 :: Int#) (ww3_XP9 :: Int#) ->
>                   case ># ww3_XP9 y_aHp of wild11_Xs {
>                     False ->
>                       $wsum01_XPd (+# ww2_XP4 ww3_XP9) (+# ww3_XP9 1);
>                     True -> ww2_XP4
>                   }; } in
>             $wsum01_XPd (+# ww_sOH ww1_sOL) (+# ww1_sOL 1);
> 
>       True -> ww_sOH
>     }

This is odd, but it doesn't hurt the inner loop, which only involves
$wsum01_XPd, and is identical to $wfold_s15t above.

> Checking the asm:
>     $ ghc -O2 -fasm
> 
>     sQ3_info:
>     .LcRt:
>       cmpq 8(%rbp),%rsi
>       jg .LcRw
>       leaq 1(%rsi),%rax
>       addq %rsi,%rbx
>       movq %rax,%rsi
>       jmp sQ3_info

So for some reason ghc ends up doing the (n + 1) addition before the
(acc + n) addition in this case - this accounts for the extra
instruction, because both n+1 and n need to be kept around for the
duration of the addq (which does the acc + n addition).

> Checking via C:
> 
>    $ ghc -O2 -optc-O3 -fvia-C
> 
> Better code, but still a bit slower:   
> 
>         sQ3_info:
>           cmpq        8(%rbp), %rsi
>           jg  .L8
>           addq        %rsi, %rbx
>           leaq        1(%rsi), %rsi
>           jmp sQ3_info

This code is identical (up to renaming registers and one offset that
I can't fully explain, but is probably related to a slight difference
in handling pointer tags between the two versions of the code) to the
"nice assembly" above.

> Running:
> 
>         $ time   ./B
>         500000000500000000
>         ./B  1.01s user 0.01s system 97% cpu 1.035 total

Hmm, about 5% slower, are you sure this isn't just noise?

If not noise, it may be some alignment effect. Hard to say.

Bertram