[Haskell-cafe] Re: speed: ghc vs gcc
Bertram Felgenhauer
bertram.felgenhauer at googlemail.com
Fri Feb 20 23:55:26 EST 2009
Don Stewart wrote:
> If we take what I usually see as the best loops GHC can do for this kind
> of thing:
>
> import Data.Array.Vector
>
> main = print (sumU (enumFromToU 1 (10^9 :: Int)))
>
> And compile it:
>
> $ ghc-core A.hs -O2 -fvia-C -optc-O3
>
> We get ideal core, all data structures fused away, and no heap allocation:
>
> $wfold_s15t :: Int# -> Int# -> Int#
> $wfold_s15t =
> \ (ww1_s150 :: Int#) (ww2_s154 :: Int#) ->
> case ># ww2_s154 ww_s14U of wild_aWm {
> False ->
> $wfold_s15t
> (+# ww1_s150 ww2_s154) (+# ww2_s154 1);
> True -> ww1_s150
> }; } in
> case $wfold_s15t 0 1
>
> Which produces nice assembly:
>
> s16e_info:
> cmpq 6(%rbx), %rdi
> jg .L2
> addq %rdi, %rsi
> leaq 1(%rdi), %rdi
> jmp s16e_info
Note that this does the addition to the accumulator first, and then
increments the counter.
[snip]
> I wondered if we just got worse code on backwards counting loops. So
> translating into the "obvious" translation, counting up:
>
> main = print (sum0 0 1)
>
> sum0 :: Int -> Int -> Int
> sum0 acc n | n > 10^9 = acc
> | otherwise = sum0 (acc + n) (n + 1)
>
> We start to notice something interesting:
>
>
> $wsum0 :: Int# -> Int# -> Int#
> $wsum0 =
> \ (ww_sOH :: Int#) (ww1_sOL :: Int#) ->
> case lvl2 of wild1_aHn { I# y_aHp ->
> case ># ww1_sOL y_aHp of wild_B1 {
> False ->
> letrec {
>
> $wsum01_XPd :: Int# -> Int# -> Int#
> $wsum01_XPd =
> \ (ww2_XP4 :: Int#) (ww3_XP9 :: Int#) ->
> case ># ww3_XP9 y_aHp of wild11_Xs {
> False ->
> $wsum01_XPd (+# ww2_XP4 ww3_XP9) (+# ww3_XP9 1);
> True -> ww2_XP4
> }; } in
> $wsum01_XPd (+# ww_sOH ww1_sOL) (+# ww1_sOL 1);
>
> True -> ww_sOH
> }
This is odd, but it doesn't hurt the inner loop, which only involves
$wsum01_XPd, and is identical to $wfold_s15t above.
> Checking the asm:
> $ ghc -O2 -fasm
>
> sQ3_info:
> .LcRt:
> cmpq 8(%rbp),%rsi
> jg .LcRw
> leaq 1(%rsi),%rax
> addq %rsi,%rbx
> movq %rax,%rsi
> jmp sQ3_info
So for some reason ghc ends up doing the (n + 1) addition before the
(acc + n) addition in this case - this accounts for the extra
instruction, because both n+1 and n need to be kept around for the
duration of the addq (which does the acc + n addition).
> Checking via C:
>
> $ ghc -O2 -optc-O3 -fvia-C
>
> Better code, but still a bit slower:
>
> sQ3_info:
> cmpq 8(%rbp), %rsi
> jg .L8
> addq %rsi, %rbx
> leaq 1(%rsi), %rsi
> jmp sQ3_info
This code is identical (up to renaming registers and one offset that
I can't fully explain, but is probably related to a slight difference
in handling pointer tags between the two versions of the code) to the
"nice assembly" above.
> Running:
>
> $ time ./B
> 500000000500000000
> ./B 1.01s user 0.01s system 97% cpu 1.035 total
Hmm, about 5% slower, are you sure this isn't just noise?
If not noise, it may be some alignment effect. Hard to say.
Bertram
More information about the Haskell-Cafe
mailing list