GHC vs. GCC on raw vector addition

Thu Jan 19 06:28:03 EST 2006

John Meacham wrote:
> On Wed, Jan 18, 2006 at 08:54:43PM +0300, Bulat Ziganshin wrote:
>> sorry, with the "gcc -O3 -ffast-math -fstrict-aliasing -funroll-loops"
>> the C version is 50 times faster than best Haskell one... it's the
>> loop from C version:
> 
> I believe something similar to what I noted here is the culprit:
> http://www.haskell.org//pipermail/glasgow-haskell-users/2005-October/009174.html
> 
> it is fixable, but not without modifying ghc.

Ah, I see what you mean by indirect jumps.  Those indirect jumps go away 
if you compile with -optc-O2 or -fasm, they're droppings left by 
inadequacies in gcc's standard -O optimisation.

Actually, -fasm does better by one instruction than gcc on this example:

.globl Test_zdwfac_info
Test_zdwfac_info:
	movq (%rbp),%rax
	cmpq $1,%rax
	jne .LcmO
	movq 8(%rbp),%r13
	addq $16,%rbp
	jmp *(%rbp)
.LcmO:
	leaq -1(%rax),%rcx
	imulq 8(%rbp),%rax
	movq %rax,8(%rbp)
	movq %rcx,(%rbp)
	jmp Test_zdwfac_info

vs. gcc -O2:

Test_zdwfac_info:
.text
         .align 8
         movq    (%rbp), %rdx
         cmpq    $1, %rdx
         je      .L6
.L3:
         movq    8(%rbp), %rax
         imulq   %rdx, %rax
         decq    %rdx
         movq    %rdx, (%rbp)
         movq    %rax, 8(%rbp)
         jmp     Test_zdwfac_info
         .p2align 4,,7
.L6:
         movq    8(%rbp), %r13
         addq    $16, %rbp
         jmp     *(%rbp)

We should probably reverse the sense of that branch, like gcc does.  The 
memory accesses are still there, of course.  Hopefully someday I'll get 
around to trying to use more registers on x86_64 again.

Cheers,
	Simon