[Haskell-cafe] Re: speed: ghc vs gcc

Fri Feb 20 15:05:19 EST 2009

barsoap:
> Don Stewart <dons at galois.com> wrote:
> 
> > No! This is not how open source works! You *should submit bug
> > reports* and *analysis*. It is so so much more useful than
> > complaining and throwing stones.
> >
> Exactly. I don't know where, but I read that the vast majorities of
> Linux bugs are reported, nailed, and then fixed, by at least three
> different persons: The first reports a misbehaviour, the second manages
> to find it surfacing in a certain line of code, the third instantly
> knows how to make it go away.

Elaboarting further:

Thinking more about Bulat's code gen observations, I think there's something
wrong here -- other than that GHC needs the new codegen to do any of the
fancier loop optimisations.

If we take what I usually see as the best loops GHC can do for this kind of thing:

    import Data.Array.Vector

    main = print (sumU (enumFromToU 1 (10^9 :: Int)))

And compile it:

    $ ghc-core A.hs -O2 -fvia-C -optc-O3

We get ideal core, all data structures fused away, and no heap allocation:

    $wfold_s15t :: Int# -> Int# -> Int#
    $wfold_s15t =
      \ (ww1_s150 :: Int#) (ww2_s154 :: Int#) ->
        case ># ww2_s154 ww_s14U of wild_aWm {
          False ->
            $wfold_s15t
              (+# ww1_s150 ww2_s154) (+# ww2_s154 1);
          True -> ww1_s150
        }; } in
    case $wfold_s15t 0 1

Which produces nice assembly:

    s16e_info:
      cmpq        6(%rbx), %rdi
      jg  .L2
      addq        %rdi, %rsi
      leaq        1(%rdi), %rdi
      jmp s16e_info

This is the best GHC will do here, in my experience, and I'm satisifed with it.

Short of new backend tweaks, and realising that GHC is not the loop magic compiler GCC is.

    http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/IntegratedCodeGen

We can be happy with this. The compiler is doing exactly what we expect.

    $ time ./B
    500000000500000000
    ./B  0.96s user 0.00s system 99% cpu 0.967 total

Now, going back to the low level version, Bulat's loop:

    main()
    {
      int sum=0;
      //for(int j=0; j<100;j++)
        for(int i=0; i<1000*1000*1000;i++)
          sum += i;
      return sum;
    }

What was first confusing for me was that he wrote the loop "backwards" when translating to Haskell,
like this:

    main = print $ sum0 (10^9) 0

    sum0 :: Int -> Int -> Int
    sum0 0  !acc = acc
    sum0 !x !acc = sum0 (x-1) (acc+x)

(The bang patterns aren't needed). Note how he counts backwards from 10^9. Was there a reason for that, Bulat?

I wondered if we just got worse code on backwards counting loops. So
translating into the "obvious" translation, counting up:

    main = print (sum0 0 1)

    sum0 :: Int -> Int -> Int
    sum0 acc n | n > 10^9  = acc
               | otherwise = sum0 (acc + n) (n + 1)

Which I actually consider to be the same difficulty as writing the C version, fwiw... 
We start to notice something interesting:

    $wsum0 :: Int# -> Int# -> Int#
    $wsum0 =
      \ (ww_sOH :: Int#) (ww1_sOL :: Int#) ->
        case lvl2 of wild1_aHn { I# y_aHp ->
        case ># ww1_sOL y_aHp of wild_B1 {
          False ->
            letrec {

              $wsum01_XPd :: Int# -> Int# -> Int#
              $wsum01_XPd =
                \ (ww2_XP4 :: Int#) (ww3_XP9 :: Int#) ->
                  case ># ww3_XP9 y_aHp of wild11_Xs {
                    False ->
                      $wsum01_XPd (+# ww2_XP4 ww3_XP9) (+# ww3_XP9 1);
                    True -> ww2_XP4
                  }; } in
            $wsum01_XPd (+# ww_sOH ww1_sOL) (+# ww1_sOL 1);

      True -> ww_sOH
    }

Why is there an extra test? What is GHC doing?
Checking the asm:

    $ ghc -O2 -fasm

    sQ3_info:
    .LcRt:
      cmpq 8(%rbp),%rsi
      jg .LcRw
      leaq 1(%rsi),%rax
      addq %rsi,%rbx
      movq %rax,%rsi
      jmp sQ3_info

    $ time ./B
    500000000500000000
    ./B  1.30s user 0.01s system 98% cpu 1.328 total

So its a fair bit slower. Now, we should, as a principle, be able to write sum directly as I did , and get the
same code from the manual, and automatically , fused version. But we didn't.

Checking via C:

   $ ghc -O2 -optc-O3 -fvia-C

Better code, but still a bit slower:   

        sQ3_info:
          cmpq        8(%rbp), %rsi
          jg  .L8
          addq        %rsi, %rbx
          leaq        1(%rsi), %rsi
          jmp sQ3_info

Running:

        $ time   ./B
        500000000500000000
        ./B  1.01s user 0.01s system 97% cpu 1.035 total

So I think we have a bug report! Why did GHC put that extra test in place?

Now, none of this addresses (I think) Bulat's point that GCC can unroll loops and do other loop magic.
That's handled under a different workflow - the new code generator.

I'll create the ticket.

-- Don