Loop unrolling + fusion ?

Sat Feb 28 12:49:32 EST 2009

Hey guys,

We have nice fusion frameworks now. E.g. stream fusion on uvector,

    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector

Takes something like this:

    import Data.Array.Vector
    import Data.Bits
    main = print . productU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (100000000 :: Int) (5::Int)

and turns it into a loop like this:

    $wfold :: Int# -> Int# -> Int#
    $wfold =
      \ (ww_sWX :: Int#) (ww1_sX1 :: Int#) ->
        case ww1_sX1 of wild_B1 {
          __DEFAULT ->
            $wfold (*# ww_sWX 40) (+# wild_B1 1);
          100000000 -> ww_sWX
        }

Now, that's fine in my book. Going via -fasm, we get:

   Main_zdwfold_info:
    .LcYt:
      movq %rdi,%rax
      cmpq $100000000,%rax
      jne .LcYx
      movq %rsi,%rbx
      jmp *(%rbp)
    .LcYx:
      incq %rax
      imulq $40,%rsi
      movq %rax,%rdi
      jmp Main_zdwfold_info

Ok:

    $ time ./product
    0
    ./product  0.31s user 0.00s system 96% cpu 0.316 total

Going via C, however, we get:

    Main_zdwfold_info:
      cmpq        $100000000, %rdi
      je  .L6
    .L2:
      leaq        (%rsi,%rsi,4), %rax
      leaq        1(%rdi), %rdi
      leaq        0(,%rax,8), %rsi
      jmp Main_zdwfold_info

Nice!

    $ time ./product
    0
    ./product  0.19s user 0.00s system 97% cpu 0.197 total

So now, since we've gone to such effort to produce a tiny loop like, this,
can't we unroll it just a little? Sadly, my attempts to get GCC to trigger
its loop unroller on this guy haven't succeeded. -funroll-loops and
-funroll-all-loops doesn't  touch it,

Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC 
it is worth unrolling this guy, so we get the win of both aggressive high level
fusion, and aggressive low level loop optimisations?

-- Don