Loop unrolling + fusion ?
Don Stewart
dons at galois.com
Sat Feb 28 12:49:32 EST 2009
Hey guys,
We have nice fusion frameworks now. E.g. stream fusion on uvector,
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector
Takes something like this:
import Data.Array.Vector
import Data.Bits
main = print . productU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (100000000 :: Int) (5::Int)
and turns it into a loop like this:
$wfold :: Int# -> Int# -> Int#
$wfold =
\ (ww_sWX :: Int#) (ww1_sX1 :: Int#) ->
case ww1_sX1 of wild_B1 {
__DEFAULT ->
$wfold (*# ww_sWX 40) (+# wild_B1 1);
100000000 -> ww_sWX
}
Now, that's fine in my book. Going via -fasm, we get:
Main_zdwfold_info:
.LcYt:
movq %rdi,%rax
cmpq $100000000,%rax
jne .LcYx
movq %rsi,%rbx
jmp *(%rbp)
.LcYx:
incq %rax
imulq $40,%rsi
movq %rax,%rdi
jmp Main_zdwfold_info
Ok:
$ time ./product
0
./product 0.31s user 0.00s system 96% cpu 0.316 total
Going via C, however, we get:
Main_zdwfold_info:
cmpq $100000000, %rdi
je .L6
.L2:
leaq (%rsi,%rsi,4), %rax
leaq 1(%rdi), %rdi
leaq 0(,%rax,8), %rsi
jmp Main_zdwfold_info
Nice!
$ time ./product
0
./product 0.19s user 0.00s system 97% cpu 0.197 total
So now, since we've gone to such effort to produce a tiny loop like, this,
can't we unroll it just a little? Sadly, my attempts to get GCC to trigger
its loop unroller on this guy haven't succeeded. -funroll-loops and
-funroll-all-loops doesn't touch it,
Anyone think of a way to apply Claus' TH unroller, or somehow convince GCC
it is worth unrolling this guy, so we get the win of both aggressive high level
fusion, and aggressive low level loop optimisations?
-- Don
More information about the Glasgow-haskell-users
mailing list