GHC vs. GCC on raw vector addition
Bulat Ziganshin
bulatz at HotPOP.com
Wed Jan 18 12:34:54 EST 2006
Hello Simon,
Wednesday, January 18, 2006, 5:31:25 PM, you wrote:
>> 2) generating random values takes about 1.5-2 seconds by itself.
>> Haskell's RNG is very different from C's one
SM> I squeezed a bit more out (see attached).
> x `seq` v `seq` return ()
it's new trick for me :) now the difference is less than 3x
btw, i also use "return $! length xs" trick to ensure that all xs
elements will be evaluated
>for from to action | from `seq` to `seq` False = undefined
and this changes nothing, at least with 6.4.1/mingw32
btw, using "mapM_ action [n..m]" is very common operation. can it be
automatically substituted with my code by using some RULE pragmas in ghc
libraries? that will automatically improve many ghc-compiled programs
too, i use the following code instead of replicateM:
myReplicateM n action = if (n<=5*10^4)
then sequence (replicate n action)
else goLarge n [] >>= return.reverse
where
goLarge 0 xs = return xs
goLarge n xs = do x <- action
(goLarge $! n-1) $! x:xs
it doesn't overflow stack and works much faster for the large n. that
is my testbed for this function:
import Control.Monad
main = do a <- replicateM 1 $ myReplicateM (1*10^6) (return 1)
return $! sum (map last a)
and also, how about adding to GHC strictness annotations?
x <- newArray (0,nelems-1) 0 :: IO !Vector
v <- newArray_ (0,nelems-1) :: IO !Vector
for :: !Int -> !Int -> (!Int -> IO a) -> IO ()
it's SO common source of performance problems...
SM> I think the main bottleneck
SM> is now the random number generator, in particular it is supplying boxed
SM> Doubles which have to be unboxed again before storing in the array.
as i say, it uses 1.5-2 seconds, i.e. only 10% of time when you run
1000 iterations (may be you not noticed that it used only in
initialization?). so, while RNG itself runs 150 times slower (!), it
doesn't make so much difference when you run 1000 iterations after
initial filling the array
and about "using Altivec instructions". the code produced for new.hs
contains only one `fadd` operation, so it is easy to find entire cycle
as it is compiled by GHC. that is one:
movl (%ebp), %eax
cmpl 12(%esi), %eax
jge L81
movl 8(%esi), %edx
leal 8(%edx,%eax,8), %eax
movl (%eax), %edx
movl %edx, 16(%esp)
movl 4(%eax), %eax
movl %eax, 20(%esp)
fldl 16(%esp)
fstpl 24(%esp)
fldl 24(%esp)
fstpl 48(%esp)
movl (%ebp), %eax
movl 4(%esi), %edx
leal 8(%edx,%eax,8), %eax
movl (%eax), %edx
movl %edx, 8(%esp)
movl 4(%eax), %eax
movl %eax, 12(%esp)
fldl 8(%esp)
fstpl 24(%esp)
fldl 24(%esp)
fstpl 40(%esp)
fldl 48(%esp)
faddl 40(%esp)
fstpl 32(%esp)
movl (%ebp), %ecx
movl 8(%esi), %eax
leal 8(%eax,%ecx,8), %ecx
fldl 32(%esp)
fstpl 24(%esp)
movl 24(%esp), %eax
movl 28(%esp), %edx
movl %eax, (%ecx)
movl %edx, 4(%ecx)
incl (%ebp)
movl $_s3IY_info, %eax
L85:
jmp *%eax
L81:
good work, yes? ;-) the C source is also amateur :)
IF_(s3IY_entry) {
W_ _c3MF;
StgDouble _s3IP;
StgDouble _s3IQ;
StgDouble _s3IS;
W_ _s3IW;
FB_
_c3MF = (I_)(*Sp) >= (I_)(R1.p[3]);
if (_c3MF >= 0x1U) goto _c3MI;
_s3IP = PK_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U)));
_s3IQ = PK_DBL((P_)(((R1.p[1]) + 0x8U) + ((*Sp) << 0x3U)));
_s3IS = _s3IP + _s3IQ;
ASSIGN_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U)),_s3IS);
_s3IW = (*Sp) + 0x1U;
*Sp = _s3IW;
JMP_((W_)&s3IY_info);
_c3MI:
R1.p = (P_)(W_)&GHCziBase_Z0T_closure;
Sp=Sp+1;
JMP_(*Sp);
FE_
}
the only cause that this code is only 3 times slower is that C version
is really limited by memory speed. when tested on 1000-element
arrays, it is 20 times slower. i'm not yet tried SSE optimization for
gcc ;)
--
Best regards,
Bulat mailto:bulatz at HotPOP.com
More information about the Glasgow-haskell-users
mailing list