GHC vs. GCC on raw vector addition

Wed Jan 18 12:34:54 EST 2006

Hello Simon,

Wednesday, January 18, 2006, 5:31:25 PM, you wrote:

>> 2) generating random values takes about 1.5-2 seconds by itself.
>> Haskell's RNG is very different from C's one

SM> I squeezed a bit more out (see attached).
>           x `seq` v `seq` return ()

it's new trick for me :)   now the difference is less than 3x

btw, i also use "return $! length xs" trick to ensure that all xs
elements will be evaluated

>for from to action | from `seq` to `seq` False = undefined

and this changes nothing, at least with 6.4.1/mingw32

btw, using "mapM_ action [n..m]" is very common operation. can it be
automatically substituted with my code by using some RULE pragmas in ghc
libraries? that will automatically improve many ghc-compiled programs

too, i use the following code instead of replicateM:

myReplicateM n action = if (n<=5*10^4)
                          then sequence (replicate n action)
                          else goLarge n [] >>= return.reverse
  where
    goLarge 0 xs = return xs
    goLarge n xs = do x <- action
                      (goLarge $! n-1) $! x:xs

it doesn't overflow stack and works much faster for the large n. that
is my testbed for this function:

import Control.Monad
main = do a <- replicateM 1 $ myReplicateM (1*10^6) (return 1)
          return $! sum (map last a)

and also, how about adding to GHC strictness annotations?

x <- newArray (0,nelems-1) 0 :: IO !Vector
v <- newArray_ (0,nelems-1) :: IO !Vector
for :: !Int -> !Int -> (!Int -> IO a) -> IO ()

it's SO common source of performance problems...

SM> I think the main bottleneck 
SM> is now the random number generator, in particular it is supplying boxed 
SM> Doubles which have to be unboxed again before storing in the array.

as i say, it uses 1.5-2 seconds, i.e. only 10% of time when you run
1000 iterations (may be you not noticed that it used only in
initialization?). so, while RNG itself runs 150 times slower (!), it
doesn't make so much difference when you run 1000 iterations after
initial filling the array

and about "using Altivec instructions". the code produced for new.hs
contains only one `fadd` operation, so it is easy to find entire cycle
as it is compiled by GHC. that is one: 

        movl    (%ebp), %eax
        cmpl    12(%esi), %eax
        jge     L81
        movl    8(%esi), %edx
        leal    8(%edx,%eax,8), %eax
        movl    (%eax), %edx
        movl    %edx, 16(%esp)
        movl    4(%eax), %eax
        movl    %eax, 20(%esp)
        fldl    16(%esp)
        fstpl   24(%esp)
        fldl    24(%esp)
        fstpl   48(%esp)
        movl    (%ebp), %eax
        movl    4(%esi), %edx
        leal    8(%edx,%eax,8), %eax
        movl    (%eax), %edx
        movl    %edx, 8(%esp)
        movl    4(%eax), %eax
        movl    %eax, 12(%esp)
        fldl    8(%esp)
        fstpl   24(%esp)
        fldl    24(%esp)
        fstpl   40(%esp)
        fldl    48(%esp)
        faddl   40(%esp)
        fstpl   32(%esp)
        movl    (%ebp), %ecx
        movl    8(%esi), %eax
        leal    8(%eax,%ecx,8), %ecx
        fldl    32(%esp)
        fstpl   24(%esp)
        movl    24(%esp), %eax
        movl    28(%esp), %edx
        movl    %eax, (%ecx)
        movl    %edx, 4(%ecx)
        incl    (%ebp)
        movl    $_s3IY_info, %eax
L85:
        jmp     *%eax
L81:

good work, yes? ;-)  the C source is also amateur :)

IF_(s3IY_entry) {
W_ _c3MF;
StgDouble _s3IP;
StgDouble _s3IQ;
StgDouble _s3IS;
W_ _s3IW;
FB_
_c3MF = (I_)(*Sp) >= (I_)(R1.p[3]);
if (_c3MF >= 0x1U) goto _c3MI;
_s3IP = PK_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U)));
_s3IQ = PK_DBL((P_)(((R1.p[1]) + 0x8U) + ((*Sp) << 0x3U)));
_s3IS = _s3IP + _s3IQ;
ASSIGN_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U)),_s3IS);
_s3IW = (*Sp) + 0x1U;
*Sp = _s3IW;
JMP_((W_)&s3IY_info);
_c3MI:
R1.p = (P_)(W_)&GHCziBase_Z0T_closure;
Sp=Sp+1;
JMP_(*Sp);
FE_
}

the only cause that this code is only 3 times slower is that C version
is really limited by memory speed. when tested on 1000-element
arrays, it is 20 times slower. i'm not yet tried SSE optimization for
gcc ;)

-- 
Best regards,
 Bulat                            mailto:bulatz at HotPOP.com