Speed of simple operations with Ptr Word32s
Ian Lynagh
igloo at earth.li
Sat Dec 4 12:17:54 EST 2004
Hi all,
I was under the impression that simple code like the below, which swaps
the endianness of a block of data, ought to be near C speed:
-----8<----------8<----------8<----------8<-----
module Main (main) where
import Word (Word32)
import Foreign.Ptr (Ptr)
import Foreign.Marshal.Array (mallocArray, advancePtr)
import Foreign.Storable (peek, poke)
import Bits ((.|.), (.&.), shiftL, shiftR)
main :: IO ()
main = do p <- mallocArray 104857600
foo p 104857600
foo :: Ptr Word32 -> Int -> IO ()
foo p i | p `seq` i `seq` False = undefined
foo _ 0 = return ()
foo p n
= do x <- peek p
poke p (shiftL x 24 .|. shiftL (x .&. 0xff00) 8
.|. (shiftR x 8 .&. 0xff00)
.|. shiftR x 24)
foo (p `advancePtr` 1) (n - 1)
-----8<----------8<----------8<----------8<-----
However, against this equally simple C code it doesn't fair too well:
-----8<----------8<----------8<----------8<-----
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main(void) {
uint32_t *p32;
int i;
p32 = malloc(104857600 * sizeof(uint32_t));
for (i = 0; i < 104857600; i++, p32++) {
*p32 = ((*p32 << 24) | ((*p32 & 0xff00) << 8)
| ((*p32 >> 8) & 0xff00)
| (*p32 >> 24));
}
return 0;
}
-----8<----------8<----------8<----------8<-----
$ cat runme.sh
rm -f *.o *.hi c H
gcc -Wall -O2 c.c -o opt_c
gcc -Wall c.c -o c
ghc -Wall -O2 H.hs -o H
rm -f *.o *.hi
ghc -Wall -O2 H.hs -o Hsf -funbox-strict-fields
for i in opt_c c H Hsf; do echo $i; /usr/bin/time ./$i; done
rm -f *.o *.hi c H
$ ./runme.sh
opt_c
1.14user 0.40system 0:01.55elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+204888minor)pagefaults 0swaps
c
1.47user 0.43system 0:01.90elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+204888minor)pagefaults 0swaps
H
6.75user 0.42system 0:07.18elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+204974minor)pagefaults 0swaps
Hsf
6.76user 0.41system 0:07.18elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+204974minor)pagefaults 0swaps
so we're 3.5-6 times slower, depending on which numbers you want to use.
Is there anything I can do to get better performance in this sort of
code without resorting to calling out to C?
Thanks
Ian
More information about the Glasgow-haskell-users
mailing list