[Haskell-cafe] blas bindings, why are they so much slower the C?
Anatoly Yakovenko
aeyakovenko at gmail.com
Fri Jun 27 21:20:51 EDT 2008
> I suspect that it is your initialization that is the difference. For
> one thing, you've initialized the arrays to different values, and in
> your C code you've fused what are two separate loops in your Haskell
> code. So you've not only given the C compiler an easier loop to run
> (since you're initializing the array to a constant rather than to a
> sequence of numbers), but you've also manually optimized that
> initialization. In fact, this fusion could be precisely the factor of
> two. Why not see what happens in Haskell if you create just one
> vector and dot it with itself? (of course, that'll also make the blas
> call faster, so you'll need to be careful in your interpretation of
> your results.)
The difference cant be in the initialization. I am calling the dot
product a million times, the malloc and init in both cases are
insignificant. Also, "fusing" the two loops in C probably wont help,
if anything having each loop run separate is likely to be faster and
result in less cache misses.
In this case, i am using vectors of size 10 only, and calling the loop
10 million times, haskell is far far slower, or 35 times. That's
pretty crappy.
$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where
import Data.Vector.Dense.IO
import Control.Monad
main = do
let size = 10
let times = 10*1000*1000
v1::IOVector Int Double <- newListVector size $ replicate size 0.1
v2::IOVector Int Double <- newListVector size $ replicate size 0.1
sum <- foldM (\ ii zz -> do
rv <- v1 `getDot` v2
return $ zz + rv
) 0.0 [0..times]
print $ sum
$ ghc --make htestdot.hs
$ time ./htestdot
1.00000001e7
real 0m17.328s
user 0m17.320s
sys 0m0.010
$ cat testdot.c
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main() {
int size = 10;
int times = 10*1000*1000;
int ii = 0;
double dd = 0.0;
double* v1 = malloc(sizeof(double) * (size));
double* v2 = malloc(sizeof(double) * (size));
for(ii = 0; ii < size; ++ii) {
v1[ii] = 0.1;
}
for(ii = 0; ii < size; ++ii) {
v2[ii] = 0.1;
}
for(ii = 0; ii < times; ++ii) {
dd += cblas_ddot(size, v1, 1, v2, 1);
}
free(v1);
free(v2);
printf("%f\n", dd);
return 0;
}
$ gcc -O2 testdot.c -lcblas -o testdot
$ time ./testdot
999999.999839
real 0m0.491s
user 0m0.480s
sys 0m0.020s
Just to make sure that fold isnt causing the slowdown, i reverted the
haskell program to use the mapM_, i still got almost the same
performance:
$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where
import Data.Vector.Dense.IO
import Control.Monad
main = do
let size = 10
let times = 10*1000*1000
v1::IOVector Int Double <- newListVector size $ replicate size 0.1
v2::IOVector Int Double <- newListVector size $ replicate size 0.1
mapM_ (\ ii -> do v1 `getDot` v2) [0..times]
$ ghc --make htestdot
$ time ./htestdot
real 0m15.660s
user 0m15.630s
sys 0m0.030s
This is what the profiler has to say:
$ cat htestdot.prof
Fri Jun 27 18:06 2008 Time and Allocation Profiling Report (Final)
htestdot +RTS -p -RTS
total time = 22.00 secs (1100 ticks @ 20 ms)
total alloc = 3,320,010,716 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main Main 100.0 100.0
individual inherited
COST CENTRE MODULE
no. entries %time %alloc %time %alloc
MAIN MAIN
1 0 0.0 0.0 100.0 100.0
main Main
222 1 93.6 88.0 93.6 88.0
CAF Main
216 5 0.0 0.0 6.4 12.0
main Main
223 0 6.4 12.0 6.4 12.0
CAF GHC.Handle
168 3 0.0 0.0 0.0 0.0
More information about the Haskell-Cafe
mailing list