[Haskell-cafe] blas bindings, why are they so much slower the C?

Fri Jun 27 21:20:51 EDT 2008

> I suspect that it is your initialization that is the difference.  For
> one thing, you've initialized the arrays to different values, and in
> your C code you've fused what are two separate loops in your Haskell
> code.  So you've not only given the C compiler an easier loop to run
> (since you're initializing the array to a constant rather than to a
> sequence of numbers), but you've also manually optimized that
> initialization.  In fact, this fusion could be precisely the factor of
> two.  Why not see what happens in Haskell if you create just one
> vector and dot it with itself? (of course, that'll also make the blas
> call faster, so you'll need to be careful in your interpretation of
> your results.)

The difference cant be in the initialization.   I am calling the dot
product a million times, the malloc and init in both cases are
insignificant.  Also, "fusing" the two loops in C probably wont help,
if anything having each loop run separate is likely to be faster and
result in less cache misses.

In this case, i am using vectors of size 10 only, and calling the loop
10 million times, haskell is far far slower, or 35 times.  That's
pretty crappy.

$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where

import Data.Vector.Dense.IO
import Control.Monad

main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   sum <- foldM (\ ii zz -> do
      rv <- v1 `getDot` v2
      return $ zz + rv
      ) 0.0 [0..times]
   print $ sum

$ ghc --make htestdot.hs
$ time ./htestdot
1.00000001e7

real    0m17.328s
user    0m17.320s
sys     0m0.010

$ cat testdot.c
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main() {
   int size = 10;
   int times = 10*1000*1000;
   int ii = 0;
   double dd = 0.0;
   double* v1 = malloc(sizeof(double) * (size));
   double* v2 = malloc(sizeof(double) * (size));
   for(ii = 0; ii < size; ++ii) {
      v1[ii] = 0.1;
   }
   for(ii = 0; ii < size; ++ii) {
      v2[ii] = 0.1;
   }
   for(ii = 0; ii < times; ++ii) {
      dd += cblas_ddot(size, v1, 1, v2, 1);
   }
   free(v1);
   free(v2);
   printf("%f\n", dd);
   return 0;
}

$ gcc -O2 testdot.c -lcblas -o testdot
$ time ./testdot
999999.999839

real    0m0.491s
user    0m0.480s
sys     0m0.020s

Just to make sure that fold isnt causing the slowdown, i reverted the
haskell program to use the mapM_, i still got almost the same
performance:

$ cat htestdot.hs
{-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields
-fglasgow-exts -fbang-patterns -lcblas#-}
module Main where

import Data.Vector.Dense.IO
import Control.Monad

main = do
   let size = 10
   let times = 10*1000*1000
   v1::IOVector Int Double <- newListVector size $ replicate size 0.1
   v2::IOVector Int Double <- newListVector size $ replicate size 0.1
   mapM_ (\ ii -> do v1 `getDot` v2) [0..times]
$ ghc --make htestdot

$ time ./htestdot

real    0m15.660s
user    0m15.630s
sys     0m0.030s

This is what the profiler has to say:

 $ cat htestdot.prof
        Fri Jun 27 18:06 2008 Time and Allocation Profiling Report  (Final)

           htestdot +RTS -p -RTS

        total time  =       22.00 secs   (1100 ticks @ 20 ms)
        total alloc = 3,320,010,716 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

main                           Main                 100.0  100.0

                        individual    inherited
COST CENTRE              MODULE
       no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN
         1           0   0.0    0.0   100.0  100.0
 main                    Main
       222           1  93.6   88.0    93.6   88.0
 CAF                     Main
       216           5   0.0    0.0     6.4   12.0
  main                   Main
       223           0   6.4   12.0     6.4   12.0
 CAF                     GHC.Handle
       168           3   0.0    0.0     0.0    0.0