[Haskell-cafe] blas bindings, why are they so much slower the C?

Wed Jun 18 13:05:44 EDT 2008

On Wed, Jun 18, 2008 at 09:16:24AM -0700, Anatoly Yakovenko wrote:
> >> #include <cblas.h>
> >> #include <stdlib.h>
> >>
> >> int main() {
> >>   int size = 1024;
> >>   int ii = 0;
> >>   double* v1 = malloc(sizeof(double) * (size));
> >>   double* v2 = malloc(sizeof(double) * (size));
> >>   for(ii = 0; ii < size*size; ++ii) {
> >>      double _dd = cblas_ddot(0, v1, size, v2, size);
> >>   }
> >>   free(v1);
> >>   free(v2);
> >> }
> >
> > Your C compiler sees that you're not using the result of cblas_ddot,
> > so it doesn't even bother to call it. That loop never gets run. All
> > your program does at runtime is call malloc and free twice, which is
> > very fast :-)
> 
> C doesn't work like that :).  functions always get called.  but i did
> find a problem with my C code, i am incorrectly calling the dot
> production function:

See a recent article in lwn on pure and const functions to see how gcc
is able to perform dead code elimination and CSE, provided its given
annotations on the relevant functions.  I'd certainly hope that your
blas library is properly annotated!

> #include <cblas.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> 
> int main() {
>    int size = 1024;
>    int ii = 0;
>    double dd = 0.0;
>    double* v1 = malloc(sizeof(double) * (size));
>    double* v2 = malloc(sizeof(double) * (size));
>    for(ii = 0; ii < size; ++ii) {
>       v1[ii] = 0.1;
>       v2[ii] = 0.1;
>    }
>    for(ii = 0; ii < size*size; ++ii) {
>       dd += cblas_ddot(size, v1, 0, v2, 0);
>    }
>    free(v1);
>    free(v2);
>    printf("%f\n", dd);
>    return 0;
> }
> 
> time ./testdot
> 10737418.240187
> 
> real    0m2.200s
> user    0m2.190s
> sys     0m0.010s
> 
> So C is about twice as fast.  I can live with that.

I suspect that it is your initialization that is the difference.  For
one thing, you've initialized the arrays to different values, and in
your C code you've fused what are two separate loops in your Haskell
code.  So you've not only given the C compiler an easier loop to run
(since you're initializing the array to a constant rather than to a
sequence of numbers), but you've also manually optimized that
initialization.  In fact, this fusion could be precisely the factor of
two.  Why not see what happens in Haskell if you create just one
vector and dot it with itself? (of course, that'll also make the blas
call faster, so you'll need to be careful in your interpretation of
your results.)

David