[Haskell-cafe] Re: [Haskell] Matrix multiplication

Wed Apr 30 02:04:00 EDT 2008

> droundy 
> On Wed, Apr 23, 2008 at 05:24:20PM +0100, Sebastian Sylvan wrote:
> > On Wed, Apr 23, 2008 at 5:01 PM, Tillmann Vogt <Tillmann.Vogt <at> rwth-aachen.de>
> > wrote:
> > 
> > > Hi,
> > >
> > > I am currently experimenting with parallelizing C-programs. I have
> > > therefore written a matrix vector multiplication example that needs 13
> > > seconds to run (5 seconds with OpenMP). Because I like Haskell I did the
> > > same in this language, but it takes about 134 seconds. Why is it so slow?
> > > Does someone have an idea?
> > >
> > Yes, in the C version you use unboxed arrays, in the Haskell version you use
> > a linked list of linked lists. Naturally the latter will take up more space,
> > require more work to index, and will thrash the cache quite a bit.
> 
> In fact, I'm impressed that Haskell can come within a factor of 10, given
> that it's using such a challenging data type! (I wonder if this may be due
> to inefficiencies in the C code, although I haven't looked at it.)

I had a look at this code, using the (unboxed, strict, fused) data parallel
arrays library (http://darcs.haskell.org/packages/ndp). 

We get rather nice code. The original C program (modified, to sum the result,
rather than just filling the array and throwing it out):

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>

    #define M 4000
    #define N 4000
    #define IT 100

    double a[M], b[M][N], c[N];

    int main(int argc, char *argv[])
    {
      double d;
      double sum;
      int i, j, l;
      time_t start,end;

      printf("Initializing matrix B and vector C\n");
      for(j=0; j<N; j++) c[j] = 2.0;
      for(i=0; i<M; i++)
            for(j=0; j<N; j++)
                   b[i][j] = j;

      printf("Executing %d matrix mult. for M = %d N = %d\n",IT,M,N);
      time (&start);

      sum = 0;
      for(i=0; i<M; i++)
         for (j=0; j<N; j++)
             sum += b[i][j]*c[j];

      time (&end);
      printf ("result: %.2lf\n", sum );

      d = difftime (end,start);
      printf ("calculation time: %.2lf seconds\n", d );
      return 0;
    }

And then rewritten in a data parallel style (but note, no parallelism here, just using 
the nice interface to STUArrays, with stream fusion):

    import Data.Array.Parallel.Unlifted
    import Text.Printf

    m  = 4000
    n  = 4000
    it = 100

    main = printf "result: %2f\n" . sumU . zipWithU (*) (concatSU b) $ repeatU m c
        where
           -- c[N]
           c = replicateU n (2.0 :: Double)
           -- b[M][N]
           b = replicateCU m t where t = mapU fromIntegral (enumFromToU 0 (n-1)) :: UArr Double

Finds 4 fusion sites, and runs in:

    $ time ./a
    result: 63984000000.0
    ./a  0.10s user 0.00s system 112% cpu 0.089 total

While the C program,

    $ time ./a.out
    Initializing matrix B and vector C
    Executing 100 matrix mult. for M = 4000 N = 4000
    result: 63984000000.00
    calculation time: 0.00 seconds
    ./a.out  0.14s user 0.09s system 92% cpu 0.248 total

GHC rocks :) NDP programming encourages a nice combinator style too,
which is interesting (and Haskell-ish).

The compiler is able to fuse array most of the arrays, (and Roman reports this
should do rather better with ghc head).

Anyway, we've a few new (fast!) array options on the horizon, looking promising.

For more info, check out the data parallel arrays page,

    http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

-- Don