[Haskell-cafe] Efficient matrix multiply using accelerate
Trevor L. McDonell
tmcdonell at cse.unsw.edu.au
Thu Sep 5 03:47:41 CEST 2013
On 04/09/2013, at 9:53 PM, Morten Olsen Lysgaard <morten at lysgaard.no> wrote:
> I've been trying to get some speed out of the accelerate library today.
> What I want to implement is something as simple as a matrix multiply.
> I'd like it to be fast and memory efficient.
Well, the trouble with something like matrix multiply is that it is only simple conceptually. Making it fast and memory [bandwidth] efficient is another matter entirely, GPU or otherwise.
> Anyone know how to achieve this with accelerate? My first thought was
> to use the generate function to create the new C array, but I didn't
> manage to wrap my head around all the fancy type features that pop up
> when you want to return an array C that has dimensions dependent on
> the dimensions of it's inputs, A and B.
If in doubt, add a type signature; sometimes you need to help GHC's type checker along a bit. Typically this comes up when constructing/deconstructing things using lift/unlift respectively, which you'll need to do to unpack shapes & indices.
> I've search around a bit and found this  example implementation but
> it is just as slow as a simple sequential algorithm in C.
What kind of GPU are you running on?
Also I haven't benchmarked that matrix multiply code, but I am not overly surprised --- it wasn't designed to be efficient.
> Here's a snippet of what I have tried to make. There are several
> errors in there. Maybe I'm approaching the problem from the wrong
> matMul' arr brr =
> let dotProd shp =
> let (Z :. rowsA :. _) = unlift (shape arr) :: (Z :. Exp
> Int :. Exp Int)
> (Z :. _ :. colsB) = unlift (shape brr) :: (Z :. Exp
> Int :. Exp Int)
> (Z :. i :. j) = unlift shp :: (Z :. Exp Int :. Exp Int)
> rs = lift (Z :. All :.) (unlift i)
> cs = (lift (Z :.) (unlift j)) (:. All)
> in the $ A.fold1All (+) $ A.zipWith (+) (flatten (slice arr
> rs)) (flatten (slice brr cs))
> in A.generate (lift (Z :. rowsA :. colsB)) dotProd
Yeah, this introduces nested parallelism, because each thread (scalar computation) in the generate function is trying to initiate new parallel computation via dotProd. Anyway, even if it did work, you'd still perform the same number of memory transfers as the replicate-based version you linked to earlier . There is a bit more of an explanation of how reduction works in Accelerate at , slide 44.
I am thinking about how to implement some shared memory optimisations, but that isn't likely to be ready soon.
If you want truly fast matrix multiply, Accelerate has an escape hatch --- there is a foreign function interface, so you can just call into the matrix multiply routine provided by, for example, the CUBLAS library.
More information about the Haskell-Cafe