[Haskell-cafe] "sum" in hmatrix and blas?

Sun Jun 8 19:40:55 EDT 2008

Alberto Ruiz <aruiz at um.es> writes:

> My experience is that Haskell allocation time is very fast and usually
> negligible in most non trivial matrix computations.
>
> A good thing about
>
> sum v = constant 1 (dim v) <.> v
>
> is that a constant vector is efficiently created internally (not from
> an intermediate Haskell list), and the inner product will be computed
> by the possibly optimized blas version available in your machine.
>
> You can also write simple definitions like the next one for the
> average of the rows of a matrix as a simple vector-matrix product:
>
> mean m = w <> m
>     where w = constant k r
>           r = rows m
>           k = 1 / fromIntegral r
>
> I prefer high level "index free" matrix operations to C-like
> code. Definitions are clearer, have less bugs, and are also more
> efficient if you use optimized numerical libs.
>
> In any case, many algorithms can be solved by the recursive approach
> described by Tomas.

After reading don's blog, I tried to make a test with both
methods.  The code is very simple, as following

module Main where
import System
import Numeric.LinearAlgebra

vsum1 :: Vector Double -> Double
vsum1 v = constant 1 (dim v) <.> v

vsum2 :: Vector Double -> Double
vsum2 v = go (d - 1) 0
    where
      d = dim v
      go :: Int -> Double -> Double
      go 0 s = s + (v @> 0)
      go !j !s = go (j - 1) (s + (v @> j))

mean :: Vector Double -> Double
mean v = vsum v / fromIntegral (dim v)
    where vsum = vsum1

main :: IO ()
main = do
  fn:nrow:ncol:_ <- getArgs
  print . mean . flatten =<< fromFile fn (read nrow, read ncol)

Compile it with "-O2 -optc-O2", and run with a data set of
length 5000000.  The results are

with "vsum1":
 80,077,984 bytes allocated in the heap
      2,208 bytes copied during GC (scavenged)
         64 bytes copied during GC (not scavenged)
 40,894,464 bytes maximum residency (2 sample(s))
  %GC time       0.0%  (0.0% elapsed)
  Alloc rate    35,235,448 bytes per MUT second
./vsum1 huge 5000000 1  2.25s user 0.09s system 99% cpu 2.348 total

This is reasonable, exactly two copies of vector with size
of 40MB.

with "vsum2":
560,743,120 bytes allocated in the heap
     19,160 bytes copied during GC (scavenged)
     15,920 bytes copied during GC (not scavenged)
 40,919,040 bytes maximum residency (2 sample(s))
  %GC time       0.3%  (0.3% elapsed)
  Alloc rate    222,110,261 bytes per MUT second
./mean2 huge 5000000 1  2.53s user 0.06s system 99% cpu 2.598 total

This is strange.  A lot of extra usage of heap?  Probably
because '@>' is not efficient?  So it looks like the
inner-product approach wins with a fairly margin.
-- 
    c/*    __o/*
    <\     * (__
    */\      <