[Haskell-cafe] "sum" in hmatrix and blas?

Sun Jun 8 21:25:25 EDT 2008

Problem solved. The Haskell loop now wins.

> > After reading don's blog, I tried to make a test with both
> > methods.  The code is very simple, as following
> > 
> > module Main where
> > import System
> > import Numeric.LinearAlgebra
> > 
> > vsum1 :: Vector Double -> Double
> > vsum1 v = constant 1 (dim v) <.> v
> > 
> > vsum2 :: Vector Double -> Double
> > vsum2 v = go (d - 1) 0
> >     where
> >       d = dim v
> >       go :: Int -> Double -> Double
> >       go 0 s = s + (v @> 0)
> >       go !j !s = go (j - 1) (s + (v @> j))
> > 
> > 
> > mean :: Vector Double -> Double
> > mean v = vsum v / fromIntegral (dim v)
> >     where vsum = vsum1
> > 
> > main :: IO ()
> > main = do
> >   fn:nrow:ncol:_ <- getArgs
> >   print . mean . flatten =<< fromFile fn (read nrow, read ncol)
> > 
> > Compile it with "-O2 -optc-O2", and run with a data set of
> > length 5000000.  The results are
> > 
> > with "vsum1":
> >  80,077,984 bytes allocated in the heap
> >       2,208 bytes copied during GC (scavenged)
> >          64 bytes copied during GC (not scavenged)
> >  40,894,464 bytes maximum residency (2 sample(s))
> >   %GC time       0.0%  (0.0% elapsed)
> >   Alloc rate    35,235,448 bytes per MUT second
> > ./vsum1 huge 5000000 1  2.25s user 0.09s system 99% cpu 2.348 total
> > 
> > This is reasonable, exactly two copies of vector with size
> > of 40MB.
> > 
> > with "vsum2":
> > 560,743,120 bytes allocated in the heap
> >      19,160 bytes copied during GC (scavenged)
> >      15,920 bytes copied during GC (not scavenged)
> >  40,919,040 bytes maximum residency (2 sample(s))
> >   %GC time       0.3%  (0.3% elapsed)
> >   Alloc rate    222,110,261 bytes per MUT second
> > ./mean2 huge 5000000 1  2.53s user 0.06s system 99% cpu 2.598 total
> > 
> > This is strange.  A lot of extra usage of heap?  Probably
> > because '@>' is not efficient?  So it looks like the
> > inner-product approach wins with a fairly margin.

> Looking at the core we get from vsum2, we see:
> 
> vsum2 compiles quite well
> 
>       $wgo_s1Nn :: Int# -> Double# -> Double#
> 
>       $wgo_s1Nn =
>         \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) ->
>           case ww3_s1Mc of ds_X17R {
>             __DEFAULT ->
>               case Data.Packed.Internal.Vector.$wat
>                      @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R)
>               of { D# y_a1KQ ->
>               $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ)
>               };
>             0 -> +## ww4_s1Mg ww2_s1M7
>           };
>     } in  $wgo_s1Nn (-# ww_s1Ms 1) 0.0
> 
> But note the return value from $wat is boxed, only to be immediately
> unboxed. That looks to be the source of the heap allocations.
> 
> Let's see if we can help that.

Ah, of course. The problem is the unsafePerformIO used to wrap up the
'peek'. This blocks the optimisations somewhat, after we unpack the Vector type.. 

So if we rewrite 'safeRead' to not use unsafePerformIO, but instead:

    safeRead v = inlinePerformIO . withForeignPtr (fptr v)

    inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r

Along with this change to Vector:

    data Vector t = V { dim  :: {-# UNPACK #-} !Int
                      , fptr :: {-# UNPACK #-} !(ForeignPtr t)
                      }

and some inlining for the bounds checks.
Following bytestrings, we see the performance go from, with a 180M input
file:

Goal:
    vsum1:
        $ ghc-core A.hs -optc-O2 -fvia-C  -fbang-patterns
        $ time ./A data 4000 2500
        0.9999727441161678

        ./A data 4000 2500  5.37s user 0.20s system 99% cpu 5.609 total
        160,081,136 bytes allocated in the heap
        155 Mb total memory in use

Before:
vsum2-old:

         $wgo_s1Ns =
        \ (ww3_s1Me :: Int#) (ww4_s1Mi :: Double#) ->
          case ww3_s1Me of ds_X17Y {
            __DEFAULT ->
              case Data.Packed.Internal.Vector.$wat
                     @ Double Foreign.Storable.$f9 ww_s1Mu ww1_s1Mw (I# ds_X17Y)
              of wild1_a1Hg { D# y_a1Hi ->
              $wgo_s1Ns (-# ds_X17Y 1) (+## ww4_s1Mi y_a1Hi)
              };
            0 -> +## ww4_s1Mi ww2_s1M9
          };
    } in  $wgo_s1Ns (-# ww_s1Mu 1) 0.0

    ./A data 4000 2500 +RTS -sstderr 
    0.999972744115876

    ./A data 4000 2500 +RTS -sstderr  6.04s user 0.15s system 99% cpu 6.203 total    
    1,121,416,400 bytes allocated in the heap
    78 Mb total memory in use

After:
    True ->
          case readDoubleOffAddr#
                 @ RealWorld ww1_s1Nr ds_X180 realWorld#
          of wild2_a1HS { (# s2_a1HU, x_a1HV #) ->
          case touch# @ ForeignPtrContents ww2_s1Ns s2_a1HU
          of s_a1HG { __DEFAULT ->
          $wgo_s1Ok (-# ds_X180 1) (+## ww4_s1Ng x_a1HV)
          }

    No needless boxing.

    $ time ./A data 4000 2500 +RTS -sstderr          
    ./A data 4000 2500 +RTS -sstderr 
    0.999972744115876

    ./A data 4000 2500 +RTS -sstderr  5.41s user 0.10s system 99% cpu 5.548 total
     80,071,072 bytes allocated in the heap

    78 Mb total memory in use

And the total runtime and allocation now is better than the fully 'C' version.
Yay.

Patch attached to the darcs version of hmatrix.
Similar changes can likely be done to the other data types in hmatrix,
for pretty much ideal code.

-- Don

-------------- next part --------------
Sun Jun  8 18:21:18 PDT 2008  Don Stewart <dons at galois.com>
  * Use unpacking and inlining to ensure Vector can be stored unlifted at runtime

New patches:

[Use unpacking and inlining to ensure Vector can be stored unlifted at runtime
Don Stewart <dons at galois.com>**20080609012118] {
hunk ./lib/Data/Packed/Internal/Vector.hs 24
+import GHC.Base
+import GHC.IOBase
+
hunk ./lib/Data/Packed/Internal/Vector.hs 28
-data Vector t = V { dim  :: Int              -- ^ number of elements
-                  , fptr :: ForeignPtr t     -- ^ foreign pointer to the memory block
+data Vector t = V { dim  :: {-# UNPACK #-} !Int              -- ^ number of elements
+                  , fptr :: {-# UNPACK #-}!(ForeignPtr t)    -- ^ foreign pointer to the memory block
hunk ./lib/Data/Packed/Internal/Vector.hs 59
-safeRead v = unsafePerformIO . withForeignPtr (fptr v)
+safeRead v = inlinePerformIO . withForeignPtr (fptr v)
+{-# INLINE safeRead #-}
+
+inlinePerformIO :: IO a -> a
+inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r
+{-# INLINE inlinePerformIO #-}
hunk ./lib/Data/Packed/Internal/Vector.hs 88
+{-# INLINE at #-}
}

Context:

[range checking and other additions to the mutable interface
Alberto Ruiz <aruiz at um.es>**20080606154148] 
[safe wrappers and examples/parallel.hs
Alberto Ruiz <aruiz at um.es>**20080606120801] 
[really add Data.Packed.ST.hs
Alberto Ruiz <aruiz at um.es>**20080605123525] 
[added acknowledgement to Bulat
Alberto Ruiz <aruiz at um.es>**20080605122046] 
[remove updateVector and updateMatrix
Alberto Ruiz <aruiz at um.es>**20080605121008] 
[first version of Data.Packed.ST
Alberto Ruiz <aruiz at um.es>**20080605120440] 
[Data.Packed.Convert
Alberto Ruiz <aruiz at um.es>**20080603090037] 
[added updateMatrix
Alberto Ruiz <aruiz at um.es>**20080601120950] 
[added updateVector
Alberto Ruiz <aruiz at um.es>**20080601113907] 
[TAG 0.3.0.0
Alberto Ruiz <aruiz at um.es>**20080527125906] 
Patch bundle hash:
189a5974a11e189e967d537176cf24e388e332e3