[Haskell-cafe] "sum" in hmatrix and blas?

Mon Jun 9 12:10:09 EDT 2008

Patch applied!

I will also make the recommended changes in the Matrix type and prepare 
some benchmarks for this kind of Haskell loops.

In fact, such excellent performance will also be very useful in image 
processing algorithms which must read and write a large number of C 
array elements. I now think that some C auxiliary functions in the 
easyVision library for operations not available in the IPP can be 
replaced by much nicer Haskell versions.

I'm currently working on the implementation of interest point 
descriptors based on histograms of local gradient directions. This may 
be a very good benchmark. I will prepare a first version and post here 
the results for discussion.

Don, many thanks for your all your help!

Alberto

Don Stewart wrote:
> Problem solved. The Haskell loop now wins.
> 
>>> After reading don's blog, I tried to make a test with both
>>> methods.  The code is very simple, as following
>>>
>>> module Main where
>>> import System
>>> import Numeric.LinearAlgebra
>>>
>>> vsum1 :: Vector Double -> Double
>>> vsum1 v = constant 1 (dim v) <.> v
>>>
>>> vsum2 :: Vector Double -> Double
>>> vsum2 v = go (d - 1) 0
>>>     where
>>>       d = dim v
>>>       go :: Int -> Double -> Double
>>>       go 0 s = s + (v @> 0)
>>>       go !j !s = go (j - 1) (s + (v @> j))
>>>
>>>
>>> mean :: Vector Double -> Double
>>> mean v = vsum v / fromIntegral (dim v)
>>>     where vsum = vsum1
>>>
>>> main :: IO ()
>>> main = do
>>>   fn:nrow:ncol:_ <- getArgs
>>>   print . mean . flatten =<< fromFile fn (read nrow, read ncol)
>>>
>>> Compile it with "-O2 -optc-O2", and run with a data set of
>>> length 5000000.  The results are
>>>
>>> with "vsum1":
>>>  80,077,984 bytes allocated in the heap
>>>       2,208 bytes copied during GC (scavenged)
>>>          64 bytes copied during GC (not scavenged)
>>>  40,894,464 bytes maximum residency (2 sample(s))
>>>   %GC time       0.0%  (0.0% elapsed)
>>>   Alloc rate    35,235,448 bytes per MUT second
>>> ./vsum1 huge 5000000 1  2.25s user 0.09s system 99% cpu 2.348 total
>>>
>>> This is reasonable, exactly two copies of vector with size
>>> of 40MB.
>>>
>>> with "vsum2":
>>> 560,743,120 bytes allocated in the heap
>>>      19,160 bytes copied during GC (scavenged)
>>>      15,920 bytes copied during GC (not scavenged)
>>>  40,919,040 bytes maximum residency (2 sample(s))
>>>   %GC time       0.3%  (0.3% elapsed)
>>>   Alloc rate    222,110,261 bytes per MUT second
>>> ./mean2 huge 5000000 1  2.53s user 0.06s system 99% cpu 2.598 total
>>>
>>> This is strange.  A lot of extra usage of heap?  Probably
>>> because '@>' is not efficient?  So it looks like the
>>> inner-product approach wins with a fairly margin.
>   
>> Looking at the core we get from vsum2, we see:
>>
>> vsum2 compiles quite well
>>
>>       $wgo_s1Nn :: Int# -> Double# -> Double#
>>
>>       $wgo_s1Nn =
>>         \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) ->
>>           case ww3_s1Mc of ds_X17R {
>>             __DEFAULT ->
>>               case Data.Packed.Internal.Vector.$wat
>>                      @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R)
>>               of { D# y_a1KQ ->
>>               $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ)
>>               };
>>             0 -> +## ww4_s1Mg ww2_s1M7
>>           };
>>     } in  $wgo_s1Nn (-# ww_s1Ms 1) 0.0
>>
>> But note the return value from $wat is boxed, only to be immediately
>> unboxed. That looks to be the source of the heap allocations.
>>
>> Let's see if we can help that.
> 
> Ah, of course. The problem is the unsafePerformIO used to wrap up the
> 'peek'. This blocks the optimisations somewhat, after we unpack the Vector type.. 
> 
> So if we rewrite 'safeRead' to not use unsafePerformIO, but instead:
> 
>     safeRead v = inlinePerformIO . withForeignPtr (fptr v)
> 
>     inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r
> 
> Along with this change to Vector:
> 
>     data Vector t = V { dim  :: {-# UNPACK #-} !Int
>                       , fptr :: {-# UNPACK #-} !(ForeignPtr t)
>                       }
> 
> and some inlining for the bounds checks.
> Following bytestrings, we see the performance go from, with a 180M input
> file:
>     
> Goal:
>     vsum1:
>         $ ghc-core A.hs -optc-O2 -fvia-C  -fbang-patterns
>         $ time ./A data 4000 2500
>         0.9999727441161678
> 
>         ./A data 4000 2500  5.37s user 0.20s system 99% cpu 5.609 total
>         160,081,136 bytes allocated in the heap
>         155 Mb total memory in use
> 
> Before:
> vsum2-old:
> 
>          $wgo_s1Ns =
>         \ (ww3_s1Me :: Int#) (ww4_s1Mi :: Double#) ->
>           case ww3_s1Me of ds_X17Y {
>             __DEFAULT ->
>               case Data.Packed.Internal.Vector.$wat
>                      @ Double Foreign.Storable.$f9 ww_s1Mu ww1_s1Mw (I# ds_X17Y)
>               of wild1_a1Hg { D# y_a1Hi ->
>               $wgo_s1Ns (-# ds_X17Y 1) (+## ww4_s1Mi y_a1Hi)
>               };
>             0 -> +## ww4_s1Mi ww2_s1M9
>           };
>     } in  $wgo_s1Ns (-# ww_s1Mu 1) 0.0
>  
>     ./A data 4000 2500 +RTS -sstderr 
>     0.999972744115876
> 
>     ./A data 4000 2500 +RTS -sstderr  6.04s user 0.15s system 99% cpu 6.203 total    
>     1,121,416,400 bytes allocated in the heap
>     78 Mb total memory in use
> 
> After:
>     True ->
>           case readDoubleOffAddr#
>                  @ RealWorld ww1_s1Nr ds_X180 realWorld#
>           of wild2_a1HS { (# s2_a1HU, x_a1HV #) ->
>           case touch# @ ForeignPtrContents ww2_s1Ns s2_a1HU
>           of s_a1HG { __DEFAULT ->
>           $wgo_s1Ok (-# ds_X180 1) (+## ww4_s1Ng x_a1HV)
>           }
> 
>     No needless boxing.
> 
>     $ time ./A data 4000 2500 +RTS -sstderr          
>     ./A data 4000 2500 +RTS -sstderr 
>     0.999972744115876
> 
>     ./A data 4000 2500 +RTS -sstderr  5.41s user 0.10s system 99% cpu 5.548 total
>      80,071,072 bytes allocated in the heap
> 
>     78 Mb total memory in use
> 
> And the total runtime and allocation now is better than the fully 'C' version.
> Yay.
> 
> Patch attached to the darcs version of hmatrix.
> Similar changes can likely be done to the other data types in hmatrix,
> for pretty much ideal code.
> 
> -- Don
>     
>