[Haskell-cafe] A Performance Puzzle

Fri Aug 3 00:16:42 UTC 2018

That's an interesting point.  Could the generation of the random matrix
be that slow?  Something to check.

In my comparison with dgetrf.c from ReLAPACK, I also used random
matrices, but measured the execution time from the start of the
factorization, so I did not include the generation of the random matrix.

The one piece of evidence that still points to a performance problem is
the scaling, since the execution time goes quite accurately as n^3 for n
* n linear systems.  I would expect the time for generation of a random
matrix, even if done very inefficiently, to scale as n^2.

On 8/2/18 7:47 PM, Vanessa McHale wrote:
> Looking at your benchmarks you may be benchmarking the wrong thing.
> The function you are benchmarking is runLUFactor, which generates
> random matrices in addition to factoring them.
>
> On 08/02/2018 05:27 PM, Gregory Wright wrote:
>>
>> Hi,
>>
>> Something Haskell has lacked for a long time is a good medium-duty
>> linear system solver based on the LU decomposition.  There are
>> bindings to the usual C/Fortran libraries, but not one in pure
>> Haskell.  (An example "LU factorization" routine that does not do
>> partial pivoting has been around for years, but lacking pivoting it
>> can fail unexpectedly on well-conditioned inputs.  Another Haskell LU
>> decomposition using partial pivoting is around, but it uses an
>> inefficient representation of the pivot matrix, so it's not suited to
>> solving systems of more than 100 x 100, say.)
>>
>> By medium duty I mean a linear system solver that can handle systems
>> of (1000s) x (1000s) and uses Crout's efficient in-place algorithm. 
>> In short, a program does everything short of exploiting SIMD vector
>> instructions for solving small subproblems.
>>
>> Instead of complaining about this, I have written a little library
>> that implements this.  It contains an LU factorization function and
>> an LU system solver.  The LU factorization also returns the parity of
>> the pivots ( = (-1)^(number of row swaps) ) so it can be used to
>> calculate determinants.  I used Gustavson's recursive (imperative)
>> version of Crout's method.  The implementation is quite simple and
>> deserves to be better known by people using functional languages to
>> do numeric work.  The library can be downloaded from GitHub:
>> https://github.com/gwright83/luSolve
>>
>> The performance scales as expected (as n^3, a linear system 10 times
>> larger in each dimension takes a 1000 times longer to solve):
>>
>> Benchmark luSolve-bench: RUNNING...
>> benchmarking LUSolve/luFactor 100 x 100 matrix
>> time                 1.944 ms   (1.920 ms .. 1.980 ms)
>>                      0.996 R²   (0.994 R² .. 0.998 R²)
>> mean                 1.981 ms   (1.958 ms .. 2.009 ms)
>> std dev              85.64 μs   (70.21 μs .. 107.7 μs)
>> variance introduced by outliers: 30% (moderately inflated)
>>
>> benchmarking LUSolve/luFactor 500 x 500 matrix
>> time                 204.3 ms   (198.1 ms .. 208.2 ms)
>>                      1.000 R²   (0.999 R² .. 1.000 R²)
>> mean                 203.3 ms   (201.2 ms .. 206.2 ms)
>> std dev              3.619 ms   (2.307 ms .. 6.231 ms)
>> variance introduced by outliers: 14% (moderately inflated)
>>
>> benchmarking LUSolve/luFactor 1000 x 1000 matrix
>> time                 1.940 s    (1.685 s .. 2.139 s)
>>                      0.998 R²   (0.993 R² .. 1.000 R²)
>> mean                 1.826 s    (1.696 s .. 1.880 s)
>> std dev              93.63 ms   (5.802 ms .. 117.8 ms)
>> variance introduced by outliers: 19% (moderately inflated)
>>
>> Benchmark luSolve-bench: FINISH
>>
>>
>> The puzzle is why the overall performance is so poor.  When I solve
>> random 1000 x 1000 systems using the linsys.c example file from the
>> Recursive LAPACK (ReLAPACK) library -- which implements the same
>> algorithm -- the average time is only 26 ms.  (I have tweaked
>> ReLAPACK's  dgetrf.c so that it doesn't use optimized routines for
>> small matrices.  As near as I can make it, the C and haskell versions
>> should be doing the same thing.)
>>
>> The haskell version runs 75 times slower.  This is the puzzle.
>>
>> My haskell version uses a mutable, matrix of unboxed doubles (from
>> Kai Zhang's matrices <https://hackage.haskell.org/package/matrices>
>> library).  Matrix reads and writes are unsafe, so there is no
>> overhead from bounds checking.
>>
>> Let's look at the result of profiling:
>>
>>         Tue Jul 31 21:07 2018 Time and Allocation Profiling Report 
>> (Final)
>>
>>            luSolve-hspec +RTS -N -p -RTS
>>
>>         total time  =     7665.31 secs   (7665309 ticks @ 1000 us, 1
>> processor)
>>         total alloc = 10,343,030,811,040 bytes  (excludes profiling
>> overheads)
>>
>> COST CENTRE           MODULE                           
>> SRC                                                      %time %alloc
>>
>> unsafeWrite           Data.Matrix.Dense.Generic.Mutable
>> src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7   29.4
>> basicUnsafeWrite      Data.Vector.Primitive.Mutable    
>> Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7   13.0
>> unsafeRead            Data.Matrix.Dense.Generic.Mutable
>> src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2   20.7
>> matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve    
>> src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4   13.5
>> readByteArray#        Data.Primitive.Types             
>> Data/Primitive/Types.hs:184:30-132                         9.0   15.5
>> basicUnsafeRead       Data.Vector.Primitive.Mutable    
>> Data/Vector/Primitive/Mutable.hs:112:3-63                  8.8    0.1
>> triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve    
>> src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)     5.2    4.5
>> matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve    
>> src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)     4.1    0.3
>> primitive             Control.Monad.Primitive          
>> Control/Monad/Primitive.hs:152:3-16                        3.8    0.0
>> basicUnsafeRead       Data.Vector.Unboxed.Base         
>> Data/Vector/Unboxed/Base.hs:278:813-868                    3.3    0.0
>> basicUnsafeWrite      Data.Vector.Unboxed.Base         
>> Data/Vector/Unboxed/Base.hs:278:872-933                    1.5    0.0
>> triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve    
>> src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)     1.3    0.1
>>
>> <snip>
>>
>>
>> A large amount of time is spent on the invocations of unsafeRead and
>> unsafeWrite.  This is a bit suspicious -- it looks as if these call
>> may not be inlined.  In the Data.Vector.Unboxed.Mutable library,
>> which provides the underlying linear vector of storage locations, the
>> unsafeRead and unsafeWrite functions are declared INLINE.  Could this
>> be a failure of the 'matrices' library to mark its unsafeRead/Write
>> functions as INLINE or SPECIALIZABLE as well?
>>
>> On the other hand, looking at the core (.dump-simpl) of the library
>> doesn't show any dictionary passing, and the access to matrix seem to
>> be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.
>>
>> If this program took three to five times longer, I would not be
>> concerned, but a factor of seventy five indicates that I've missed
>> something important.  Can anyone tell me what it is?
>>
>> Best Wishes,
>>
>> Greg
>>
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/4d388a43/attachment-0001.html>