[Haskell-cafe] A Performance Puzzle

Thu Aug 2 23:47:58 UTC 2018

Looking at your benchmarks you may be benchmarking the wrong thing. The
function you are benchmarking is runLUFactor, which generates random
matrices in addition to factoring them.

On 08/02/2018 05:27 PM, Gregory Wright wrote:
>
> Hi,
>
> Something Haskell has lacked for a long time is a good medium-duty
> linear system solver based on the LU decomposition.  There are
> bindings to the usual C/Fortran libraries, but not one in pure
> Haskell.  (An example "LU factorization" routine that does not do
> partial pivoting has been around for years, but lacking pivoting it
> can fail unexpectedly on well-conditioned inputs.  Another Haskell LU
> decomposition using partial pivoting is around, but it uses an
> inefficient representation of the pivot matrix, so it's not suited to
> solving systems of more than 100 x 100, say.)
>
> By medium duty I mean a linear system solver that can handle systems
> of (1000s) x (1000s) and uses Crout's efficient in-place algorithm. 
> In short, a program does everything short of exploiting SIMD vector
> instructions for solving small subproblems.
>
> Instead of complaining about this, I have written a little library
> that implements this.  It contains an LU factorization function and an
> LU system solver.  The LU factorization also returns the parity of the
> pivots ( = (-1)^(number of row swaps) ) so it can be used to calculate
> determinants.  I used Gustavson's recursive (imperative) version of
> Crout's method.  The implementation is quite simple and deserves to be
> better known by people using functional languages to do numeric work. 
> The library can be downloaded from GitHub:
> https://github.com/gwright83/luSolve
>
> The performance scales as expected (as n^3, a linear system 10 times
> larger in each dimension takes a 1000 times longer to solve):
>
> Benchmark luSolve-bench: RUNNING...
> benchmarking LUSolve/luFactor 100 x 100 matrix
> time                 1.944 ms   (1.920 ms .. 1.980 ms)
>                      0.996 R²   (0.994 R² .. 0.998 R²)
> mean                 1.981 ms   (1.958 ms .. 2.009 ms)
> std dev              85.64 μs   (70.21 μs .. 107.7 μs)
> variance introduced by outliers: 30% (moderately inflated)
>
> benchmarking LUSolve/luFactor 500 x 500 matrix
> time                 204.3 ms   (198.1 ms .. 208.2 ms)
>                      1.000 R²   (0.999 R² .. 1.000 R²)
> mean                 203.3 ms   (201.2 ms .. 206.2 ms)
> std dev              3.619 ms   (2.307 ms .. 6.231 ms)
> variance introduced by outliers: 14% (moderately inflated)
>
> benchmarking LUSolve/luFactor 1000 x 1000 matrix
> time                 1.940 s    (1.685 s .. 2.139 s)
>                      0.998 R²   (0.993 R² .. 1.000 R²)
> mean                 1.826 s    (1.696 s .. 1.880 s)
> std dev              93.63 ms   (5.802 ms .. 117.8 ms)
> variance introduced by outliers: 19% (moderately inflated)
>
> Benchmark luSolve-bench: FINISH
>
>
> The puzzle is why the overall performance is so poor.  When I solve
> random 1000 x 1000 systems using the linsys.c example file from the
> Recursive LAPACK (ReLAPACK) library -- which implements the same
> algorithm -- the average time is only 26 ms.  (I have tweaked
> ReLAPACK's  dgetrf.c so that it doesn't use optimized routines for
> small matrices.  As near as I can make it, the C and haskell versions
> should be doing the same thing.)
>
> The haskell version runs 75 times slower.  This is the puzzle.
>
> My haskell version uses a mutable, matrix of unboxed doubles (from Kai
> Zhang's matrices <https://hackage.haskell.org/package/matrices>
> library).  Matrix reads and writes are unsafe, so there is no overhead
> from bounds checking.
>
> Let's look at the result of profiling:
>
>         Tue Jul 31 21:07 2018 Time and Allocation Profiling Report 
> (Final)
>
>            luSolve-hspec +RTS -N -p -RTS
>
>         total time  =     7665.31 secs   (7665309 ticks @ 1000 us, 1
> processor)
>         total alloc = 10,343,030,811,040 bytes  (excludes profiling
> overheads)
>
> COST CENTRE           MODULE                           
> SRC                                                      %time %alloc
>
> unsafeWrite           Data.Matrix.Dense.Generic.Mutable
> src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7   29.4
> basicUnsafeWrite      Data.Vector.Primitive.Mutable    
> Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7   13.0
> unsafeRead            Data.Matrix.Dense.Generic.Mutable
> src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2   20.7
> matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve    
> src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4   13.5
> readByteArray#        Data.Primitive.Types             
> Data/Primitive/Types.hs:184:30-132                         9.0   15.5
> basicUnsafeRead       Data.Vector.Primitive.Mutable    
> Data/Vector/Primitive/Mutable.hs:112:3-63                  8.8    0.1
> triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve    
> src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)     5.2    4.5
> matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve    
> src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)     4.1    0.3
> primitive             Control.Monad.Primitive          
> Control/Monad/Primitive.hs:152:3-16                        3.8    0.0
> basicUnsafeRead       Data.Vector.Unboxed.Base         
> Data/Vector/Unboxed/Base.hs:278:813-868                    3.3    0.0
> basicUnsafeWrite      Data.Vector.Unboxed.Base         
> Data/Vector/Unboxed/Base.hs:278:872-933                    1.5    0.0
> triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve    
> src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)     1.3    0.1
>
> <snip>
>
>
> A large amount of time is spent on the invocations of unsafeRead and
> unsafeWrite.  This is a bit suspicious -- it looks as if these call
> may not be inlined.  In the Data.Vector.Unboxed.Mutable library, which
> provides the underlying linear vector of storage locations, the
> unsafeRead and unsafeWrite functions are declared INLINE.  Could this
> be a failure of the 'matrices' library to mark its unsafeRead/Write
> functions as INLINE or SPECIALIZABLE as well?
>
> On the other hand, looking at the core (.dump-simpl) of the library
> doesn't show any dictionary passing, and the access to matrix seem to
> be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.
>
> If this program took three to five times longer, I would not be
> concerned, but a factor of seventy five indicates that I've missed
> something important.  Can anyone tell me what it is?
>
> Best Wishes,
>
> Greg
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/0dfa3230/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/0dfa3230/attachment.sig>