[Haskell-cafe] A Performance Puzzle
Gregory Wright
gwright at antiope.com
Fri Aug 3 00:16:42 UTC 2018
That's an interesting point. Could the generation of the random matrix
be that slow? Something to check.
In my comparison with dgetrf.c from ReLAPACK, I also used random
matrices, but measured the execution time from the start of the
factorization, so I did not include the generation of the random matrix.
The one piece of evidence that still points to a performance problem is
the scaling, since the execution time goes quite accurately as n^3 for n
* n linear systems. I would expect the time for generation of a random
matrix, even if done very inefficiently, to scale as n^2.
On 8/2/18 7:47 PM, Vanessa McHale wrote:
> Looking at your benchmarks you may be benchmarking the wrong thing.
> The function you are benchmarking is runLUFactor, which generates
> random matrices in addition to factoring them.
>
> On 08/02/2018 05:27 PM, Gregory Wright wrote:
>>
>> Hi,
>>
>> Something Haskell has lacked for a long time is a good medium-duty
>> linear system solver based on the LU decomposition. There are
>> bindings to the usual C/Fortran libraries, but not one in pure
>> Haskell. (An example "LU factorization" routine that does not do
>> partial pivoting has been around for years, but lacking pivoting it
>> can fail unexpectedly on well-conditioned inputs. Another Haskell LU
>> decomposition using partial pivoting is around, but it uses an
>> inefficient representation of the pivot matrix, so it's not suited to
>> solving systems of more than 100 x 100, say.)
>>
>> By medium duty I mean a linear system solver that can handle systems
>> of (1000s) x (1000s) and uses Crout's efficient in-place algorithm.
>> In short, a program does everything short of exploiting SIMD vector
>> instructions for solving small subproblems.
>>
>> Instead of complaining about this, I have written a little library
>> that implements this. It contains an LU factorization function and
>> an LU system solver. The LU factorization also returns the parity of
>> the pivots ( = (-1)^(number of row swaps) ) so it can be used to
>> calculate determinants. I used Gustavson's recursive (imperative)
>> version of Crout's method. The implementation is quite simple and
>> deserves to be better known by people using functional languages to
>> do numeric work. The library can be downloaded from GitHub:
>> https://github.com/gwright83/luSolve
>>
>> The performance scales as expected (as n^3, a linear system 10 times
>> larger in each dimension takes a 1000 times longer to solve):
>>
>> Benchmark luSolve-bench: RUNNING...
>> benchmarking LUSolve/luFactor 100 x 100 matrix
>> time 1.944 ms (1.920 ms .. 1.980 ms)
>> 0.996 R² (0.994 R² .. 0.998 R²)
>> mean 1.981 ms (1.958 ms .. 2.009 ms)
>> std dev 85.64 μs (70.21 μs .. 107.7 μs)
>> variance introduced by outliers: 30% (moderately inflated)
>>
>> benchmarking LUSolve/luFactor 500 x 500 matrix
>> time 204.3 ms (198.1 ms .. 208.2 ms)
>> 1.000 R² (0.999 R² .. 1.000 R²)
>> mean 203.3 ms (201.2 ms .. 206.2 ms)
>> std dev 3.619 ms (2.307 ms .. 6.231 ms)
>> variance introduced by outliers: 14% (moderately inflated)
>>
>> benchmarking LUSolve/luFactor 1000 x 1000 matrix
>> time 1.940 s (1.685 s .. 2.139 s)
>> 0.998 R² (0.993 R² .. 1.000 R²)
>> mean 1.826 s (1.696 s .. 1.880 s)
>> std dev 93.63 ms (5.802 ms .. 117.8 ms)
>> variance introduced by outliers: 19% (moderately inflated)
>>
>> Benchmark luSolve-bench: FINISH
>>
>>
>> The puzzle is why the overall performance is so poor. When I solve
>> random 1000 x 1000 systems using the linsys.c example file from the
>> Recursive LAPACK (ReLAPACK) library -- which implements the same
>> algorithm -- the average time is only 26 ms. (I have tweaked
>> ReLAPACK's dgetrf.c so that it doesn't use optimized routines for
>> small matrices. As near as I can make it, the C and haskell versions
>> should be doing the same thing.)
>>
>> The haskell version runs 75 times slower. This is the puzzle.
>>
>> My haskell version uses a mutable, matrix of unboxed doubles (from
>> Kai Zhang's matrices <https://hackage.haskell.org/package/matrices>
>> library). Matrix reads and writes are unsafe, so there is no
>> overhead from bounds checking.
>>
>> Let's look at the result of profiling:
>>
>> Tue Jul 31 21:07 2018 Time and Allocation Profiling Report
>> (Final)
>>
>> luSolve-hspec +RTS -N -p -RTS
>>
>> total time = 7665.31 secs (7665309 ticks @ 1000 us, 1
>> processor)
>> total alloc = 10,343,030,811,040 bytes (excludes profiling
>> overheads)
>>
>> COST CENTRE MODULE
>> SRC %time %alloc
>>
>> unsafeWrite Data.Matrix.Dense.Generic.Mutable
>> src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38) 17.7 29.4
>> basicUnsafeWrite Data.Vector.Primitive.Mutable
>> Data/Vector/Primitive/Mutable.hs:115:3-69 14.7 13.0
>> unsafeRead Data.Matrix.Dense.Generic.Mutable
>> src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38) 14.2 20.7
>> matrixMultiply.\.\.\ Numeric.LinearAlgebra.LUSolve
>> src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86) 13.4 13.5
>> readByteArray# Data.Primitive.Types
>> Data/Primitive/Types.hs:184:30-132 9.0 15.5
>> basicUnsafeRead Data.Vector.Primitive.Mutable
>> Data/Vector/Primitive/Mutable.hs:112:3-63 8.8 0.1
>> triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve
>> src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58) 5.2 4.5
>> matrixMultiply.\.\ Numeric.LinearAlgebra.LUSolve
>> src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86) 4.1 0.3
>> primitive Control.Monad.Primitive
>> Control/Monad/Primitive.hs:152:3-16 3.8 0.0
>> basicUnsafeRead Data.Vector.Unboxed.Base
>> Data/Vector/Unboxed/Base.hs:278:813-868 3.3 0.0
>> basicUnsafeWrite Data.Vector.Unboxed.Base
>> Data/Vector/Unboxed/Base.hs:278:872-933 1.5 0.0
>> triangularSolve.\.\ Numeric.LinearAlgebra.LUSolve
>> src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58) 1.3 0.1
>>
>> <snip>
>>
>>
>> A large amount of time is spent on the invocations of unsafeRead and
>> unsafeWrite. This is a bit suspicious -- it looks as if these call
>> may not be inlined. In the Data.Vector.Unboxed.Mutable library,
>> which provides the underlying linear vector of storage locations, the
>> unsafeRead and unsafeWrite functions are declared INLINE. Could this
>> be a failure of the 'matrices' library to mark its unsafeRead/Write
>> functions as INLINE or SPECIALIZABLE as well?
>>
>> On the other hand, looking at the core (.dump-simpl) of the library
>> doesn't show any dictionary passing, and the access to matrix seem to
>> be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.
>>
>> If this program took three to five times longer, I would not be
>> concerned, but a factor of seventy five indicates that I've missed
>> something important. Can anyone tell me what it is?
>>
>> Best Wishes,
>>
>> Greg
>>
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/4d388a43/attachment-0001.html>
More information about the Haskell-Cafe
mailing list