[Haskell-cafe] A Performance Puzzle

Fri Aug 3 00:34:15 UTC 2018

I agree, I think the main problem would still remain once that was
accounted for, but it may be worth doing correctly nonetheless :)

The only thing I can think of off the top of my head is using array
rather thanvector. I am not sure that would fix performance, but it
would make it closer to the C implementation. You might also consider
looking at a heap profile or using threadscope.

On 08/02/2018 07:16 PM, Gregory Wright wrote:
> That's an interesting point.  Could the generation of the random
> matrix be that slow?  Something to check.
>
> In my comparison with dgetrf.c from ReLAPACK, I also used random
> matrices, but measured the execution time from the start of the
> factorization, so I did not include the generation of the random matrix.
>
> The one piece of evidence that still points to a performance problem
> is the scaling, since the execution time goes quite accurately as n^3
> for n * n linear systems.  I would expect the time for generation of a
> random matrix, even if done very inefficiently, to scale as n^2.
>
>
> On 8/2/18 7:47 PM, Vanessa McHale wrote:
>> Looking at your benchmarks you may be benchmarking the wrong thing.
>> The function you are benchmarking is runLUFactor, which generates
>> random matrices in addition to factoring them.
>>
>> On 08/02/2018 05:27 PM, Gregory Wright wrote:
>>>
>>> Hi,
>>>
>>> Something Haskell has lacked for a long time is a good medium-duty
>>> linear system solver based on the LU decomposition.  There are
>>> bindings to the usual C/Fortran libraries, but not one in pure
>>> Haskell.  (An example "LU factorization" routine that does not do
>>> partial pivoting has been around for years, but lacking pivoting it
>>> can fail unexpectedly on well-conditioned inputs.  Another Haskell
>>> LU decomposition using partial pivoting is around, but it uses an
>>> inefficient representation of the pivot matrix, so it's not suited
>>> to solving systems of more than 100 x 100, say.)
>>>
>>> By medium duty I mean a linear system solver that can handle systems
>>> of (1000s) x (1000s) and uses Crout's efficient in-place algorithm. 
>>> In short, a program does everything short of exploiting SIMD vector
>>> instructions for solving small subproblems.
>>>
>>> Instead of complaining about this, I have written a little library
>>> that implements this.  It contains an LU factorization function and
>>> an LU system solver.  The LU factorization also returns the parity
>>> of the pivots ( = (-1)^(number of row swaps) ) so it can be used to
>>> calculate determinants.  I used Gustavson's recursive (imperative)
>>> version of Crout's method.  The implementation is quite simple and
>>> deserves to be better known by people using functional languages to
>>> do numeric work.  The library can be downloaded from GitHub:
>>> https://github.com/gwright83/luSolve
>>>
>>> The performance scales as expected (as n^3, a linear system 10 times
>>> larger in each dimension takes a 1000 times longer to solve):
>>>
>>> Benchmark luSolve-bench: RUNNING...
>>> benchmarking LUSolve/luFactor 100 x 100 matrix
>>> time                 1.944 ms   (1.920 ms .. 1.980 ms)
>>>                      0.996 R²   (0.994 R² .. 0.998 R²)
>>> mean                 1.981 ms   (1.958 ms .. 2.009 ms)
>>> std dev              85.64 μs   (70.21 μs .. 107.7 μs)
>>> variance introduced by outliers: 30% (moderately inflated)
>>>
>>> benchmarking LUSolve/luFactor 500 x 500 matrix
>>> time                 204.3 ms   (198.1 ms .. 208.2 ms)
>>>                      1.000 R²   (0.999 R² .. 1.000 R²)
>>> mean                 203.3 ms   (201.2 ms .. 206.2 ms)
>>> std dev              3.619 ms   (2.307 ms .. 6.231 ms)
>>> variance introduced by outliers: 14% (moderately inflated)
>>>
>>> benchmarking LUSolve/luFactor 1000 x 1000 matrix
>>> time                 1.940 s    (1.685 s .. 2.139 s)
>>>                      0.998 R²   (0.993 R² .. 1.000 R²)
>>> mean                 1.826 s    (1.696 s .. 1.880 s)
>>> std dev              93.63 ms   (5.802 ms .. 117.8 ms)
>>> variance introduced by outliers: 19% (moderately inflated)
>>>
>>> Benchmark luSolve-bench: FINISH
>>>
>>>
>>> The puzzle is why the overall performance is so poor.  When I solve
>>> random 1000 x 1000 systems using the linsys.c example file from the
>>> Recursive LAPACK (ReLAPACK) library -- which implements the same
>>> algorithm -- the average time is only 26 ms.  (I have tweaked
>>> ReLAPACK's  dgetrf.c so that it doesn't use optimized routines for
>>> small matrices.  As near as I can make it, the C and haskell
>>> versions should be doing the same thing.)
>>>
>>> The haskell version runs 75 times slower.  This is the puzzle.
>>>
>>> My haskell version uses a mutable, matrix of unboxed doubles (from
>>> Kai Zhang's matrices <https://hackage.haskell.org/package/matrices>
>>> library).  Matrix reads and writes are unsafe, so there is no
>>> overhead from bounds checking.
>>>
>>> Let's look at the result of profiling:
>>>
>>>         Tue Jul 31 21:07 2018 Time and Allocation Profiling Report 
>>> (Final)
>>>
>>>            luSolve-hspec +RTS -N -p -RTS
>>>
>>>         total time  =     7665.31 secs   (7665309 ticks @ 1000 us, 1
>>> processor)
>>>         total alloc = 10,343,030,811,040 bytes  (excludes profiling
>>> overheads)
>>>
>>> COST CENTRE           MODULE                           
>>> SRC                                                      %time %alloc
>>>
>>> unsafeWrite           Data.Matrix.Dense.Generic.Mutable
>>> src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7   29.4
>>> basicUnsafeWrite      Data.Vector.Primitive.Mutable    
>>> Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7   13.0
>>> unsafeRead            Data.Matrix.Dense.Generic.Mutable
>>> src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2   20.7
>>> matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve    
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4   13.5
>>> readByteArray#        Data.Primitive.Types             
>>> Data/Primitive/Types.hs:184:30-132                         9.0   15.5
>>> basicUnsafeRead       Data.Vector.Primitive.Mutable    
>>> Data/Vector/Primitive/Mutable.hs:112:3-63                  8.8    0.1
>>> triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve    
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)     5.2    4.5
>>> matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve    
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)     4.1    0.3
>>> primitive             Control.Monad.Primitive          
>>> Control/Monad/Primitive.hs:152:3-16                        3.8    0.0
>>> basicUnsafeRead       Data.Vector.Unboxed.Base         
>>> Data/Vector/Unboxed/Base.hs:278:813-868                    3.3    0.0
>>> basicUnsafeWrite      Data.Vector.Unboxed.Base         
>>> Data/Vector/Unboxed/Base.hs:278:872-933                    1.5    0.0
>>> triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve    
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)     1.3    0.1
>>>
>>> <snip>
>>>
>>>
>>> A large amount of time is spent on the invocations of unsafeRead and
>>> unsafeWrite.  This is a bit suspicious -- it looks as if these call
>>> may not be inlined.  In the Data.Vector.Unboxed.Mutable library,
>>> which provides the underlying linear vector of storage locations,
>>> the unsafeRead and unsafeWrite functions are declared INLINE.  Could
>>> this be a failure of the 'matrices' library to mark its
>>> unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?
>>>
>>> On the other hand, looking at the core (.dump-simpl) of the library
>>> doesn't show any dictionary passing, and the access to matrix seem
>>> to be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.
>>>
>>> If this program took three to five times longer, I would not be
>>> concerned, but a factor of seventy five indicates that I've missed
>>> something important.  Can anyone tell me what it is?
>>>
>>> Best Wishes,
>>>
>>> Greg
>>>
>>>
>>>
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> To (un)subscribe, modify options or view archives go to:
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>> Only members subscribed via the mailman list are allowed to post.
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/b696257d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/b696257d/attachment.sig>