[Haskell-cafe] A Performance Puzzle
Vanessa McHale
vanessa.mchale at iohk.io
Fri Aug 3 00:34:15 UTC 2018
I agree, I think the main problem would still remain once that was
accounted for, but it may be worth doing correctly nonetheless :)
The only thing I can think of off the top of my head is using array
rather thanvector. I am not sure that would fix performance, but it
would make it closer to the C implementation. You might also consider
looking at a heap profile or using threadscope.
On 08/02/2018 07:16 PM, Gregory Wright wrote:
> That's an interesting point. Could the generation of the random
> matrix be that slow? Something to check.
>
> In my comparison with dgetrf.c from ReLAPACK, I also used random
> matrices, but measured the execution time from the start of the
> factorization, so I did not include the generation of the random matrix.
>
> The one piece of evidence that still points to a performance problem
> is the scaling, since the execution time goes quite accurately as n^3
> for n * n linear systems. I would expect the time for generation of a
> random matrix, even if done very inefficiently, to scale as n^2.
>
>
> On 8/2/18 7:47 PM, Vanessa McHale wrote:
>> Looking at your benchmarks you may be benchmarking the wrong thing.
>> The function you are benchmarking is runLUFactor, which generates
>> random matrices in addition to factoring them.
>>
>> On 08/02/2018 05:27 PM, Gregory Wright wrote:
>>>
>>> Hi,
>>>
>>> Something Haskell has lacked for a long time is a good medium-duty
>>> linear system solver based on the LU decomposition. There are
>>> bindings to the usual C/Fortran libraries, but not one in pure
>>> Haskell. (An example "LU factorization" routine that does not do
>>> partial pivoting has been around for years, but lacking pivoting it
>>> can fail unexpectedly on well-conditioned inputs. Another Haskell
>>> LU decomposition using partial pivoting is around, but it uses an
>>> inefficient representation of the pivot matrix, so it's not suited
>>> to solving systems of more than 100 x 100, say.)
>>>
>>> By medium duty I mean a linear system solver that can handle systems
>>> of (1000s) x (1000s) and uses Crout's efficient in-place algorithm.
>>> In short, a program does everything short of exploiting SIMD vector
>>> instructions for solving small subproblems.
>>>
>>> Instead of complaining about this, I have written a little library
>>> that implements this. It contains an LU factorization function and
>>> an LU system solver. The LU factorization also returns the parity
>>> of the pivots ( = (-1)^(number of row swaps) ) so it can be used to
>>> calculate determinants. I used Gustavson's recursive (imperative)
>>> version of Crout's method. The implementation is quite simple and
>>> deserves to be better known by people using functional languages to
>>> do numeric work. The library can be downloaded from GitHub:
>>> https://github.com/gwright83/luSolve
>>>
>>> The performance scales as expected (as n^3, a linear system 10 times
>>> larger in each dimension takes a 1000 times longer to solve):
>>>
>>> Benchmark luSolve-bench: RUNNING...
>>> benchmarking LUSolve/luFactor 100 x 100 matrix
>>> time 1.944 ms (1.920 ms .. 1.980 ms)
>>> 0.996 R² (0.994 R² .. 0.998 R²)
>>> mean 1.981 ms (1.958 ms .. 2.009 ms)
>>> std dev 85.64 μs (70.21 μs .. 107.7 μs)
>>> variance introduced by outliers: 30% (moderately inflated)
>>>
>>> benchmarking LUSolve/luFactor 500 x 500 matrix
>>> time 204.3 ms (198.1 ms .. 208.2 ms)
>>> 1.000 R² (0.999 R² .. 1.000 R²)
>>> mean 203.3 ms (201.2 ms .. 206.2 ms)
>>> std dev 3.619 ms (2.307 ms .. 6.231 ms)
>>> variance introduced by outliers: 14% (moderately inflated)
>>>
>>> benchmarking LUSolve/luFactor 1000 x 1000 matrix
>>> time 1.940 s (1.685 s .. 2.139 s)
>>> 0.998 R² (0.993 R² .. 1.000 R²)
>>> mean 1.826 s (1.696 s .. 1.880 s)
>>> std dev 93.63 ms (5.802 ms .. 117.8 ms)
>>> variance introduced by outliers: 19% (moderately inflated)
>>>
>>> Benchmark luSolve-bench: FINISH
>>>
>>>
>>> The puzzle is why the overall performance is so poor. When I solve
>>> random 1000 x 1000 systems using the linsys.c example file from the
>>> Recursive LAPACK (ReLAPACK) library -- which implements the same
>>> algorithm -- the average time is only 26 ms. (I have tweaked
>>> ReLAPACK's dgetrf.c so that it doesn't use optimized routines for
>>> small matrices. As near as I can make it, the C and haskell
>>> versions should be doing the same thing.)
>>>
>>> The haskell version runs 75 times slower. This is the puzzle.
>>>
>>> My haskell version uses a mutable, matrix of unboxed doubles (from
>>> Kai Zhang's matrices <https://hackage.haskell.org/package/matrices>
>>> library). Matrix reads and writes are unsafe, so there is no
>>> overhead from bounds checking.
>>>
>>> Let's look at the result of profiling:
>>>
>>> Tue Jul 31 21:07 2018 Time and Allocation Profiling Report
>>> (Final)
>>>
>>> luSolve-hspec +RTS -N -p -RTS
>>>
>>> total time = 7665.31 secs (7665309 ticks @ 1000 us, 1
>>> processor)
>>> total alloc = 10,343,030,811,040 bytes (excludes profiling
>>> overheads)
>>>
>>> COST CENTRE MODULE
>>> SRC %time %alloc
>>>
>>> unsafeWrite Data.Matrix.Dense.Generic.Mutable
>>> src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38) 17.7 29.4
>>> basicUnsafeWrite Data.Vector.Primitive.Mutable
>>> Data/Vector/Primitive/Mutable.hs:115:3-69 14.7 13.0
>>> unsafeRead Data.Matrix.Dense.Generic.Mutable
>>> src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38) 14.2 20.7
>>> matrixMultiply.\.\.\ Numeric.LinearAlgebra.LUSolve
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86) 13.4 13.5
>>> readByteArray# Data.Primitive.Types
>>> Data/Primitive/Types.hs:184:30-132 9.0 15.5
>>> basicUnsafeRead Data.Vector.Primitive.Mutable
>>> Data/Vector/Primitive/Mutable.hs:112:3-63 8.8 0.1
>>> triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58) 5.2 4.5
>>> matrixMultiply.\.\ Numeric.LinearAlgebra.LUSolve
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86) 4.1 0.3
>>> primitive Control.Monad.Primitive
>>> Control/Monad/Primitive.hs:152:3-16 3.8 0.0
>>> basicUnsafeRead Data.Vector.Unboxed.Base
>>> Data/Vector/Unboxed/Base.hs:278:813-868 3.3 0.0
>>> basicUnsafeWrite Data.Vector.Unboxed.Base
>>> Data/Vector/Unboxed/Base.hs:278:872-933 1.5 0.0
>>> triangularSolve.\.\ Numeric.LinearAlgebra.LUSolve
>>> src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58) 1.3 0.1
>>>
>>> <snip>
>>>
>>>
>>> A large amount of time is spent on the invocations of unsafeRead and
>>> unsafeWrite. This is a bit suspicious -- it looks as if these call
>>> may not be inlined. In the Data.Vector.Unboxed.Mutable library,
>>> which provides the underlying linear vector of storage locations,
>>> the unsafeRead and unsafeWrite functions are declared INLINE. Could
>>> this be a failure of the 'matrices' library to mark its
>>> unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?
>>>
>>> On the other hand, looking at the core (.dump-simpl) of the library
>>> doesn't show any dictionary passing, and the access to matrix seem
>>> to be through GHC.Prim.writeDoubleArray# and GHC.Prim.readDoubleArray#.
>>>
>>> If this program took three to five times longer, I would not be
>>> concerned, but a factor of seventy five indicates that I've missed
>>> something important. Can anyone tell me what it is?
>>>
>>> Best Wishes,
>>>
>>> Greg
>>>
>>>
>>>
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> To (un)subscribe, modify options or view archives go to:
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>> Only members subscribed via the mailman list are allowed to post.
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/b696257d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20180802/b696257d/attachment.sig>
More information about the Haskell-Cafe
mailing list