<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    That's an interesting point.  Could the generation of the random

    matrix be that slow?  Something to check.<br>

    <br>

    In my comparison with dgetrf.c from ReLAPACK, I also used random

    matrices, but measured the execution time from the start of the

    factorization, so I did not include the generation of the random

    matrix.<br>

    <br>

    The one piece of evidence that still points to a performance problem

    is the scaling, since the execution time goes quite accurately as

    n^3 for n * n linear systems.  I would expect the time for

    generation of a random matrix, even if done very inefficiently, to

    scale as n^2.<br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 8/2/18 7:47 PM, Vanessa McHale

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:bac0c282-aab6-3e85-e81b-a9c3935e4f38@iohk.io">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      Looking at your benchmarks you may be benchmarking the wrong

      thing. The function you are benchmarking is <tt>runLUFactor, </tt>which

      generates random matrices in addition to factoring them.<br>

      <tt><br>

      </tt>

      <div class="moz-cite-prefix">On 08/02/2018 05:27 PM, Gregory

        Wright wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:4c9d1388-099f-6e00-113e-0da97c6e1041@antiope.com">

        <meta http-equiv="content-type" content="text/html;

          charset=UTF-8">

        <p>Hi,</p>

        <p>Something Haskell has lacked for a long time is a good

          medium-duty linear system solver based on the LU

          decomposition.  There are bindings to the usual C/Fortran

          libraries, but not one in pure Haskell.  (An example "LU

          factorization" routine that does not do partial pivoting has

          been around for years, but lacking pivoting it can fail

          unexpectedly on well-conditioned inputs.  Another Haskell LU

          decomposition using partial pivoting is around, but it uses an

          inefficient representation of the pivot matrix, so it's not

          suited to solving systems of more than 100 x 100, say.)</p>

        <p>By medium duty I mean a linear system solver that can handle

          systems of (1000s) x (1000s) and uses Crout's efficient

          in-place algorithm.  In short, a program does everything short

          of exploiting SIMD vector instructions for solving small

          subproblems.</p>

        <p>Instead of complaining about this, I have written a little

          library that implements this.  It contains an LU factorization

          function and an LU system solver.  The LU factorization also

          returns the parity of the pivots ( = (-1)^(number of row

          swaps) ) so it can be used to calculate determinants.  I used

          Gustavson's recursive (imperative) version of Crout's method. 

          The implementation is quite simple and deserves to be better

          known by people using functional languages to do numeric

          work.  The library can be downloaded from GitHub: <a

            moz-do-not-send="true"

            href="https://github.com/gwright83/luSolve">https://github.com/gwright83/luSolve</a><br>

        </p>

        <p>The performance scales as expected (as n^3, a linear system

          10 times larger in each dimension takes a 1000 times longer to

          solve):<br>

        </p>

        <p><tt>Benchmark luSolve-bench: RUNNING...</tt><tt><br>

          </tt><tt>benchmarking LUSolve/luFactor 100 x 100 matrix</tt><tt><br>

          </tt><tt>time                 1.944 ms   (1.920 ms .. 1.980

            ms)</tt><tt><br>

          </tt><tt>                     0.996 R²   (0.994 R² .. 0.998

            R²)</tt><tt><br>

          </tt><tt>mean                 1.981 ms   (1.958 ms .. 2.009

            ms)</tt><tt><br>

          </tt><tt>std dev              85.64 μs   (70.21 μs .. 107.7

            μs)</tt><tt><br>

          </tt><tt>variance introduced by outliers: 30% (moderately

            inflated)</tt><tt><br>

          </tt><tt><br>

          </tt><tt>benchmarking LUSolve/luFactor 500 x 500 matrix</tt><tt><br>

          </tt><tt>time                 204.3 ms   (198.1 ms .. 208.2

            ms)</tt><tt><br>

          </tt><tt>                     1.000 R²   (0.999 R² .. 1.000

            R²)</tt><tt><br>

          </tt><tt>mean                 203.3 ms   (201.2 ms .. 206.2

            ms)</tt><tt><br>

          </tt><tt>std dev              3.619 ms   (2.307 ms .. 6.231

            ms)</tt><tt><br>

          </tt><tt>variance introduced by outliers: 14% (moderately

            inflated)</tt><tt><br>

          </tt><tt><br>

          </tt><tt>benchmarking LUSolve/luFactor 1000 x 1000 matrix</tt><tt><br>

          </tt><tt>time                 1.940 s    (1.685 s .. 2.139 s)</tt><tt><br>

          </tt><tt>                     0.998 R²   (0.993 R² .. 1.000

            R²)</tt><tt><br>

          </tt><tt>mean                 1.826 s    (1.696 s .. 1.880 s)</tt><tt><br>

          </tt><tt>std dev              93.63 ms   (5.802 ms .. 117.8

            ms)</tt><tt><br>

          </tt><tt>variance introduced by outliers: 19% (moderately

            inflated)</tt><tt><br>

          </tt><tt><br>

          </tt><tt>Benchmark luSolve-bench: FINISH</tt></p>

        <p><tt><br>

          </tt></p>

        <p>The puzzle is why the overall performance is so poor.  When I

          solve random 1000 x 1000 systems using the linsys.c example

          file from the Recursive LAPACK (ReLAPACK) library -- which

          implements the same algorithm -- the average time is only 26

          ms.  (I have tweaked ReLAPACK's  dgetrf.c so that it doesn't

          use optimized routines for small matrices.  As near as I can

          make it, the C and haskell versions should be doing the same

          thing.)</p>

        <p>The haskell version runs 75 times slower.  This is the

          puzzle.</p>

        <p>My haskell version uses a mutable, matrix of unboxed doubles

          (from Kai Zhang's <a moz-do-not-send="true"

            href="https://hackage.haskell.org/package/matrices">matrices</a>

          library).  Matrix reads and writes are unsafe, so there is no

          overhead from bounds checking.</p>

        <p>Let's look at the result of profiling:</p>

        <p><tt>        Tue Jul 31 21:07 2018 Time and Allocation

            Profiling Report  (Final)</tt><tt><br>

          </tt><tt><br>

          </tt><tt>           luSolve-hspec +RTS -N -p -RTS</tt><tt><br>

          </tt><tt><br>

          </tt><tt>        total time  =     7665.31 secs   (7665309

            ticks @ 1000 us, 1 processor)</tt><tt><br>

          </tt><tt>        total alloc = 10,343,030,811,040 bytes 

            (excludes profiling overheads)</tt><tt><br>

          </tt><tt><br>

          </tt><tt>COST CENTRE          

            MODULE                           

            SRC                                                     

            %time %alloc</tt><tt><br>

          </tt><tt><br>

          </tt><tt>unsafeWrite          

            Data.Matrix.Dense.Generic.Mutable

            src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)  

            17.7   29.4</tt><tt><br>

          </tt><tt>basicUnsafeWrite     

            Data.Vector.Primitive.Mutable    

            Data/Vector/Primitive/Mutable.hs:115:3-69                

            14.7   13.0</tt><tt><br>

          </tt><tt>unsafeRead           

            Data.Matrix.Dense.Generic.Mutable

            src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)  

            14.2   20.7</tt><tt><br>

          </tt><tt>matrixMultiply.\.\.\ 

            Numeric.LinearAlgebra.LUSolve    

            src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)   

            13.4   13.5</tt><tt><br>

          </tt><tt>readByteArray#       

            Data.Primitive.Types             

            Data/Primitive/Types.hs:184:30-132                        

            9.0   15.5</tt><tt><br>

          </tt><tt>basicUnsafeRead      

            Data.Vector.Primitive.Mutable    

            Data/Vector/Primitive/Mutable.hs:112:3-63                 

            8.8    0.1</tt><tt><br>

          </tt><tt>triangularSolve.\.\.\

            Numeric.LinearAlgebra.LUSolve    

            src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)    

            5.2    4.5</tt><tt><br>

          </tt><tt>matrixMultiply.\.\   

            Numeric.LinearAlgebra.LUSolve    

            src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)    

            4.1    0.3</tt><tt><br>

          </tt><tt>primitive            

            Control.Monad.Primitive          

            Control/Monad/Primitive.hs:152:3-16                       

            3.8    0.0</tt><tt><br>

          </tt><tt>basicUnsafeRead      

            Data.Vector.Unboxed.Base         

            Data/Vector/Unboxed/Base.hs:278:813-868                   

            3.3    0.0</tt><tt><br>

          </tt><tt>basicUnsafeWrite     

            Data.Vector.Unboxed.Base         

            Data/Vector/Unboxed/Base.hs:278:872-933                   

            1.5    0.0</tt><tt><br>

          </tt><tt>triangularSolve.\.\  

            Numeric.LinearAlgebra.LUSolve    

            src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)    

            1.3    0.1</tt><tt><br>

          </tt><tt><br>

          </tt><tt><snip></tt></p>

        <p><tt><br>

          </tt></p>

        <p>A large amount of time is spent on the invocations of

          unsafeRead and unsafeWrite.  This is a bit suspicious -- it

          looks as if these call may not be inlined.  In the

          Data.Vector.Unboxed.Mutable library, which provides the

          underlying linear vector of storage locations, the unsafeRead

          and unsafeWrite functions are declared INLINE.  Could this be

          a failure of the 'matrices' library to mark its

          unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?</p>

        <p>On the other hand, looking at the core (.dump-simpl) of the

          library doesn't show any dictionary passing, and the access to

          matrix seem to be through GHC.Prim.writeDoubleArray# and

          GHC.Prim.readDoubleArray#.</p>

        <p>If this program took three to five times longer, I would not

          be concerned, but a factor of seventy five indicates that I've

          missed something important.  Can anyone tell me what it is?</p>

        <p>Best Wishes,</p>

        <p>Greg<br>

        </p>

        <br>

        <fieldset class="mimeAttachmentHeader"></fieldset>

        <br>

        <pre wrap="">_______________________________________________

Haskell-Cafe mailing list

To (un)subscribe, modify options or view archives go to:

<a class="moz-txt-link-freetext" href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>

Only members subscribed via the mailman list are allowed to post.</pre>

      </blockquote>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Haskell-Cafe mailing list

To (un)subscribe, modify options or view archives go to:

<a class="moz-txt-link-freetext" href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>

Only members subscribed via the mailman list are allowed to post.</pre>

    </blockquote>

    <br>

  </body>

</html>