<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>I agree, I think the main problem would still remain once that

      was accounted for, but it may be worth doing correctly nonetheless

      :)<br>

      <br>

      The only thing I can think of off the top of my head is using <tt>array

      </tt>rather than<tt> vector.</tt> I am not sure that would fix

      performance, but it would make it closer to the C implementation.

      You might also consider looking at a heap profile or using

      threadscope. <br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 08/02/2018 07:16 PM, Gregory Wright

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:e6fc7f54-b04d-1284-32f6-c734bcf6b250@antiope.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      That's an interesting point.  Could the generation of the random

      matrix be that slow?  Something to check.<br>

      <br>

      In my comparison with dgetrf.c from ReLAPACK, I also used random

      matrices, but measured the execution time from the start of the

      factorization, so I did not include the generation of the random

      matrix.<br>

      <br>

      The one piece of evidence that still points to a performance

      problem is the scaling, since the execution time goes quite

      accurately as n^3 for n * n linear systems.  I would expect the

      time for generation of a random matrix, even if done very

      inefficiently, to scale as n^2.<br>

      <br>

      <br>

      <div class="moz-cite-prefix">On 8/2/18 7:47 PM, Vanessa McHale

        wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:bac0c282-aab6-3e85-e81b-a9c3935e4f38@iohk.io">

        <meta http-equiv="Content-Type" content="text/html;

          charset=utf-8">

        Looking at your benchmarks you may be benchmarking the wrong

        thing. The function you are benchmarking is <tt>runLUFactor, </tt>which

        generates random matrices in addition to factoring them.<br>

        <tt><br>

        </tt>

        <div class="moz-cite-prefix">On 08/02/2018 05:27 PM, Gregory

          Wright wrote:<br>

        </div>

        <blockquote type="cite"

          cite="mid:4c9d1388-099f-6e00-113e-0da97c6e1041@antiope.com">

          <meta http-equiv="content-type" content="text/html;

            charset=utf-8">

          <p>Hi,</p>

          <p>Something Haskell has lacked for a long time is a good

            medium-duty linear system solver based on the LU

            decomposition.  There are bindings to the usual C/Fortran

            libraries, but not one in pure Haskell.  (An example "LU

            factorization" routine that does not do partial pivoting has

            been around for years, but lacking pivoting it can fail

            unexpectedly on well-conditioned inputs.  Another Haskell LU

            decomposition using partial pivoting is around, but it uses

            an inefficient representation of the pivot matrix, so it's

            not suited to solving systems of more than 100 x 100, say.)</p>

          <p>By medium duty I mean a linear system solver that can

            handle systems of (1000s) x (1000s) and uses Crout's

            efficient in-place algorithm.  In short, a program does

            everything short of exploiting SIMD vector instructions for

            solving small subproblems.</p>

          <p>Instead of complaining about this, I have written a little

            library that implements this.  It contains an LU

            factorization function and an LU system solver.  The LU

            factorization also returns the parity of the pivots ( =

            (-1)^(number of row swaps) ) so it can be used to calculate

            determinants.  I used Gustavson's recursive (imperative)

            version of Crout's method.  The implementation is quite

            simple and deserves to be better known by people using

            functional languages to do numeric work.  The library can be

            downloaded from GitHub: <a moz-do-not-send="true"

              href="https://github.com/gwright83/luSolve">https://github.com/gwright83/luSolve</a><br>

          </p>

          <p>The performance scales as expected (as n^3, a linear system

            10 times larger in each dimension takes a 1000 times longer

            to solve):<br>

          </p>

          <p><tt>Benchmark luSolve-bench: RUNNING...</tt><tt><br>

            </tt><tt>benchmarking LUSolve/luFactor 100 x 100 matrix</tt><tt><br>

            </tt><tt>time                 1.944 ms   (1.920 ms .. 1.980

              ms)</tt><tt><br>

            </tt><tt>                     0.996 R²   (0.994 R² .. 0.998

              R²)</tt><tt><br>

            </tt><tt>mean                 1.981 ms   (1.958 ms .. 2.009

              ms)</tt><tt><br>

            </tt><tt>std dev              85.64 μs   (70.21 μs .. 107.7

              μs)</tt><tt><br>

            </tt><tt>variance introduced by outliers: 30% (moderately

              inflated)</tt><tt><br>

            </tt><tt><br>

            </tt><tt>benchmarking LUSolve/luFactor 500 x 500 matrix</tt><tt><br>

            </tt><tt>time                 204.3 ms   (198.1 ms .. 208.2

              ms)</tt><tt><br>

            </tt><tt>                     1.000 R²   (0.999 R² .. 1.000

              R²)</tt><tt><br>

            </tt><tt>mean                 203.3 ms   (201.2 ms .. 206.2

              ms)</tt><tt><br>

            </tt><tt>std dev              3.619 ms   (2.307 ms .. 6.231

              ms)</tt><tt><br>

            </tt><tt>variance introduced by outliers: 14% (moderately

              inflated)</tt><tt><br>

            </tt><tt><br>

            </tt><tt>benchmarking LUSolve/luFactor 1000 x 1000 matrix</tt><tt><br>

            </tt><tt>time                 1.940 s    (1.685 s .. 2.139

              s)</tt><tt><br>

            </tt><tt>                     0.998 R²   (0.993 R² .. 1.000

              R²)</tt><tt><br>

            </tt><tt>mean                 1.826 s    (1.696 s .. 1.880

              s)</tt><tt><br>

            </tt><tt>std dev              93.63 ms   (5.802 ms .. 117.8

              ms)</tt><tt><br>

            </tt><tt>variance introduced by outliers: 19% (moderately

              inflated)</tt><tt><br>

            </tt><tt><br>

            </tt><tt>Benchmark luSolve-bench: FINISH</tt></p>

          <p><tt><br>

            </tt></p>

          <p>The puzzle is why the overall performance is so poor.  When

            I solve random 1000 x 1000 systems using the linsys.c

            example file from the Recursive LAPACK (ReLAPACK) library --

            which implements the same algorithm -- the average time is

            only 26 ms.  (I have tweaked ReLAPACK's  dgetrf.c so that it

            doesn't use optimized routines for small matrices.  As near

            as I can make it, the C and haskell versions should be doing

            the same thing.)</p>

          <p>The haskell version runs 75 times slower.  This is the

            puzzle.</p>

          <p>My haskell version uses a mutable, matrix of unboxed

            doubles (from Kai Zhang's <a moz-do-not-send="true"

              href="https://hackage.haskell.org/package/matrices">matrices</a>

            library).  Matrix reads and writes are unsafe, so there is

            no overhead from bounds checking.</p>

          <p>Let's look at the result of profiling:</p>

          <p><tt>        Tue Jul 31 21:07 2018 Time and Allocation

              Profiling Report  (Final)</tt><tt><br>

            </tt><tt><br>

            </tt><tt>           luSolve-hspec +RTS -N -p -RTS</tt><tt><br>

            </tt><tt><br>

            </tt><tt>        total time  =     7665.31 secs   (7665309

              ticks @ 1000 us, 1 processor)</tt><tt><br>

            </tt><tt>        total alloc = 10,343,030,811,040 bytes 

              (excludes profiling overheads)</tt><tt><br>

            </tt><tt><br>

            </tt><tt>COST CENTRE          

              MODULE                           

              SRC                                                     

              %time %alloc</tt><tt><br>

            </tt><tt><br>

            </tt><tt>unsafeWrite          

              Data.Matrix.Dense.Generic.Mutable

              src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)  

              17.7   29.4</tt><tt><br>

            </tt><tt>basicUnsafeWrite     

              Data.Vector.Primitive.Mutable    

              Data/Vector/Primitive/Mutable.hs:115:3-69                

              14.7   13.0</tt><tt><br>

            </tt><tt>unsafeRead           

              Data.Matrix.Dense.Generic.Mutable

              src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)  

              14.2   20.7</tt><tt><br>

            </tt><tt>matrixMultiply.\.\.\ 

              Numeric.LinearAlgebra.LUSolve    

              src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)   

              13.4   13.5</tt><tt><br>

            </tt><tt>readByteArray#       

              Data.Primitive.Types             

              Data/Primitive/Types.hs:184:30-132                        

              9.0   15.5</tt><tt><br>

            </tt><tt>basicUnsafeRead      

              Data.Vector.Primitive.Mutable    

              Data/Vector/Primitive/Mutable.hs:112:3-63                 

              8.8    0.1</tt><tt><br>

            </tt><tt>triangularSolve.\.\.\

              Numeric.LinearAlgebra.LUSolve    

              src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)    

              5.2    4.5</tt><tt><br>

            </tt><tt>matrixMultiply.\.\   

              Numeric.LinearAlgebra.LUSolve    

              src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)    

              4.1    0.3</tt><tt><br>

            </tt><tt>primitive            

              Control.Monad.Primitive          

              Control/Monad/Primitive.hs:152:3-16                       

              3.8    0.0</tt><tt><br>

            </tt><tt>basicUnsafeRead      

              Data.Vector.Unboxed.Base         

              Data/Vector/Unboxed/Base.hs:278:813-868                   

              3.3    0.0</tt><tt><br>

            </tt><tt>basicUnsafeWrite     

              Data.Vector.Unboxed.Base         

              Data/Vector/Unboxed/Base.hs:278:872-933                   

              1.5    0.0</tt><tt><br>

            </tt><tt>triangularSolve.\.\  

              Numeric.LinearAlgebra.LUSolve    

              src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)    

              1.3    0.1</tt><tt><br>

            </tt><tt><br>

            </tt><tt><snip></tt></p>

          <p><tt><br>

            </tt></p>

          <p>A large amount of time is spent on the invocations of

            unsafeRead and unsafeWrite.  This is a bit suspicious -- it

            looks as if these call may not be inlined.  In the

            Data.Vector.Unboxed.Mutable library, which provides the

            underlying linear vector of storage locations, the

            unsafeRead and unsafeWrite functions are declared INLINE. 

            Could this be a failure of the 'matrices' library to mark

            its unsafeRead/Write functions as INLINE or SPECIALIZABLE as

            well?</p>

          <p>On the other hand, looking at the core (.dump-simpl) of the

            library doesn't show any dictionary passing, and the access

            to matrix seem to be through GHC.Prim.writeDoubleArray# and

            GHC.Prim.readDoubleArray#.</p>

          <p>If this program took three to five times longer, I would

            not be concerned, but a factor of seventy five indicates

            that I've missed something important.  Can anyone tell me

            what it is?</p>

          <p>Best Wishes,</p>

          <p>Greg<br>

          </p>

          <br>

          <fieldset class="mimeAttachmentHeader"></fieldset>

          <br>

          <pre wrap="">_______________________________________________

Haskell-Cafe mailing list

To (un)subscribe, modify options or view archives go to:

<a class="moz-txt-link-freetext" href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>

Only members subscribed via the mailman list are allowed to post.</pre>

        </blockquote>

        <br>

        <fieldset class="mimeAttachmentHeader"></fieldset>

        <br>

        <pre wrap="">_______________________________________________

Haskell-Cafe mailing list

To (un)subscribe, modify options or view archives go to:

<a class="moz-txt-link-freetext" href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>

Only members subscribed via the mailman list are allowed to post.</pre>

      </blockquote>

      <br>

    </blockquote>

  </body>

</html>