<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hi,</p>

    <p>Something Haskell has lacked for a long time is a good

      medium-duty linear system solver based on the LU decomposition. 

      There are bindings to the usual C/Fortran libraries, but not one

      in pure Haskell.  (An example "LU factorization" routine that does

      not do partial pivoting has been around for years, but lacking

      pivoting it can fail unexpectedly on well-conditioned inputs. 

      Another Haskell LU decomposition using partial pivoting is around,

      but it uses an inefficient representation of the pivot matrix, so

      it's not suited to solving systems of more than 100 x 100, say.)</p>

    <p>By medium duty I mean a linear system solver that can handle

      systems of (1000s) x (1000s) and uses Crout's efficient in-place

      algorithm.  In short, a program does everything short of

      exploiting SIMD vector instructions for solving small subproblems.</p>

    <p>Instead of complaining about this, I have written a little

      library that implements this.  It contains an LU factorization

      function and an LU system solver.  The LU factorization also

      returns the parity of the pivots ( = (-1)^(number of row swaps) )

      so it can be used to calculate determinants.  I used Gustavson's

      recursive (imperative) version of Crout's method.  The

      implementation is quite simple and deserves to be better known by

      people using functional languages to do numeric work.  The library

      can be downloaded from GitHub: <a moz-do-not-send="true"

        href="https://github.com/gwright83/luSolve">https://github.com/gwright83/luSolve</a><br>

    </p>

    <p>The performance scales as expected (as n^3, a linear system 10

      times larger in each dimension takes a 1000 times longer to

      solve):<br>

    </p>

    <p><tt>Benchmark luSolve-bench: RUNNING...</tt><tt><br>

      </tt><tt>benchmarking LUSolve/luFactor 100 x 100 matrix</tt><tt><br>

      </tt><tt>time                 1.944 ms   (1.920 ms .. 1.980 ms)</tt><tt><br>

      </tt><tt>                     0.996 R²   (0.994 R² .. 0.998 R²)</tt><tt><br>

      </tt><tt>mean                 1.981 ms   (1.958 ms .. 2.009 ms)</tt><tt><br>

      </tt><tt>std dev              85.64 μs   (70.21 μs .. 107.7 μs)</tt><tt><br>

      </tt><tt>variance introduced by outliers: 30% (moderately

        inflated)</tt><tt><br>

      </tt><tt><br>

      </tt><tt>benchmarking LUSolve/luFactor 500 x 500 matrix</tt><tt><br>

      </tt><tt>time                 204.3 ms   (198.1 ms .. 208.2 ms)</tt><tt><br>

      </tt><tt>                     1.000 R²   (0.999 R² .. 1.000 R²)</tt><tt><br>

      </tt><tt>mean                 203.3 ms   (201.2 ms .. 206.2 ms)</tt><tt><br>

      </tt><tt>std dev              3.619 ms   (2.307 ms .. 6.231 ms)</tt><tt><br>

      </tt><tt>variance introduced by outliers: 14% (moderately

        inflated)</tt><tt><br>

      </tt><tt><br>

      </tt><tt>benchmarking LUSolve/luFactor 1000 x 1000 matrix</tt><tt><br>

      </tt><tt>time                 1.940 s    (1.685 s .. 2.139 s)</tt><tt><br>

      </tt><tt>                     0.998 R²   (0.993 R² .. 1.000 R²)</tt><tt><br>

      </tt><tt>mean                 1.826 s    (1.696 s .. 1.880 s)</tt><tt><br>

      </tt><tt>std dev              93.63 ms   (5.802 ms .. 117.8 ms)</tt><tt><br>

      </tt><tt>variance introduced by outliers: 19% (moderately

        inflated)</tt><tt><br>

      </tt><tt><br>

      </tt><tt>Benchmark luSolve-bench: FINISH</tt></p>

    <p><tt><br>

      </tt></p>

    <p>The puzzle is why the overall performance is so poor.  When I

      solve random 1000 x 1000 systems using the linsys.c example file

      from the Recursive LAPACK (ReLAPACK) library -- which implements

      the same algorithm -- the average time is only 26 ms.  (I have

      tweaked ReLAPACK's  dgetrf.c so that it doesn't use optimized

      routines for small matrices.  As near as I can make it, the C and

      haskell versions should be doing the same thing.)</p>

    <p>The haskell version runs 75 times slower.  This is the puzzle.</p>

    <p>My haskell version uses a mutable, matrix of unboxed doubles

      (from Kai Zhang's <a moz-do-not-send="true"

        href="https://hackage.haskell.org/package/matrices">matrices</a>

      library).  Matrix reads and writes are unsafe, so there is no

      overhead from bounds checking.</p>

    <p>Let's look at the result of profiling:</p>

    <p><tt>        Tue Jul 31 21:07 2018 Time and Allocation Profiling

        Report  (Final)</tt><tt><br>

      </tt><tt><br>

      </tt><tt>           luSolve-hspec +RTS -N -p -RTS</tt><tt><br>

      </tt><tt><br>

      </tt><tt>        total time  =     7665.31 secs   (7665309 ticks @

        1000 us, 1 processor)</tt><tt><br>

      </tt><tt>        total alloc = 10,343,030,811,040 bytes  (excludes

        profiling overheads)</tt><tt><br>

      </tt><tt><br>

      </tt><tt>COST CENTRE           MODULE                           

        SRC                                                      %time

        %alloc</tt><tt><br>

      </tt><tt><br>

      </tt><tt>unsafeWrite           Data.Matrix.Dense.Generic.Mutable

        src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)   17.7  

        29.4</tt><tt><br>

      </tt><tt>basicUnsafeWrite      Data.Vector.Primitive.Mutable    

        Data/Vector/Primitive/Mutable.hs:115:3-69                 14.7  

        13.0</tt><tt><br>

      </tt><tt>unsafeRead            Data.Matrix.Dense.Generic.Mutable

        src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)   14.2  

        20.7</tt><tt><br>

      </tt><tt>matrixMultiply.\.\.\  Numeric.LinearAlgebra.LUSolve    

        src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)    13.4  

        13.5</tt><tt><br>

      </tt><tt>readByteArray#        Data.Primitive.Types             

        Data/Primitive/Types.hs:184:30-132                         9.0  

        15.5</tt><tt><br>

      </tt><tt>basicUnsafeRead       Data.Vector.Primitive.Mutable    

        Data/Vector/Primitive/Mutable.hs:112:3-63                 

        8.8    0.1</tt><tt><br>

      </tt><tt>triangularSolve.\.\.\ Numeric.LinearAlgebra.LUSolve    

        src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)    

        5.2    4.5</tt><tt><br>

      </tt><tt>matrixMultiply.\.\    Numeric.LinearAlgebra.LUSolve    

        src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)    

        4.1    0.3</tt><tt><br>

      </tt><tt>primitive             Control.Monad.Primitive          

        Control/Monad/Primitive.hs:152:3-16                       

        3.8    0.0</tt><tt><br>

      </tt><tt>basicUnsafeRead       Data.Vector.Unboxed.Base         

        Data/Vector/Unboxed/Base.hs:278:813-868                   

        3.3    0.0</tt><tt><br>

      </tt><tt>basicUnsafeWrite      Data.Vector.Unboxed.Base         

        Data/Vector/Unboxed/Base.hs:278:872-933                   

        1.5    0.0</tt><tt><br>

      </tt><tt>triangularSolve.\.\   Numeric.LinearAlgebra.LUSolve    

        src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)    

        1.3    0.1</tt><tt><br>

      </tt><tt><br>

      </tt><tt><snip></tt></p>

    <p><tt><br>

      </tt></p>

    <p>A large amount of time is spent on the invocations of unsafeRead

      and unsafeWrite.  This is a bit suspicious -- it looks as if these

      call may not be inlined.  In the Data.Vector.Unboxed.Mutable

      library, which provides the underlying linear vector of storage

      locations, the unsafeRead and unsafeWrite functions are declared

      INLINE.  Could this be a failure of the 'matrices' library to mark

      its unsafeRead/Write functions as INLINE or SPECIALIZABLE as well?</p>

    <p>On the other hand, looking at the core (.dump-simpl) of the

      library doesn't show any dictionary passing, and the access to

      matrix seem to be through GHC.Prim.writeDoubleArray# and

      GHC.Prim.readDoubleArray#.</p>

    <p>If this program took three to five times longer, I would not be

      concerned, but a factor of seventy five indicates that I've missed

      something important.  Can anyone tell me what it is?</p>

    <p>Best Wishes,</p>

    <p>Greg<br>

    </p>

  </body>

</html>