<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>I agree, I think the main problem would still remain once that
was accounted for, but it may be worth doing correctly nonetheless
:)<br>
<br>
The only thing I can think of off the top of my head is using <tt>array
</tt>rather than<tt> vector.</tt> I am not sure that would fix
performance, but it would make it closer to the C implementation.
You might also consider looking at a heap profile or using
threadscope. <br>
</p>
<br>
<div class="moz-cite-prefix">On 08/02/2018 07:16 PM, Gregory Wright
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:e6fc7f54-b04d-1284-32f6-c734bcf6b250@antiope.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
That's an interesting point. Could the generation of the random
matrix be that slow? Something to check.<br>
<br>
In my comparison with dgetrf.c from ReLAPACK, I also used random
matrices, but measured the execution time from the start of the
factorization, so I did not include the generation of the random
matrix.<br>
<br>
The one piece of evidence that still points to a performance
problem is the scaling, since the execution time goes quite
accurately as n^3 for n * n linear systems. I would expect the
time for generation of a random matrix, even if done very
inefficiently, to scale as n^2.<br>
<br>
<br>
<div class="moz-cite-prefix">On 8/2/18 7:47 PM, Vanessa McHale
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:bac0c282-aab6-3e85-e81b-a9c3935e4f38@iohk.io">
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
Looking at your benchmarks you may be benchmarking the wrong
thing. The function you are benchmarking is <tt>runLUFactor, </tt>which
generates random matrices in addition to factoring them.<br>
<tt><br>
</tt>
<div class="moz-cite-prefix">On 08/02/2018 05:27 PM, Gregory
Wright wrote:<br>
</div>
<blockquote type="cite"
cite="mid:4c9d1388-099f-6e00-113e-0da97c6e1041@antiope.com">
<meta http-equiv="content-type" content="text/html;
charset=utf-8">
<p>Hi,</p>
<p>Something Haskell has lacked for a long time is a good
medium-duty linear system solver based on the LU
decomposition. There are bindings to the usual C/Fortran
libraries, but not one in pure Haskell. (An example "LU
factorization" routine that does not do partial pivoting has
been around for years, but lacking pivoting it can fail
unexpectedly on well-conditioned inputs. Another Haskell LU
decomposition using partial pivoting is around, but it uses
an inefficient representation of the pivot matrix, so it's
not suited to solving systems of more than 100 x 100, say.)</p>
<p>By medium duty I mean a linear system solver that can
handle systems of (1000s) x (1000s) and uses Crout's
efficient in-place algorithm. In short, a program does
everything short of exploiting SIMD vector instructions for
solving small subproblems.</p>
<p>Instead of complaining about this, I have written a little
library that implements this. It contains an LU
factorization function and an LU system solver. The LU
factorization also returns the parity of the pivots ( =
(-1)^(number of row swaps) ) so it can be used to calculate
determinants. I used Gustavson's recursive (imperative)
version of Crout's method. The implementation is quite
simple and deserves to be better known by people using
functional languages to do numeric work. The library can be
downloaded from GitHub: <a moz-do-not-send="true"
href="https://github.com/gwright83/luSolve">https://github.com/gwright83/luSolve</a><br>
</p>
<p>The performance scales as expected (as n^3, a linear system
10 times larger in each dimension takes a 1000 times longer
to solve):<br>
</p>
<p><tt>Benchmark luSolve-bench: RUNNING...</tt><tt><br>
</tt><tt>benchmarking LUSolve/luFactor 100 x 100 matrix</tt><tt><br>
</tt><tt>time 1.944 ms (1.920 ms .. 1.980
ms)</tt><tt><br>
</tt><tt> 0.996 R² (0.994 R² .. 0.998
R²)</tt><tt><br>
</tt><tt>mean 1.981 ms (1.958 ms .. 2.009
ms)</tt><tt><br>
</tt><tt>std dev 85.64 μs (70.21 μs .. 107.7
μs)</tt><tt><br>
</tt><tt>variance introduced by outliers: 30% (moderately
inflated)</tt><tt><br>
</tt><tt><br>
</tt><tt>benchmarking LUSolve/luFactor 500 x 500 matrix</tt><tt><br>
</tt><tt>time 204.3 ms (198.1 ms .. 208.2
ms)</tt><tt><br>
</tt><tt> 1.000 R² (0.999 R² .. 1.000
R²)</tt><tt><br>
</tt><tt>mean 203.3 ms (201.2 ms .. 206.2
ms)</tt><tt><br>
</tt><tt>std dev 3.619 ms (2.307 ms .. 6.231
ms)</tt><tt><br>
</tt><tt>variance introduced by outliers: 14% (moderately
inflated)</tt><tt><br>
</tt><tt><br>
</tt><tt>benchmarking LUSolve/luFactor 1000 x 1000 matrix</tt><tt><br>
</tt><tt>time 1.940 s (1.685 s .. 2.139
s)</tt><tt><br>
</tt><tt> 0.998 R² (0.993 R² .. 1.000
R²)</tt><tt><br>
</tt><tt>mean 1.826 s (1.696 s .. 1.880
s)</tt><tt><br>
</tt><tt>std dev 93.63 ms (5.802 ms .. 117.8
ms)</tt><tt><br>
</tt><tt>variance introduced by outliers: 19% (moderately
inflated)</tt><tt><br>
</tt><tt><br>
</tt><tt>Benchmark luSolve-bench: FINISH</tt></p>
<p><tt><br>
</tt></p>
<p>The puzzle is why the overall performance is so poor. When
I solve random 1000 x 1000 systems using the linsys.c
example file from the Recursive LAPACK (ReLAPACK) library --
which implements the same algorithm -- the average time is
only 26 ms. (I have tweaked ReLAPACK's dgetrf.c so that it
doesn't use optimized routines for small matrices. As near
as I can make it, the C and haskell versions should be doing
the same thing.)</p>
<p>The haskell version runs 75 times slower. This is the
puzzle.</p>
<p>My haskell version uses a mutable, matrix of unboxed
doubles (from Kai Zhang's <a moz-do-not-send="true"
href="https://hackage.haskell.org/package/matrices">matrices</a>
library). Matrix reads and writes are unsafe, so there is
no overhead from bounds checking.</p>
<p>Let's look at the result of profiling:</p>
<p><tt> Tue Jul 31 21:07 2018 Time and Allocation
Profiling Report (Final)</tt><tt><br>
</tt><tt><br>
</tt><tt> luSolve-hspec +RTS -N -p -RTS</tt><tt><br>
</tt><tt><br>
</tt><tt> total time = 7665.31 secs (7665309
ticks @ 1000 us, 1 processor)</tt><tt><br>
</tt><tt> total alloc = 10,343,030,811,040 bytes
(excludes profiling overheads)</tt><tt><br>
</tt><tt><br>
</tt><tt>COST CENTRE
MODULE
SRC
%time %alloc</tt><tt><br>
</tt><tt><br>
</tt><tt>unsafeWrite
Data.Matrix.Dense.Generic.Mutable
src/Data/Matrix/Dense/Generic/Mutable.hs:(38,5)-(39,38)
17.7 29.4</tt><tt><br>
</tt><tt>basicUnsafeWrite
Data.Vector.Primitive.Mutable
Data/Vector/Primitive/Mutable.hs:115:3-69
14.7 13.0</tt><tt><br>
</tt><tt>unsafeRead
Data.Matrix.Dense.Generic.Mutable
src/Data/Matrix/Dense/Generic/Mutable.hs:(34,5)-(35,38)
14.2 20.7</tt><tt><br>
</tt><tt>matrixMultiply.\.\.\
Numeric.LinearAlgebra.LUSolve
src/Numeric/LinearAlgebra/LUSolve.hs:(245,54)-(249,86)
13.4 13.5</tt><tt><br>
</tt><tt>readByteArray#
Data.Primitive.Types
Data/Primitive/Types.hs:184:30-132
9.0 15.5</tt><tt><br>
</tt><tt>basicUnsafeRead
Data.Vector.Primitive.Mutable
Data/Vector/Primitive/Mutable.hs:112:3-63
8.8 0.1</tt><tt><br>
</tt><tt>triangularSolve.\.\.\
Numeric.LinearAlgebra.LUSolve
src/Numeric/LinearAlgebra/LUSolve.hs:(382,45)-(386,58)
5.2 4.5</tt><tt><br>
</tt><tt>matrixMultiply.\.\
Numeric.LinearAlgebra.LUSolve
src/Numeric/LinearAlgebra/LUSolve.hs:(244,54)-(249,86)
4.1 0.3</tt><tt><br>
</tt><tt>primitive
Control.Monad.Primitive
Control/Monad/Primitive.hs:152:3-16
3.8 0.0</tt><tt><br>
</tt><tt>basicUnsafeRead
Data.Vector.Unboxed.Base
Data/Vector/Unboxed/Base.hs:278:813-868
3.3 0.0</tt><tt><br>
</tt><tt>basicUnsafeWrite
Data.Vector.Unboxed.Base
Data/Vector/Unboxed/Base.hs:278:872-933
1.5 0.0</tt><tt><br>
</tt><tt>triangularSolve.\.\
Numeric.LinearAlgebra.LUSolve
src/Numeric/LinearAlgebra/LUSolve.hs:(376,33)-(386,58)
1.3 0.1</tt><tt><br>
</tt><tt><br>
</tt><tt><snip></tt></p>
<p><tt><br>
</tt></p>
<p>A large amount of time is spent on the invocations of
unsafeRead and unsafeWrite. This is a bit suspicious -- it
looks as if these call may not be inlined. In the
Data.Vector.Unboxed.Mutable library, which provides the
underlying linear vector of storage locations, the
unsafeRead and unsafeWrite functions are declared INLINE.
Could this be a failure of the 'matrices' library to mark
its unsafeRead/Write functions as INLINE or SPECIALIZABLE as
well?</p>
<p>On the other hand, looking at the core (.dump-simpl) of the
library doesn't show any dictionary passing, and the access
to matrix seem to be through GHC.Prim.writeDoubleArray# and
GHC.Prim.readDoubleArray#.</p>
<p>If this program took three to five times longer, I would
not be concerned, but a factor of seventy five indicates
that I've missed something important. Can anyone tell me
what it is?</p>
<p>Best Wishes,</p>
<p>Greg<br>
</p>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
<a class="moz-txt-link-freetext" href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>
Only members subscribed via the mailman list are allowed to post.</pre>
</blockquote>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
<a class="moz-txt-link-freetext" href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>
Only members subscribed via the mailman list are allowed to post.</pre>
</blockquote>
<br>
</blockquote>
</body>
</html>