[Haskell-cafe] A Performance Puzzle
Bertram Felgenhauer
bertram.felgenhauer at googlemail.com
Sat Aug 4 13:11:32 UTC 2018
Hi Gregory,
Gregory Wright wrote:
> Something Haskell has lacked for a long time is a good medium-duty
> linear system solver based on the LU decomposition. There are bindings
> to the usual C/Fortran libraries, but not one in pure Haskell. (An
> example "LU factorization" routine that does not do partial pivoting has
> been around for years, but lacking pivoting it can fail unexpectedly on
> well-conditioned inputs. Another Haskell LU decomposition using partial
> pivoting is around, but it uses an inefficient representation of the
> pivot matrix, so it's not suited to solving systems of more than 100 x
> 100, say.)
There are ways to improve the code; in particular, one can change the
matrix multiplication to accumulate the result of multiplying a row
by a column:
numLoopState 0 (ca - 1) 0 $ \s k -> do
aik <- MU.unsafeRead a (i, k)
bkj <- MU.unsafeRead b (k, j)
return $! s + aik * bkj
At the end of the day, ghc's native code generator is pretty terrible
for tight inner loops as they arise in matrix multiplication. The
above results in an assembly language loop that does lots of useless
shuffling of registers (saving and restoring them on the stack), index
recomputation, and even checking a pointer tag to see if some closure
has already been evaluated, all in the innermost loop. (Full code at
end.)
If you use -fllvm, the code becomes quite a bit faster, and the inner
loop looks amazingly decent; there is no saving of state anymore, no
tag check, and the index computation has mostly disappeared thanks to
strength reduction:
4114a0: f2 0f 10 0f movsd (%rdi),%xmm1
; fetch entry from first matrix
4114a4: f2 0f 59 0b mulsd (%rbx),%xmm1
; multiply by entry from second matrix
4114a8: f2 0f 58 c1 addsd %xmm1,%xmm0
; add to accumulator
4114ac: 49 ff c5 inc %r13
; increment loop counter
4114af: 4c 01 e3 add %r12,%rbx
; advance pointer into second matrix
4114b2: 48 83 c7 08 add $0x8,%rdi
; advance pointer into first matrix
4114b6: 4c 39 ee cmp %r13,%rsi
; test for end of loop
4114b9: 75 e5 jne 4114a0 <cgbl_info$def+0xe0>
But this could be quite a bit faster if the code were vectorized, for
example, by processing 4 columns at the same time. I would expect a
good implementation of BLAS to do that.
To speed matrix multiplication up on a higher level, Strassen
multiplication comes to mind, though "at the price of a somewhat
reduced numerical stability" (Wikipedia).
Cheers,
Bertram
NCG inner loop:
_cdUi:
cmpq %rbx,%r10 ; check loop upper bound
je _cdUx ; exit loop
_cdUr:
movq $block_cdUp_info,-96(%rbp)
movq %rbx,%r11 ; save %rbx in %r11
movq %rcx,%rbx
movq %rdx,-88(%rbp)
movq %rsi,-80(%rbp)
movq %rax,-72(%rbp)
movq %rdi,-64(%rbp)
movq %r8,-56(%rbp)
movq %r9,-48(%rbp)
movq %r11,-40(%rbp)
movq %rcx,-32(%rbp)
movq %r14,-24(%rbp)
movq %r10,-16(%rbp)
movsd %xmm0,-8(%rbp) ; save lots of state
addq $-96,%rbp
testb $7,%bl ; check whether (rbx) is evaluated
jne _cdUp ; yes -> skip closure
_cdUs:
jmp *(%rbx) ; enter closure
.align 8
.quad 122571
.quad 30
block_cdUp_info:
_cdUp:
movq 8(%rbp),%rax ; was: %rdx
movq 16(%rbp),%rcx ; was: %rsi
movq 24(%rbp),%rdx ; was: %rax
movq 32(%rbp),%rsi ; was: %rdi
movq 40(%rbp),%rdi ; was: %8
movq 48(%rbp),%r8 ; was: %r9
movq 56(%rbp),%r9 ; was: %r11
movq 64(%rbp),%r10 ; was: %rcx
movq 72(%rbp),%r11 ; was: %r14
movq 80(%rbp),%r14 ; was: %r10
movq %rax,64(%rsp) ; spill, was: %rdx
movq %r14,%rax
movq %rcx,72(%rsp) ; spill, was: %rsi
movq 64(%rsp),%rcx
imulq %rcx,%rax
addq %r11,%rax
movq %rsi,%rcx
addq %rax,%rcx
movq 72(%rsp),%rax
addq %rcx,%rax
movq 7(%rbx),%rbx
movq 64(%rsp),%rcx
imulq %rcx,%rbx
addq %r14,%rbx
movq %rdi,%rcx
addq %rbx,%rcx
movq 72(%rsp),%rbx
addq %rcx,%rbx
movsd 16(%rdx,%rbx,8),%xmm0 ; read one array entry
mulsd 16(%rdx,%rax,8),%xmm0 ; multiply by other array entry
movsd 88(%rbp),%xmm1 ; add accumulator (was: %xmm0)
addsd %xmm0,%xmm1
addq $96,%rbp
incq %r14
_nefT:
movsd %xmm1,%xmm0 ; update accumulator
movq %r10,%rcx ; was: %rcx
movq %r14,%r10 ; was: %r10 (now + 1)
movq %r11,%r14 ; was: %r9
movq %r9,%rbx ; was: %r11 (where we saved %rbx earlier)
movq %r8,%r9 ; was: %r9
movq %rdi,%r8 ; was: %r8
movq %rsi,%rdi ; was: %rdi
movq %rdx,%rax ; was: %rax
movq 72(%rsp),%rsi ; was: %rsi
movq 64(%rsp),%rdi ; was: %rdx ???
jmp _cdUi ; loop
More information about the Haskell-Cafe
mailing list