[Haskell-cafe] Why is boxed mutable array so slow?
Daniel Fischer
daniel.is.fischer at googlemail.com
Sat Dec 1 21:12:29 CET 2012
On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote:
> All in all even unboxed array is about 10 times slower than Java version.
> I don't understand why is even unboxed array so slow.
It's not the unboxed arrays that are slow.
Your code has a couple of weak spots, and GHC's native code generator has a
weakness that bites here.
On my box, I don't quite have a 10× difference to my translation to Java, it's
a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its knees
by running something that takes 3GB+ of RAM, so I run the unboxed array part
only) with the LLVM backend and 8× (0.93s) with the native code generator.
That's in the same ballpark, though.
So what's the deal?
Main.main_$s$wa1 [Occ=LoopBreaker]
:: GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> GHC.Types.Int
-> GHC.Types.Int
-> GHC.Types.Int
-> ...
Your loops carry boxed Ints around, that's always a bad sign. In this case it
doesn't hurt too much, however, since these values are neither read nor
substituted during the loop (they're first and last index of the array and
number of elements). Additionally, they carry an IOUArray constructor around.
That is unnecessary. Eliminating a couple of dead parameters
init' a = do
(_,n) <- getBounds a
let init k
| k > n = return ()
| otherwise = do
let x = fromIntegral $ k + k `div` 3
unsafeWrite a k x
init (k+1)
init 0
partial_sum a = do
(_,n) <- getBounds a
let ps i s
| i > n = return ()
| otherwise = do
k <- unsafeRead a i
let l = s + k
unsafeWrite a i l
ps (i+1) l
k <- unsafeRead a 0
ps 1 k
brings the time for the native code generator down to 0.82s, and for the LLVM
backend the time remains the same.
Next problem, you're using `div` for the division.
`div` does some checking and potentially fixup (not here, since everything is
non-negative) after the machine division because `div` is specified to satisfy
a = (a `div` b) * b + (a `mod` b)
with 0 <= a `mod` b < abs b.
That is in itself slower than the pure machine division you get with quot.
So let's see what we get with `quot`.
0.65s with the native code generator, and 0.13 with the LLVM backend.
Whoops, what's that?
The problem is, as can be seen by manually replacing k `quot` 3 with
(k *2863311531) `shiftR` 33
(requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native
backend, the LLVM backend and Java (as well as C) all take more or less the
same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s,
0.14s], that division is a **very** slow operation.
Java and LLVM know how to replace the division by the constant 3 with a
mulitplication, a couple of shifts and an addition (since we never have
negative numbers here, just one multiplication and shift suffice, but neither
Java nor LLVM can do that on their own because it's not guaranteed by the
type). The native code generator doesn't - not yet.
So the programme spends the majority of the time dividing. The array reads and
writes are on par with Java's (and, for that matter, C's).
If you make the divisor a parameter instead of a compile time constant, the
NCG is not affected at all, the LLVM backend gives you equal performance (it
can't optimise a division by a divisor it doesn't know). Java is at an
advantage there, after a while the JIT sees that it might be a good idea to
optimise the division and so its time only trebles.
More information about the Haskell-Cafe
mailing list