[Haskell-cafe] Why is boxed mutable array so slow?

Sat Dec 1 21:12:29 CET 2012

On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote:
> All in all even unboxed array is about 10 times slower than Java version.
> I don't understand why is even unboxed array so slow.

It's not the unboxed arrays that are slow.

Your code has a couple of weak spots, and GHC's native code generator has a 
weakness that bites here.

On my box, I don't quite have a 10× difference to my translation to Java, it's 
a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its knees 
by running something that takes 3GB+ of RAM, so I run the unboxed array part 
only) with the LLVM backend and 8× (0.93s) with the native code generator. 
That's in the same ballpark, though.

So what's the deal?

Main.main_$s$wa1 [Occ=LoopBreaker]
  :: GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> GHC.Types.Int
     -> GHC.Types.Int
     -> GHC.Types.Int
     -> ...

Your loops carry boxed Ints around, that's always a bad sign. In this case it 
doesn't hurt too much, however, since these values are neither read nor 
substituted during the loop (they're first and last index of the array and 
number of elements). Additionally, they carry an IOUArray constructor around. 
That is unnecessary. Eliminating a couple of dead parameters

init' a = do
    (_,n) <- getBounds a
    let init k
          | k > n     = return ()
          | otherwise = do
              let x = fromIntegral $ k + k `div` 3
              unsafeWrite a k x
              init (k+1)
    init 0

partial_sum a = do
    (_,n) <- getBounds a
    let ps i s
          | i > n     = return ()
          | otherwise = do
              k <- unsafeRead a i
              let l = s + k
              unsafeWrite a i l
              ps (i+1) l
    k <- unsafeRead a 0
    ps 1 k

brings the time for the native code generator down to 0.82s, and for the LLVM 
backend the time remains the same.

Next problem, you're using `div` for the division.

`div` does some checking and potentially fixup (not here, since everything is 
non-negative) after the machine division because `div` is specified to satisfy

a = (a `div` b) * b + (a `mod` b)

with 0 <= a `mod` b < abs b.

That is in itself slower than the pure machine division you get with quot.

So let's see what we get with `quot`.

0.65s with the native code generator, and 0.13 with the LLVM backend.

Whoops, what's that?

The problem is, as can be seen by manually replacing k `quot` 3 with

(k *2863311531) `shiftR` 33

(requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native 
backend, the LLVM backend and Java (as well as C) all take more or less the 
same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s, 
0.14s], that division is a **very** slow operation.

Java and LLVM know how to replace the division by the constant 3 with a 
mulitplication, a couple of shifts and an addition (since we never have 
negative numbers here, just one multiplication and shift suffice, but neither 
Java nor LLVM can do that on their own because it's not guaranteed by the 
type). The native code generator doesn't - not yet.

So the programme spends the majority of the time dividing. The array reads and 
writes are on par with Java's (and, for that matter, C's).

If you make the divisor a parameter instead of a compile time constant, the 
NCG is not affected at all, the LLVM backend gives you equal performance (it 
can't optimise a division by a divisor it doesn't know). Java is at an 
advantage there, after a while the JIT sees that it might be a good idea to 
optimise the division and so its time only trebles.