[Haskell-cafe] Why is boxed mutable array so slow?

Sun Dec 2 06:57:02 CET 2012

Wow, now performance is on par with Java ;)So slow division was main problem, that and GC .
Thanks!

> From: daniel.is.fischer at googlemail.com
> To: haskell-cafe at haskell.org
> CC: bmaxa at hotmail.com
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> Date: Sat, 1 Dec 2012 21:12:29 +0100
> 
> On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote:
> > All in all even unboxed array is about 10 times slower than Java version.
> > I don't understand why is even unboxed array so slow.
> 
> It's not the unboxed arrays that are slow.
> 
> Your code has a couple of weak spots, and GHC's native code generator has a 
> weakness that bites here.
> 
> On my box, I don't quite have a 10× difference to my translation to Java, it's 
> a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its knees 
> by running something that takes 3GB+ of RAM, so I run the unboxed array part 
> only) with the LLVM backend and 8× (0.93s) with the native code generator. 
> That's in the same ballpark, though.
> 
> So what's the deal?
> 
> Main.main_$s$wa1 [Occ=LoopBreaker]
>   :: GHC.Prim.Int#
>      -> GHC.Prim.Int#
>      -> GHC.Prim.State# GHC.Prim.RealWorld
>      -> GHC.Types.Int
>      -> GHC.Types.Int
>      -> GHC.Types.Int
>      -> ...
> 
> Your loops carry boxed Ints around, that's always a bad sign. In this case it 
> doesn't hurt too much, however, since these values are neither read nor 
> substituted during the loop (they're first and last index of the array and 
> number of elements). Additionally, they carry an IOUArray constructor around. 
> That is unnecessary. Eliminating a couple of dead parameters
> 
> 
> init' a = do
>     (_,n) <- getBounds a
>     let init k
>           | k > n     = return ()
>           | otherwise = do
>               let x = fromIntegral $ k + k `div` 3
>               unsafeWrite a k x
>               init (k+1)
>     init 0
> 
> partial_sum a = do
>     (_,n) <- getBounds a
>     let ps i s
>           | i > n     = return ()
>           | otherwise = do
>               k <- unsafeRead a i
>               let l = s + k
>               unsafeWrite a i l
>               ps (i+1) l
>     k <- unsafeRead a 0
>     ps 1 k
> 
> brings the time for the native code generator down to 0.82s, and for the LLVM 
> backend the time remains the same.
> 
> Next problem, you're using `div` for the division.
> 
> `div` does some checking and potentially fixup (not here, since everything is 
> non-negative) after the machine division because `div` is specified to satisfy
> 
> a = (a `div` b) * b + (a `mod` b)
> 
> with 0 <= a `mod` b < abs b.
> 
> That is in itself slower than the pure machine division you get with quot.
> 
> So let's see what we get with `quot`.
> 
> 0.65s with the native code generator, and 0.13 with the LLVM backend.
> 
> Whoops, what's that?
> 
> The problem is, as can be seen by manually replacing k `quot` 3 with
> 
> (k *2863311531) `shiftR` 33
> 
> (requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native 
> backend, the LLVM backend and Java (as well as C) all take more or less the 
> same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s, 
> 0.14s], that division is a **very** slow operation.
> 
> Java and LLVM know how to replace the division by the constant 3 with a 
> mulitplication, a couple of shifts and an addition (since we never have 
> negative numbers here, just one multiplication and shift suffice, but neither 
> Java nor LLVM can do that on their own because it's not guaranteed by the 
> type). The native code generator doesn't - not yet.
> 
> So the programme spends the majority of the time dividing. The array reads and 
> writes are on par with Java's (and, for that matter, C's).
> 
> If you make the divisor a parameter instead of a compile time constant, the 
> NCG is not affected at all, the LLVM backend gives you equal performance (it 
> can't optimise a division by a divisor it doesn't know). Java is at an 
> advantage there, after a while the JIT sees that it might be a good idea to 
> optimise the division and so its time only trebles.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20121202/c87d56f4/attachment.htm>