[Haskell-cafe] Why is boxed mutable array so slow?

Sat Dec 1 19:22:47 CET 2012

Regarding when to use mutable arrays versus vectors, I would always
use vectors -- they optimize better, and have a better interface.

Also, I have updated and released a new version of the tool mentioned below.
You can get it on Hackage, updated to ghc 7 series.

    http://hackage.haskell.org/package/ghc-gc-tune-0.3

For your boxed vector program, we get results that show a clear
performance peak with a -A of around 64M, about  the size of the
allocated array ...

http://i.imgur.com/dZ2Eo.png

Best settings for Running time:
0.16s:  +RTS -A67108864 -H1048576
0.16s:  +RTS -A67108864 -H2097152
0.16s:  +RTS -A67108864 -H8388608

E.g.

$ time ./A +RTS -A67M -H1M
boxed vector
last 945735787 seconds 0.123

-- Don

On Sat, Dec 1, 2012 at 12:20 PM, Don Stewart <dons00 at gmail.com> wrote:
> The obvious difference between boxed and unboxed arrays is that the
> boxed arrays are full of pointers to heap allocated objects. This
> means you pay indirection to access the values, much more time in GC
> spent chasing pointers (though card marking helps), and generally do
> more allocation.
>
> Compare the GC stats below, for
>
> * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
>
> So there's your main answer. The increased data density of unboxed
> arrays also helps a too.
>
> Now, you can help out  the GC signifcantly by hinting at how much
> you're going to allocated in the youngest generation (see the
> ghc-gc-tune app for a methodical approach to this, though it needs
> updating to ghc 7 --
> http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
>  and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> ).
>
> Use the +RTS -A flag to set an initial youngest generation heap size
> to the size of your array, and watch the GC cost disappear. For our
> boxed vector, we'd use +RTS -A50M, resulting in:
>
> * Boxed vector: 8k copied, 1% of time in GC, 0.157s
>
> So not bad. 3x speedup through a RTS flag. -A is very useful if you
> are working with boxed, mutable arrays.
>
> For reference, there's a generic version below that specializes based
> on the vector type parameter.
>
> ---------------------------------
>
> {-# LANGUAGE BangPatterns #-}
>
> import System.CPUTime
> import Text.Printf
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> import qualified Data.Vector.Mutable as V
> import qualified Data.Vector.Unboxed.Mutable as U
> import qualified Data.Vector.Generic.Mutable as G
>
> main :: IO()
> main = do
>
> --   (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> --   performGC
>    (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
>    performGC
>
> test' s a = do
>     putStrLn s
>     begin <- getCPUTime
>     init'' a
>     partial_sum' a
>     end <- getCPUTime
>     let diff = (fromIntegral (end - begin)) / (10**12)
>     last <- G.read a (n'-1)
>     printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 1000 * 1000
>
> init'' !a = init 0 (n'-1)
>   where
>     init :: Int -> Int -> IO ()
>     init !k !n
>         | k > n     = return ()
>         | otherwise = do
>             let !x = fromIntegral $ k + k `div` 3
>             G.write a k x
>             init (k+1) n
>
>
>
> partial_sum' !a = do
>     k <- G.read a 0
>     ps 1 (n'-1) k
>   where
>     ps :: Int -> Int -> Int32 -> IO ()
>     ps i n s
>         | i > n     = return ()
>         | otherwise = do
>             k <- G.read a i
>             let !l = fromIntegral $ s + k
>             G.write a i l
>             ps (i+1) n l
>
>
> ---------------------------------
>
> $ time ./A +RTS -s
> boxed vector
> last 945735787 seconds 0.420
>       40,121,448 bytes allocated in the heap
>       88,355,272 bytes copied during GC
>       24,036,456 bytes maximum residency (6 sample(s))
>          380,632 bytes maximum slop
>               54 MB total memory in use (0 MB lost due to fragmentation)
>
>   %GC     time      75.2%  (75.9% elapsed)
>
>   Alloc rate    359,655,602 bytes per MUT second
>
> ./A +RTS -s  0.40s user 0.07s system 98% cpu 0.475 total
>
>
> $ time ./A +RTS -s
> unboxed vector
> last 945735787 seconds 0.080
>        4,113,568 bytes allocated in the heap
>           11,288 bytes copied during GC
>        4,003,256 bytes maximum residency (3 sample(s))
>          182,856 bytes maximum slop
>                5 MB total memory in use (0 MB lost due to fragmentation)
>
>   %GC     time       1.3%  (1.3% elapsed)
>
>   Alloc rate    51,416,660 bytes per MUT second
>
> ./A +RTS -s  0.08s user 0.01s system 98% cpu 0.088 total
>
>
> $ time ./A +RTS -A50M -s
> boxed vector
> last 945735787 seconds 0.127
>       40,121,504 bytes allocated in the heap
>            8,032 bytes copied during GC
>           44,704 bytes maximum residency (2 sample(s))
>           20,832 bytes maximum slop
>               59 MB total memory in use (0 MB lost due to fragmentation)
>
>   %GC     time       1.0%  (1.0% elapsed)
>
>   Productivity  97.4% of total user, 99.6% of total elapsed
>
> ./A +RTS -A50M -s  0.10s user 0.05s system 97% cpu 0.157 total
>
>
>
> ---------------------------------
>
>
> On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa at hotmail.com> wrote:
>> I have made benchmark test inspired by
>> http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
>>
>> What surprised me is that unboxed array is much faster than boxed array.
>> Actually boxed array performance is on par with standard Haskell list
>> which is very slow.
>> All in all even unboxed array is about 10 times slower than Java version.
>> I don't understand why is even unboxed array so slow.
>> But! unboxed array consumes least amount of RAM.
>> (warning, program consumes more than 3gb of ram)
>>
>>  bmaxa at maxa:~/examples$ time ./Cumul
>> boxed array
>> last 262486571 seconds 4.972
>> unboxed array
>> last 262486571 seconds 0.776
>> list
>> last 262486571 seconds 6.812
>>
>> real    0m13.086s
>> user    0m11.996s
>> sys     0m1.080s
>>
>> -------------------------------------------------------------------------
>> {-# LANGUAGE CPP, BangPatterns #-}
>> import System.CPUTime
>> import Text.Printf
>> import Data.Array.IO
>> import Data.Array.Base
>> import Data.Int
>> import Control.DeepSeq
>> import System.Mem
>>
>> main :: IO()
>> main = do
>> (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
>> performGC
>> (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
>> performGC
>> begin <- getCPUTime
>> printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
>> end <- getCPUTime
>> let diff = (fromIntegral (end - begin)) / (10^12)
>> printf " seconds %.3f\n" (diff::Double)
>>
>> test s a = do
>> putStrLn s
>> begin <- getCPUTime
>> init' a
>> partial_sum a
>> end <- getCPUTime
>> let diff = (fromIntegral (end - begin)) / (10^12)
>> last <- readArray a (n'-1)
>> printf "last %d seconds %.3f\n" last (diff::Double)
>>
>> n' :: Int
>> n' = 50 * 1000 * 1000
>>
>> type A = IOArray Int Int32
>> type B = IOUArray Int Int32
>>
>> init' a = do
>> (_,n) <- getBounds a
>> init a 0 n
>> where
>> init a k n
>> | k > n = return ()
>> | otherwise = do
>> let  !x = fromIntegral $ k + k `div` 3
>> unsafeWrite a k x
>> init a (k+1) n
>>
>> partial_sum a = do
>> (_,n) <- getBounds a
>> k <- unsafeRead a 0
>> ps a 1 n k
>> where
>> ps a i n s
>> | i > n = return ()
>> | otherwise = do
>> k <- unsafeRead a i
>> let !l = fromIntegral $ s + k
>> unsafeWrite a i l
>> ps a (i+1) n l
>>
>> data' :: [Int32]
>> data' = [k + k `div` 3 | k <- [0..] ]
>>
>> sum' = scanl1 (+)
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>