[Haskell-cafe] Why is boxed mutable array so slow?
Don Stewart
dons00 at gmail.com
Sat Dec 1 19:22:47 CET 2012
Regarding when to use mutable arrays versus vectors, I would always
use vectors -- they optimize better, and have a better interface.
Also, I have updated and released a new version of the tool mentioned below.
You can get it on Hackage, updated to ghc 7 series.
http://hackage.haskell.org/package/ghc-gc-tune-0.3
For your boxed vector program, we get results that show a clear
performance peak with a -A of around 64M, about the size of the
allocated array ...
http://i.imgur.com/dZ2Eo.png
Best settings for Running time:
0.16s: +RTS -A67108864 -H1048576
0.16s: +RTS -A67108864 -H2097152
0.16s: +RTS -A67108864 -H8388608
E.g.
$ time ./A +RTS -A67M -H1M
boxed vector
last 945735787 seconds 0.123
-- Don
On Sat, Dec 1, 2012 at 12:20 PM, Don Stewart <dons00 at gmail.com> wrote:
> The obvious difference between boxed and unboxed arrays is that the
> boxed arrays are full of pointers to heap allocated objects. This
> means you pay indirection to access the values, much more time in GC
> spent chasing pointers (though card marking helps), and generally do
> more allocation.
>
> Compare the GC stats below, for
>
> * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
>
> So there's your main answer. The increased data density of unboxed
> arrays also helps a too.
>
> Now, you can help out the GC signifcantly by hinting at how much
> you're going to allocated in the youngest generation (see the
> ghc-gc-tune app for a methodical approach to this, though it needs
> updating to ghc 7 --
> http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
> and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> ).
>
> Use the +RTS -A flag to set an initial youngest generation heap size
> to the size of your array, and watch the GC cost disappear. For our
> boxed vector, we'd use +RTS -A50M, resulting in:
>
> * Boxed vector: 8k copied, 1% of time in GC, 0.157s
>
> So not bad. 3x speedup through a RTS flag. -A is very useful if you
> are working with boxed, mutable arrays.
>
> For reference, there's a generic version below that specializes based
> on the vector type parameter.
>
> ---------------------------------
>
> {-# LANGUAGE BangPatterns #-}
>
> import System.CPUTime
> import Text.Printf
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> import qualified Data.Vector.Mutable as V
> import qualified Data.Vector.Unboxed.Mutable as U
> import qualified Data.Vector.Generic.Mutable as G
>
> main :: IO()
> main = do
>
> -- (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> -- performGC
> (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
> performGC
>
> test' s a = do
> putStrLn s
> begin <- getCPUTime
> init'' a
> partial_sum' a
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10**12)
> last <- G.read a (n'-1)
> printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 1000 * 1000
>
> init'' !a = init 0 (n'-1)
> where
> init :: Int -> Int -> IO ()
> init !k !n
> | k > n = return ()
> | otherwise = do
> let !x = fromIntegral $ k + k `div` 3
> G.write a k x
> init (k+1) n
>
>
>
> partial_sum' !a = do
> k <- G.read a 0
> ps 1 (n'-1) k
> where
> ps :: Int -> Int -> Int32 -> IO ()
> ps i n s
> | i > n = return ()
> | otherwise = do
> k <- G.read a i
> let !l = fromIntegral $ s + k
> G.write a i l
> ps (i+1) n l
>
>
> ---------------------------------
>
> $ time ./A +RTS -s
> boxed vector
> last 945735787 seconds 0.420
> 40,121,448 bytes allocated in the heap
> 88,355,272 bytes copied during GC
> 24,036,456 bytes maximum residency (6 sample(s))
> 380,632 bytes maximum slop
> 54 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 75.2% (75.9% elapsed)
>
> Alloc rate 359,655,602 bytes per MUT second
>
> ./A +RTS -s 0.40s user 0.07s system 98% cpu 0.475 total
>
>
> $ time ./A +RTS -s
> unboxed vector
> last 945735787 seconds 0.080
> 4,113,568 bytes allocated in the heap
> 11,288 bytes copied during GC
> 4,003,256 bytes maximum residency (3 sample(s))
> 182,856 bytes maximum slop
> 5 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.3% (1.3% elapsed)
>
> Alloc rate 51,416,660 bytes per MUT second
>
> ./A +RTS -s 0.08s user 0.01s system 98% cpu 0.088 total
>
>
> $ time ./A +RTS -A50M -s
> boxed vector
> last 945735787 seconds 0.127
> 40,121,504 bytes allocated in the heap
> 8,032 bytes copied during GC
> 44,704 bytes maximum residency (2 sample(s))
> 20,832 bytes maximum slop
> 59 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.0% (1.0% elapsed)
>
> Productivity 97.4% of total user, 99.6% of total elapsed
>
> ./A +RTS -A50M -s 0.10s user 0.05s system 97% cpu 0.157 total
>
>
>
> ---------------------------------
>
>
> On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa at hotmail.com> wrote:
>> I have made benchmark test inspired by
>> http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
>>
>> What surprised me is that unboxed array is much faster than boxed array.
>> Actually boxed array performance is on par with standard Haskell list
>> which is very slow.
>> All in all even unboxed array is about 10 times slower than Java version.
>> I don't understand why is even unboxed array so slow.
>> But! unboxed array consumes least amount of RAM.
>> (warning, program consumes more than 3gb of ram)
>>
>> bmaxa at maxa:~/examples$ time ./Cumul
>> boxed array
>> last 262486571 seconds 4.972
>> unboxed array
>> last 262486571 seconds 0.776
>> list
>> last 262486571 seconds 6.812
>>
>> real 0m13.086s
>> user 0m11.996s
>> sys 0m1.080s
>>
>> -------------------------------------------------------------------------
>> {-# LANGUAGE CPP, BangPatterns #-}
>> import System.CPUTime
>> import Text.Printf
>> import Data.Array.IO
>> import Data.Array.Base
>> import Data.Int
>> import Control.DeepSeq
>> import System.Mem
>>
>> main :: IO()
>> main = do
>> (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
>> performGC
>> (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
>> performGC
>> begin <- getCPUTime
>> printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
>> end <- getCPUTime
>> let diff = (fromIntegral (end - begin)) / (10^12)
>> printf " seconds %.3f\n" (diff::Double)
>>
>> test s a = do
>> putStrLn s
>> begin <- getCPUTime
>> init' a
>> partial_sum a
>> end <- getCPUTime
>> let diff = (fromIntegral (end - begin)) / (10^12)
>> last <- readArray a (n'-1)
>> printf "last %d seconds %.3f\n" last (diff::Double)
>>
>> n' :: Int
>> n' = 50 * 1000 * 1000
>>
>> type A = IOArray Int Int32
>> type B = IOUArray Int Int32
>>
>> init' a = do
>> (_,n) <- getBounds a
>> init a 0 n
>> where
>> init a k n
>> | k > n = return ()
>> | otherwise = do
>> let !x = fromIntegral $ k + k `div` 3
>> unsafeWrite a k x
>> init a (k+1) n
>>
>> partial_sum a = do
>> (_,n) <- getBounds a
>> k <- unsafeRead a 0
>> ps a 1 n k
>> where
>> ps a i n s
>> | i > n = return ()
>> | otherwise = do
>> k <- unsafeRead a i
>> let !l = fromIntegral $ s + k
>> unsafeWrite a i l
>> ps a (i+1) n l
>>
>> data' :: [Int32]
>> data' = [k + k `div` 3 | k <- [0..] ]
>>
>> sum' = scanl1 (+)
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
More information about the Haskell-Cafe
mailing list