[Haskell-cafe] Why is boxed mutable array so slow?
Branimir Maksimovic
bmaxa at hotmail.com
Sat Dec 1 19:04:03 CET 2012
Wow that sped it up 5 times.I see that boxed Vector is 25% faster than IOArray.What is the difference and when to use Vector,when IOArray?Thanks!
bmaxa at maxa:~/examples$ time ./Cumul +RTS -A1600Mboxed arraylast 262486571 seconds 1.196unboxed arraylast 262486571 seconds 0.748boxed vectorlast 262486571 seconds 0.908unboxed vectorlast 262486571 seconds 0.720
real 0m3.805suser 0m3.428ssys 0m0.372s
> Date: Sat, 1 Dec 2012 12:20:37 -0500
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> From: dons00 at gmail.com
> To: bmaxa at hotmail.com
> CC: haskell-cafe at haskell.org
>
> The obvious difference between boxed and unboxed arrays is that the
> boxed arrays are full of pointers to heap allocated objects. This
> means you pay indirection to access the values, much more time in GC
> spent chasing pointers (though card marking helps), and generally do
> more allocation.
>
> Compare the GC stats below, for
>
> * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
>
> So there's your main answer. The increased data density of unboxed
> arrays also helps a too.
>
> Now, you can help out the GC signifcantly by hinting at how much
> you're going to allocated in the youngest generation (see the
> ghc-gc-tune app for a methodical approach to this, though it needs
> updating to ghc 7 --
> http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
> and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> ).
>
> Use the +RTS -A flag to set an initial youngest generation heap size
> to the size of your array, and watch the GC cost disappear. For our
> boxed vector, we'd use +RTS -A50M, resulting in:
>
> * Boxed vector: 8k copied, 1% of time in GC, 0.157s
>
> So not bad. 3x speedup through a RTS flag. -A is very useful if you
> are working with boxed, mutable arrays.
>
> For reference, there's a generic version below that specializes based
> on the vector type parameter.
>
> ---------------------------------
>
> {-# LANGUAGE BangPatterns #-}
>
> import System.CPUTime
> import Text.Printf
> import Data.Int
> import Control.DeepSeq
> import System.Mem
>
> import qualified Data.Vector.Mutable as V
> import qualified Data.Vector.Unboxed.Mutable as U
> import qualified Data.Vector.Generic.Mutable as G
>
> main :: IO()
> main = do
>
> -- (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> -- performGC
> (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
> performGC
>
> test' s a = do
> putStrLn s
> begin <- getCPUTime
> init'' a
> partial_sum' a
> end <- getCPUTime
> let diff = (fromIntegral (end - begin)) / (10**12)
> last <- G.read a (n'-1)
> printf "last %d seconds %.3f\n" last (diff::Double)
>
> n' :: Int
> n' = 1000 * 1000
>
> init'' !a = init 0 (n'-1)
> where
> init :: Int -> Int -> IO ()
> init !k !n
> | k > n = return ()
> | otherwise = do
> let !x = fromIntegral $ k + k `div` 3
> G.write a k x
> init (k+1) n
>
>
>
> partial_sum' !a = do
> k <- G.read a 0
> ps 1 (n'-1) k
> where
> ps :: Int -> Int -> Int32 -> IO ()
> ps i n s
> | i > n = return ()
> | otherwise = do
> k <- G.read a i
> let !l = fromIntegral $ s + k
> G.write a i l
> ps (i+1) n l
>
>
> ---------------------------------
>
> $ time ./A +RTS -s
> boxed vector
> last 945735787 seconds 0.420
> 40,121,448 bytes allocated in the heap
> 88,355,272 bytes copied during GC
> 24,036,456 bytes maximum residency (6 sample(s))
> 380,632 bytes maximum slop
> 54 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 75.2% (75.9% elapsed)
>
> Alloc rate 359,655,602 bytes per MUT second
>
> ./A +RTS -s 0.40s user 0.07s system 98% cpu 0.475 total
>
>
> $ time ./A +RTS -s
> unboxed vector
> last 945735787 seconds 0.080
> 4,113,568 bytes allocated in the heap
> 11,288 bytes copied during GC
> 4,003,256 bytes maximum residency (3 sample(s))
> 182,856 bytes maximum slop
> 5 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.3% (1.3% elapsed)
>
> Alloc rate 51,416,660 bytes per MUT second
>
> ./A +RTS -s 0.08s user 0.01s system 98% cpu 0.088 total
>
>
> $ time ./A +RTS -A50M -s
> boxed vector
> last 945735787 seconds 0.127
> 40,121,504 bytes allocated in the heap
> 8,032 bytes copied during GC
> 44,704 bytes maximum residency (2 sample(s))
> 20,832 bytes maximum slop
> 59 MB total memory in use (0 MB lost due to fragmentation)
>
> %GC time 1.0% (1.0% elapsed)
>
> Productivity 97.4% of total user, 99.6% of total elapsed
>
> ./A +RTS -A50M -s 0.10s user 0.05s system 97% cpu 0.157 total
>
>
>
> ---------------------------------
>
>
> On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa at hotmail.com> wrote:
> > I have made benchmark test inspired by
> > http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
> >
> > What surprised me is that unboxed array is much faster than boxed array.
> > Actually boxed array performance is on par with standard Haskell list
> > which is very slow.
> > All in all even unboxed array is about 10 times slower than Java version.
> > I don't understand why is even unboxed array so slow.
> > But! unboxed array consumes least amount of RAM.
> > (warning, program consumes more than 3gb of ram)
> >
> > bmaxa at maxa:~/examples$ time ./Cumul
> > boxed array
> > last 262486571 seconds 4.972
> > unboxed array
> > last 262486571 seconds 0.776
> > list
> > last 262486571 seconds 6.812
> >
> > real 0m13.086s
> > user 0m11.996s
> > sys 0m1.080s
> >
> > -------------------------------------------------------------------------
> > {-# LANGUAGE CPP, BangPatterns #-}
> > import System.CPUTime
> > import Text.Printf
> > import Data.Array.IO
> > import Data.Array.Base
> > import Data.Int
> > import Control.DeepSeq
> > import System.Mem
> >
> > main :: IO()
> > main = do
> > (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> > performGC
> > (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> > performGC
> > begin <- getCPUTime
> > printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > printf " seconds %.3f\n" (diff::Double)
> >
> > test s a = do
> > putStrLn s
> > begin <- getCPUTime
> > init' a
> > partial_sum a
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > last <- readArray a (n'-1)
> > printf "last %d seconds %.3f\n" last (diff::Double)
> >
> > n' :: Int
> > n' = 50 * 1000 * 1000
> >
> > type A = IOArray Int Int32
> > type B = IOUArray Int Int32
> >
> > init' a = do
> > (_,n) <- getBounds a
> > init a 0 n
> > where
> > init a k n
> > | k > n = return ()
> > | otherwise = do
> > let !x = fromIntegral $ k + k `div` 3
> > unsafeWrite a k x
> > init a (k+1) n
> >
> > partial_sum a = do
> > (_,n) <- getBounds a
> > k <- unsafeRead a 0
> > ps a 1 n k
> > where
> > ps a i n s
> > | i > n = return ()
> > | otherwise = do
> > k <- unsafeRead a i
> > let !l = fromIntegral $ s + k
> > unsafeWrite a i l
> > ps a (i+1) n l
> >
> > data' :: [Int32]
> > data' = [k + k `div` 3 | k <- [0..] ]
> >
> > sum' = scanl1 (+)
> >
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20121201/6a364bb0/attachment.htm>
More information about the Haskell-Cafe
mailing list