[Haskell-cafe] Why is boxed mutable array so slow?

Sat Dec 1 19:58:25 CET 2012

Thanks! I have downloaded tool and playing with it.I will use boxed vectors in the future ;)

> Date: Sat, 1 Dec 2012 13:22:47 -0500
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> From: dons00 at gmail.com
> To: bmaxa at hotmail.com
> CC: haskell-cafe at haskell.org
> 
> Regarding when to use mutable arrays versus vectors, I would always
> use vectors -- they optimize better, and have a better interface.
> 
> Also, I have updated and released a new version of the tool mentioned below.
> You can get it on Hackage, updated to ghc 7 series.
> 
>     http://hackage.haskell.org/package/ghc-gc-tune-0.3
> 
> For your boxed vector program, we get results that show a clear
> performance peak with a -A of around 64M, about  the size of the
> allocated array ...
> 
> http://i.imgur.com/dZ2Eo.png
> 
> Best settings for Running time:
> 0.16s:  +RTS -A67108864 -H1048576
> 0.16s:  +RTS -A67108864 -H2097152
> 0.16s:  +RTS -A67108864 -H8388608
> 
> E.g.
> 
> $ time ./A +RTS -A67M -H1M
> boxed vector
> last 945735787 seconds 0.123
> 
> -- Don
> 
> On Sat, Dec 1, 2012 at 12:20 PM, Don Stewart <dons00 at gmail.com> wrote:
> > The obvious difference between boxed and unboxed arrays is that the
> > boxed arrays are full of pointers to heap allocated objects. This
> > means you pay indirection to access the values, much more time in GC
> > spent chasing pointers (though card marking helps), and generally do
> > more allocation.
> >
> > Compare the GC stats below, for
> >
> > * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> > * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
> >
> > So there's your main answer. The increased data density of unboxed
> > arrays also helps a too.
> >
> > Now, you can help out  the GC signifcantly by hinting at how much
> > you're going to allocated in the youngest generation (see the
> > ghc-gc-tune app for a methodical approach to this, though it needs
> > updating to ghc 7 --
> > http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
> >  and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> > ).
> >
> > Use the +RTS -A flag to set an initial youngest generation heap size
> > to the size of your array, and watch the GC cost disappear. For our
> > boxed vector, we'd use +RTS -A50M, resulting in:
> >
> > * Boxed vector: 8k copied, 1% of time in GC, 0.157s
> >
> > So not bad. 3x speedup through a RTS flag. -A is very useful if you
> > are working with boxed, mutable arrays.
> >
> > For reference, there's a generic version below that specializes based
> > on the vector type parameter.
> >
> > ---------------------------------
> >
> > {-# LANGUAGE BangPatterns #-}
> >
> > import System.CPUTime
> > import Text.Printf
> > import Data.Int
> > import Control.DeepSeq
> > import System.Mem
> >
> > import qualified Data.Vector.Mutable as V
> > import qualified Data.Vector.Unboxed.Mutable as U
> > import qualified Data.Vector.Generic.Mutable as G
> >
> > main :: IO()
> > main = do
> >
> > --   (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> > --   performGC
> >    (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
> >    performGC
> >
> > test' s a = do
> >     putStrLn s
> >     begin <- getCPUTime
> >     init'' a
> >     partial_sum' a
> >     end <- getCPUTime
> >     let diff = (fromIntegral (end - begin)) / (10**12)
> >     last <- G.read a (n'-1)
> >     printf "last %d seconds %.3f\n" last (diff::Double)
> >
> > n' :: Int
> > n' = 1000 * 1000
> >
> > init'' !a = init 0 (n'-1)
> >   where
> >     init :: Int -> Int -> IO ()
> >     init !k !n
> >         | k > n     = return ()
> >         | otherwise = do
> >             let !x = fromIntegral $ k + k `div` 3
> >             G.write a k x
> >             init (k+1) n
> >
> >
> >
> > partial_sum' !a = do
> >     k <- G.read a 0
> >     ps 1 (n'-1) k
> >   where
> >     ps :: Int -> Int -> Int32 -> IO ()
> >     ps i n s
> >         | i > n     = return ()
> >         | otherwise = do
> >             k <- G.read a i
> >             let !l = fromIntegral $ s + k
> >             G.write a i l
> >             ps (i+1) n l
> >
> >
> > ---------------------------------
> >
> > $ time ./A +RTS -s
> > boxed vector
> > last 945735787 seconds 0.420
> >       40,121,448 bytes allocated in the heap
> >       88,355,272 bytes copied during GC
> >       24,036,456 bytes maximum residency (6 sample(s))
> >          380,632 bytes maximum slop
> >               54 MB total memory in use (0 MB lost due to fragmentation)
> >
> >   %GC     time      75.2%  (75.9% elapsed)
> >
> >   Alloc rate    359,655,602 bytes per MUT second
> >
> > ./A +RTS -s  0.40s user 0.07s system 98% cpu 0.475 total
> >
> >
> > $ time ./A +RTS -s
> > unboxed vector
> > last 945735787 seconds 0.080
> >        4,113,568 bytes allocated in the heap
> >           11,288 bytes copied during GC
> >        4,003,256 bytes maximum residency (3 sample(s))
> >          182,856 bytes maximum slop
> >                5 MB total memory in use (0 MB lost due to fragmentation)
> >
> >   %GC     time       1.3%  (1.3% elapsed)
> >
> >   Alloc rate    51,416,660 bytes per MUT second
> >
> > ./A +RTS -s  0.08s user 0.01s system 98% cpu 0.088 total
> >
> >
> > $ time ./A +RTS -A50M -s
> > boxed vector
> > last 945735787 seconds 0.127
> >       40,121,504 bytes allocated in the heap
> >            8,032 bytes copied during GC
> >           44,704 bytes maximum residency (2 sample(s))
> >           20,832 bytes maximum slop
> >               59 MB total memory in use (0 MB lost due to fragmentation)
> >
> >   %GC     time       1.0%  (1.0% elapsed)
> >
> >   Productivity  97.4% of total user, 99.6% of total elapsed
> >
> > ./A +RTS -A50M -s  0.10s user 0.05s system 97% cpu 0.157 total
> >
> >
> >
> > ---------------------------------
> >
> >
> > On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa at hotmail.com> wrote:
> >> I have made benchmark test inspired by
> >> http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
> >>
> >> What surprised me is that unboxed array is much faster than boxed array.
> >> Actually boxed array performance is on par with standard Haskell list
> >> which is very slow.
> >> All in all even unboxed array is about 10 times slower than Java version.
> >> I don't understand why is even unboxed array so slow.
> >> But! unboxed array consumes least amount of RAM.
> >> (warning, program consumes more than 3gb of ram)
> >>
> >>  bmaxa at maxa:~/examples$ time ./Cumul
> >> boxed array
> >> last 262486571 seconds 4.972
> >> unboxed array
> >> last 262486571 seconds 0.776
> >> list
> >> last 262486571 seconds 6.812
> >>
> >> real    0m13.086s
> >> user    0m11.996s
> >> sys     0m1.080s
> >>
> >> -------------------------------------------------------------------------
> >> {-# LANGUAGE CPP, BangPatterns #-}
> >> import System.CPUTime
> >> import Text.Printf
> >> import Data.Array.IO
> >> import Data.Array.Base
> >> import Data.Int
> >> import Control.DeepSeq
> >> import System.Mem
> >>
> >> main :: IO()
> >> main = do
> >> (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> >> performGC
> >> (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> >> performGC
> >> begin <- getCPUTime
> >> printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> >> end <- getCPUTime
> >> let diff = (fromIntegral (end - begin)) / (10^12)
> >> printf " seconds %.3f\n" (diff::Double)
> >>
> >> test s a = do
> >> putStrLn s
> >> begin <- getCPUTime
> >> init' a
> >> partial_sum a
> >> end <- getCPUTime
> >> let diff = (fromIntegral (end - begin)) / (10^12)
> >> last <- readArray a (n'-1)
> >> printf "last %d seconds %.3f\n" last (diff::Double)
> >>
> >> n' :: Int
> >> n' = 50 * 1000 * 1000
> >>
> >> type A = IOArray Int Int32
> >> type B = IOUArray Int Int32
> >>
> >> init' a = do
> >> (_,n) <- getBounds a
> >> init a 0 n
> >> where
> >> init a k n
> >> | k > n = return ()
> >> | otherwise = do
> >> let  !x = fromIntegral $ k + k `div` 3
> >> unsafeWrite a k x
> >> init a (k+1) n
> >>
> >> partial_sum a = do
> >> (_,n) <- getBounds a
> >> k <- unsafeRead a 0
> >> ps a 1 n k
> >> where
> >> ps a i n s
> >> | i > n = return ()
> >> | otherwise = do
> >> k <- unsafeRead a i
> >> let !l = fromIntegral $ s + k
> >> unsafeWrite a i l
> >> ps a (i+1) n l
> >>
> >> data' :: [Int32]
> >> data' = [k + k `div` 3 | k <- [0..] ]
> >>
> >> sum' = scanl1 (+)
> >>
> >>
> >> _______________________________________________
> >> Haskell-Cafe mailing list
> >> Haskell-Cafe at haskell.org
> >> http://www.haskell.org/mailman/listinfo/haskell-cafe
> >>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20121201/1061ccb9/attachment-0001.htm>