Is evacuate for StgMutArrPtrs and StgArrPtrs expensive to GC?
Andrew Farmer
afarmer
Tue Oct 1 20:56:58 UTC 2013
Definitely... I'm somewhat fully occupied for the next two weeks, but
should be able to dig it out then and organize/share it.
On Oct 1, 2013 3:50 PM, "Carter Schonwald" <carter.schonwald at gmail.com>
wrote:
> awesome!
>
> please let us know when some of the info is available publicly, perhaps so
> other folks can help out wiht experimentation
>
>
> On Tue, Oct 1, 2013 at 4:30 PM, Andrew Farmer <afarmer at ittc.ku.edu> wrote:
>
>> I did indeed implement dynamic nursery sizing and did some preliminary
>> benchmarking. The headline figure: 15% speedup on the nofib/gc benchmarks,
>> though the variance was pretty large, and there were some slowdowns.
>>
>> My scheme was very simple... I kept track of the size and rough
>> collection time of the previous three collections and did a sort of crude
>> binary search to find a minimum in the search space. I did it this way
>> because it was simple and required constant time and memory to make a
>> decision. Though one of the conclusions was that collection time was a bad
>> metric, due to the way the RTS re-uses blocks. As Simon pointed out,
>> tracking retainment or some other metric would probably be better, but I
>> need to explore it. Another result: the default size is almost always too
>> small (at least for the nofib programs). CPUs come with huge caches, and
>> using the RTS flag -A to set the allocation area to be roughly the size of
>> the L3 cache usually gave pretty decent speedups.
>>
>> I did this for a class project, and had to put it down to focus on other
>> things, and just haven't picked it back up. I still have a patch laying
>> around, and several pages of notes with ideas for improvement in both the
>> metric and search. I'm hoping to pick it back up again in a couple months,
>> with an eye on a workshop paper, and a real patch for 7.10.
>>
>>
>> On Tue, Oct 1, 2013 at 3:36 AM, Simon Marlow <marlowsd at gmail.com> wrote:
>>
>>> It's typical for benchmarks that allocate a large data structure to
>>> spend a lot of time in the GC. The data gets copied twice - once in the
>>> young generation and then again when promoted to the old generation. You
>>> can make this kind of benchmark much faster by just using a bigger
>>> allocation area.
>>>
>>> There's nothing inherently costly about StgMutArrPtrs compared to other
>>> objects, except that they are variable size and therefore we can't unroll
>>> the copy loop, but I don't think that's a big effect. The actual copying
>>> is the major cost.
>>>
>>> The way to improve this kind of benchmark would be to add some
>>> heuristics for varying the nursery size based on the quantity of data
>>> retained, for example. I think there's a lot of room for improvement here,
>>> but someone needs to do some careful benchmarking and experimentation.
>>> Andrew Farmer did some work on this and allegedly got good results but we
>>> never saw the code (hint hint!).
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>> On 1 October 2013 06:43, Johan Tibell <johan.tibell at gmail.com> wrote:
>>>
>>>> The code for 'allocate' in rts/sm/Storage.c doesn't seem that
>>>> expensive. An extra branch compared to inline allocation and
>>>> allocation is done in the next nursery block (risking fragmentation?).
>>>>
>>>> -- Johan
>>>>
>>>> On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell <johan.tibell at gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > When I benchmark Data.HashMap.insert from unordered-containers
>>>> > (inserting the keys [0..10000]) the runtime is dominated by GC:
>>>> >
>>>> > $ cat Test.hs
>>>> > module Main where
>>>> >
>>>> > import Control.DeepSeq
>>>> > import Control.Exception
>>>> > import Control.Monad
>>>> > import qualified Data.HashMap.Strict as HM
>>>> > import Data.List (foldl')
>>>> >
>>>> > main = do
>>>> > let ks = [0..10000] :: [Int]
>>>> > evaluate (rnf ks)
>>>> > forM_ ([0..1000] :: [Int]) $ \ x -> do
>>>> > evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m)
>>>> HM.empty ks
>>>> >
>>>> > $ perf record -g ./Test +RTS -s
>>>> > 6,187,678,112 bytes allocated in the heap
>>>> > 3,309,887,128 bytes copied during GC
>>>> > 1,299,200 bytes maximum residency (1002 sample(s))
>>>> > 118,816 bytes maximum slop
>>>> > 5 MB total memory in use (0 MB lost due to
>>>> fragmentation)
>>>> >
>>>> > Tot time (elapsed) Avg pause
>>>> Max pause
>>>> > Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s
>>>> 0.0005s
>>>> > Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s
>>>> 0.0022s
>>>> >
>>>> > INIT time 0.00s ( 0.00s elapsed)
>>>> > MUT time 1.02s ( 1.03s elapsed)
>>>> > GC time 1.80s ( 1.80s elapsed)
>>>> > EXIT time 0.00s ( 0.00s elapsed)
>>>> > Total time 2.82s ( 2.84s elapsed)
>>>> >
>>>> > %GC time 63.7% (63.5% elapsed)
>>>> >
>>>> > Alloc rate 6,042,264,963 bytes per MUT second
>>>> >
>>>> > Productivity 36.3% of total user, 36.1% of total elapsed
>>>> >
>>>> > $ perf report
>>>> > 41.46% Test Test [.] evacuate
>>>> > 15.47% Test Test [.] scavenge_block
>>>> > 11.04% Test Test [.] s3cN_info
>>>> > 8.74% Test Test [.] s3aZ_info
>>>> > 3.59% Test Test [.] 0x7ff5
>>>> > 2.83% Test Test [.] scavenge_mut_arr_ptrs
>>>> > 2.69% Test libc-2.15.so [.] 0x147fd9
>>>> > 2.51% Test Test [.] allocate
>>>> > 2.00% Test Test [.] s3oo_info
>>>> > 0.91% Test Test [.] todo_block_full
>>>> > 0.87% Test Test [.] hs_popcnt64
>>>> > 0.80% Test Test [.] s3en_info
>>>> > 0.62% Test Test [.] s3el_info
>>>> >
>>>> > Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more
>>>> > expensive than GC:ing normal heap objects (i.e. for standard data
>>>> > types)?
>>>> >
>>>> > -- Johan
>>>>
>>>
>>>
>>
>> _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://www.haskell.org/mailman/listinfo/ghc-devs
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20131001/b0900696/attachment.html>
More information about the ghc-devs
mailing list