Performance of small allocations via prim ops
Harendra Kumar
harendra.kumar at gmail.com
Fri Apr 7 06:07:05 UTC 2023
Little bit of grepping in the code gave me this:
emitPrimOp cfg primop =
let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)
in case primop of
NewByteArrayOp_Char -> \case
[(CmmLit (CmmInt n w))]
| asUnsigned w n <= max_inl_alloc_size --
<------------------------------- see this line
-> opIntoRegs $ \ [res] -> doNewByteArrayOp res (fromInteger n)
_ -> PrimopCmmEmit_External
We are emitting a more efficient code when the size of the array is
smaller. And the threshold is governed by a compiler flag:
, make_ord_flag defGhcFlag "fmax-inline-alloc-size"
(intSuffix (\n d -> d { maxInlineAllocSize = n }))
This means allocation of smaller arrays is extremely efficient and we can
control it using `-fmax-inline-alloc-size`, the default is 128. That's a
new thing I learnt today.
Given this new finding, my original question now applies only to the case
when the array size is bigger than this configurable threshold, which is a
little less motivating. And Ben says that the call is not expensive, so we
can leave it there.
-harendra
On Fri, 7 Apr 2023 at 11:08, Harendra Kumar <harendra.kumar at gmail.com>
wrote:
> Ah, some other optimization seems to be kicking in here. When I increase
> the size of the array to > 128 then I see a call to stg_newByteArray# being
> emitted:
>
> {offset
> c1kb: // global
> if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto
> c1kd;
> c1kc: // global
> R1 = Main.main1_closure;
> call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
> c1kd: // global
> I64[Sp - 8] = c1k9;
> R1 = 129;
> Sp = Sp - 8;
> call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
> upd: 8;
>
> -harendra
>
> On Fri, 7 Apr 2023 at 10:49, Harendra Kumar <harendra.kumar at gmail.com>
> wrote:
>
>> Thanks Ben and Carter.
>>
>> I compiled the following to Cmm:
>>
>> {-# LANGUAGE MagicHash #-}
>> {-# LANGUAGE UnboxedTuples #-}
>>
>> import GHC.IO
>> import GHC.Exts
>>
>> data M = M (MutableByteArray# RealWorld)
>>
>> main = do
>> _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
>> arr #))
>> return ()
>>
>> It produced the following Cmm:
>>
>> {offset
>> c1k3: // global
>> Hp = Hp + 24;
>> if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
>> c1k7: // global
>> HpAlloc = 24;
>> R1 = Main.main1_closure;
>> call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>> c1k6: // global
>> I64[Hp - 16] = stg_ARR_WORDS_info;
>> I64[Hp - 8] = 1;
>> R1 = GHC.Tuple.()_closure+1;
>> call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>> }
>>
>> It seems to be as good as it gets. There is absolutely no scope for
>> improvement in this.
>>
>> -harendra
>>
>> On Fri, 7 Apr 2023 at 03:32, Ben Gamari <ben at smart-cactus.org> wrote:
>>
>>> Harendra Kumar <harendra.kumar at gmail.com> writes:
>>>
>>> > I was looking at the RTS code for allocating small objects via prim ops
>>> > e.g. newByteArray# . The code looks like:
>>> >
>>> > stg_newByteArrayzh ( W_ n )
>>> > {
>>> > MAYBE_GC_N(stg_newByteArrayzh, n);
>>> >
>>> > payload_words = ROUNDUP_BYTES_TO_WDS(n);
>>> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
>>> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>>> >
>>> > We are making a foreign call here (ccall). I am wondering how much
>>> overhead
>>> > a ccall adds? I guess it may have to save and restore registers. Would
>>> it
>>> > be better to do the fast path case of allocating small objects from the
>>> > nursery using cmm code like in stg_gc_noregs?
>>> >
>>> GHC's operational model is designed in such a way that foreign calls are
>>> fairly cheap (e.g. we don't need to switch stacks, which can be quite
>>> costly). Judging by the assembler produced for newByteArray# in one
>>> random x86-64 tree that I have lying around, it's only a couple of
>>> data-movement instructions, an %eax clear, and a stack pop:
>>>
>>> 36: 48 89 ce mov %rcx,%rsi
>>> 39: 48 89 c7 mov %rax,%rdi
>>> 3c: 31 c0 xor %eax,%eax
>>> 3e: e8 00 00 00 00 call 43
>>> <stg_newByteArrayzh+0x43>
>>> 43: 48 83 c4 08 add $0x8,%rsp
>>>
>>> The data movement operations in particular are quite cheap on most
>>> microarchitectures where GHC would run due to register renaming. I doubt
>>> that this overhead would be noticable in anything but a synthetic
>>> benchmark. However, it never hurts to measure.
>>>
>>> Cheers,
>>>
>>> - Ben
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20230407/ce12ab99/attachment.html>
More information about the ghc-devs
mailing list