Performance of small allocations via prim ops

Fri Apr 7 05:38:16 UTC 2023

Ah, some other optimization seems to be kicking in here. When I increase
the size of the array to > 128 then I see a call to stg_newByteArray# being
emitted:

     {offset
       c1kb: // global
           if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd;
       c1kc: // global
           R1 = Main.main1_closure;
           call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
       c1kd: // global
           I64[Sp - 8] = c1k9;
           R1 = 129;
           Sp = Sp - 8;
           call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
upd: 8;

-harendra

On Fri, 7 Apr 2023 at 10:49, Harendra Kumar <harendra.kumar at gmail.com>
wrote:

> Thanks Ben and Carter.
>
> I compiled the following to Cmm:
>
> {-# LANGUAGE MagicHash #-}
> {-# LANGUAGE UnboxedTuples #-}
>
> import GHC.IO
> import GHC.Exts
>
> data M = M (MutableByteArray# RealWorld)
>
> main = do
>      _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
> arr #))
>      return ()
>
> It produced the following Cmm:
>
>      {offset
>        c1k3: // global
>            Hp = Hp + 24;
>            if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
>        c1k7: // global
>            HpAlloc = 24;
>            R1 = Main.main1_closure;
>            call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>        c1k6: // global
>            I64[Hp - 16] = stg_ARR_WORDS_info;
>            I64[Hp - 8] = 1;
>            R1 = GHC.Tuple.()_closure+1;
>            call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>      }
>
> It seems to be as good as it gets. There is absolutely no scope for
> improvement in this.
>
> -harendra
>
> On Fri, 7 Apr 2023 at 03:32, Ben Gamari <ben at smart-cactus.org> wrote:
>
>> Harendra Kumar <harendra.kumar at gmail.com> writes:
>>
>> > I was looking at the RTS code for allocating small objects via prim ops
>> > e.g. newByteArray# . The code looks like:
>> >
>> > stg_newByteArrayzh ( W_ n )
>> > {
>> >     MAYBE_GC_N(stg_newByteArrayzh, n);
>> >
>> >     payload_words = ROUNDUP_BYTES_TO_WDS(n);
>> >     words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
>> >     ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>> >
>> > We are making a foreign call here (ccall). I am wondering how much
>> overhead
>> > a ccall adds? I guess it may have to save and restore registers. Would
>> it
>> > be better to do the fast path case of allocating small objects from the
>> > nursery using cmm code like in stg_gc_noregs?
>> >
>> GHC's operational model is designed in such a way that foreign calls are
>> fairly cheap (e.g. we don't need to switch stacks, which can be quite
>> costly). Judging by the assembler produced for newByteArray# in one
>> random x86-64 tree that I have lying around, it's only a couple of
>> data-movement instructions, an %eax clear, and a stack pop:
>>
>>       36:       48 89 ce                mov    %rcx,%rsi
>>       39:       48 89 c7                mov    %rax,%rdi
>>       3c:       31 c0                   xor    %eax,%eax
>>       3e:       e8 00 00 00 00          call   43
>> <stg_newByteArrayzh+0x43>
>>       43:       48 83 c4 08             add    $0x8,%rsp
>>
>> The data movement operations in particular are quite cheap on most
>> microarchitectures where GHC would run due to register renaming. I doubt
>> that this overhead would be noticable in anything but a synthetic
>> benchmark. However, it never hurts to measure.
>>
>> Cheers,
>>
>> - Ben
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20230407/227a264a/attachment.html>