Performance of small allocations via prim ops

Fri Apr 7 07:28:56 UTC 2023

> We are emitting a more efficient code when the size of the array is
smaller. And the threshold is governed by a compiler flag:

It would be good if this was documented.  Perhaps in the Haddock for
`newByteArray#`?  Or where?

S

On Fri, 7 Apr 2023 at 07:07, Harendra Kumar <harendra.kumar at gmail.com>
wrote:

> Little bit of grepping in the code gave me this:
>
> emitPrimOp cfg primop =
>   let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)
>   in case primop of
>   NewByteArrayOp_Char -> \case
>     [(CmmLit (CmmInt n w))]
>       | asUnsigned w n <= max_inl_alloc_size     --
> <------------------------------- see this line
>       -> opIntoRegs  $ \ [res] -> doNewByteArrayOp res (fromInteger n)
>     _ -> PrimopCmmEmit_External
>
> We are emitting a more efficient code when the size of the array is
> smaller. And the threshold is governed by a compiler flag:
>
>   , make_ord_flag defGhcFlag "fmax-inline-alloc-size"
>       (intSuffix (\n d -> d { maxInlineAllocSize = n }))
>
> This means allocation of smaller arrays is extremely efficient and we can
> control it using `-fmax-inline-alloc-size`, the default is 128. That's a
> new thing I learnt today.
>
> Given this new finding, my original question now applies only to the case
> when the array size is bigger than this configurable threshold, which is a
> little less motivating. And Ben says that the call is not expensive, so we
> can leave it there.
>
> -harendra
>
> On Fri, 7 Apr 2023 at 11:08, Harendra Kumar <harendra.kumar at gmail.com>
> wrote:
>
>> Ah, some other optimization seems to be kicking in here. When I increase
>> the size of the array to > 128 then I see a call to stg_newByteArray# being
>> emitted:
>>
>>      {offset
>>        c1kb: // global
>>            if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto
>> c1kd;
>>        c1kc: // global
>>            R1 = Main.main1_closure;
>>            call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>>        c1kd: // global
>>            I64[Sp - 8] = c1k9;
>>            R1 = 129;
>>            Sp = Sp - 8;
>>            call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
>> upd: 8;
>>
>> -harendra
>>
>> On Fri, 7 Apr 2023 at 10:49, Harendra Kumar <harendra.kumar at gmail.com>
>> wrote:
>>
>>> Thanks Ben and Carter.
>>>
>>> I compiled the following to Cmm:
>>>
>>> {-# LANGUAGE MagicHash #-}
>>> {-# LANGUAGE UnboxedTuples #-}
>>>
>>> import GHC.IO
>>> import GHC.Exts
>>>
>>> data M = M (MutableByteArray# RealWorld)
>>>
>>> main = do
>>>      _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
>>> arr #))
>>>      return ()
>>>
>>> It produced the following Cmm:
>>>
>>>      {offset
>>>        c1k3: // global
>>>            Hp = Hp + 24;
>>>            if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
>>>        c1k7: // global
>>>            HpAlloc = 24;
>>>            R1 = Main.main1_closure;
>>>            call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
>>>        c1k6: // global
>>>            I64[Hp - 16] = stg_ARR_WORDS_info;
>>>            I64[Hp - 8] = 1;
>>>            R1 = GHC.Tuple.()_closure+1;
>>>            call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
>>>      }
>>>
>>> It seems to be as good as it gets. There is absolutely no scope for
>>> improvement in this.
>>>
>>> -harendra
>>>
>>> On Fri, 7 Apr 2023 at 03:32, Ben Gamari <ben at smart-cactus.org> wrote:
>>>
>>>> Harendra Kumar <harendra.kumar at gmail.com> writes:
>>>>
>>>> > I was looking at the RTS code for allocating small objects via prim
>>>> ops
>>>> > e.g. newByteArray# . The code looks like:
>>>> >
>>>> > stg_newByteArrayzh ( W_ n )
>>>> > {
>>>> >     MAYBE_GC_N(stg_newByteArrayzh, n);
>>>> >
>>>> >     payload_words = ROUNDUP_BYTES_TO_WDS(n);
>>>> >     words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
>>>> >     ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>>>> >
>>>> > We are making a foreign call here (ccall). I am wondering how much
>>>> overhead
>>>> > a ccall adds? I guess it may have to save and restore registers.
>>>> Would it
>>>> > be better to do the fast path case of allocating small objects from
>>>> the
>>>> > nursery using cmm code like in stg_gc_noregs?
>>>> >
>>>> GHC's operational model is designed in such a way that foreign calls are
>>>> fairly cheap (e.g. we don't need to switch stacks, which can be quite
>>>> costly). Judging by the assembler produced for newByteArray# in one
>>>> random x86-64 tree that I have lying around, it's only a couple of
>>>> data-movement instructions, an %eax clear, and a stack pop:
>>>>
>>>>       36:       48 89 ce                mov    %rcx,%rsi
>>>>       39:       48 89 c7                mov    %rax,%rdi
>>>>       3c:       31 c0                   xor    %eax,%eax
>>>>       3e:       e8 00 00 00 00          call   43
>>>> <stg_newByteArrayzh+0x43>
>>>>       43:       48 83 c4 08             add    $0x8,%rsp
>>>>
>>>> The data movement operations in particular are quite cheap on most
>>>> microarchitectures where GHC would run due to register renaming. I doubt
>>>> that this overhead would be noticable in anything but a synthetic
>>>> benchmark. However, it never hurts to measure.
>>>>
>>>> Cheers,
>>>>
>>>> - Ben
>>>>
>>> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20230407/cfa76ad8/attachment.html>