Performance of small allocations via prim ops
Harendra Kumar
harendra.kumar at gmail.com
Fri Apr 7 05:19:59 UTC 2023
Thanks Ben and Carter.
I compiled the following to Cmm:
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
import GHC.IO
import GHC.Exts
data M = M (MutableByteArray# RealWorld)
main = do
_ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
arr #))
return ()
It produced the following Cmm:
{offset
c1k3: // global
Hp = Hp + 24;
if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
c1k7: // global
HpAlloc = 24;
R1 = Main.main1_closure;
call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
c1k6: // global
I64[Hp - 16] = stg_ARR_WORDS_info;
I64[Hp - 8] = 1;
R1 = GHC.Tuple.()_closure+1;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
It seems to be as good as it gets. There is absolutely no scope for
improvement in this.
-harendra
On Fri, 7 Apr 2023 at 03:32, Ben Gamari <ben at smart-cactus.org> wrote:
> Harendra Kumar <harendra.kumar at gmail.com> writes:
>
> > I was looking at the RTS code for allocating small objects via prim ops
> > e.g. newByteArray# . The code looks like:
> >
> > stg_newByteArrayzh ( W_ n )
> > {
> > MAYBE_GC_N(stg_newByteArrayzh, n);
> >
> > payload_words = ROUNDUP_BYTES_TO_WDS(n);
> > words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
> > ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
> >
> > We are making a foreign call here (ccall). I am wondering how much
> overhead
> > a ccall adds? I guess it may have to save and restore registers. Would it
> > be better to do the fast path case of allocating small objects from the
> > nursery using cmm code like in stg_gc_noregs?
> >
> GHC's operational model is designed in such a way that foreign calls are
> fairly cheap (e.g. we don't need to switch stacks, which can be quite
> costly). Judging by the assembler produced for newByteArray# in one
> random x86-64 tree that I have lying around, it's only a couple of
> data-movement instructions, an %eax clear, and a stack pop:
>
> 36: 48 89 ce mov %rcx,%rsi
> 39: 48 89 c7 mov %rax,%rdi
> 3c: 31 c0 xor %eax,%eax
> 3e: e8 00 00 00 00 call 43 <stg_newByteArrayzh+0x43>
> 43: 48 83 c4 08 add $0x8,%rsp
>
> The data movement operations in particular are quite cheap on most
> microarchitectures where GHC would run due to register renaming. I doubt
> that this overhead would be noticable in anything but a synthetic
> benchmark. However, it never hurts to measure.
>
> Cheers,
>
> - Ben
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20230407/81dfd90f/attachment.html>
More information about the ghc-devs
mailing list