<div dir="ltr"><div>Little bit of grepping in the code gave me this:</div><div><br></div><div>emitPrimOp cfg primop =<br> let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)<br> in case primop of<br> NewByteArrayOp_Char -> \case<br> [(CmmLit (CmmInt n w))]<br> | asUnsigned w n <= max_inl_alloc_size -- <------------------------------- see this line<br> -> opIntoRegs $ \ [res] -> doNewByteArrayOp res (fromInteger n)<br> _ -> PrimopCmmEmit_External</div><div><br></div><div>We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag:</div><div><br></div><div> , make_ord_flag defGhcFlag "fmax-inline-alloc-size"<br> (intSuffix (\n d -> d { maxInlineAllocSize = n }))</div><div><br></div><div>This means allocation of smaller arrays is extremely efficient and we can control it using `-fmax-inline-alloc-size`, the default is 128. That's a new thing I learnt today.</div><div><br></div><div>Given this new finding, my original question now applies only to the case when the array size is bigger than this configurable threshold, which is a little less motivating. And Ben says that the call is not expensive, so we can leave it there.<br></div><div><br></div><div>-harendra<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 7 Apr 2023 at 11:08, Harendra Kumar <<a href="mailto:harendra.kumar@gmail.com">harendra.kumar@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Ah, some other optimization seems to be kicking in here. When I increase the size of the array to > 128 then I see a call to stg_newByteArray# being emitted:</div><div><br></div><div> {offset<br> c1kb: // global<br> if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd;<br> c1kc: // global<br> R1 = Main.main1_closure;<br> call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;<br> c1kd: // global<br> I64[Sp - 8] = c1k9;<br> R1 = 129;<br> Sp = Sp - 8;<br> call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, upd: 8;</div><div><br></div><div>-harendra<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 7 Apr 2023 at 10:49, Harendra Kumar <<a href="mailto:harendra.kumar@gmail.com" target="_blank">harendra.kumar@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Thanks Ben and Carter.</div><div><br></div><div>I compiled the following to Cmm:</div><div><br></div><div>{-# LANGUAGE MagicHash #-}<br>{-# LANGUAGE UnboxedTuples #-}<br><br>import <a href="http://GHC.IO" target="_blank">GHC.IO</a><br>import GHC.Exts<br><br>data M = M (MutableByteArray# RealWorld)<br><br>main = do<br> _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #))<br> return ()</div><div><br></div><div>It produced the following Cmm:</div><div><br></div><div> {offset<br> c1k3: // global<br> Hp = Hp + 24;<br> if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;<br> c1k7: // global<br> HpAlloc = 24;<br> R1 = Main.main1_closure;<br> call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;<br> c1k6: // global<br> I64[Hp - 16] = stg_ARR_WORDS_info;<br> I64[Hp - 8] = 1;<br> R1 = GHC.Tuple.()_closure+1;<br> call (P64[Sp])(R1) args: 8, res: 0, upd: 8;<br> }</div><div><br></div><div>It seems to be as good as it gets. There is absolutely no scope for improvement in this.</div><div><br></div><div>-harendra<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 7 Apr 2023 at 03:32, Ben Gamari <<a href="mailto:ben@smart-cactus.org" target="_blank">ben@smart-cactus.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Harendra Kumar <<a href="mailto:harendra.kumar@gmail.com" target="_blank">harendra.kumar@gmail.com</a>> writes:<br>
<br>
> I was looking at the RTS code for allocating small objects via prim ops<br>
> e.g. newByteArray# . The code looks like:<br>
><br>
> stg_newByteArrayzh ( W_ n )<br>
> {<br>
> MAYBE_GC_N(stg_newByteArrayzh, n);<br>
><br>
> payload_words = ROUNDUP_BYTES_TO_WDS(n);<br>
> words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;<br>
> ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);<br>
><br>
> We are making a foreign call here (ccall). I am wondering how much overhead<br>
> a ccall adds? I guess it may have to save and restore registers. Would it<br>
> be better to do the fast path case of allocating small objects from the<br>
> nursery using cmm code like in stg_gc_noregs?<br>
><br>
GHC's operational model is designed in such a way that foreign calls are<br>
fairly cheap (e.g. we don't need to switch stacks, which can be quite<br>
costly). Judging by the assembler produced for newByteArray# in one<br>
random x86-64 tree that I have lying around, it's only a couple of<br>
data-movement instructions, an %eax clear, and a stack pop:<br>
<br>
36: 48 89 ce mov %rcx,%rsi<br>
39: 48 89 c7 mov %rax,%rdi<br>
3c: 31 c0 xor %eax,%eax<br>
3e: e8 00 00 00 00 call 43 <stg_newByteArrayzh+0x43><br>
43: 48 83 c4 08 add $0x8,%rsp<br>
<br>
The data movement operations in particular are quite cheap on most<br>
microarchitectures where GHC would run due to register renaming. I doubt<br>
that this overhead would be noticable in anything but a synthetic<br>
benchmark. However, it never hurts to measure.<br>
<br>
Cheers,<br>
<br>
- Ben<br>
</blockquote></div></div>
</blockquote></div></div>
</blockquote></div>