FFI calls: is it possible to allocate a small memory block on a stack?

Fri Apr 23 06:10:35 EDT 2010

On 23/04/2010 04:39, Denys Rtveliashvili wrote:

> OK, the code I have checked out from the repository contains this in
> "rts/sm/Storage.h":
>
>     extern bdescr * pinned_object_block;
>
>
> And in "rts/sm/Storage.c":
>
>     bdescr *pinned_object_block;

Ah, I was looking in the HEAD, where I've already fixed this by moving 
pinned_object_block into the Capability and hence making it CPU-local. 
The patch that fixed it was

Tue Dec  1 16:03:21 GMT 2009  Simon Marlow <marlowsd at gmail.com>
   * Make allocatePinned use local storage, and other refactorings

> As for locking, here is one one of examples:
>
>     StgPtr
>     allocatePinned( lnat n )
>     {
>     StgPtr p;
>     bdescr *bd = pinned_object_block;
>
>     // If the request is for a large object, then allocate()
>     // will give us a pinned object anyway.
>     if (n >= LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
>     p = allocate(n);
>     Bdescr(p)->flags |= BF_PINNED;
>     return p;
>     }
>
>     *ACQUIRE_SM_LOCK; // [RTVD: here we acquire the lock]*
>
>     TICK_ALLOC_HEAP_NOCTR(n);
>     CCS_ALLOC(CCCS,n);
>
>     // If we don't have a block of pinned objects yet, or the current
>     // one isn't large enough to hold the new object, allocate a new one.
>     if (bd == NULL || (bd->free + n) > (bd->start + BLOCK_SIZE_W)) {
>     pinned_object_block = bd = allocBlock();
>     dbl_link_onto(bd, &g0s0->large_objects);
>     g0s0->n_large_blocks++;
>     bd->gen_no = 0;
>     bd->step = g0s0;
>     bd->flags = BF_PINNED | BF_LARGE;
>     bd->free = bd->start;
>     alloc_blocks++;
>     }
>
>     p = bd->free;
>     bd->free += n;
>     *RELEASE_SM_LOCK; // [RTVD: here we release the lock]*
>     return p;
>     }

Yes, this was also fixed by the aforementioned patch.

Bear in mind that in the vast majority of programs allocatePinned is not 
in the inner loop, which is why it hasn't been a priority to optimise it 
until now.

>     Of course, TICK_ALLOC_HEAP_NOCTR and CCS_ALLOC may require
>     synchronization if they use shared state (which is, again, probably
>     unnecessary). However, in case no profiling goes on and
>     "pinned_object_block" is TSO-local, isn't it possible to remove
>     locking completely from this code? The only case when locking will
>     be necessary is when a fresh block has to be allocated, and that can
>     be done within the "allocBlock" method (or, more precisely, by using
>     "allocBlock_lock".

TSO-local would be bad: TSOs are lightweight threads and in many cases 
are smaller than a block.  Capability-local is what you want.

>     ACQUIRE_SM_LOCK/RELEASE_SM_LOCK pair is present in other places too,
>     but I have not analysed yet if it is really necessary there. For
>     example, things like newCAF and newDynCAF are wrapped into it.

Right, but these are not common cases that need to be optimised.  newCAF 
is only called once per CAF, thereafter it is accessed without locks.

It may be that we could find benchmarks where access to the block 
allocator is the performance bottleneck, indeed in the parallel GC we 
sometimes see contention for it.  If that turns out to be a problem then 
we may need to think about per-CPU free lists in the block allocator, 
but I think it would entail a fair bit of complexity and if we're not 
careful extra memory overhead, e.g. where one CPU has all the free 
blocks in its local free list and the others have none.  So I'd like to 
avoid going down that route unless we absolutely have to.  The block 
allocator is nice and simple right now.

Cheers,
	Simon