FFI calls: is it possible to allocate a small memory block on a stack?

Fri Apr 23 14:03:14 EDT 2010

Hi Simon,

> > OK, the code I have checked out from the repository contains this in
> > "rts/sm/Storage.h":
> >
> >     extern bdescr * pinned_object_block;
> >
> >
> > And in "rts/sm/Storage.c":
> >
> >     bdescr *pinned_object_block;
> 
> Ah, I was looking in the HEAD, where I've already fixed this by moving 
> pinned_object_block into the Capability and hence making it CPU-local. 
> The patch that fixed it was
> 
> Tue Dec  1 16:03:21 GMT 2009  Simon Marlow <marlowsd at gmail.com>
>    * Make allocatePinned use local storage, and other refactorings

The version I have checked out is 6.12 and that's why I haven't seen
this patch.
Are there any plans for including this patch in the next GHC release?

> Yes, this was also fixed by the aforementioned patch.
> 
> Bear in mind that in the vast majority of programs allocatePinned is not 
> in the inner loop, which is why it hasn't been a priority to optimise it 
> until now.

I guess the code which makes use of ByteStrings (especially, when it
splits them into many smaller substrings) calls to allocatePinned very
frequently even within inner loops.

> TSO-local would be bad: TSOs are lightweight threads and in many cases 
> are smaller than a block.  Capability-local is what you want.

Ah... Yes, capabilities are a far better choice.

> Right, but these are not common cases that need to be optimised.  newCAF 
> is only called once per CAF, thereafter it is accessed without locks.

Can't recall from the top of my head, but I think I had a case when
newCAF was used very actively in a simple piece of code. The code looked
like this:

sequence_ $ replicate N $ doSmth

The Cmm code showed that it produced calls to newCAF and something
related to black holes. And when I added "return ()"  after that line,
the black holes new calls to "newCAF" have disappeared. It was on
6.12.1, I believe. I still have no idea why it happened and why these
black holes where necessary, but I'll try to reproduce it one more time
and show you an example if it has any interest for you.

> It may be that we could find benchmarks where access to the block 
> allocator is the performance bottleneck, indeed in the parallel GC we 
> sometimes see contention for it.  If that turns out to be a problem then 
> we may need to think about per-CPU free lists in the block allocator, 
> but I think it would entail a fair bit of complexity and if we're not 
> careful extra memory overhead, e.g. where one CPU has all the free 
> blocks in its local free list and the others have none.  So I'd like to 
> avoid going down that route unless we absolutely have to.  The block 
> allocator is nice and simple right now.

I suppose I should check out the HEAD then and give it a try, because
earlier I had performance issues in the threaded runtime (~20% of
overhead and far more noise) in an application which was doing some
slicing, reshuffling and composing text via ByteStrings with a modest
amount of passing data around via "Chan"s.

On a slightly different topic: please could you point me to a place
where stg_upd_frame_info is generated? I can't find it in *.c, *.cmm or
*.hs and guess it is something very special.

With kind regards,
Denys Rtveliashvili
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20100423/fd7a5826/attachment.html