optimizing StgPtr allocate (Capability *cap, W_ n)

Thu Oct 16 07:04:22 UTC 2014

Hi Bulat,

This seems quite reasonable to me. Have you eyeballed the assembly
GCC produces to see that the hotpath is improved? If you can submit
a patch that would be great!

Cheers,
Edward

Excerpts from Bulat Ziganshin's message of 2014-10-14 10:08:59 -0700:
> Hello Glasgow-haskell-users,
> 
> i'm looking a the https://github.com/ghc/ghc/blob/23bb90460d7c963ee617d250fa0a33c6ac7bbc53/rts/sm/Storage.c#L680
> 
> if i correctly understand, it's speed-critical routine?
> 
> i think that it may be improved in this way:
> 
> StgPtr allocate (Capability *cap, W_ n)
> {
>     bdescr *bd;
>     StgPtr p;
> 
>     TICK_ALLOC_HEAP_NOCTR(WDS(n));
>     CCS_ALLOC(cap->r.rCCCS,n);
> 
> /// here starts new improved code:
> 
>     bd = cap->r.rCurrentAlloc;
>     if (bd == NULL || bd->free + n > bd->end) {
>         if (n >= LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
>             ....
>         }
>         if (bd->free + n <= bd->start + BLOCK_SIZE_W)
>             bd->end = min (bd->start + BLOCK_SIZE_W, bd->free + LARGE_OBJECT_THRESHOLD)
>             goto usual_alloc;
>         }
>         ....
>     }
> 
> /// and here it stops
> 
> usual_alloc:
>     p = bd->free;
>     bd->free += n;
> 
>     IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
>     return p;
> }
> 
> 
> i  think  it's  obvious - we consolidate two if's on the crirical path
> into the single one plus avoid one ADD by keeping highly-useful bd->end pointer
> 
> further   improvements   may   include   removing  bd==NULL  check  by
> initializing bd->free=bd->end=NULL   and   moving   entire   "if" body
> into   separate   slow_allocate()  procedure  marked  "noinline"  with
> allocate() probably marked as forceinline:
> 
> StgPtr allocate (Capability *cap, W_ n)
> {
>     bdescr *bd;
>     StgPtr p;
> 
>     TICK_ALLOC_HEAP_NOCTR(WDS(n));
>     CCS_ALLOC(cap->r.rCCCS,n);
> 
>     bd = cap->r.rCurrentAlloc;
>     if (bd->free + n > bd->end)
>         return slow_allocate(cap,n);
> 
>     p = bd->free;
>     bd->free += n;
> 
>     IF_DEBUG(sanity, ASSERT(*((StgWord8*)p) == 0xaa));
>     return p;
> }
> 
> this  change  will  greatly simplify optimizer's work. according to my
> experience   current  C++  compilers  are  weak  on  optimizing  large
> functions with complex execution paths and such transformations really
> improve the generated code
>