[Haskell-cafe] Boxed Mutable Arrays

Wed Dec 16 08:27:24 EST 2009

On Dec 16, 2009, at 5:50 AM, Serguey Zefirov wrote:

> 2009/12/16 Matt Morrow <moonpatio at gmail.com>:
>> What are peoples' thoughts on this?
>> http://hackage.haskell.org/trac/ghc/ticket/650#comment:16
> 
> I think it won't get any better.
> 
> Either we have O(log(N)) updates because we have to update
> hierarchical structure to speed up GC scanning (to get it to
> O(Mlog(N)), where M is a number of updated cells), or we have O(N)
> scanning.
> 
> As far as I can tell, other systems (Java, for example) suffer from
> that problem as well.

The ticket suggests using VM protection to track writes.  This has been tried a number of times over the years by the GC community, and has usually gone badly - it turns out getting the OS to tell you about the page fault in reasonable time is hard.  It usually turns out to be cheaper just to make a cheap write barrier.

There are tricks that let us avoid re-scanning an array frequently during generational GC (and in particular avoid the problem of re-scanning the entire array during minor GC just to catch a single write).  But they require that we design appropriate read or write barriers for array accesses.  This is a Simple Matter of Programming, but hasn't risen to the top of anyone's priority list (in spite of the fact that this bug has existed for over a decade now, as far as I know).  It's fiddly coding and annoying to debug.

For those who are curious, here's one trick that avoids repeated re-scanning of arrays during GC:
  * During post-allocation tracing, move all data pointed to by the array into contiguous chunks of memory.
  * Group together that memory logically with the memory containing the array itself.
  * Keep track of whether anything points in to this memory; if nothing does, free the lot of it (or use the bad old tracing method; you won't find the big array and won't pay anything).
  * If there are subsequent writes, trace those and move them to the region.
  * Re-scan the whole region only if there have been enough writes (presumably causing the region to fill with data that has been overwritten and thrown away).

Here the cost is roughly proportional to the number of writes you perform.  Haskell being lazy, that might be more than you would expect.  There are (lots of) other tricks along the same lines with differing tradeoffs, and naturally I've skimmed over some important details.  What you should take away is that GCing an array of pointers need not be expensive---we can do O(writes) scanning over its lifespan, rather than O(N) scanning work at every single GC.  And note that in general we don't pay any GC costs at all unless we actually keep the array around for a while.

-Jan-Willem Maessen

> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe