Replacement for GMP: Update

Fri Aug 11 23:06:15 EDT 2006

On Fri, 2006-08-11 at 18:56 -0400, Peter Tanski wrote:
> John,
> 
> > After all on the average call where an object of that
> > size is free already it is a single array lookup, we have:
> >
> > (a) fetch pointer (one read)
> > (b) fetch next (one read)
> > (c) store next as current (one write)
> 
> This is true for memory access; it is not true for memory  
> allocation.

That's an *allocation* obtained by popping a suitable
block off a linked list (sorry should have explained better)

>   I do not know how malloc allocates memory on Windows but  
> on general POSIX systems the kernel uses a linked list and lots of  
> other management things to reduce fragmentation, such KMEMSTAT.  

Yes, but most allocations don't require a kernel call:
malloc is implemented in libc and suballocates an block
obtained from the kernel. Free'd blocks go on a free list
indexed by block size .. so the typical program will get
to a state where most allocations reduce to popping a list.

> Malloc may also block, which is something that you have more control  
> over in your own garbage collected heap.

Yes.

>  A really good explanation  
> for the problem of rapidly allocating and deallocating temporary  
> blocks of memory under 35kb is here: http://ridiculousfish.com/blog/ 
> archives/2006/05/16/36/ .

Eh? That explains why OSX is slower with 35K blocks because
it switches to kernel calls for larger allocations at that 
size .. the article doesn't say anything about smaller blocks.

I assume we're considering small blocks here: list nodes and
so on are small, and these would be the majority case in a
functional programming language.

> > A purely functional system -- one which does NOT convert
> > self tail calls into jumps and reuse storage -- can
> > perhaps be faster, since each thread can have its own
> > local arena to allocate from (without need of any write
> > barrier) .. however it isn't clear to me what the trade
> > off is between cache misses and parallelism.
> 
> That is interesting but I do not understand whether your mention of  
> self tail calls turned into jumps was low or high level. 

What I mean is the optimisation (which Felix does for functional
code) whereby you convert:

	int f(int x) {
		int y = x + x;
		if(x>100) return x;
		return f(y);
	}

to

	int f(int x) {
	start:
		int y = x + x;
		if (x>100) return x;
		x = y; // assign argument
		goto start;
	}

This optimisation eliminates the need to spawn a new stack
frame, and will eliminate cache misses.

So it looks very efficient.

My point here is that actually this is a disastrous optimisation
in a  multi-processing environment, because in general, the
assignment of a pointer means the store isn't write once.

With write once store -- ignoring registers -- you can
collect lazily in a background thread. Although you're
not scanning all of memory .. new memory is allocated
as time goes on .. it doesn't matter. Any garbage you
find is certain to be garbage.

This is because you cannot miss a heap block in your sweep.
Even though new blocks are being allocated in parallel,
they can only obtain a pointer from the old picture
of the heap (implying the block is reachable), and,
the pointer in the old block from which it was obtained
cannot be erased: since constructed blocks are read only.

I haven't figured all the details. I am imagining a generational
copying collector with a separate young heap for every
thread, whilst the great grand mother heap is collected
in the background.

This means (a) mother collection, (b) young heap compaction 
both work on a per thread basis independently
of other threads and without any synchronisation or locking.

What requires synchronisation is aging. Moving objects
out of the child heaps into a parent needs to be
serialised, and the start of a new sweep of the parent
heap must twiddle a lock or something to ensure it
has a coherent snapshot of the heap at some point
of this serialisation (however the picture of the heap
only needs to be synchronised once).

for example if you have a linked list of the mother heap
blocks, you have to serialise pushing stuff onto it during
ageing .. and you have to get a valid pointer to start the
sweep from.

The point is, this scheme reduces the need for serialisation
to the infrequent need to age objects or start a new
sweep of the mother heap: the actual sweep, raw allocations,
and even young heap compaction, all occur in separate threads and
don't interfere with each other.

Thus, it seems a lot more scalable to a multi-processing
environment to actually have LOW LEVEL purely functional
paradigm: no store can be written more than once.

More precisely, pointers are initialised to NULL,
and be assigned a non-NULL heap pointer once, but non-NULL
pointer can never be erased.

Which means: converting tail calls to a loop and
using mutable store is a nono: it is faster on a single CPU
but will actually be slower on a multi-processor because
you'd be forced to use an inferior garbage collection
scheme.

Felix uses an inferior scheme because it uses the C/C++
object model by specification. I have to stop ALL threads
to collect. This is REALLY BAD: if you have N processors,
N-1 of them can wait for a long time for the last one
to stop! [Felix cannot signal the threads to stop. It has
to wait until they reach a yield point]

> (Maybe that is what he meant when some time ago he  
> wrote John Meacham and said that they (the GHC researchers)  
> considered compiling via C a dead end.) 

Yes. I tend to agree. C is really very dead, it lacks really
basic serial primitives, and it uses a stack model: it isn't
really any good even for C :)

Unfortunately, hardware design tends to supply stuff that
runs C code fast ;(

> The curious thing about GHC- 
> Haskell is that through the prism of Cmm, which enforces such things  
> as immutable variables and recursion right at the machine level,  
> Haskell is less a language of translation to sequential machine code  
> and more a description of a computational model.  If you still think  
> I am wrong about this, 

I don't. I think you're right. Purely functional storage model
seems to scale better to multiple threads.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net