[Haskell] Programming language shootout (completing the Haskell entry)

Tue Mar 30 09:49:16 EST 2004

As another data point, when I was writing the run-time system both for 
pH (on an 8-way Sun) and later for Eager Haskell (on x86) I 
experimented with simple prefetching of various sorts.  Two things 
seemed to be moderately effective:

* The heap was organized into "chunks" which were parceled out among 
running threads.  By writing into the pages of a chunk when it is 
allocated, the necessary TLB entries can be pulled in early.  This 
seemed to be a universal win on both architectures.  On a system with 
full software TLB miss handling it would be a wash, I expect.

* Writing a line or two ahead in the heap.  The idea here is to get 
the appropriate lines into the cache (and dirty) before they actually 
saw heavy use.  Alas, this can cause fenceposting problems when you 
get near the end of a chunk (it's not safe to write past a chunk 
boundary).  A processor which buffers writes doesn't benefit 
enormously, either; we can buffer the last cache line in the heap 
until it has been completely written, and never fetch from memory at 
all.  There were no big wins here in my experience.

One has to be careful about the semantics of prefetch instructions, by 
the way.  At least on SPARC they probably don't do what you want; if 
my memory serves me correctly, there is a separate prefetch cache 
that's connected to the floating-point load/store unit, and what you 
actually want for most Haskell-ish code would be a non-blocking load 
instruction instead.  On x86 the prefetch instructions are part of the 
SIMD floating-point instruction set, and might (or might not) have 
similar caveats.

The prefetch builtins for gcc are a pretty recent innovation, by the 
way.  I would have loved to have them back when I was experimenting. 
Of course, you'd want to make sure you got an appropriate instruction 
out of the compiler.

Andrew Cheadle wrote:
> Personally I think looking at 'cache-concious' layout of specific data
> structures within the allocator and depth first vs breadth first copying
> within the collector provide more realistic optimisation opportunities. I
> believe there's been a fair amount of work on this, although specific papers
> elude me at the mo.

I would agree with this---though the conclusion from the GC literature 
  seems to be that no particular copying strategy is best for all 
applications.

-Jan-Willem Maessen