[Haskell-cafe] Glome.hs-0.3 and other things

Mon Apr 21 14:54:24 EDT 2008

Jim Snow wrote:
> Useful references: "What Every Programmer Needs to Know About Memory" 
> http://lwn.net/Articles/250967/

Thank you for monopolising my entire afternoon. This is probably the 
single most interesting thing I've read in ages! :-D

Thinking about it, maybe this explains my sorting benchmark - sorting 
Word32 values probably involves more shifting data from place to place 
than actual computation, so it's probably limited by the bandwidth of 
the FSB than the CPU's ability to perform number chrunking. Running it 
in multiple threads is probably just thrashing the L2 cache rather than 
actually helping...

After studying all this material, I do find myself feeling slightly 
concerned. The article shows how in C / C++ / assembly you can arrange 
your data and order your computations to make maximum use of the various 
caches and avoid certain bottlenecks in the system. And the *vast* 
performance difference it can yield. But what happens if you happen to 
be writing your program in Smalltalk or Java or... Haskell. Up here, 
you're so far from the hardware that it would seem that you have *no 
hope* of using the CPU caches effectively. Think about it - data 
scattered all over the heap in an essentially random pattern, getting 
moved around periodically by the GC. [Oh, the GC! That sounds like it 
must nail the hell out of the cache. And even if it doesn't, it more or 
less negates its usefulness because all the "hot" data is now at a 
different address.] Just think about trying to process a Haskell list - 
all those multiple levels of pointer indirection! The CPU has no hope of 
prefetching that stuff... Damn, it seems as if there's simply no way of 
ever making Haskell fast. :-(

I suppose the idea is that Haskell is supposed to help you work at a 
higher level of abstraction, so you can concentrate on building better 
*algorithms* which require less work in the first place. Surely using an 
algorithm in a suitably superior complexity class could more than make 
up for the performance loss from lack of cache coherence. But people who 
work in C and C++ know how to build efficient algorithms too, so... hmm, 
we seem to be in trouble here.