No "last core parallel slowdown" on OS X

Mon Apr 20 12:30:39 EDT 2009

[Sorry if this turns out to be a dup, it appears that my first send  
got lost, while my followup message went through.]

I ran some longer trials, and noticed a further pattern I wish I could  
explain:

I'm comparing the enumeration of the roughly 69 billion atomic  
lattices on six atoms, on my four core, 2.4 GHz Q6600 box running OS  
X, against an eight core, 2 x 3.16 Ghz Xeon X5460 box at my department  
running Linux. Note that my processor now costs $200 (it's the  
venerable "Dodge Dart" of quad core chips), while the pair of Xeon  
processors cost $2400. The Haskell code is straightforward; it uses  
bit fields and reverse search, but it doesn't take advantage of  
symmetry, so it must "touch" every lattice to complete the  
enumeration. Its memory footprint is insignificant.

Never mind 7 cores, Linux performs worse before it runs out of cores.  
Comparing 1, 2, 3, 4 cores on each machine, look at "real" and "user"  
time in minutes, and the ratio:

Linux
2 x 3.16 GHz Xeon X5460
1       2       3       4
466.7   250.8   183.7   149.3
466.4   479.0   505.2   528.1
1.00    1.91    2.75    3.54

OS X
2.4 GHx Q6600
1       2       3       4
676.9   359.4   246.7   191.4
673.4   673.7   675.9   674.8
0.99    1.87    2.74    3.53

These ratios match up like physical constants, or at least invariants  
of my Haskell implementation. However, the user time is constant on OS  
X, so these ratios reflect the actual parallel speedup on OS X. The  
user time climbs steadily on Linux, significantly diluting the  
parallel speedup on Linux. Somehow, whatever is going wrong in the  
interaction between Haskell and Linux is being captured in this  
increase in user time.

I love how my cheap little box comes close to pulling even with a  
departmental compute server I can't afford, because of this difference  
in operating systems.