Test performance impact (was: The dreaded M-R)

Thu Feb 2 07:34:30 EST 2006

On 02 February 2006 09:52, John Hughes wrote:

> 	Summary: 2 programs failed to compile due to type errors (anna,
gg).
> 	One program did 19% more allocation, a few other programs
increased
> 	allocation very slightly (<2%).
> 
> 	            pic         +0.28%   +19.27%      0.02
> 
> 
> 
> Thanks, that was interesting. A follow-up question: pic has a space
> bug. How long will it take you to find and fix it?

I just tried this, and it took me just a few minutes.  Compiling both
versions with profiling, for the original:

	total time  =        0.00 secs   (0 ticks @ 20 ms)
	total alloc =  11,200,656 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

chargeDensity                  ChargeDensity          0.0    2.5
accumCharge                    ChargeDensity          0.0   13.5
relax                          Potential              0.0   31.4
correct                        Potential              0.0    5.0
genRand                        Utils                  0.0    1.0
fineMesh                       Utils                  0.0    2.4
applyOpToMesh                  Utils                  0.0   12.7
=:                             Utils                  0.0    2.3
pushParticle                   PushParticle           0.0   16.1
timeStep                       Pic                    0.0   11.0

and with the monomorphism restriction turned off:

        total time  =        0.02 secs   (1 ticks @ 20 ms)
        total alloc =  12,893,544 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

pushParticle                   PushParticle         100.0   20.8
chargeDensity                  ChargeDensity          0.0    2.2
accumCharge                    ChargeDensity          0.0   18.0
relax                          Potential              0.0   27.3
correct                        Potential              0.0    4.4
fineMesh                       Utils                  0.0    2.1
applyOpToMesh                  Utils                  0.0   11.1
=:                             Utils                  0.0    2.0
timeStep                       Pic                    0.0    9.5

So, ignoring the %time column (the program didn't run long enough for
the profiler to get enough time samples), we can see the following
functions increased their allocation as a % of the total:

  pushParticle, accumCharge

Looking at the code for accumCharge:

accumCharge :: [Position] -> [MeshAssoc]
accumCharge [] = []
accumCharge ((x,y):xys) =
	[((i ,j ) , charge * (1-dx) * (1-dy))] ++
	[((i',j ) , charge * dx * (1-dy))] ++
	[((i ,j') , charge * (1-dx) * dy)] ++
	[((i',j') , charge * dx * dy)] ++
	accumCharge xys
	where
	    i = truncate x
	    i' = (i+1) `rem` nCell
	    j = truncate y
	    j' = (j+1) `rem` nCell
	    dx = x - fromIntegral i
	    dy = y - fromIntegral j

Now, because I know what I'm looking for, I can pretty quickly spot the
problem.  I had to look at the definition of MeshAssoc to figure out
that the result type of this function forces i to have type Int, yet it
is used elsewhere as the argument to fromIntegral, where if i is
overloaded will be defaulted to Integer.  When I give type signatures to
i and j (:: Int), the allocation reduces.

The pushParticle function has an identical pattern.  Fixing these two
functions brought the performance back to the original.  But I've also
changed the semantics - the author might have *wanted* i at type Integer
in the definition of dx to avoid overflow, and the monomorphism
restriction had prevented it.

I suppose you could ask how you'd find the problem if you didn't know
what to look for.  So I added some more annotations:

	    i = {-# SCC "i" #-} truncate x
	    i' = {-# SCC "i'" #-}  (i+1) `rem` nCell
	    j = {-# SCC "j" #-} truncate y
	    j' = {-# SCC "j'" #-} (j+1) `rem` nCell
	    dx = {-# SCC "dx" #-} x - fromIntegral i
	    dy = {-# SCC "dy" #-} y - fromIntegral j

and the profiling output shows:

i                              ChargeDensity        100.0    6.8
j                              ChargeDensity          0.0    6.8
chargeDensity                  ChargeDensity          0.0    2.2
accumCharge                    ChargeDensity          0.0    3.9
relax                          Potential              0.0   27.2
...

So this pretty clearly identifies the problem area (although the figures
don't quite add up, I suspect the insertion of the annotations has
affected optimisation in some way).

Still, you could argue that it doesn't actually tell you the cause of
the problem: namely that i&j are being evaluated twice as often as you
might expect by looking at the code.  This is what the compiler warning
would do, and I completely agree that not having this property evident
by looking at the source code is a serious shortcoming.

> And how come speed
> improved slightly in many cases--that seems counter- intuitive.

The runtimes are unreliable, due to the short runnning time of most of
these benchmarks.  We have a "slow" mode for the benchmark suite that
runs each program with larger test data, but I didn't use it this time -
mostly we find that measuring allocations is useful as a first
approximation, and it's certainly more reliable.

(rest of email snipped, most of which I agree with).

Cheers,
	Simon