timings using optimized codewarrior

Wed Jun 6 03:55:45 UTC 2001

Gang:
Well, my student and I just got an optimized version of a simple 
little diffusion stencil using Metrowerks Codewarrior and frankly the 
results are kind of interesting. First, optimized code runs around 6 
times faster than unoptimzed code. Dave Nystrom and I seem to recall 
that optimized code under R1 ran 10-20 times faster than unoptimized 
using KCC. I don't want to attach too much significance to this 
though. It is an indicator that either Metrowerks optimizer is not 
all that hot (my belief) or that somehow the abstractions of R2 are 
somehow less onerous in debug mode (also a possibility).

Anyhow, this was a 2-D diffusion stencil and no matter what we tried 
we always got a linear response in the timing study. For the first 
block, this is a good result since we were just running the same size 
problem for more cycles.

But, we tried to run larger and larger problems to get us to go out 
of L2 cache (on this PIII we had a 256K L2 cache), and we were never 
able to notice a drop. It stayed linear versus the total number of 
cells. This will make more sense when we can convert the units to 
something like MFlops. So either you guys have done something really 
impressive regarding cache utilization, or we are running so slowly 
that cache misses are not noticeable.

Also, we ran optimized on a Mac and a PC and the result differed 
exactly by the difference in clock speed. This was a surprise, since 
Mac advocates had always claimed that the Motorola floating point 
performance was better than that of Intel at a given clock rate. 
(This was for a 650 MHZ PIII laptop and a 500 MHz G3 laptop).

As an aside, the Brick Engine ran reproducibly slightly slower than 
compressibleBrick (only about 5 percent, so it was basically a dead 
heat), but, I would have expected Brick to be slightly faster than 
compressibleBrick (and probably by more than a few percent since it 
should have less overhead).

Anyhow, here is the raw data for the Brick PIII runs:
cellsXY	cycles	elapsed time (secs):        Total Cells
101	1000	24	24	24		10201
101	2000	48	48	47		10201
101	3000	72	71	72		10201
101	5000	119	122	121		10201
101	10000	243	243	244		10201
101	25000	608	607	606		10201

25	1000	2	2	2		625
51	1000	6	5	5		2601
101	1000	24	24	24		10201
201	1000	103	103	103		40401
501	1000	635				251001
1001	1000	2520				1002001

15	5000	4	4	3		225
25	5000	8	9	8		625
51	5000	29	29	28		2601
75	5000	61	61	61		5625
101	5000	124				10201
201	5000	518				40401
401	5000	2054				160801

3	10000	3				9
6	10000	3				36
12	10000	6				144
24	10000	15				576

3	100000	28				9
6	100000	34				36
12	100000	59				144
24	100000	150				576

This was single processor with no MPI, etc under Win 2000. All I/O 
was turned off within the timed region so it was just Cycle Manager 
loop overhead (Tecolote Loops over models, in this case 1 Model) and 
floating point calculations being timed. There were 3 fields 
involved, Temperature, Conductivity and a TmpField to collect the 
stencil info. We used difftime and the time_t time functions to 
collect our data (in seconds), so only high granularity can be 
studied.

Code:
This is the relation between Conductivity (Lval) and Temperature:

template<class Traits>
void DiffRelation<Traits>::ConFuncT6( const ScalarField& Conductivity,
				      const ScalarField& Temperature ) {
     Conductivity = (1.0/(2.0*Dim))*pow(Temperature,Real(6.0));
}

This is the relation between TmpField (Lval) and Conductivity and Temperature:
template<class Traits>
void OffsetRelation<Traits>::sumNeighbors(const ScalarField& TmpField,
	const ScalarField& Conductivity, const ScalarField& Temperature ) {
	Interval<Dim> ND = Temperature.domain();
	Loc<Dim> offset;
	TmpField = 0.0;
	for ( int d = 0; d < Dim; ++d ) {
		for ( int off = 0; off < Dim; ++off ) {
		  offset[off] = off==d ? 1:0;
		}
		TmpField(ND) += 
Temperature(ND+offset)*Conductivity(ND+offset) +
				Temperature(ND-offset)*Conductivity(ND-offset);
	}
}

So the loop was just over this one line:

Temperature = (1.0-2.0*Conductivity*Dim)*Temperature + TmpField;

which kept causing the other two updater dependencies above to get 
called as relations.

There are some obvious optimizations which can be performed on this 
code, but, it was the relative timings of optimized and unoptimized 
executables (factor of 6) along with these simple scaling studies 
that we were interested in.

Hope this is interesting,
John Hall and Richard Williams