timings using optimized codewarrior
John Hall
johnharveyhall at qwest.net
Wed Jun 6 03:55:45 UTC 2001
Gang:
Well, my student and I just got an optimized version of a simple
little diffusion stencil using Metrowerks Codewarrior and frankly the
results are kind of interesting. First, optimized code runs around 6
times faster than unoptimzed code. Dave Nystrom and I seem to recall
that optimized code under R1 ran 10-20 times faster than unoptimized
using KCC. I don't want to attach too much significance to this
though. It is an indicator that either Metrowerks optimizer is not
all that hot (my belief) or that somehow the abstractions of R2 are
somehow less onerous in debug mode (also a possibility).
Anyhow, this was a 2-D diffusion stencil and no matter what we tried
we always got a linear response in the timing study. For the first
block, this is a good result since we were just running the same size
problem for more cycles.
But, we tried to run larger and larger problems to get us to go out
of L2 cache (on this PIII we had a 256K L2 cache), and we were never
able to notice a drop. It stayed linear versus the total number of
cells. This will make more sense when we can convert the units to
something like MFlops. So either you guys have done something really
impressive regarding cache utilization, or we are running so slowly
that cache misses are not noticeable.
Also, we ran optimized on a Mac and a PC and the result differed
exactly by the difference in clock speed. This was a surprise, since
Mac advocates had always claimed that the Motorola floating point
performance was better than that of Intel at a given clock rate.
(This was for a 650 MHZ PIII laptop and a 500 MHz G3 laptop).
As an aside, the Brick Engine ran reproducibly slightly slower than
compressibleBrick (only about 5 percent, so it was basically a dead
heat), but, I would have expected Brick to be slightly faster than
compressibleBrick (and probably by more than a few percent since it
should have less overhead).
Anyhow, here is the raw data for the Brick PIII runs:
cellsXY cycles elapsed time (secs): Total Cells
101 1000 24 24 24 10201
101 2000 48 48 47 10201
101 3000 72 71 72 10201
101 5000 119 122 121 10201
101 10000 243 243 244 10201
101 25000 608 607 606 10201
25 1000 2 2 2 625
51 1000 6 5 5 2601
101 1000 24 24 24 10201
201 1000 103 103 103 40401
501 1000 635 251001
1001 1000 2520 1002001
15 5000 4 4 3 225
25 5000 8 9 8 625
51 5000 29 29 28 2601
75 5000 61 61 61 5625
101 5000 124 10201
201 5000 518 40401
401 5000 2054 160801
3 10000 3 9
6 10000 3 36
12 10000 6 144
24 10000 15 576
3 100000 28 9
6 100000 34 36
12 100000 59 144
24 100000 150 576
This was single processor with no MPI, etc under Win 2000. All I/O
was turned off within the timed region so it was just Cycle Manager
loop overhead (Tecolote Loops over models, in this case 1 Model) and
floating point calculations being timed. There were 3 fields
involved, Temperature, Conductivity and a TmpField to collect the
stencil info. We used difftime and the time_t time functions to
collect our data (in seconds), so only high granularity can be
studied.
Code:
This is the relation between Conductivity (Lval) and Temperature:
template<class Traits>
void DiffRelation<Traits>::ConFuncT6( const ScalarField& Conductivity,
const ScalarField& Temperature ) {
Conductivity = (1.0/(2.0*Dim))*pow(Temperature,Real(6.0));
}
This is the relation between TmpField (Lval) and Conductivity and Temperature:
template<class Traits>
void OffsetRelation<Traits>::sumNeighbors(const ScalarField& TmpField,
const ScalarField& Conductivity, const ScalarField& Temperature ) {
Interval<Dim> ND = Temperature.domain();
Loc<Dim> offset;
TmpField = 0.0;
for ( int d = 0; d < Dim; ++d ) {
for ( int off = 0; off < Dim; ++off ) {
offset[off] = off==d ? 1:0;
}
TmpField(ND) +=
Temperature(ND+offset)*Conductivity(ND+offset) +
Temperature(ND-offset)*Conductivity(ND-offset);
}
}
So the loop was just over this one line:
Temperature = (1.0-2.0*Conductivity*Dim)*Temperature + TmpField;
which kept causing the other two updater dependencies above to get
called as relations.
There are some obvious optimizations which can be performed on this
code, but, it was the relative timings of optimized and unoptimized
executables (factor of 6) along with these simple scaling studies
that we were interested in.
Hope this is interesting,
John Hall and Richard Williams
More information about the pooma-dev
mailing list