[pooma-dev] timers and performance measurement under Linux

Mon Aug 6 18:47:36 UTC 2001

-----Original Message-----
From: Julian C. Cummings [mailto:cummings at linkline.com]
Sent: Monday, August 06, 2001 12:21 PM
To: James Crotinger; pooma-dev at pooma.codesourcery.com
Subject: RE: [pooma-dev] timers and performance measurement under Linux

Julian: ---------------------------------------------------
The gettimeofday() function is probably the best thing to use for wallclock
time measurement.  This is what we used in the old Timer class in Pooma r1.
I haven't looked at your check-in yet, but hopefully you remembered to check
for overflow in the microseconds counter and increment the seconds counter
accordingly.  Other than that, I remember that code as being pretty simple.

-----------------------------------------------------------

I just did:

  return tv.tv_sec + 1.e-6 * tv.tv_usec;

This mirrors what we are doing with clock_gettime. My interpretation of
gettimeofday is that tv_usec should always be less than 1e6 - it is supposed
to return the number of seconds and microseconds since 12:00 am Jan 1, 1970.
I checked this under Linux - tv_usec resets to zero everytime tv_sec is
increased. So I don't see a reason to put our own (% 1000000) after it, and
indeed if it were over 1000000 I'm not even sure how I'd interpret tv_sec. 

Julian: ---------------------------------------------------
As for your comments on the PIII performance, I think what you are seeing 
is correct.  The out-of-cache performance is not very good.  You will see 
closer to optimal performance only when the problem size is in-cache, and 
the caches are much smaller than what we were used to on the SGI boxes.
With an optimized C code kernel, you should be able to see the cache effect
and stronger flops numbers for small problem sizes.  (But of course, it gets
harder to measure accurately, too.)  I'm not aware of any profiling tools
from
KAI, so I think prof/gprof is all there is, unless you know how to access
Pentium
hardware counters. 

-----------------------------------------------------------

Oh, this number is definitely memory bandwidth limited - there are three to
four loads and two stores every trip through the loop, which does four flops
(two multiplies and two adds). I get a peak C performance of about 390
MFlops for N = 60 or so. The peak POOMA II Brick performance is only 115 at
a slightly higher N and then it drops off very rapidly to about 30.

I tried gprof with "KCC -pg" generated code this morning, and gprof crashed
after about 10 minutes of crunching on the output of a run. Has anyone else
out there seen this? I'm going to try compiling with gcc, but I'm not sure
it generates good enough code for me to trust the profile results to guide
me to the right optimizations. 

  Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://sourcerytools.com/pipermail/pooma-dev/attachments/20010806/c9cda917/attachment.html>