[vsipl++] [patch] Profiling chapter for tutorial

Wed Jul 26 01:20:13 UTC 2006

Don McCoy wrote:
 > The attached patch adds a new chapter to the tutorial that describes how
 > to use the profiler feature.  An html version is included also, for ease
 > of review.
 >
 > Regards,
 >

Don,

This is looking good.  I have some word smithing suggestions below
in qoutes that you can feel free to ignore.  Please check this in
when you're happy with it.

Minor things:
  - make sure that code fragments in the text have <code>...</code> tags.
    For example the Scope_event class.
  - likewise, make sure the directories in the text have <filename> tags.
  - change mflops to mops

 >
 >
 > ------------------------------------------------------------------------
 >
 > Chapter 4. Performance
 > Prev <user-defined-blocks.html> 	 	 Next <application-profiling.html>
 >
 > ------------------------------------------------------------------------
 >
 >
 >     Chapter 4. Performance
 >
 > *Table of Contents*
 >
 > Library Profiling <ch04.html#library-profiling>
 >
 >     Configuration Options <ch04.html#configuration>
 >     Accumulating Profile Data <ch04.html#accumulating-profile-data>
 >     Trace Profile Data <ch04.html#trace-profile-data>
 >     Performance API <ch04.html#performance-api>
 >
 > Application Profiling <application-profiling.html>
 >
 >
 >     Library Profiling
 >
 > Sourcery VSIPL++ provides some features that help speed application
 > development by helping you locate and quantify the expensive
 > computations in your algorithm.

"Sourcery VSIPL++ provides library profiling features that speed
application development by locating and quantifying the expensive
computations in your algorithm"

 > Built-in profiling capabilities, when
 > enabled, provide timing data for many signal processing functions, such
 > as FFT's, as well as common linear algebra computations like matrix
 > multiplication and addition.

"These profiling capabilities provide timing data for signal processing
functions (such as FFTs), linear algebra computations (such as matrix
multiply), and elementwise expressions (such as vector addition).
When not required, profiling can be disabled at configure time, resulting
in no application overhead."

 > A full listing of functions covered is
 > shown in the table below.
 >
 > The profiler operates in two modes. In 'trace' mode, the time spent in
 > each function is stored separately and presented in chronological order.
 > This mode is preferred when a highly detailed view of program execution
 > is desired. In 'accumulate' mode, the times and opcounts are summed so
 > that an average runtime and MFLOP/s for the function can be computed.
 > This is desirable when investigating a specific function's performance.
 >
 > *Table 4.1. Functions Profiled*
 >
 > Section	Object/Function
 > |signal|	Convolution
 > |signal|	Correlation
 > |signal|	Fft
 > |signal|	Fir
 > |signal|	Iir
 >
 > See the file "profiling.txt" for a detailed explanation of the profiler
 > output for each of the functions above.
 >
 >
 >       Configuration Options

"Before using profiling, you need to configure the library with profiling
enabled. ..."

 > A timer is required to obtain the profile data. For profiling to be
 > useful, the timer should have high resolution and low overhead, such as
 > the Pentium and x86_64 time-stamp counters. When building the library
 > from source, you should enable a timer suitable for your particular
 > platform along with the profiler itself. These may be subsequently
 > disabled for the production version of the code without altering the
 > source code. For 64-bit Intel and AMD processors, use:
 >
 > --enable-timer=x86_64_tsc
 >
 > --enable-profiler
 >
 > If you are using a binary package on either of these platforms, then you
 > need take no special steps, as the timer and profiler are already
 > enabled for you.

Hmm, this raises a good point.  If profiling has any appreciable overhead,
we should add a third configuration option with profiling disabled.
  - debug   - debug, profiling enabled
  - profile - optimized, profiling enabled
  - release - optimized, most profiling disabled (maybe leave some very
    low overhead things enabled, like performance API).

 >
 >
 >       Accumulating Profile Data
 >
 > Using this feature is very easy. Simply pass the path to a log file to
 > the constructor of the Profile object as follows:

"Using profiler's accumulate mode is easy.  Simply construct a
'Profile' object with the name of a log file as follows:

	Profile profile("/dev/stdout", pm_accum);

Or simply:

	Profile profile("/dev/stdout");

I would mention using 'pm_accum' as a second argument.  Otherwise it
seems arbitrary that pm_trace is required for tracing mode.

 >
 > Profile profile("/dev/stdout");

 >
 > Profiled library functions will store timing data in memory while this
 > object is in scope. The profile data is written to the log file when the
 > object is destroyed. Note that for this reason, only one object of this
 > type may be created at any given time.
 >
 > The examples/ subdirectory provided with the source distribution
 > demonstrates this profiling mode using a 2048-point forward FFT followed
 > by an inverse FFT scaled by the length. The profiler uses the timer to
 > measure each FFT call and uses the size to compute an estimate of the
 > performance. For each unique event, the profiler outputs an indentifying
 > tag, the accumulated time spent 'in scope' (in "ticks"), the number of
 > times called, the total number of floating point operations performed
 > per call and the computed performance in millions of flops per second.
 > The time value may be converted to seconds by dividing it by the
 > 'clocks_per_second' constant.

"... by the 'clocks_per_second' constant in the log file header."

 >
 > # mode: pm_accum
 > # timer: x86_64_tsc_time
 > # clocks_per_sec: 3591371008
 > #
 > # tag : total ticks : num calls : op count : mflops

                                                ^^ mops

 > Fwd FFT C-C by_val 2048x1 : 208089 : 1 : 112640 : 1944.03
 > Inv FFT C-C by_val 2048x1 : 209736 : 1 : 112640 : 1928.77
 >
 >
 >
 > This information is important in analyzing total processing requirements
 > for an algorithm. However, care should be taken in interpreting the
 > results to ensure that they are representative of the intended
 > application. For example, in the above FFT the data will most likely not
 > be resident in cache as it would be in some instances. With a well
 > designed pipelined processing chain (typical of many embedded
 > applications) the data will be in cache, yielding significantly better
 > performance. To obtain a good estimate of the in-cache peformance, place
 > the FFT in a loop so that it is called many times.
 >
 > # mode: pm_accum
 > # timer: x86_64_tsc_time
 > # clocks_per_sec: 3591371008
 > #
 > # tag : total ticks : num calls : op count : mflops
 > Fwd FFT C-C by_val 2048x1 : 6212808 : 100 : 112640 : 6511.26
 >
 >
 >
 > This is only a portion of the analysis that would be necessary to
 > predict the performance of a real-world application. Once you are able
 > to accurately measure library performance, you may then extend that to
 > profile your own application code, using the same features used internal
 > to the library.
 >
 >
 >       Trace Profile Data
 >
 > This mode is used similarly to accumulate mode, except that an extra
 > parameter is passed to the creation of the Profile object.
 >
 > Profile profile("/dev/stdout", pm_trace);
 >
 > This mode is more important when investigating the execution sequence of
 > your program. The profiler simply records each library call as a pair of
 > events, allowing you to see where it entered and exited scope in each 
case.

"This mode is useful for investigating the ...

 >
 > Long traces can result when profiling in this mode, so be sure to avoid
 > taking more data than you have memory to store (and have time to process
 > later). The output is very similar to the output in accumulate mode.
 >
 > # mode: pm_trace
 > # timer: x86_64_tsc_time
 > # clocks_per_sec: 3591371008
 > #
 > # index : tag : ticks : open id : op count
 > 1 : FFT Fwd 1D C-C by_val    2048x1 : 4688163420488244 : 0 : 112640
 > 2 : FFT Fwd 1D C-C by_val    2048x1 : 4688163420626385 : 1 : 0
 > 3 : FFT Inv 1D C-C by_val    2048x1 : 4688163420643116 : 0 : 112640
 > 4 : FFT Inv 1D C-C by_val    2048x1 : 4688163420830298 : 3 : 0
 >
 >
 >
 > For each event, the profiler outputs an event number, an indentifying
 > tag, and the current timestamp (in "ticks"). The next two fields differ
 > depending on whether the event is coming into scope or out of scope.
 > When coming into scope, a zero is shown followed by the estimated count
 > of floating point operations for that function. When exiting scope, the
 > profiler displays the event number being closed followed by a zero. In
 > all cases, the timestamp (and intervals) may be converted to seconds by
 > dividing by the 'clocks_per_second' constant.
 >
 >
 >       Performance API
 >
 > An additional interface is provided for getting run-time profile data.
 > This allows you to selectively monitor the performance of a particular
 > instance of a VSIPL class such as Fft, Convolution or Correlation.
 >
 > Classes with the Performance API built in contain a function that takes
 > a string parameter and returns single-precision floating point number.
 > This flexible interface allows you to obtain a variety of useful values
 > all through a single function.

"Classes with the Performance API provide a function called
'impl_performance' that takes a string parameter and returns 
single-precision
floating point number."

 >
 > For example, given an Fft object named "fwd_fft", the following call
 > shows how to obtain an estimate of the number of floating point
 > operations per second performed.
 >
 > float mflops = fwd_fft.impl_performance("mflops");

"the following call shows how to obtain an estimate of the performance
in number of operations per second:

	float mops = fwd_fft.impl_performance("mops");

An "operation" will vary depending on the object and type of data
being processed.  For example, a single-precison Fft object will
return the number of single-precison floating-point operations
performed per second.

 >
 > The table below lists the current types of information available.
 >
 > *Table 4.2. Performance API Metrics*
 >
 > Parameter	Description
 > |mflops|	performance in millions of floating point operations per second

   |mops|	performance in millions of

 > |count|	number of times invoked
 > |time|	total time spent performing the operation, in seconds
 > |op_count|	number of floating point operations per invocation
 > |mbs|	data rate in millions of bytes per second (not applicable in for
 > all operations)
 >
 > ------------------------------------------------------------------------
 > Prev <user-defined-blocks.html> 	 	 Next <application-profiling.html>
 > User-defined blocks 	Home <index.html>	 Application Profiling
 >
 >
 > ------------------------------------------------------------------------
 >
 > Application Profiling
 > Prev <ch04.html> 	Chapter 4. Performance	 Next <glossary.html>
 >
 > ------------------------------------------------------------------------
 >
 >
 >     Application Profiling
 >
 > When knowing detailed run-time information regarding the library
 > functions used by your algorithm is not enough, you may want to add
 > profiling capabilities to some of your own code. Here, we introduce a
 > new object, the Scope_event class, and show you how to use it in your
 > application.

"The profiling mode provides an API that allows you to instrument
your own code.  Here we introduce ..."

 >
 > To create a Scope_event, simply call the constructor, passing it the
 > string that will become the event tag and, optionally, an integer value
 > expressing the number of floating point operations that will be
 > performed by the time the Scope_event object is destroyed. For example,
 > to measure the time taken to compute a simple running sum of squares
 > over a C array:
 >
 > #include <vsip/initfin.hpp>
 > #include <vsip/support.hpp>
 > #include <vsip/impl/profile.hpp>
 >
 > using namespace vsip;
 > using namespace impl;
 >
 > int
 > main()
 > {
 >   vsipl init;
 >
 >   int data[1024];
 >   for (int i = 0; i < 1024; ++i)
 >     data[i] = i;
 >
 >   profile::Scope_enable scope("/dev/stdout" );
 >
 >   // This computation will be timed and included in the profiler output.
 >   {
 >     profile::Scope_event user_event("sum of squares", 2 * 1024);
 >
 >     int sum = 0;
 >     for (int i = 0; i < 1024; ++i)
 >       sum += data[i] * data[i];
 >   }
 >
 >   return 0;
 > }
 >
 >
 >
 >
 > This resulting profile data is identical in format to that used for
 > profiling library functions.
 >
 > # mode: pm_accum
 > # timer: x86_64_tsc_time
 > # clocks_per_sec: 3591371008
 > #
 > # tag : total ticks : num calls : op count : mflops
 > sum of squares : 18153 : 1 : 2048 : 405.174
 >
 >
 >
 > Combining both application and library profiling is possible in either
 > trace or accumulate modes. Performance events can be nested to help
 > identify points of interest in your program. Events can be used to label
 > different regions, such as "range processing" and "azimuth processing"
 > for SAR. When examining the trace output, profile events for library
 > functions, such as FFTs, will be nested within profile events for
 > application regions.
 >
 > ------------------------------------------------------------------------
 > Prev <ch04.html> 	Up <ch04.html>	 Next <glossary.html>
 > Chapter 4. Performance 	Home <index.html>	 Glossary
 >

-- 
Jules Bergmann
CodeSourcery
jules at codesourcery.com
(650) 331-3385 x705