[vsipl++] [patch] Tutorial updates

Wed Sep 13 13:09:24 UTC 2006

Don,

This looks good.  I have several suggestions below on the tutorial chapter.
Use them as you please :)  Once you're happy please check it in.  We can
continue to incorporate edits as we review at the whole document.

I haven't had a chance to read the reference chapter yet, I will send
comments on that later.

				thanks,
				-- Jules

 > +    <para>
 > +      In addition to the accumulate and trace modes, which have 
pre-defined
 > +      output formats, Sourcery VSIPL++ exposes a profiling API that 
you can
                                                     ^^ Performance API
 > +      use to gather data directly on individual objects, such as FFTs.
 > +      If you need finer control of what operations are profiled, or 
if you
 > +      want to record the profiling data in a custom format, you may 
wish to
 > +      use this API directly.  See <xref linkend="performance_api"/> for
 > +      more details.
 > +    </para>
 > +<table xml:id="supported_ops" frame="none" rowsep="0">
 > +  <title>Operations Supporting Profiling</title>

 >      <para>
 >        See the file <filename>profiling.txt</filename> for a detailed
 >        explanation of the profiler output for each of the functions 
above.
 > +      See the file <filename>profiling.txt</filename> for a detailed

Isn't profiling.txt now Chapter 5?

 > +      explanation of the profiler output for each of the functions 
above.
 > +      For information about how to configure the library for profiling,
 > +      see the Quickstart also.
 >      </para>

 > +    <para>
 > +      This macro enables profiling operations in several different areas
 > +      of the library, depending on the value of
 > +      <replaceable>mask</replaceable>.  To profile all operations, use
 > +      the value <literal>15</literal>.  See <xref 
linkend="mask-values"/>
 > +      for other possible values.

I would mention the motivation behind why we have a mask:

   "Since profiling can introduce overhead, especially for element-wise
   expressions, this macro allows you to choose which operations in the
   library are profiled.  To profile all operations, use the value
   <literal>15</literal>.  See ..."
 > +    </para>

 > +    <note>
 > +      <para>
 > +	Profiling support requires that you link with a version of Sourcery
 > +	VSIPL++ that supports profiling.  If you have received a binary
 > +	distribution of Sourcery VSIPL++ from CodeSourcery, you probably
 > +	already have an appropriate version of the library.  If you are
 > +	building Sourcery VSIPL++ yourself, see the Quickstart guide for
 > +	more information about the requirements for building Sourcery
 > +	VSIPL++ with profiling enabled.

We've changed things so that all libraries support profiling, if a timer
is provided:

   "Profiling requires that the library be configured with a high-resolution
   timer.  Binary distributions of Sourcery VSIPL++ from CodeSourcery have
   this done.  If you are building Sourcery VSIPL++ from source, see the
   Quickstart guide for more information about configuring high-resolution
   timers."

 > +      </para>
 > +    </note>

 > +    <section><title>Setup</title>
 > +    <para>
 > +      The only computation performed in the setup phase is a forward FFT
 > +      that maps the pulse replica into the frequency domain.  This
 > +      computation corresponds to the following line of the profiling
 > +      data:
 > +<screen>Fft Fwd C-C by_ref 256 : 142119 : 1 : 10240 : 258.767
 > +</screen>
 > +      The "Fft Fwd C-C by_ref 256" tag indicates that this computation
 > +      is a 256-element forward FFT with complex, single-precision 
inputs
 > +      and outputs, returning its result by reference.  The notation 
used
 > +      for data types (e.g., "C-C" in this example) is given in
                                                          ^^ described
 > +      <xref linkend="data-type-names"/>.
 > +    </para>
 > +
 >      </section>
 > -    <section id="trace-profile-data"><title>Trace Profile Data</title>
 > +    <section><title>Convert to frequency domain</title>
 >      <para>
 > -      This mode is used similarly to accumulate mode, except that an
 > -      extra parameter is passed to the creation of the 
<code>Profile</code>
 > -      object.
 > -      <screen>Profile profile("/dev/stdout", pm_trace);</screen>
 > +      The next step of the computation is to convert from the time 
domain
 > +      to the frequency domain.  In particular, an FFT is applied to 
a data
 > +      cube of 64 pulses, each containing 256 range cells:

   "In particular, a FFT is applied to each pulse of a data cube, which 
consists
   of 64 pulses each containing 256 range cells:"

 > +<screen>Fftm Fwd row_type C-C by_ref 64x256 : 1188144 : 1 : 1146880 
: 3466.65
 > +</screen>
 > +      For this FFT, the size is reported differently (rows x columns)
 > +      because this is a two-dimensional FFT.

It's not a 2-D FFT, its an "Multiple 1D FFT":

   "For this operation, a Fftm object was used to perform multiple FFTs 
on each
    row of the data cube."

 > +      The operation count (1.1 million) far outweighs that of
 > +      any other step, except the inverse FFT.
 > +      The performance measured was 3.5 GFLOPS/s on a 3.6 GHz Xeon.
 > +      Since the theoretical peak performance on such
 > +      a machine is about 14.4 GFLOP/s, the program has achieved an
 > +      a very good 24% of peak.
 > +      Other example programs measure in-cache FFT perfomance on vectors
 > +      of the same size at 4.9 GFLOP/s.  Therefore, considering that the
 > +      3.5 GFLOP/s includes cache overheads, the result is still good.

I would move the first sentence to a new paragraph following, to give it
some more context:

   "Since the operation count (1.1 million) of the FFT (and inverse
   FFT) outweigh the rest of the computation, the overall performance
   will be very close to the FFT performance."

 > +    </para>
 > +    </section>
 > +    <section><title>Convolution</title>
 > +    <para>
 > +      The actual convolution consists of a vector-matrix 
multiplication.
 > +      The corresponding profiling output is:
 > +<screen>Expr_Loop_Vmmul 2D vmmul(C,C) 64x256 : 1539531 : 1 : 98304 : 
229.321
 > +</screen>
 > +      Sourcery VSIPL++ chose to evaluate this expression by 
performing a
 > +      row-wise vector-vector multiplication on each of the rows of the
 > +      matrix.  Therefore, there is a second line:
 > +<screen>Expr_SIMD_VV-simd::vmul 1D *(C,C) 256 : 316674 : 64 : 1536 : 
1114.86
 > +</screen>
 > +      The tag used for this expression is "*(C,C)".  The profiling 
tag for
 > +      many operations is shown using a prefix notation; the operation
 > +      performed is followed by the types of the arguments.  The 
"simd" tag
 > +      indicates that VSIPL++ used the Single Instruction Multiple 
Data (SIMD)
 > +      facilities on the Xeon architecture for maximum performance.
 > +    </para>
 > +    <para>
 > +      The tick count for the vector-matrix multiplication (vmmul) 
includes
 > +      the time spent in the multiple row-wise scalar-vector 
multiplications.
 > +      Therefore the total number of time used by the program is 
*not* the
 > +      sum of the tick counts given for each line.

We should mention why vmmul performance is less than the constituent 
scalar-vector
multiplies:

   "You should notice the performance difference between the vmmul event and
   the individual scalar-vector multiplications.  Some of this is due to the
   extra work vmmul does to setup each individual multiplication: loop 
overhead
   and subview creation.  However, most of this is due to the overhead 
of profiling:
   the cost of accessing timers and the cost of maintaining profile data 
structures.

   In general, profiling overhead only slows the program execution but
   does not affect the measurements taken.  However, when an operation
   being profiled (such as vmmul) consists of many invocations of other
   profile operations (such as scalar-vector multiplication), measurements
   may be affected.

   When profiling is disabled, the performance of vmmul will be very 
close to the
   performance measured for the individual scalar-vector multiplications."

 > +    </para>
 > +    </section>
 > +    <section><title>Convert back to time domain</title>
 > +    <para>
 > +      The last step of the algorithm is to convert back to the time 
domain
 > +      by using an inverse FFT.  An inverse FFT is computationally
 > +      equivalent to a forward FFT, except that an additional 
multiplication
 > +      is performed to handle scaling.  The lines corresponding to the
 > +      inverse FFT are:

When scaling is done is a choice left up to the user, so instead of
saying "An inverse FFT is computationally equiv to a forward FFT,
except ..." which implies this is true of all FFTs, you might say "The
"The inverse FFT is computationally equiv to the forward FFT, except
...", which implies this is true for the FFTs in the example.

 > +<screen>Expr_Dense 2D *(C,s) 64x256 : 687285 : 1 : 32768 : 171.228
 > +Expr_Loop 1D *(C,s) 16384 : 653265 : 1 : 32768 : 180.145
 > +Fftm Inv row_type C-C by_ref 64x256 : 1559304 : 1 : 1146880 : 2641.48
 > +</screen>
 > +      The first line describes a evaluation of a "dense" two-
 > +      dimensional multiplication between a single-precision complex
 > +      view (a matrix) and a single-precision scalar.  Note that
 > +      scalars are represented using lower-case equivalents for
 > +      the data types in the table above.
 > +    </para>
 > +    <para>
 > +      A "dense" matrix is one in which the values are packed
 > +      tightly in memory with no intervening space between the rows
 > +      or columns.  Therefore, the two-dimensional multiplication can
 > +      be thought of as a 1-dimensional multiplication of a long vector.
 > +      The evaluation of the 2-D operation includes the time required 
for
 > +      the 1-D operation, together with a small amount of overhead.
 > +      You can tell that this is the case as the time shown on the
 > +      first line is slightly greater than the time shown on the second.
 > +      Both show the same number of operations because they are
 > +      referring to the same calculation.
 > +    </para>
 > +    <para>
 > +      Similarly, the time required for the inverse FFT includes both 
the
 > +      time spent actually computing the FFT as well as the time 
required
 > +      for the scaling multiplication.  Because the multiplication is 
not
 > +      included in the theoretical operation count, the MOP/s count 
shown
 > +      is somewhat smaller than than for the forward FFT.

I believe the theoretical operation count is intended to include this
scaling cost, but it requires extra effort on the part of the
implementation:

   "For FFTs, Sourcery VSIPL++ uses the commonly accepted theoretical
    operation count of 5 N log2(N).  This includes the cost of scaling,
    which may be folded in with final twiddle factors.  However, as this
    example illustrates, not all FFT backends have this capability, as
    a result scaled FFTs often have a MOP/s rate lower than non-scaled
    FFTs."

 > +    </para>
 > +    </section>
 > +  </section>
 > +  <para>
 > +    The analysis presented in this section is only a portion of what
 > +    one would do to verify an algorithm is performing as desired.
 > +    Core routines utilizing techniques such as the fast convolution
 > +    method comprise only a portion of larger programs whose
 > +    performance is also of interest.
 > +    The profiling capabilities utilized here can be extended to cover
 > +    those areas of the application as well.
 > +    See <xref linkend="application_profiling"/> for more details.
 > +  </para>
 > +  </section>
 > +    <section><title>Trace Profile Data</title>
 > +    <para>

Flow suggestion: describe what trace mode is, then give details on how
to enable it:

   "In trace mode, the profiler records each library call as a pair
   of events, allowing you to see where each call was made and when it
   returned. This provides two time stamps per call, showing not only
   which functions were executed, but how they were nested with respect
   to one another.  This mode is useful for investigating the execution
   sequence of your program.

   To enable trace mode, construct the 'Profile' object with a 'pm_trace'
   flag, as in this line:

     <screen>Profile profile("profile.txt", pm_trace);</screen>

   Long traces can result when profiling in this mode, so be sure to
   avoid gathering more data than you have memory to store (and have
   time to process later).  The output is very similar to the output in
   accumulate mode."

 > +      By passing an additional parameter to the 'Profile' constructor,
 > +      you can switch from "accumulate" mode to "trace" mode.  This line:
 > +<screen>Profile profile("profile.txt", pm_trace);</screen>
 > +      will cause Sourcery VSIPL++ to enter trace profiling mode.
 > +      All computations performed by your program while
 > +      <code>profile</code> is in scope will be traced.
 >        This mode is useful for investigating the execution sequence
 >        of your program.
 > -      The profiler simply records each library call as a pair of 
events,
 > -      allowing you to see where it entered and exited scope in each 
case.
 > +      The profiler records each library call as a pair of events,
 > +      allowing you to see where each call was made and when it returned.
 > +      This provides two time stamps per call, showing not only which
 > +      functions were executed, but how they were nested with respect
 > +      to one another.
 > +      Long traces can result when profiling in this mode, so
 > +      be sure to avoid gathering more data than you have memory to
 > +      store (and have time to process later).  The output is very
 > +      similar to the output in accumulate mode.
 >      </para>
 >      <para>
 > -      Long traces can result when profiling in this mode, so be sure to
 > -      avoid taking more data than you have memory to store (and have 
time
 > -      to process later).  The output is very similar to the output in
 > -      accumulate mode.
 > +      Here is a sample of the output obtained by running the fast
 > +      convolution example in trace mode, which can also be run with
 > +      the options
 > +<screen>--vsipl++-profile-mode=trace 
--vsipl++-profile-output=profile.txt
 > +</screen>
 >      </para>
 >      <programlisting><xi:include href="src/profile_trace.txt" 
parse="text"/>
 >      </programlisting>
 >      <para>
 > -      For each event, the profiler outputs an event number, an 
indentifying
 > -      tag, and the current timestamp (in "ticks").  The next two fields
 > -      differ depending on whether the event is coming into scope or 
out of
 > -      scope.  When coming into scope, a zero is shown followed by the
 > -      estimated count of floating point operations for that function.
 > -      When exiting scope, the profiler displays the event number being
 > -      closed followed by a zero.  In all cases, the timestamp (and
 > -      intervals) may be converted to seconds by dividing by the
 > -      'clocks_per_second' constant in the log file header.
 > +      For each event, the Sourcery VSIPL++ outputs an event number,
 > +      an indentifying tag, and the current timestamp (in "ticks").
 > +      The next two fields differ depending on whether the event
 > +      marks the entry point of a library function or its return.
 > +      At the start of a call, a zero is shown followed by the estimated
 > +      count of floating point operations for that function.  When
 > +      returning from a call, the profiler displays the event number
 > +      created when the function was called, followed by a zero.
 > +      In all cases, the timestamp (and intervals) may be converted to
 > +      seconds by dividing by the 'clocks_per_second' constant in the
 > +      log file header.
 >      </para>
 > +    <para>
 > +      In the break shown by the ellipses, the program is in the 
middle of
 > +      performing the vector-matrix multiply, which has been broken down
 > +      into 64 separate vector-multiplies.  The first two FFT's are
 > +      completed, as shown by the fact that each have two entries,
 > +      one for where the computation began and one for where it ended.
 > +      The Vmmul function has started, but not yet finished, so it only
 > +      has one entry as of yet.

The output includes the end event of the vmmul.  How about

   "For brevity, events for some of the 64 scalar-vector mulitplies 
performed in
    the vmmul operation have been replaced with an ellipses."

 > +    </para>
 >      </section>
 > -    <section id="performance-api"><title>Performance API</title>
 > +    <section xml:id="performance_api"><title>Performance API</title>
 >      <para>
 >        An additional interface is provided for getting run-time 
profile data.
 >        This allows you to selectively monitor the performance of a
 > @@ -166,19 +331,19 @@
 >      </para>
 >      <para>
 >        Classes with the Performance API provide a function called
 > -      <code>impl_performance</code> that takes a string parameter 
and returns
 > -      single-precision floating point number.
 > +      <code>impl_performance</code> that takes a std::string parameter
 > +      and returns a single-precision floating point number.

Doesn't impl_performance take a 'char const*' parameter?

 >      </para>
 >      <para>
 >        The following call shows how to obtain an estimate of the 
performance
 >        in number of operations per second:
 > -
 > -      <screen>float mops = fwd_fft.impl_performance("mops");</screen>
 > -
 > -      An "operation" will vary depending on the object and type of data
 > -      being processed.  For example, a single-precison Fft object will
 > -      return the number of single-precison floating-point operations
 > -      performed per second.
 > +<screen>float mops = fwd_fft.impl_performance("mops");</screen>
 > +      The definition of "operation" varies depending on the object
 > +      and type of data being processed.  For example, a single-precison
 > +      Fft object will return the number of single-precison
 > +      floating-point operations performed per second while a complex
 > +      double-precision FFT object will return the number of double-
 > +      precision floating-point operations performed per second.
 >      </para>
 >      <para>
 >        The table below lists the current types of information available.
 > @@ -219,37 +384,59 @@
 >  </table>
 >      </section>
 >    </section>
 > -  <section id="application-profiling"><title>Application 
Profiling</title>
 > +  <section xml:id="application_profiling">
 > +    <title>Application Profiling</title>
 >      <para>
 > -      The profiling mode provides an API that allows you to instrument
 > -      your own code.  Here we introduce a new object, the
 > -      <code>Scope_event</code> class, and show you how to use it in 
your
 > -      application.
 > +      Sourcery VSIPL++ provides an interface that allows you to 
instrument
 > +      your own code through the <code>Scope_event</code> class.

For avoidance of doubt, you could mention that these events get included
in the profiling output:

   "Sourcery VSIPL++ provides an interface that allows you to instrument
   your own code with profiling events that will be included in the
   accumulate mode and trace mode output.

   "Profiling events are recorded by constructing a 'Scope_even' object.
   ... MERGE WITH NEXT PARAGRAPH"

 >      </para>
 >      <para>
 > -      To create a <code>Scope_event</code>, simply call the 
constructor, passing
 > -      it the string that will become the event tag and, optionally, 
an integer
 > -      value expressing the number of floating point operations that will
 > -      be performed by the time the <code>Scope_event</code> object 
is destroyed.
 > -      For example, to  measure the time taken to compute a simple 
running sum
 > -      of squares over a C array:
 > +      To create a <code>Scope_event</code>, call the constructor, 
passing
 > +      it a std::string that will become the event tag and, 
optionally, an
 > +      integer value expressing the number of floating point operations
 > +      that will be performed by the time the <code>Scope_event</code>
 > +      object is destroyed.  For example, to measure the time taken to
 > +      compute the main portion in the fast convolution example,
 > +      modify the source as follows:
-- 
Jules Bergmann
CodeSourcery
jules at codesourcery.com
(650) 331-3385 x705