PATCH: Updates to tutorial
Mark Mitchell
mark at codesourcery.com
Fri Sep 15 01:11:54 UTC 2006
This patch fixes some typos/grammar/etc. in the tutorial. There's
clearly more we could do to improve the documentation, but this will
do for the upcoming release.
Jules, Don, I noticed that there's no performance graph for the
temporal-locality version of the parallel fast convolution. Is that
graph available?
Thanks,
--
Mark Mitchell
CodeSourcery
mark at codesourcery.com
(650) 331-3385 x713
2006-09-14 Mark Mitchell <mark at codesourcery.com>
* doc/tutorial/tutorial.xml: Add references to API reference and
specification.
* doc/tutorial/performance.xml: Edit.
* doc/tutorial/parallel.xml: Likewise.
* doc/tutorial/serial.xml: Likewise.
Index: performance.xml
===================================================================
--- performance.xml (revision 149238)
+++ performance.xml (working copy)
@@ -12,8 +12,27 @@
]>
<chapter id="chap-performance"
xmlns:xi="http://www.w3.org/2003/XInclude">
- <title>Performance</title>
+ <title>Profiling</title>
+ <chapterinfo>
+ <abstract>
+ <para>
+ This chapter explains how to use the profiling features of Sourcery
+ VSIPL++ to improve the performance of your application.
+ </para>
+ </abstract>
+ </chapterinfo>
+
+ <para>
+ This chapter explains how to use the profiling features of Sourcery
+ VSIPL++ to improve the performance of your application. Sourcery
+ VSIPL++ provides two profiling modes. The <firstterm>library
+ profiling</firstterm> mode allows you to gather data about the
+ time used for computations performed through the VSIPL++ API. The
+ <firstterm>application profiling</firstterm> mode allows you to
+ instrument blocks of application code to gather data at a higher
+ level.
+ </para>
<section><title>Library Profiling</title>
<para>
@@ -90,16 +109,15 @@
<para>
To enable profiling, define
<option>-DVSIP_IMPL_PROFILER=<replaceable>mask</replaceable></option>
- on the command line when compiling your program.
- On many systems, this option may be added to the CXXFLAGS variable
- in the project makefile.
- </para>
- <para>
- Since profiling can introduce overhead, especially for element-wise
- expressions, this macro allows you to choose which operations in the
- library are profiled. To profile all operations, use
+ on the command line when compiling your program. (If you are
+ using <command>make</command> to build your program, you might
+ want to add this command-line option to the
+ <varname>CXXFLAGS</varname> variable.) To profile all operations, use
<option>-DVSIP_IMPL_PROFILER=15</option>.
See <xref linkend="mask-values"/> for other possible values.
+ Since profiling introduces some overhead, especially for element-wise
+ expressions, you may wish to limit the set of operations that are
+ are profiled.
</para>
<note>
<para>
@@ -115,27 +133,31 @@
<section><title>Accumulating Profile Data</title>
<para>
- To use the accumulate mode, you must declare a <code>Profile</code>
+ To use the accumulate mode, you must declare a
+ <classname>Profile</classname>
object. Sourcery VSIPL++ will collect profiling data throughout
- its lifetime. When the object goes out of scope, the data
- collected by profiling will be written to a log file. For
+ the lifetime of this object. When the object goes out of scope,
+ the data collected by profiling will be written to a log file. For
example, to profile your entire program, with all data written
to the file <filename>profile.txt</filename>, you would add
this line:
<screen>Profile profile("profile.txt", pm_accum);</screen>
- to the beginning of your <code>main</code> function, after
+ to the beginning of your <function>main</function> function, after
initializing Sourcery VSIPL++. Then, when the program exits,
this object will go out of scope and profiling data will be
written to the output file. For this reason, only one object
of this type may be in scope at any given time.
</para>
<para>
- If you are profiling your entire program, you may specify options
- on the command line that perform the equivalent of the above two steps:
-
+ If you want to profile your entire program, you may invoke your
+ program with the following command-line options:
<screen>--vsipl++-profile-mode=accum --vsipl++-profile-output=profile.txt</screen>
+ These options will be processed during the call to
+ <function>vsip::init</function>, and are equivalent to declaring
+ the profiling object in <function>maine</function>, as described
+ above.
</para>
<para>
Using this technique on the example program <filename>fce-serial.cpp
@@ -149,7 +171,8 @@
(or "event"). The first column gives a name for the event. The
second column is the total amount of time spent in this operation
in "ticks". (You can convert ticks to seconds by dividing by the
- value given by the "clocks_per_sec" value in the profiling header.)
+ value given by the <varname>clocks_per_sec</varname> value in
+ the profiling header.)
The third column indicates the number of times this operation was
performed. The fourth column indicates the number of mathematical
operations performed during the computation. (This is the number of
@@ -369,32 +392,34 @@ Fftm Inv C-C by_ref 64x256 : 1559304 : 1
<section xml:id="performance_api"><title>Performance API</title>
<para>
- An additional interface is provided for getting run-time profile data.
- This allows you to selectively monitor the performance of a
- particular instance of a VSIPL class such as Fft, Convolution or
- Correlation.
- </para>
- <para>
- Classes instrumented the Performance API provide a function
- called <code>impl_performance</code> that takes a pointer to a
- constant character string and returns a single-precision floating
- point number.
+ Sourcery VSIPL++ provides an additional, low-level interface for
+ accessing profile data. This interface allows you to
+ selectively monitor the performance of a particular instance of
+ classes that implement the Performance API. Classes
+ instrumented the Performance API provide a function called
+ <methodname>impl_performance</methodname>. This function maps
+ keywords (provided as C-style strings) to floating-point values.
+ The <classname>Fft</classname>,
+ <classname>Convolution</classname>, and
+ <classname>Correlation</classname> classes all implement the
+ performance API.
</para>
<para>
The following call shows how to obtain an estimate of the performance
- in number of operations per second:
+ in number of operations per second from a particular FFT object:
<screen>float mops = fwd_fft.impl_performance("mops");</screen>
- The definition of "operation" varies depending on the object
- and type of data being processed. For example, a single-precison
- Fft object will return the number of single-precison
- floating-point operations performed per second while a complex
- double-precision FFT object will return the number of double-
- precision floating-point operations performed per second.
+ The definition of "operation" varies depending on the
+ object and type of data being processed. For example, a
+ single-precison FFT object will return the number of
+ single-precison floating-point operations performed per second
+ while a complex double-precision FFT object will return the
+ number of double-precision floating-point operations performed
+ per second.
</para>
<para>
- The table below lists the current types of information available.
+ The table below lists the information available.
</para>
<table frame="none" rowsep="0"><title>Performance API Metrics</title>
<tgroup cols="2">
@@ -442,28 +467,28 @@ Fftm Inv C-C by_ref 64x256 : 1559304 : 1
included in the accumulate mode and trace mode output.
</para>
<para>
- Profiling events are recorded by constructing a <code>Scope_event
- </code> object. To create a <code>Scope_event</code>, call the
- constructor, passing it a <code>std::string</code> that will
+ Profiling events are recorded by constructing a <classname>Scope_event
+ </classname> object. To create a
+ <classname>Scope_event</classname>, call the
+ constructor, passing it a <classname>std::string</classname> that will
become the event tag and, optionally, an integer value expressing
the number of floating point operations that will be performed by
- the time the object is destroyed.
- For example, to measure the time taken to compute the main portion
- in the fast convolution example, modify the source as follows:
+ the time the object is destroyed. The following example shows
+ how to use this facility:
</para>
<programlisting><xi:include href="src/profile_example.cpp" parse="text"/>
</programlisting>
- <para>
- The operation count passed as the second parameter is the
- sum of the two FFT's and the vector-matrix multiply.
- This resulting profile data is identical in format to that used for
- profiling library functions.
- </para>
+ <para>
+ The operation count passed as the second parameter is the
+ sum of the two FFT's and the vector-matrix multiply.
+ The resulting profile data is identical in format to that
+ obtained using the library API:
+ </para>
<programlisting><xi:include href="src/profile_output.txt" parse="text"/>
</programlisting>
<para>
Now the output has a new line that represents the time that
- the <code>Scope_event</code> object exists, i.e. only while the
+ the <classname>Scope_event</classname> object exists, i.e. only while the
program executes the three main steps of the fast convolution.
<screen>Fast Convolution : 4256109 : 1 : 2424832 : 2046.11</screen>
Index: tutorial.xml
===================================================================
--- tutorial.xml (revision 149238)
+++ tutorial.xml (working copy)
@@ -61,7 +61,11 @@
<title>Reference</title>
<partintro>
<para>
- The sections in Part II form a reference manual for Sourcery VSIPL++.
+ The sections in Part II provide reference information about
+ Sourcery VSIPL++. You should also refer to the VSIPL++ API
+ Specification and Sourcery VSIPL++ API Reference, both of
+ which are available at <ulink
+ url="http://www.codesourcery.com/vsiplplusplus"/>.
</para>
<literallayout>
Index: parallel.xml
===================================================================
--- parallel.xml (revision 149238)
+++ parallel.xml (working copy)
@@ -28,7 +28,7 @@
<para>
The first fast convolution program in the previous chapter makes
use of two implicitly parallel operators: <function>Fftm</function> and
- <function>vmmul</function>. These operators are implicity parallel
+ <function>vmmul</function>. These operators are implicitly parallel
in the sense that they process each row of the matrix
independently. If you had enough processors, you could put each
row on a separate processor and then perform the entire
@@ -38,19 +38,20 @@
<para>
In the VSIPL++ API, you have explicit control of the number of
processors used for a computation. Since the default is to use
- just a single processor, the program above will not run in
- parallel, even on a multi-processor system. This section will show
- you how to use <firstterm>maps</firstterm> to take advantage of
- multiple processors. Using a map tells Sourcery VSIPL++ to
- distribute a single block of data across multiple processors.
- Then, Sourcery VSIPL++ will automatically move data between
- processors as necessary.
+ just a single processor, the program in <xref
+ linkend="sec-serial-fastconv"/> will not run in parallel, even on a
+ multi-processor system. This section will show you how to use
+ <firstterm>maps</firstterm> to take advantage of multiple
+ processors. Using a map tells Sourcery VSIPL++ to distribute a
+ single block of data across multiple processors. Then, Sourcery
+ VSIPL++ will automatically move data between processors as
+ necessary.
</para>
<para>
The VSIPL++ API uses the Single-Program Multiple-Data (SPMD) model
for parallelism. In this model, every processor runs the same
- program, but operates on different sets of data. For instance, in
+ program, but operates on different sets of data. For example, in
the fast convolution example, multiple processors perform FFTs at
the same time, but each processor handles different rows in the
matrix.
@@ -218,12 +219,12 @@
<title>Implicit Parallelism: Parallel Foreach</title>
<para>
- You may feel that the original formulation was simpler and more
+ You may feel that the original formulation using implicitly
+ parallel operators was simpler and more
intuitive than the more-efficient variant using explicit loops.
Sourcery VSIPL++ provides an extension to the VSIPL++ API that
allows you to retain the elegance of that formulation while still
- obtaining the temporal locality obtained with the style shown in
- the previous two sections.
+ obtaining good temporal locality.
</para>
<para>
@@ -373,11 +374,11 @@
<para>
Because the data will be arriving via DMA, you must explicitly
- manage the memory used by Sourcery VSIPL++. Each processor must allocate
- the memory for its local portion of
- <varname>data_in_block</varname>. (All processors except the
- actual input processor will allocate zero bytes, since the input
- data is located on a single processor.) The code required to
+ manage the memory used by Sourcery VSIPL++. Because VSIPL++ uses the
+ SPMD model, each processor must allocate
+ the memory for its local portion the input block, even though all
+ processors except the actual input processor will allocate zero
+ bytes. The code required to
set up the views is:
</para>
Index: serial.xml
===================================================================
--- serial.xml (revision 149238)
+++ serial.xml (working copy)
@@ -151,7 +151,7 @@
Before performing the actual convolution, you must convert the
replica to the frequency domain using the FFT created above. Because
the replica data is a property of the chirp, we only need to do
- this once; even if our radar system runs for a long time, the
+ this once; even if the radar system runs for a long time, the
converted replica will always be the same. VSIPL++ FFT
objects behave like functions, so you can just "call" the
FFT object:
@@ -165,7 +165,7 @@
objects you've already created to go into and out of the frequency
domain. While in the frequency domain, you will use the
<function>vmmul</function> operator to perform a
- vector-matrix multiply. This will multiply each row
+ vector-matrix multiply. This operator multiplies each row
(dimension zero) of the frequency-domain matrix by the replica.
The <function>vmmul</function> operator is a template taking a
single parameter which indicates whether the multiplication should
@@ -284,7 +284,7 @@
}]]></programlisting>
<para>
- The following graph shows that the new "interleaves"
+ The following graph shows that the new "interleaved"
formulation is faster than the original "phased" approach
for large data sets. For smaller data sets (where all of the data
fits in the cache anyhow), the original method is faster because
@@ -309,12 +309,12 @@
</para>
<para>
- To perform I/O with external routines (such as posix
- <function>read</function> and <function>write</function>
- it is necessary to obtain a pointer to data.
- Sourcery VSIPL++ provides multiple ways to do this:
- using <firstterm>user-defined storage</firstterm>, and
- using <firstterm>external data access</firstterm>.
+ To perform I/O with external routines (such as the POSIX
+ <function>read</function> and <function>write</function> functions)
+ it is necessary to obtain a pointer to the raw data used by
+ Sourcery VSIPL++. Sourcery VSIPL++ provides two ways to do this:
+ you may use either <firstterm>user-defined storage</firstterm> or
+ <firstterm>external data access</firstterm>.
In this section you will use user-defined storage to
perform I/O. Later, in <xref linkend="sec-io-extdata"/> you
will see how to use external data access for I/O.
@@ -385,7 +385,7 @@
The <varname>true</varname> argument indicates that the data
values sould be preserved by the admit. In cases where the
values do not need to preserved (such as admitting a block
- after outout I/O has been performed and before the block will be
+ after output I/O has been performed and before the block will be
overwritten by new values in VSIPL++) you can use
<varname>false</varname> instead.
</para>
@@ -417,14 +417,13 @@
<para>
In this section, you will use <firstterm>External Data
- Access</firstterm> to get pointer to a block's data.
- External data access allows a pointer to any block's
- data to be taken, even if the block was not created with
- user-specified storage (or if the block is not a <varname>Dense</varname>
- block at all!) This capability is useful in context where you
- cannot control how a block is created. To illustrate
- this, you will create a utility routine for I/O that works
- with any view passed as a parameter.
+ Access</firstterm> to get a pointer to a block's data.
+ You can use this method with any block, even if the block does not
+ use user-specified storage. The external data access method is
+ useful in contexts where you cannot control how the block is
+ allocate. For example, in this section, you will create a utility
+ routine for I/O that works with any matrix or vector, even if it
+ was not created with user-defined storage.
</para>
<para>
@@ -440,30 +439,30 @@
<varname>block_type</varname> and the requested layout
<varname>layout_type</varname>. The constructor takes
two parameters: the block being accessed, and the type of
- syncing necessary.
+ synchronization necessary.
</para>
<para>
- The <varname>layout_type</varname> parameter is an
- specialized <varname>Layout</varname> class template that
+ The <varname>layout_type</varname> parameter is a
+ specialized <classname>Layout</classname> class template that
determines the layout of data that <function>Ext_data</function>
provides. If no type is given,
the natural layout of the block is used. However, in some
- cases it is necessary to access the data in a certain way,
- such as dense or row-major.
+ cases you may wish to specify row-major or column-major layout.
</para>
<para>
- <varname>Layout</varname> class template takes 4 parameters to
+ The <classname>Layout</classname> class template takes 4 parameters to
indicate dimensionality, dimension-ordering, packing format,
and complex storage format (if complex). In the example below
you will use the layout_type to request the data access to be dense,
- row-major, with interleaved real and imaginar values if complex.
- This will allow you to read data sequentially from a file.
+ row-major, with interleaved real and imaginary values. This layout
+ corresponds to a common storage format used for binary files
+ storing complex data.
</para>
<para>
- The sync type is analgous to the update flags for
+ The synchronization type is analgous to the update flags for
<function>admit()</function> and <function>release()</function>.
<varname>SYNC_IN</varname> indicates that the block and pointer
@@ -486,15 +485,18 @@
<programlisting><![CDATA[ value_type* ptr = ext.data();]]></programlisting>
<para>
- The pointer provided is valid only during the life of the object.
- Moreover, the block being accessed should not be used during that time.
+ The pointer provided is valid only during the life of the
+ <classname>Ext_data</classname> object.
+ Moreover, the block referred to by the
+ <classname>Ext_data</classname> object must not be used during this
+ period.
</para>
<para>
- Putting this together, you can create a routine to perform
+ Using these capabilities together, you can create a routine to perform
I/O into a block. This routine will take two arguments:
- a filename to read, and a view to put the data into.
+ a filename to read, and a view in which to store the data.
The amount of data read from the file will be determined by
the view's size.
</para>
More information about the vsipl++
mailing list