From don at codesourcery.com Sun Sep 3 00:48:39 2006 From: don at codesourcery.com (Don McCoy) Date: Sat, 02 Sep 2006 18:48:39 -0600 Subject: [patch] Profiling configuration change Message-ID: <44FA2667.50404@codesourcery.com> This patch changes the way users will compile their programs when using profiling. Before, configuration options were needed. Now user programs will be compiled with a macro that serves the same purpose. Use -DVSIP_IMPL_PROFILER=15 to enable profiling of all operations. The value 15 (0x0F) is a mask composed of bits now defined in impl/profile.hpp. Each bit corresponds to a set of operations as before. It is not necessary to define anything when building the library (other than a timer), so both the debug and release binary builds will be suitable for use with profiling. Regards, -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pe.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pe.diff URL: From jules at codesourcery.com Sun Sep 3 00:54:45 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Sat, 02 Sep 2006 20:54:45 -0400 Subject: [vsipl++] [patch] Profiling configuration change In-Reply-To: <44FA2667.50404@codesourcery.com> References: <44FA2667.50404@codesourcery.com> Message-ID: <44FA27D5.6050708@codesourcery.com> Don McCoy wrote: > This patch changes the way users will compile their programs when using > profiling. Before, configuration options were needed. Now user > programs will be compiled with a macro that serves the same purpose. > > Use -DVSIP_IMPL_PROFILER=15 to enable profiling of all operations. The > value 15 (0x0F) is a mask composed of bits now defined in > impl/profile.hpp. Each bit corresponds to a set of operations as before. > > It is not necessary to define anything when building the library (other > than a timer), so both the debug and release binary builds will be > suitable for use with profiling. > > Regards, Don, this looks good, please check it in. thanks, -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From don at codesourcery.com Sun Sep 3 19:53:17 2006 From: don at codesourcery.com (Don McCoy) Date: Sun, 03 Sep 2006 13:53:17 -0600 Subject: [vsipl++] Readme for Profiling In-Reply-To: <44EB5004.1090409@codesourcery.com> References: <44EB5004.1090409@codesourcery.com> Message-ID: <44FB32AD.8050707@codesourcery.com> This has been updated to reflect recent changes in how we enable profiling. Don McCoy wrote: > This 'readme' file is referred to in the tutorial section on > profiling, meant to reside in the top-level directory of the source > distribution. It serves as a place to put implementation details that > would otherwise clutter the tutorial. It also makes a nice handy > mini-reference. > > In the near future I'd like to add some more details regarding each of > the objects or events we profile internally. This may make it more > clear how to determine which events are "nested" (i.e. listed by more > than one expression evaluator). > -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pr2.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pr2.diff URL: From don at codesourcery.com Tue Sep 5 19:00:35 2006 From: don at codesourcery.com (Don McCoy) Date: Tue, 05 Sep 2006 13:00:35 -0600 Subject: [patch] SIMD 'rscvmul' evaluators Message-ID: <44FDC953.4030509@codesourcery.com> This patch corrects a problem where two of the SIMD evaluators were not handling re-dimensioned (2-D --> 1-D) views correctly. This only affects the case where an element-wise multiplication is being performed with a real scalar and a complex view (and the view was re-dim'd). It is worth mentioning also that this defect was uncovered using the profiling features recently added. In general, the evaluators were doing the right thing, but when they did not, it fell back to using loop fusion, thereby still getting the correct answer (just not taking advantage of SIMD instructions). Regards, -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sl.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sl.diff URL: From mark at codesourcery.com Tue Sep 5 20:08:43 2006 From: mark at codesourcery.com (Mark Mitchell) Date: Tue, 05 Sep 2006 13:08:43 -0700 Subject: [vsipl++] [patch] SIMD 'rscvmul' evaluators In-Reply-To: <44FDC953.4030509@codesourcery.com> References: <44FDC953.4030509@codesourcery.com> Message-ID: <44FDD94B.4070900@codesourcery.com> Don McCoy wrote: > This patch corrects a problem where two of the SIMD evaluators were not > handling re-dimensioned (2-D --> 1-D) views correctly. This only > affects the case where an element-wise multiplication is being performed > with a real scalar and a complex view (and the view was re-dim'd). > > It is worth mentioning also that this defect was uncovered using the > profiling features recently added. In general, the evaluators were > doing the right thing, but when they did not, it fell back to using loop > fusion, thereby still getting the correct answer (just not taking > advantage of SIMD instructions). This is very nice on several levels: nice that the profilers found the problem, nice that the bug was performance degradation rather than failure, and nice that you were able to easily fix it! -- Mark Mitchell CodeSourcery mark at codesourcery.com (650) 331-3385 x713 From jules at codesourcery.com Tue Sep 5 20:38:25 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 05 Sep 2006 16:38:25 -0400 Subject: [vsipl++] [patch] SIMD 'rscvmul' evaluators In-Reply-To: <44FDC953.4030509@codesourcery.com> References: <44FDC953.4030509@codesourcery.com> Message-ID: <44FDE041.3000802@codesourcery.com> Don McCoy wrote: > This patch corrects a problem where two of the SIMD evaluators were not > handling re-dimensioned (2-D --> 1-D) views correctly. This only > affects the case where an element-wise multiplication is being performed > with a real scalar and a complex view (and the view was re-dim'd). Don, this looks good, please check it in. thanks -- Jules > > It is worth mentioning also that this defect was uncovered using the > profiling features recently added. In general, the evaluators were > doing the right thing, but when they did not, it fell back to using loop > fusion, thereby still getting the correct answer (just not taking > advantage of SIMD instructions). This is good stuff! -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From jules at codesourcery.com Tue Sep 5 21:54:35 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 05 Sep 2006 17:54:35 -0400 Subject: PAS support for split-complex Message-ID: <44FDF21B.2040000@codesourcery.com> This patch adds support to PAS parallel services (and some testing) for split complex. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pas-split.diff URL: From jules at codesourcery.com Wed Sep 6 15:19:58 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Wed, 06 Sep 2006 11:19:58 -0400 Subject: [vsipl++] Readme for Profiling In-Reply-To: <44FB32AD.8050707@codesourcery.com> References: <44EB5004.1090409@codesourcery.com> <44FB32AD.8050707@codesourcery.com> Message-ID: <44FEE71E.8070906@codesourcery.com> Don McCoy wrote: > This has been updated to reflect recent changes in how we enable profiling. Don, This looks good. I originally envisioned this to be primarily the event descriptions in section 5 ("Event Tags") that would be fluctuating around enough that a text file is the best place to keep them. However, the additional material looks good. It just increases the pressure to get it into (soon to be created) reference section of the user's guide! I have some suggestions below. I think the document looks good overall, once you're happy, please check it in as a text file. In the meantime, I will rename the "Tutorial" document to be a "User's Guide" with tutorial and reference sections. When this is done, you can create a chapter with the stable bits from this document. -- Jules > > ------------------------------------------------------------------------ > > 2006-09-03 Don McCoy > > * profiling.txt: New file. Readme for built-in profiling. > > > ------------------------------------------------------------------------ > > Index: profiling.txt > =================================================================== > --- profiling.txt (revision 0) > +++ profiling.txt (revision 0) > @@ -0,0 +1,256 @@ > +------------------------------------------------------------------------- > + Sourcery VSIPL++ Profiling API > +------------------------------------------------------------------------- > +Copyright (c) 2006 by CodeSourcery. All rights reserved. > + > + > +Contents > +------------------------------------------------------------------------- > +1) Compiling with Profiling Enabled > +2) Command Line Options > +3) Profiling Functions > +4) Profile Log Files > +5) Event Tags > + > + > + > +1) Compiling with Profiling Enabled I would call this section "Configure and Compile Options for Profiling" or just "Configure and COmpile Options" (since this is profiling reference chapter). > +------------------------------------------------------------------------- "There are no configure options for profiling, instead it is enabled via compile-time options. However, to use profiling it is necessary to configure the library with a suitable high-resolution timer (cross reference to '--enable_timer' option in quickstart). For example," > +If building from source, enable a suitable high-resolution timer > +when configuring the library. For example, > + > + --enable-timer=x86_64_tsc > + > +Pre-built versions of the library enable a suitable timer for your > +system. > + > + > +To enable profiling, define VSIP_IMPL_PROFILER= on the command > +line when compiling your program. On many systems, this option may be > +added to the CXXFLAGS variable in the project makefile. > + > +This macro enables profiling operations in several different areas > +of the library, depending on the value of > + > + Profiling Configuration Mask > + > + Section Description Value > + ------------------------------------- > + signal Signal Processing 1 > + matvec Linear Algbra 2 > + fns Elementwise Functions 4 > + user User-defined Operations 8 > + > +Determine the mask value by summing the values listed in the table > +for the areas you wish to profile. For example, if you wish to > +gather performance data on your own code as well as for FFT's, > +you would enable 'user' and 'signal' from the table above. The > +value you would choose would be 1 + 8 = 9. > + > + > + > +2) Command Line Options > +------------------------------------------------------------------------- I would emphasize that this is contigent on enabling profiling when compiling. "For programs that have been compiled with profiling enabled, the profiling mode and output file can be controlled from the command line." > +You may profile programs without inserting any code by specifying the > +options on the command line. Use this to choose the profiler mode: > + > + --vsipl++-profile-mode={accum, trace} > + These paragraphs on trace and accumulate mode could go into a separate section "Profiling Modes", or into the section on the log file format. This point in the file is the first point they are used so it is the logical place to define them, but since this is reference text, it may not be used in a linear fashion. I.e. If a user may want to refresh their memory on what the modes are ("what are the profilign modes again?"), it would not be readily apparant from the table of contents that their definitions are in this section. > +In 'trace' mode, the start and stop times where events begin and end > +are stored as profile data. The log will present these events in > +chronological order. This mode is preferred when a highly detailed > +view of program execution is desired. > + > +In 'accumulate' mode, the start and stop times are subtracted to > +compute the duration of an event and the cumulative sum of these > +durations are stored as profile data. The log will indicate the > +total amount of time spent in each event. This mode is desirable > +when investigating a specific function's average performance. > + > + > +Specify the path to the log file for profile output using: > + > + --vsipl++-profile-output=/path/to/logfile > + > +The second option defaults to the standard output on most > +systems, so it may be omitted. > + > + > + > +3) Profiling Functions These are all objects, so this should be "Profiling Objects" > +------------------------------------------------------------------------- It would be good to clarify that manually creating a Profile object is an alternative to controlling profiling from the command line. Maybe end the previous section with a transition paragraph: "The profiling command line options control profiling for the entire program execution. For finer grain control, such as enabling profiling during a specific portion of the program, or to mix different profiling modes, explicit Profiling objects can be created." Also, I would mention the arguments (object creation) before mentioning what happens when the object is destroyed: "The 'Profile' object is used to enable profiling during the lifetime of the object. When created, it takes arguments to indicate the output file and the profiling mode (trace or accumulate). When destroyted (i.e. goes out of scope or is explicitly deleted), the profile data is written to the specified output file. For example:" > +The 'Profile' object is created to gather timing data for the > +duration of its existence. When it is destroyed (i.e. goes > +out of scope or is explicitly deleted) the profile data is written > +to the specified output file. The first parameter specifies the > +logfile and the second, the profiling mode. For example: > + > + impl::profile::Profile profile("profile.txt", impl::profile::accum) Let's not overwrite this file with profiling output! It would be good to clarify or hint that a user only needs to create Scope_event objects for user-defined events. The library already defines a host of Scope_events for internal events. > + > +The 'Scope_event' object is used to insert a profiler event > +into the log. "'Scope_event' is only necessary in user code for user-defined events." > This object should be created at the point where > +you wish to begin timing and destroyed when the event is over > +(such as a computation). For example: > + > + impl::profile::Scope_event event("User Event", op_count); ^^^^^^^^^^^^ "Event Tag" would tie this to the use of 'tag' in the log file description. > + > +The first parameter is the tag that will be used to display the > +event's performance data in the log file. The second parameter is ^ "(Section 5 "Event Tags" describes the event tags used internally by the library)" > +optional. If used, 'op_count' should be an unsigned integer specifying > +an estimate of the total number of operations (floating point or > +otherwise) performed. This is used by the profiler to compute > +the rate of computation. Without it, the profiler will still > +yield useful timing data. > + > +Creating a Scope_event object on the stack is the easiest way > +to control the region it will profile. For example, from within > +the body of a function (or the as the entire function), use > +this to define a region of interest: > + > + { > + impl::profile::Scope_event event("Main computation:"); > + > + // perform main computation > + // > + ... > + } > + > +The closing brace causes 'event' to go out of scope, logging > +the amount of time spent doing the computation. > + > + > + > +4) Profile Log Files > +------------------------------------------------------------------------- > +The profiler outputs a small header at the beginning of each log file. > +The headers differ slighly for acculate mode and trace modes. 4a) Log file header # mode: pm_accum # timer: x86_64_tsc_time # clocks_per_sec: 3591375104 The log file header has separate lines that describe: - the profiling mode used, - the low-level timer used to measure clock ticks, - the number of clock ticks per second, > + > +4a) Accumulate mode > + > +# mode: pm_accum > +# timer: x86_64_tsc_time > +# clocks_per_sec: 3591375104 > +# > +# tag : total ticks : num calls : op count : mops > + > +The respective columns that follow this header are: > + > + tag A descriptive name of the operation. This is either > + a name used internally or specified by the user. > + > + total ticks The duration of the event in processor ticks. > + > + num calls The number of times the event occurred. > + > + op count The number of operations performed per event. > + > + mops The calculated performance figure in millions > + of operations per second. You could describe how mops is computed: num_calls * op_count ---------------------- 10^6 mops = ---------------------------- total_ticks ---------------- clocks_per_sec > + > + > +4b) Trace mode > + > +# mode: pm_trace > +# timer: x86_64_tsc_time > +# clocks_per_sec: 3591375104 > +# > +# index : tag : ticks : open id : op count > + > +The respective columns that follow this header are: > + > + index The entry number, beginning at one. > + > + tag A descriptive name of the operation. This is either > + a name used internally or specified by the user. > + > + ticks The current reading from the processor clock. > + > + open id A zero to indicate an event was created. > + An event index to indicated the end of an event. "If zero, indicates the start of an event. If non-zero, this indicates the end of an event and refers to the index of corresponding start of the event" > + > + op count The number of operations performed per event, or > + zero to indicate the end of an event. > + > + > +Note that the timings expressed in 'ticks' may be converted to seconds > +by dividing by the 'clocks_per_second' constant in the header. > + > + > + > +5) Event Tags > +------------------------------------------------------------------------- > +Sourcery VSIPL++ uses the following tags for profiling objects/functions > +within the library. These tags are readable text containing information > +that varies depending on the event, but generally it tells you: ^^^^^^^^^ "but generally it describes:" > + > + * The object/function name > + * The number of dimensions > + * Information about the data types involved > + * The size of each dimension > + > +In all cases, data types (, and below) are expressed using > +the BLAS/LAPACK convention of > + > + S - float > + C - complex > + D - double > + Z - complex > + > +Expressions on views (vectors, matrices) are shown using prefix > +notation, i.e. > + > + operator(operand, ...) > + > +Each operand may be the result of another computation, so expressions > +are nested, the parenthesis determining the order of evaluation. > +When the operand types are views, the usual S/D/C/Z are used to > +indicate the type. When operands are scalars, lower-case values > +are used instead (s/d/c/z). > + > + > +Current Tag List: > + > + --signal-- > + Convolution [1D|2D] x > + Correlation [1D|2D] x > + Fft 1D [Inv|Fwd] - [by_ref|by_val] x1 What about 2D and 3D Ffts? Perhaps this should be: Fft [1D|2D|3D] [Inv|Fwd] - [by_ref|by_val] > + Fftm 2D [Inv|Fwd] - [by_ref|by_val] x All Fftm's are 2D. However, the can either be row-wise or column-wise. Perhaps this could be: Fftm [row|col] [Inv|Fwd] - [by_ref|by_val] x > + Fir > + Iir > + > + --matvec-- > + dot x1 > + cvjdot x1 > + trans x > + herm x > + kron x x > + outer x1 x1 > + gemp x x > + gems x > + cumsum x > + modulate x1 > + > + --fns-- "--Element-wise expressions--" would be more descriptive to the user. Also (although some of this is redundant with above) "For element-wise expressions, event tags have the following format: EVALUATOR DIM EXPR SIZE The EVALUATOR indicates which VSIPL++ evaluator was dispatched to compute the expression. DIM indicates the dimensionality of the expression. EXPR is memonic of the expression. SIZE is ..." Also, a brief description of the evaluators would be useful: "The following evaluators are provided (Dispath to vendor math libraries, such as SAL and IPP, is implemented with multiple evaluators that share the same prefix): - Expr_Loop - generic loop-fusion evaluator. - Expr_SIMD_Loop - SIMD loop-fusion evaluator. - Expr_Copy - optimized data-copy evaluator. - Expr_Trans - optimized matrix transpose evaluator. - Expr_Dense - evaluator for dense, multi-dimensional expressions. Converts them into corresponding 1-dim expressions that are re-dispatched. - Expr_SAL_* - evaluators for dispatch to the SAL vendor math library. - Expr_IPP_* - evaluators for dispatch to the SAL vendor math library. - Expr_SIMD_* - evaluators for dispatch to the builtin SIMD routines (with the exception of Expr_SIMD_Loop, see above). A complete listing of the evaluators is useful, but I would be OK with leaving it out in favor of a condense list (Expr_SAL_* instead of a complete listing of the SAL evaluators). The complete list is going to fluctuate and its going to have extraneous detail that we won't docuemnt here (for example, this isn't the place to describe the difference between Expr_IPP_SV- and Expr_IPP_SV_FO- and between Expr_SAL_VVV and Expr_SAL_fVVV). The condensed list should be enough for the user to determine if their functions was dispatched to a math library (i.e. Expr_IPP_*), handled internally in an optimized fashioned, or just handled with loop fusion. > + Expr_Loop [1D|2D|3D] > + Expr_Copy " " (all have dim/expr/size) > + Expr_Trans > + Expr_Dense > + Expr_SAL_COPY > + Expr_SAL_V > + Expr_SAL_VV > + Expr_SAL_VVV > + Expr_SAL_fVVV > + Expr_SAL_VV_V > + Expr_SAL_V_VV > + Expr_SAL_fVV_V > + Expr_Loop_Vmmul > + Expr_IPP_V- > + Expr_IPP_VV- > + Expr_IPP_SV- > + Expr_IPP_SV_FO- > + Expr_IPP_VS- > + Expr_IPP_VS_AS_SV- > + Expr_SIMD_V- > + Expr_SIMD_VV- > + Expr_SIMD_Loop > + -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From stefan at codesourcery.com Thu Sep 7 04:19:34 2006 From: stefan at codesourcery.com (Stefan Seefeld) Date: Thu, 07 Sep 2006 00:19:34 -0400 Subject: patch: Disable exceptions when compiler doesn't support them. Message-ID: <44FF9DD6.9090707@codesourcery.com> The attached patch allows the library to detect when the icl doesn't support exceptions (i.e. when -GX or equivalent is not used) and makes it use vsip::impl::fatal_exception() instead. Additionally, the latter now also reports the call-site. I have tested with 'icl -GX' as well as with 'icl' and got the desired results. OK to check in ? Thanks, Stefan -- Stefan Seefeld CodeSourcery stefan at codesourcery.com (650) 331-3385 x718 -------------- next part -------------- A non-text attachment was scrubbed... Name: no-exception.patch Type: text/x-patch Size: 2674 bytes Desc: not available URL: From mark at codesourcery.com Thu Sep 7 04:28:49 2006 From: mark at codesourcery.com (Mark Mitchell) Date: Wed, 06 Sep 2006 21:28:49 -0700 Subject: [vsipl++] patch: Disable exceptions when compiler doesn't support them. In-Reply-To: <44FF9DD6.9090707@codesourcery.com> References: <44FF9DD6.9090707@codesourcery.com> Message-ID: <44FFA001.6040108@codesourcery.com> Stefan Seefeld wrote: > +// If the Intel compiler on windows is used without exception handling (-GX) > +# if defined(__ICL) && __EXCEPTIONS != 1 Picking nits: it's usually best to say "&& !__EXCEPTIONS" for things like this, since they might set __EXCEPTIONS to 2 in the future to indicate that they have a superset of what we currently think of as exceptions. -- Mark Mitchell CodeSourcery mark at codesourcery.com (650) 331-3385 x713 From jules at codesourcery.com Thu Sep 7 11:04:37 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Thu, 07 Sep 2006 07:04:37 -0400 Subject: [vsipl++] patch: Disable exceptions when compiler doesn't support them. In-Reply-To: <44FF9DD6.9090707@codesourcery.com> References: <44FF9DD6.9090707@codesourcery.com> Message-ID: <44FFFCC5.6060705@codesourcery.com> Stefan Seefeld wrote: > > OK to check in ? > Stefan, yes this looks good (with Mark's suggestion). thanks, -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From joseph_sacco at comcast.net Fri Sep 8 16:45:36 2006 From: joseph_sacco at comcast.net (Joseph E. Sacco, Ph.D.) Date: Fri, 08 Sep 2006 12:45:36 -0400 Subject: configure fails to recognize mpich Message-ID: <1157733936.25286.22.camel@plantain.jesacco.com> System: * G4-based PPC running YDL-4.1 [FC4 clone] * gcc-4.0.2 * mpich-1.2.7p1 [installed in /opt/mpich] * mpich2-1.0.4pl [installed in /opt/mpich2] ========================================================================== Problem: configure test for mpich fails. Discussion ---------- Running ./configure --prefix=/opt/vsipl++ --with-mpi-prefix=/opt/mpich fails: ... checking for mpi.h... yes checking whether MPICH_NAME is declared... yes checking for MPI build instructions... configure: error: Unable to compile / link test MPI application. The same result is obtained with mpich2 ./configure --prefix=/opt/vsipl++ --with-mpi-prefix=/opt/mpich The test for mpi within configure appears rather innocuous: #include VSIP_IMPL_MPI_H <<=== #include int main () { MPI_Init(0, 0); ; return 0; } and does compile /link when run outside of configure. Thoughts??? -Joseph -- joseph_sacco [at] comcast [dot] net From jules at codesourcery.com Fri Sep 8 17:24:26 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Fri, 08 Sep 2006 13:24:26 -0400 Subject: [vsipl++] configure fails to recognize mpich In-Reply-To: <1157733936.25286.22.camel@plantain.jesacco.com> References: <1157733936.25286.22.camel@plantain.jesacco.com> Message-ID: <4501A74A.9060700@codesourcery.com> Joseph, We've tested with MPICH in the past, but unfortunately much of our recent work has been with LAM/MPI. We would like to fix this though. Would you mind sending the config.log file? thanks, -- Jules Joseph E. Sacco, Ph.D. wrote: > > Problem: > > configure test for mpich fails. > > -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From stefan at codesourcery.com Fri Sep 8 20:59:41 2006 From: stefan at codesourcery.com (Stefan Seefeld) Date: Fri, 08 Sep 2006 16:59:41 -0400 Subject: [vsipl++] configure fails to recognize mpich In-Reply-To: <1157733936.25286.22.camel@plantain.jesacco.com> References: <1157733936.25286.22.camel@plantain.jesacco.com> Message-ID: <4501D9BD.3040004@codesourcery.com> Joseph, I believe I have found the cause of the error. Our configuration script assumes that running 'mpicxx -show -c' will generate a command string in which the last token is '-c', which we then filter out using sed. However, it appears in your case mpicxx generates a command string where the '-c' option is in between other options, and so our attempt to filter it out fails. The attached patch makes sed filter out the '-c' option no matter where in the command it appears. Please confirm that this fixes the error for you. Thanks, Stefan -- Stefan Seefeld CodeSourcery stefan at codesourcery.com (650) 331-3385 x718 -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.diff Type: text/x-patch Size: 604 bytes Desc: not available URL: From joseph_sacco at comcast.net Sat Sep 9 17:26:30 2006 From: joseph_sacco at comcast.net (Joseph E. Sacco, Ph.D.) Date: Sat, 09 Sep 2006 13:26:30 -0400 Subject: [vsipl++] configure fails to recognize mpich In-Reply-To: <4501D9BD.3040004@codesourcery.com> References: <1157733936.25286.22.camel@plantain.jesacco.com> <4501D9BD.3040004@codesourcery.com> Message-ID: <1157822790.2513.10.camel@plantain.jesacco.com> Stefan, I can confirm that your patch works: [patch applied to configure] ... with mpi enabled: yes With parallel service: mpich ... There are many ways to resolve this issue. mpich1 supports command line arguments: $ mpicxx -compile-info g++ -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1 -DHAVE_MPI_CPP -I/opt/mpich/include/mpi2c++ -fexceptions -c -I/opt/mpich/include $ mpicxx -link-info g++ -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1 -L/opt/mpich/lib -lpmpich++ -lmpich -lpthread -lrt Maybe it would be cleaner to use these directly rather than using "-show". Be well, -Joseph ========================================================================= On Fri, 2006-09-08 at 16:59 -0400, Stefan Seefeld wrote: > Joseph, > > I believe I have found the cause of the error. Our configuration script > assumes that running 'mpicxx -show -c' will generate a command string in which > the last token is '-c', which we then filter out using sed. > > However, it appears in your case mpicxx generates a command string where > the '-c' option is in between other options, and so our attempt to filter > it out fails. > > The attached patch makes sed filter out the '-c' option no matter where in the > command it appears. Please confirm that this fixes the error for you. > > Thanks, > Stefan > -- joseph_sacco [at] comcast [dot] net From stefan at codesourcery.com Sat Sep 9 18:04:09 2006 From: stefan at codesourcery.com (Stefan Seefeld) Date: Sat, 09 Sep 2006 14:04:09 -0400 Subject: [vsipl++] configure fails to recognize mpich In-Reply-To: <1157822790.2513.10.camel@plantain.jesacco.com> References: <1157733936.25286.22.camel@plantain.jesacco.com> <4501D9BD.3040004@codesourcery.com> <1157822790.2513.10.camel@plantain.jesacco.com> Message-ID: <45030219.9030604@codesourcery.com> Joseph E. Sacco, Ph.D. wrote: > Stefan, > > I can confirm that your patch works: > > [patch applied to configure] > > ... > with mpi enabled: yes > With parallel service: mpich > ... Excellent ! > There are many ways to resolve this issue. mpich1 supports command line > arguments: > > $ mpicxx -compile-info > g++ -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 > -DHAVE_STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1 -DHAVE_MPI_CPP > -I/opt/mpich/include/mpi2c++ -fexceptions -c -I/opt/mpich/include > > $ mpicxx -link-info > g++ -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 > -DHAVE_STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1 -L/opt/mpich/lib > -lpmpich++ -lmpich -lpthread -lrt > > > Maybe it would be cleaner to use these directly rather than using > "-show". I agree. However, as we have to deal with different versions of that applet, we are aiming for a mechanism that is supported by all of them. The '-show' / '-showme' option seems to be the least common denominator. Thanks, Stefan -- Stefan Seefeld CodeSourcery stefan at codesourcery.com (650) 331-3385 x718 From don at codesourcery.com Sun Sep 10 01:13:32 2006 From: don at codesourcery.com (Don McCoy) Date: Sat, 09 Sep 2006 19:13:32 -0600 Subject: [patch] Fixes for building benchmarks with IPP (and MPI) Message-ID: <450366BC.70904@codesourcery.com> This patch does the following: * adds missing includes for several of the IPP benchmarks * removes two dependencies on headers in tests/ for the benchmarks (by moving them to vsip_csl/) * changes the standalone benchmarks makefile to exclude MPI-specific benchmarks by default * corrects a missing definition needed in the benchmark's makefile for detecting whether or not MPI is used Regards, -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ib.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ib.diff URL: From don at codesourcery.com Sun Sep 10 21:24:25 2006 From: don at codesourcery.com (Don McCoy) Date: Sun, 10 Sep 2006 15:24:25 -0600 Subject: [patch] CFAR benchmark storage order Message-ID: <45048289.4000700@codesourcery.com> This change reverts the storage order of the tensor back to 'tuple<0, 1, 2>' for the Vector and Hybrid methods. The Slice method explicitly uses 'tuple<2, 1, 0>' in order to get the best performance. This was tested in the 'builtin' configuration on both 32-bit and 64-bit platforms, using GCC 4.1 and 3.4 respectively. Regards, -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cb.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cb.diff URL: From stefan at codesourcery.com Mon Sep 11 14:28:22 2006 From: stefan at codesourcery.com (Stefan Seefeld) Date: Mon, 11 Sep 2006 10:28:22 -0400 Subject: patch: Some more adjustments for intel-win Message-ID: <45057286.3090408@codesourcery.com> The attached patch contains some more (very minor) adjustments for compiling with intel-win, as well as a fix for a bug in our MPI detection, as reported by Joseph E. Sacco. Thanks, Stefan -- Stefan Seefeld CodeSourcery stefan at codesourcery.com (650) 331-3385 x718 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch URL: From jules at codesourcery.com Mon Sep 11 15:04:09 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Mon, 11 Sep 2006 11:04:09 -0400 Subject: [patch] Tutorial updates Message-ID: <45057AE9.3040906@codesourcery.com> This patch makes some of the tutorial updates we have discussed: - Focuses tutorial on fast convolution by splitting the parallel fast convolution chapter into separate chapters for serial and parallel. - Makes the tutorial a user's guide with two parts: tutorial (Part I) and reference (Part II). Also attached is a PDF. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: doc.diff URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tutorial.pdf Type: application/pdf Size: 173784 bytes Desc: not available URL: From stefan at codesourcery.com Tue Sep 12 03:40:36 2006 From: stefan at codesourcery.com (Stefan Seefeld) Date: Mon, 11 Sep 2006 23:40:36 -0400 Subject: [vsipl++] patch: Some more adjustments for intel-win In-Reply-To: <45057286.3090408@codesourcery.com> References: <45057286.3090408@codesourcery.com> Message-ID: <45062C34.3050000@codesourcery.com> Here is an enhanced and extended version of the previous patch. New additions include a new vsip/impl/inttypes.hpp header providing fixed-size integer types such as int8_type, which makes the vsip_csl::matlab code work even on windows (where there is neither nor ), and a fix to a bug related to the handling of Rt_ext_data. OK to check in ? Thanks, Stefan -- Stefan Seefeld CodeSourcery stefan at codesourcery.com (650) 331-3385 x718 -------------- next part -------------- A non-text attachment was scrubbed... Name: intel-win.patch Type: text/x-patch Size: 28618 bytes Desc: not available URL: From mark at codesourcery.com Tue Sep 12 03:50:12 2006 From: mark at codesourcery.com (Mark Mitchell) Date: Mon, 11 Sep 2006 20:50:12 -0700 Subject: [vsipl++] patch: Some more adjustments for intel-win In-Reply-To: <45062C34.3050000@codesourcery.com> References: <45057286.3090408@codesourcery.com> <45062C34.3050000@codesourcery.com> Message-ID: <45062E74.7090201@codesourcery.com> Stefan Seefeld wrote: > +# if SIZEOF_CHAR == 1 > + typedef signed char int8_type; > + typedef unsigned char uint8_type; > +# else > +# error "No 8-bit integer type" > +# endif > + > +# if SIZEOF_SHORT == 2 > + typedef short int16_type; > + typedef unsigned short uint16_type; Just for the record: 1. sizeof (char) is required to be 1 in C++. 2. However, char is not required to be an 8-bit type. So, in theory, these checks (which you added at my suggestion) are not fully robust. For example, on a machine for which char is a 32-bit type, the above code will not work as intended. However, I would not worry about this -- not even a little bit. There are a very few such machines, and none in mainstream use, and if we find one, we can always fix this at that point. (One relatively portable way might be to use UCHAR_{MIN,MAX} to tell us how many bits are actually in a char.) -- Mark Mitchell CodeSourcery mark at codesourcery.com (650) 331-3385 x713 From jules at codesourcery.com Tue Sep 12 12:41:23 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 12 Sep 2006 08:41:23 -0400 Subject: [vsipl++] patch: Some more adjustments for intel-win In-Reply-To: <45062C34.3050000@codesourcery.com> References: <45057286.3090408@codesourcery.com> <45062C34.3050000@codesourcery.com> Message-ID: <4506AAF3.7060304@codesourcery.com> Stefan Seefeld wrote: > Here is an enhanced and extended version of the previous patch. > New additions include a new vsip/impl/inttypes.hpp header > providing fixed-size integer types such as int8_type, which > makes the vsip_csl::matlab code work even on windows (where > there is neither nor ), > and a fix to a bug related to the handling of Rt_ext_data. > > OK to check in ? > > Thanks, > Stefan Stefan, this looks good, please commit. thanks, -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From jules at codesourcery.com Tue Sep 12 14:12:29 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 12 Sep 2006 10:12:29 -0400 Subject: [patch] Fix SIMD loop fusion to handle re-dimensioned expressions Message-ID: <4506C04D.9070207@codesourcery.com> This patch uses Adjust_layout_dim so that SIMD loop fusion Ext_data access works with re-dimensioned expressions (i.e. those generated by Eval_dense_expr). It also adds a regression test for the case, and extends coverage_binary as well. This was causing the fft test to fail when for the builtin binary packages. Patch applied. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fix.diff URL: From jules at codesourcery.com Tue Sep 12 16:10:17 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 12 Sep 2006 12:10:17 -0400 Subject: [patch] Changes for merged packages Message-ID: <4506DBE9.1040307@codesourcery.com> This patch makes the changes necessary to build merged packages. Major changes: - configure.ac: Move macros for parallel services, FFT, and ATLAS from acconfig.hpp to command line, so that different library variants in merged package will have *similar* acconfig.hpp. I say similar because there are some macros that configure places in acconfig.hpp that are only included in some variants (SIZEOF_DOUBLE, SIZEOF_LONG_DOUBLE) and different between variants (SIZEOF_LONG_DOUBLE). Since we only use these values during configure, and not in the library, the differences don't affect the merged package. However, to be safe, I've undefined those in config.hpp. - package.py and scripts/config: Change to build merged packages. Primarily use --libdir to distinguish between variants instead of suffixes (although suffixes are still used to save away acconfig.hpp and results.qmr files for later inspection). This patch also includes: - adds -lvsip_csl to context.in and vsipl++.pc.in so that tests using vsip_csl pass. - adds some verbose macros to fft.cpp to make failures easier to debug. Ok to commit? -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mondo.diff URL: From jules at codesourcery.com Tue Sep 12 17:26:21 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 12 Sep 2006 13:26:21 -0400 Subject: [vsipl++] [patch] CFAR benchmark storage order In-Reply-To: <45048289.4000700@codesourcery.com> References: <45048289.4000700@codesourcery.com> Message-ID: <4506EDBD.7020602@codesourcery.com> Don McCoy wrote: > This change reverts the storage order of the tensor back to 'tuple<0, 1, > 2>' for the Vector and Hybrid methods. The Slice method explicitly uses > 'tuple<2, 1, 0>' in order to get the best performance. > > This was tested in the 'builtin' configuration on both 32-bit and 64-bit > platforms, using GCC 4.1 and 3.4 respectively. Don, this looks good, please commit. thanks, -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From jules at codesourcery.com Tue Sep 12 17:38:47 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 12 Sep 2006 13:38:47 -0400 Subject: [vsipl++] [patch] Fixes for building benchmarks with IPP (and MPI) In-Reply-To: <450366BC.70904@codesourcery.com> References: <450366BC.70904@codesourcery.com> Message-ID: <4506F0A7.1040800@codesourcery.com> Don McCoy wrote: > This patch does the following: > > * adds missing includes for several of the IPP benchmarks > * removes two dependencies on headers in tests/ for the benchmarks > (by moving them to vsip_csl/) > * changes the standalone benchmarks makefile to exclude MPI-specific > benchmarks by default > * corrects a missing definition needed in the benchmark's makefile > for detecting whether or not MPI is used Don, This looks good. I have one comment below, please check it in once that is addressed. thanks, -- Jules > Index: GNUmakefile.in > =================================================================== > --- GNUmakefile.in (revision 148805) > +++ GNUmakefile.in (working copy) > @@ -116,8 +116,8 @@ > VSIP_IMPL_SAL_FFT := @VSIP_IMPL_SAL_FFT@ > VSIP_IMPL_IPP_FFT := @VSIP_IMPL_IPP_FFT@ > VSIP_IMPL_FFTW3 := @VSIP_IMPL_FFTW3@ > +VSIP_IMPL_MPI_H := @VSIP_IMPL_MPI_H@ Since VSIP_IMPL_MPI_H is used here as a boolean (1 if MPI is present), and is used elsewhere as the name of the MPI header file, can you call it something else to avoid confusion -- for example VSIP_IMPL_MPI or VSIP_IMPL_HAVE_MPI? > Index: configure.ac > =================================================================== > --- configure.ac (revision 148805) > +++ configure.ac (working copy) > @@ -858,7 +858,8 @@ > vsipl_par_service=0 > CPPFLAGS="$save_CPPFLAGS" > fi > - else > + else > + AC_SUBST(VSIP_IMPL_MPI_H, 1) > AC_DEFINE_UNQUOTED([VSIP_IMPL_MPI_H], $vsipl_mpi_h_name, > [The name of the header to include for the MPI interface, with <> quotes.]) -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From stefan at codesourcery.com Tue Sep 12 19:39:22 2006 From: stefan at codesourcery.com (Stefan Seefeld) Date: Tue, 12 Sep 2006 15:39:22 -0400 Subject: patch: Fix issues with hypotf. Message-ID: <45070CEA.3070300@codesourcery.com> The attached patch properly forward-declares hypotf as extern "C", and falls back to ::hypot(double, double) if not HAVE_HYPOTF. The patch is checked in. Thanks, Stefan -- Stefan Seefeld CodeSourcery stefan at codesourcery.com (650) 331-3385 x718 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch URL: From don at codesourcery.com Tue Sep 12 19:49:46 2006 From: don at codesourcery.com (Don McCoy) Date: Tue, 12 Sep 2006 13:49:46 -0600 Subject: [vsipl++] [patch] Fixes for building benchmarks with IPP (and MPI) In-Reply-To: <4506F0A7.1040800@codesourcery.com> References: <450366BC.70904@codesourcery.com> <4506F0A7.1040800@codesourcery.com> Message-ID: <45070F5A.4050303@codesourcery.com> Jules Bergmann wrote: > Since VSIP_IMPL_MPI_H is used here as a boolean (1 if MPI is present), > and is used elsewhere as the name of the MPI header file, can you call > it something else to avoid confusion -- for example VSIP_IMPL_MPI or > VSIP_IMPL_HAVE_MPI? Done. Checked in. -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 From don at codesourcery.com Wed Sep 13 00:16:08 2006 From: don at codesourcery.com (Don McCoy) Date: Tue, 12 Sep 2006 18:16:08 -0600 Subject: [vsipl++] [patch] Tutorial updates In-Reply-To: <45057AE9.3040906@codesourcery.com> References: <45057AE9.3040906@codesourcery.com> Message-ID: <45074DC8.6000608@codesourcery.com> Jules Bergmann wrote: > This patch makes some of the tutorial updates we have discussed: > > - Focuses tutorial on fast convolution by splitting the parallel fast > convolution chapter into separate chapters for serial and parallel. > > - Makes the tutorial a user's guide with two parts: tutorial (Part I) > and reference (Part II). > This patch extends these changes further by: - Rewrites the performance chapter about profiling in part I using fast convolution as the example. - Adding a profiling section to the reference part II. Regards, -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pt3.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pt3.diff URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tutorial.pdf Type: application/pdf Size: 220169 bytes Desc: not available URL: From jules at codesourcery.com Wed Sep 13 02:23:52 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Tue, 12 Sep 2006 22:23:52 -0400 Subject: [patch] Fast path for FFT Message-ID: <45076BB8.60005@codesourcery.com> This patch adds a fast path for 1-dim, CC FFTs with unit-stride data. The fast path uses compiled-time Ext_data instead of Rt_ext_data for a marginal performance improvement. To determine whether the backend will work with the fastpath (in particular, whether it supports split- or interleaved- complex), it is queried when the workspace is created. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fp.diff URL: From don at codesourcery.com Wed Sep 13 03:55:48 2006 From: don at codesourcery.com (Don McCoy) Date: Tue, 12 Sep 2006 21:55:48 -0600 Subject: [vsipl++] [patch] CFAR benchmark storage order In-Reply-To: <4506EDBD.7020602@codesourcery.com> References: <45048289.4000700@codesourcery.com> <4506EDBD.7020602@codesourcery.com> Message-ID: <45078144.1020009@codesourcery.com> Jules Bergmann wrote: > Don McCoy wrote: >> This change reverts the storage order of the tensor back to 'tuple<0, >> 1, 2>' for the Vector and Hybrid methods. The Slice method >> explicitly uses 'tuple<2, 1, 0>' in order to get the best performance. >> > This patch corrects an error with the previous patch (tuple<2,1,0> should have been tuple<2,0,1>). Mea culpa, -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cb2.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cb2.diff URL: From don at codesourcery.com Wed Sep 13 07:54:24 2006 From: don at codesourcery.com (Don McCoy) Date: Wed, 13 Sep 2006 01:54:24 -0600 Subject: [vsipl++] [patch] CFAR benchmark storage order In-Reply-To: <45078144.1020009@codesourcery.com> References: <45048289.4000700@codesourcery.com> <4506EDBD.7020602@codesourcery.com> <45078144.1020009@codesourcery.com> Message-ID: <4507B930.6070403@codesourcery.com> Please disregard the previous version(s) of this patch. The attached version has been checked more thoroughly than before. This time I ran all the sets with varying storage orders for the CFAR data cube, then I compared results at the points specified by the HPEC Challenge (in terms of the number of range gates, RG). This retesting resulted in a change for the "by-vector" algorithm for about a 5% performance improvement. See the table below, produced from data taken from the Xeon cluster at GTRI. Slice RG 2-0-1 0-2-1 Set 1 64 293 210 Set 2 3500 186 147 Set 3 1909 187 145 Set 4 9900 202 150 Vector Hybrid RG 0-1-2 1-0-2 0-1-2 1-0-2 Set 1 64 96 97 384 415 Set 2 3500 125 133 693 666 Set 3 1909 123 130 692 650 Set 4 9900 124 132 697 670 Regards, -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cb3.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cb3.diff URL: From assem at codesourcery.com Wed Sep 13 10:36:27 2006 From: assem at codesourcery.com (Assem Salama) Date: Wed, 13 Sep 2006 06:36:27 -0400 Subject: Matlab IO docbook Message-ID: <4507DF2B.4040705@codesourcery.com> Everyone, I had sent this patch out a while back but didn't get any replies about it. So, I'm assuming that this might be useful now. It is the docbook section that I wrote for Matlab IO. Thanks, Assem -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: svn.diff.09132006.1.log URL: From jules at codesourcery.com Wed Sep 13 13:09:24 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Wed, 13 Sep 2006 09:09:24 -0400 Subject: [vsipl++] [patch] Tutorial updates In-Reply-To: <45074DC8.6000608@codesourcery.com> References: <45057AE9.3040906@codesourcery.com> <45074DC8.6000608@codesourcery.com> Message-ID: <45080304.6010704@codesourcery.com> Don, This looks good. I have several suggestions below on the tutorial chapter. Use them as you please :) Once you're happy please check it in. We can continue to incorporate edits as we review at the whole document. I haven't had a chance to read the reference chapter yet, I will send comments on that later. thanks, -- Jules > + > + In addition to the accumulate and trace modes, which have pre-defined > + output formats, Sourcery VSIPL++ exposes a profiling API that you can ^^ Performance API > + use to gather data directly on individual objects, such as FFTs. > + If you need finer control of what operations are profiled, or if you > + want to record the profiling data in a custom format, you may wish to > + use this API directly. See for > + more details. > + > + > + Operations Supporting Profiling > > See the file profiling.txt for a detailed > explanation of the profiler output for each of the functions above. > + See the file profiling.txt for a detailed Isn't profiling.txt now Chapter 5? > + explanation of the profiler output for each of the functions above. > + For information about how to configure the library for profiling, > + see the Quickstart also. > > + > + This macro enables profiling operations in several different areas > + of the library, depending on the value of > + mask. To profile all operations, use > + the value 15. See > + for other possible values. I would mention the motivation behind why we have a mask: "Since profiling can introduce overhead, especially for element-wise expressions, this macro allows you to choose which operations in the library are profiled. To profile all operations, use the value 15. See ..." > + > + > + > + Profiling support requires that you link with a version of Sourcery > + VSIPL++ that supports profiling. If you have received a binary > + distribution of Sourcery VSIPL++ from CodeSourcery, you probably > + already have an appropriate version of the library. If you are > + building Sourcery VSIPL++ yourself, see the Quickstart guide for > + more information about the requirements for building Sourcery > + VSIPL++ with profiling enabled. We've changed things so that all libraries support profiling, if a timer is provided: "Profiling requires that the library be configured with a high-resolution timer. Binary distributions of Sourcery VSIPL++ from CodeSourcery have this done. If you are building Sourcery VSIPL++ from source, see the Quickstart guide for more information about configuring high-resolution timers." > + > + > +
Setup > + > + The only computation performed in the setup phase is a forward FFT > + that maps the pulse replica into the frequency domain. This > + computation corresponds to the following line of the profiling > + data: > +Fft Fwd C-C by_ref 256 : 142119 : 1 : 10240 : 258.767 > + > + The "Fft Fwd C-C by_ref 256" tag indicates that this computation > + is a 256-element forward FFT with complex, single-precision inputs > + and outputs, returning its result by reference. The notation used > + for data types (e.g., "C-C" in this example) is given in ^^ described > + . > + > + >
> -
Trace Profile Data > +
Convert to frequency domain > > - This mode is used similarly to accumulate mode, except that an > - extra parameter is passed to the creation of the Profile > - object. > - Profile profile("/dev/stdout", pm_trace); > + The next step of the computation is to convert from the time domain > + to the frequency domain. In particular, an FFT is applied to a data > + cube of 64 pulses, each containing 256 range cells: "In particular, a FFT is applied to each pulse of a data cube, which consists of 64 pulses each containing 256 range cells:" > +Fftm Fwd row_type C-C by_ref 64x256 : 1188144 : 1 : 1146880 : 3466.65 > + > + For this FFT, the size is reported differently (rows x columns) > + because this is a two-dimensional FFT. It's not a 2-D FFT, its an "Multiple 1D FFT": "For this operation, a Fftm object was used to perform multiple FFTs on each row of the data cube." > + The operation count (1.1 million) far outweighs that of > + any other step, except the inverse FFT. > + The performance measured was 3.5 GFLOPS/s on a 3.6 GHz Xeon. > + Since the theoretical peak performance on such > + a machine is about 14.4 GFLOP/s, the program has achieved an > + a very good 24% of peak. > + Other example programs measure in-cache FFT perfomance on vectors > + of the same size at 4.9 GFLOP/s. Therefore, considering that the > + 3.5 GFLOP/s includes cache overheads, the result is still good. I would move the first sentence to a new paragraph following, to give it some more context: "Since the operation count (1.1 million) of the FFT (and inverse FFT) outweigh the rest of the computation, the overall performance will be very close to the FFT performance." > + > +
> +
Convolution > + > + The actual convolution consists of a vector-matrix multiplication. > + The corresponding profiling output is: > +Expr_Loop_Vmmul 2D vmmul(C,C) 64x256 : 1539531 : 1 : 98304 : 229.321 > + > + Sourcery VSIPL++ chose to evaluate this expression by performing a > + row-wise vector-vector multiplication on each of the rows of the > + matrix. Therefore, there is a second line: > +Expr_SIMD_VV-simd::vmul 1D *(C,C) 256 : 316674 : 64 : 1536 : 1114.86 > + > + The tag used for this expression is "*(C,C)". The profiling tag for > + many operations is shown using a prefix notation; the operation > + performed is followed by the types of the arguments. The "simd" tag > + indicates that VSIPL++ used the Single Instruction Multiple Data (SIMD) > + facilities on the Xeon architecture for maximum performance. > + > + > + The tick count for the vector-matrix multiplication (vmmul) includes > + the time spent in the multiple row-wise scalar-vector multiplications. > + Therefore the total number of time used by the program is *not* the > + sum of the tick counts given for each line. We should mention why vmmul performance is less than the constituent scalar-vector multiplies: "You should notice the performance difference between the vmmul event and the individual scalar-vector multiplications. Some of this is due to the extra work vmmul does to setup each individual multiplication: loop overhead and subview creation. However, most of this is due to the overhead of profiling: the cost of accessing timers and the cost of maintaining profile data structures. In general, profiling overhead only slows the program execution but does not affect the measurements taken. However, when an operation being profiled (such as vmmul) consists of many invocations of other profile operations (such as scalar-vector multiplication), measurements may be affected. When profiling is disabled, the performance of vmmul will be very close to the performance measured for the individual scalar-vector multiplications." > + > +
> +
Convert back to time domain > + > + The last step of the algorithm is to convert back to the time domain > + by using an inverse FFT. An inverse FFT is computationally > + equivalent to a forward FFT, except that an additional multiplication > + is performed to handle scaling. The lines corresponding to the > + inverse FFT are: When scaling is done is a choice left up to the user, so instead of saying "An inverse FFT is computationally equiv to a forward FFT, except ..." which implies this is true of all FFTs, you might say "The "The inverse FFT is computationally equiv to the forward FFT, except ...", which implies this is true for the FFTs in the example. > +Expr_Dense 2D *(C,s) 64x256 : 687285 : 1 : 32768 : 171.228 > +Expr_Loop 1D *(C,s) 16384 : 653265 : 1 : 32768 : 180.145 > +Fftm Inv row_type C-C by_ref 64x256 : 1559304 : 1 : 1146880 : 2641.48 > + > + The first line describes a evaluation of a "dense" two- > + dimensional multiplication between a single-precision complex > + view (a matrix) and a single-precision scalar. Note that > + scalars are represented using lower-case equivalents for > + the data types in the table above. > + > + > + A "dense" matrix is one in which the values are packed > + tightly in memory with no intervening space between the rows > + or columns. Therefore, the two-dimensional multiplication can > + be thought of as a 1-dimensional multiplication of a long vector. > + The evaluation of the 2-D operation includes the time required for > + the 1-D operation, together with a small amount of overhead. > + You can tell that this is the case as the time shown on the > + first line is slightly greater than the time shown on the second. > + Both show the same number of operations because they are > + referring to the same calculation. > + > + > + Similarly, the time required for the inverse FFT includes both the > + time spent actually computing the FFT as well as the time required > + for the scaling multiplication. Because the multiplication is not > + included in the theoretical operation count, the MOP/s count shown > + is somewhat smaller than than for the forward FFT. I believe the theoretical operation count is intended to include this scaling cost, but it requires extra effort on the part of the implementation: "For FFTs, Sourcery VSIPL++ uses the commonly accepted theoretical operation count of 5 N log2(N). This includes the cost of scaling, which may be folded in with final twiddle factors. However, as this example illustrates, not all FFT backends have this capability, as a result scaled FFTs often have a MOP/s rate lower than non-scaled FFTs." > + > +
> +
> + > + The analysis presented in this section is only a portion of what > + one would do to verify an algorithm is performing as desired. > + Core routines utilizing techniques such as the fast convolution > + method comprise only a portion of larger programs whose > + performance is also of interest. > + The profiling capabilities utilized here can be extended to cover > + those areas of the application as well. > + See for more details. > + > + > +
Trace Profile Data > + Flow suggestion: describe what trace mode is, then give details on how to enable it: "In trace mode, the profiler records each library call as a pair of events, allowing you to see where each call was made and when it returned. This provides two time stamps per call, showing not only which functions were executed, but how they were nested with respect to one another. This mode is useful for investigating the execution sequence of your program. To enable trace mode, construct the 'Profile' object with a 'pm_trace' flag, as in this line: Profile profile("profile.txt", pm_trace); Long traces can result when profiling in this mode, so be sure to avoid gathering more data than you have memory to store (and have time to process later). The output is very similar to the output in accumulate mode." > + By passing an additional parameter to the 'Profile' constructor, > + you can switch from "accumulate" mode to "trace" mode. This line: > +Profile profile("profile.txt", pm_trace); > + will cause Sourcery VSIPL++ to enter trace profiling mode. > + All computations performed by your program while > + profile is in scope will be traced. > This mode is useful for investigating the execution sequence > of your program. > - The profiler simply records each library call as a pair of events, > - allowing you to see where it entered and exited scope in each case. > + The profiler records each library call as a pair of events, > + allowing you to see where each call was made and when it returned. > + This provides two time stamps per call, showing not only which > + functions were executed, but how they were nested with respect > + to one another. > + Long traces can result when profiling in this mode, so > + be sure to avoid gathering more data than you have memory to > + store (and have time to process later). The output is very > + similar to the output in accumulate mode. > > > - Long traces can result when profiling in this mode, so be sure to > - avoid taking more data than you have memory to store (and have time > - to process later). The output is very similar to the output in > - accumulate mode. > + Here is a sample of the output obtained by running the fast > + convolution example in trace mode, which can also be run with > + the options > +--vsipl++-profile-mode=trace --vsipl++-profile-output=profile.txt > + > > > > > - For each event, the profiler outputs an event number, an indentifying > - tag, and the current timestamp (in "ticks"). The next two fields > - differ depending on whether the event is coming into scope or out of > - scope. When coming into scope, a zero is shown followed by the > - estimated count of floating point operations for that function. > - When exiting scope, the profiler displays the event number being > - closed followed by a zero. In all cases, the timestamp (and > - intervals) may be converted to seconds by dividing by the > - 'clocks_per_second' constant in the log file header. > + For each event, the Sourcery VSIPL++ outputs an event number, > + an indentifying tag, and the current timestamp (in "ticks"). > + The next two fields differ depending on whether the event > + marks the entry point of a library function or its return. > + At the start of a call, a zero is shown followed by the estimated > + count of floating point operations for that function. When > + returning from a call, the profiler displays the event number > + created when the function was called, followed by a zero. > + In all cases, the timestamp (and intervals) may be converted to > + seconds by dividing by the 'clocks_per_second' constant in the > + log file header. > > + > + In the break shown by the ellipses, the program is in the middle of > + performing the vector-matrix multiply, which has been broken down > + into 64 separate vector-multiplies. The first two FFT's are > + completed, as shown by the fact that each have two entries, > + one for where the computation began and one for where it ended. > + The Vmmul function has started, but not yet finished, so it only > + has one entry as of yet. The output includes the end event of the vmmul. How about "For brevity, events for some of the 64 scalar-vector mulitplies performed in the vmmul operation have been replaced with an ellipses." > + >
> -
Performance API > +
Performance API > > An additional interface is provided for getting run-time profile data. > This allows you to selectively monitor the performance of a > @@ -166,19 +331,19 @@ > > > Classes with the Performance API provide a function called > - impl_performance that takes a string parameter and returns > - single-precision floating point number. > + impl_performance that takes a std::string parameter > + and returns a single-precision floating point number. Doesn't impl_performance take a 'char const*' parameter? > > > The following call shows how to obtain an estimate of the performance > in number of operations per second: > - > - float mops = fwd_fft.impl_performance("mops"); > - > - An "operation" will vary depending on the object and type of data > - being processed. For example, a single-precison Fft object will > - return the number of single-precison floating-point operations > - performed per second. > +float mops = fwd_fft.impl_performance("mops"); > + The definition of "operation" varies depending on the object > + and type of data being processed. For example, a single-precison > + Fft object will return the number of single-precison > + floating-point operations performed per second while a complex > + double-precision FFT object will return the number of double- > + precision floating-point operations performed per second. > > > The table below lists the current types of information available. > @@ -219,37 +384,59 @@ >
> > > -
Application Profiling > +
> + Application Profiling > > - The profiling mode provides an API that allows you to instrument > - your own code. Here we introduce a new object, the > - Scope_event class, and show you how to use it in your > - application. > + Sourcery VSIPL++ provides an interface that allows you to instrument > + your own code through the Scope_event class. For avoidance of doubt, you could mention that these events get included in the profiling output: "Sourcery VSIPL++ provides an interface that allows you to instrument your own code with profiling events that will be included in the accumulate mode and trace mode output. "Profiling events are recorded by constructing a 'Scope_even' object. ... MERGE WITH NEXT PARAGRAPH" > > > - To create a Scope_event, simply call the constructor, passing > - it the string that will become the event tag and, optionally, an integer > - value expressing the number of floating point operations that will > - be performed by the time the Scope_event object is destroyed. > - For example, to measure the time taken to compute a simple running sum > - of squares over a C array: > + To create a Scope_event, call the constructor, passing > + it a std::string that will become the event tag and, optionally, an > + integer value expressing the number of floating point operations > + that will be performed by the time the Scope_event > + object is destroyed. For example, to measure the time taken to > + compute the main portion in the fast convolution example, > + modify the source as follows: -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From stefan at codesourcery.com Wed Sep 13 18:32:55 2006 From: stefan at codesourcery.com (Stefan Seefeld) Date: Wed, 13 Sep 2006 14:32:55 -0400 Subject: patch: Add support for IPP on windows. Message-ID: <45084ED7.8000003@codesourcery.com> The attached patch makes the library compile with IPP on windows. OK to commit ? Regards, Stefan -- Stefan Seefeld CodeSourcery stefan at codesourcery.com (650) 331-3385 x718 -------------- next part -------------- A non-text attachment was scrubbed... Name: IPP.patch Type: text/x-patch Size: 4575 bytes Desc: not available URL: From mark at codesourcery.com Wed Sep 13 18:37:20 2006 From: mark at codesourcery.com (Mark Mitchell) Date: Wed, 13 Sep 2006 11:37:20 -0700 Subject: [vsipl++] patch: Add support for IPP on windows. In-Reply-To: <45084ED7.8000003@codesourcery.com> References: <45084ED7.8000003@codesourcery.com> Message-ID: <45084FE0.40608@codesourcery.com> Stefan Seefeld wrote: > The attached patch makes the library compile with IPP on windows. > OK to commit ? Looks OK to me. > Index: src/vsip/impl/config.hpp > =================================================================== > --- src/vsip/impl/config.hpp (revision 149109) > +++ src/vsip/impl/config.hpp (working copy) > @@ -29,6 +29,13 @@ > # define VSIP_IMPL_PI 3.14159265358979323846 > #endif > > +#if defined(_WIN32) && VSIP_IMPL_HAVE_IPP > +// IPP on Windows uses __stdcall for all functions. > +# define VSIP_IMPL_IPP_CALL __stdcall > +#else > +# define VSIP_IMPL_IPP_CALL > +#endif Here, you probably don't really need the HAVE_IPP test, since the macro is only used with IPP. It's not harmful; just seems redundant. Thanks, -- Mark Mitchell CodeSourcery mark at codesourcery.com (650) 331-3385 x713 From jules at codesourcery.com Wed Sep 13 21:11:23 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Wed, 13 Sep 2006 17:11:23 -0400 Subject: [vsipl++] patch: Add support for IPP on windows. In-Reply-To: <45084FE0.40608@codesourcery.com> References: <45084ED7.8000003@codesourcery.com> <45084FE0.40608@codesourcery.com> Message-ID: <450873FB.7030909@codesourcery.com> Mark Mitchell wrote: > Stefan Seefeld wrote: >> The attached patch makes the library compile with IPP on windows. >> OK to commit ? > > Looks OK to me. Looks good here too. Same comment as Mark about the VSIP_IMPL_HAVE_IPP test. :) Please check it in. thanks, -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From don at codesourcery.com Thu Sep 14 07:52:28 2006 From: don at codesourcery.com (Don McCoy) Date: Thu, 14 Sep 2006 01:52:28 -0600 Subject: [vsipl++] [patch] Tutorial updates In-Reply-To: <45080304.6010704@codesourcery.com> References: <45057AE9.3040906@codesourcery.com> <45074DC8.6000608@codesourcery.com> <45080304.6010704@codesourcery.com> Message-ID: <45090A3C.4040403@codesourcery.com> Jules Bergmann wrote: > Don, > > This looks good. I have several suggestions below on the tutorial > chapter. > Use them as you please :) Once you're happy please check it in. We can > continue to incorporate edits as we review at the whole document. > Thanks again for the comments. It is now checked in with these edits and a few others, as attached. -- Don McCoy don (at) CodeSourcery (888) 776-0262 / (650) 331-3385, x712 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pt4.changes URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pt4.diff URL: From jules at codesourcery.com Thu Sep 14 21:00:58 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Thu, 14 Sep 2006 17:00:58 -0400 Subject: [patch] work around for icl transpose bug Message-ID: <4509C30A.2010507@codesourcery.com> This patch attempts to work around the icl bug with complex transpose. It has been tested with Intel C++ 9.1 for Windows ia32 against a simplified test case that triggered the bug (I will post that test later today). It has not been tested with the original solver failures. Patch applied. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: trans.diff URL: From mark at codesourcery.com Thu Sep 14 22:18:22 2006 From: mark at codesourcery.com (Mark Mitchell) Date: Thu, 14 Sep 2006 15:18:22 -0700 Subject: [vsipl++] [patch] work around for icl transpose bug In-Reply-To: <4509C30A.2010507@codesourcery.com> References: <4509C30A.2010507@codesourcery.com> Message-ID: <4509D52E.9000004@codesourcery.com> Jules Bergmann wrote: > This patch attempts to work around the icl bug with complex transpose. > > It has been tested with Intel C++ 9.1 for Windows ia32 against a > simplified test case that triggered the bug (I will post that test later > today). It has not been tested with the original solver failures. How horribly awfully sad. :-( Looking at the test case you posted, I don't spot a coding bug. It's always possible, but I didn't see it. So, it does seem most likely to be a coding bug. In any case, given the schedule, I definitely agree that a work-around is in order. I'm finishing up minor edits to the tutorial this afternoon/evening. Thanks, -- Mark Mitchell CodeSourcery mark at codesourcery.com (650) 331-3385 x713 From mark at codesourcery.com Thu Sep 14 22:21:33 2006 From: mark at codesourcery.com (Mark Mitchell) Date: Thu, 14 Sep 2006 15:21:33 -0700 Subject: [vsipl++] [patch] work around for icl transpose bug In-Reply-To: <4509D52E.9000004@codesourcery.com> References: <4509C30A.2010507@codesourcery.com> <4509D52E.9000004@codesourcery.com> Message-ID: <4509D5ED.1070908@codesourcery.com> Mark Mitchell wrote: > Jules Bergmann wrote: >> This patch attempts to work around the icl bug with complex transpose. >> >> It has been tested with Intel C++ 9.1 for Windows ia32 against a >> simplified test case that triggered the bug (I will post that test >> later today). It has not been tested with the original solver failures. > > How horribly awfully sad. :-( Looking at the test case you posted, I > don't spot a coding bug. It's always possible, but I didn't see it. So, > it does seem most likely to be a coding bug. compiler bug, I mean. -- Mark Mitchell CodeSourcery mark at codesourcery.com (650) 331-3385 x713 From mark at codesourcery.com Fri Sep 15 01:11:54 2006 From: mark at codesourcery.com (Mark Mitchell) Date: Thu, 14 Sep 2006 18:11:54 -0700 Subject: PATCH: Updates to tutorial Message-ID: <200609150111.k8F1Bsm2013570@sethra.codesourcery.com> This patch fixes some typos/grammar/etc. in the tutorial. There's clearly more we could do to improve the documentation, but this will do for the upcoming release. Jules, Don, I noticed that there's no performance graph for the temporal-locality version of the parallel fast convolution. Is that graph available? Thanks, -- Mark Mitchell CodeSourcery mark at codesourcery.com (650) 331-3385 x713 2006-09-14 Mark Mitchell * doc/tutorial/tutorial.xml: Add references to API reference and specification. * doc/tutorial/performance.xml: Edit. * doc/tutorial/parallel.xml: Likewise. * doc/tutorial/serial.xml: Likewise. Index: performance.xml =================================================================== --- performance.xml (revision 149238) +++ performance.xml (working copy) @@ -12,8 +12,27 @@ ]> - Performance + Profiling + + + + This chapter explains how to use the profiling features of Sourcery + VSIPL++ to improve the performance of your application. + + + + + + This chapter explains how to use the profiling features of Sourcery + VSIPL++ to improve the performance of your application. Sourcery + VSIPL++ provides two profiling modes. The library + profiling mode allows you to gather data about the + time used for computations performed through the VSIPL++ API. The + application profiling mode allows you to + instrument blocks of application code to gather data at a higher + level. +
Library Profiling @@ -90,16 +109,15 @@ To enable profiling, define - on the command line when compiling your program. - On many systems, this option may be added to the CXXFLAGS variable - in the project makefile. - - - Since profiling can introduce overhead, especially for element-wise - expressions, this macro allows you to choose which operations in the - library are profiled. To profile all operations, use + on the command line when compiling your program. (If you are + using make to build your program, you might + want to add this command-line option to the + CXXFLAGS variable.) To profile all operations, use . See for other possible values. + Since profiling introduces some overhead, especially for element-wise + expressions, you may wish to limit the set of operations that are + are profiled. @@ -115,27 +133,31 @@
Accumulating Profile Data - To use the accumulate mode, you must declare a Profile + To use the accumulate mode, you must declare a + Profile object. Sourcery VSIPL++ will collect profiling data throughout - its lifetime. When the object goes out of scope, the data - collected by profiling will be written to a log file. For + the lifetime of this object. When the object goes out of scope, + the data collected by profiling will be written to a log file. For example, to profile your entire program, with all data written to the file profile.txt, you would add this line: Profile profile("profile.txt", pm_accum); - to the beginning of your main function, after + to the beginning of your main function, after initializing Sourcery VSIPL++. Then, when the program exits, this object will go out of scope and profiling data will be written to the output file. For this reason, only one object of this type may be in scope at any given time. - If you are profiling your entire program, you may specify options - on the command line that perform the equivalent of the above two steps: - + If you want to profile your entire program, you may invoke your + program with the following command-line options: --vsipl++-profile-mode=accum --vsipl++-profile-output=profile.txt + These options will be processed during the call to + vsip::init, and are equivalent to declaring + the profiling object in maine, as described + above. Using this technique on the example program fce-serial.cpp @@ -149,7 +171,8 @@ (or "event"). The first column gives a name for the event. The second column is the total amount of time spent in this operation in "ticks". (You can convert ticks to seconds by dividing by the - value given by the "clocks_per_sec" value in the profiling header.) + value given by the clocks_per_sec value in + the profiling header.) The third column indicates the number of times this operation was performed. The fourth column indicates the number of mathematical operations performed during the computation. (This is the number of @@ -369,32 +392,34 @@ Fftm Inv C-C by_ref 64x256 : 1559304 : 1
Performance API - An additional interface is provided for getting run-time profile data. - This allows you to selectively monitor the performance of a - particular instance of a VSIPL class such as Fft, Convolution or - Correlation. - - - Classes instrumented the Performance API provide a function - called impl_performance that takes a pointer to a - constant character string and returns a single-precision floating - point number. + Sourcery VSIPL++ provides an additional, low-level interface for + accessing profile data. This interface allows you to + selectively monitor the performance of a particular instance of + classes that implement the Performance API. Classes + instrumented the Performance API provide a function called + impl_performance. This function maps + keywords (provided as C-style strings) to floating-point values. + The Fft, + Convolution, and + Correlation classes all implement the + performance API. The following call shows how to obtain an estimate of the performance - in number of operations per second: + in number of operations per second from a particular FFT object: float mops = fwd_fft.impl_performance("mops"); - The definition of "operation" varies depending on the object - and type of data being processed. For example, a single-precison - Fft object will return the number of single-precison - floating-point operations performed per second while a complex - double-precision FFT object will return the number of double- - precision floating-point operations performed per second. + The definition of "operation" varies depending on the + object and type of data being processed. For example, a + single-precison FFT object will return the number of + single-precison floating-point operations performed per second + while a complex double-precision FFT object will return the + number of double-precision floating-point operations performed + per second. - The table below lists the current types of information available. + The table below lists the information available. Performance API Metrics @@ -442,28 +467,28 @@ Fftm Inv C-C by_ref 64x256 : 1559304 : 1 included in the accumulate mode and trace mode output. - Profiling events are recorded by constructing a Scope_event - object. To create a Scope_event, call the - constructor, passing it a std::string that will + Profiling events are recorded by constructing a Scope_event + object. To create a + Scope_event, call the + constructor, passing it a std::string that will become the event tag and, optionally, an integer value expressing the number of floating point operations that will be performed by - the time the object is destroyed. - For example, to measure the time taken to compute the main portion - in the fast convolution example, modify the source as follows: + the time the object is destroyed. The following example shows + how to use this facility: - - The operation count passed as the second parameter is the - sum of the two FFT's and the vector-matrix multiply. - This resulting profile data is identical in format to that used for - profiling library functions. - + + The operation count passed as the second parameter is the + sum of the two FFT's and the vector-matrix multiply. + The resulting profile data is identical in format to that + obtained using the library API: + Now the output has a new line that represents the time that - the Scope_event object exists, i.e. only while the + the Scope_event object exists, i.e. only while the program executes the three main steps of the fast convolution. Fast Convolution : 4256109 : 1 : 2424832 : 2046.11 Index: tutorial.xml =================================================================== --- tutorial.xml (revision 149238) +++ tutorial.xml (working copy) @@ -61,7 +61,11 @@ Reference - The sections in Part II form a reference manual for Sourcery VSIPL++. + The sections in Part II provide reference information about + Sourcery VSIPL++. You should also refer to the VSIPL++ API + Specification and Sourcery VSIPL++ API Reference, both of + which are available at . Index: parallel.xml =================================================================== --- parallel.xml (revision 149238) +++ parallel.xml (working copy) @@ -28,7 +28,7 @@ The first fast convolution program in the previous chapter makes use of two implicitly parallel operators: Fftm and - vmmul. These operators are implicity parallel + vmmul. These operators are implicitly parallel in the sense that they process each row of the matrix independently. If you had enough processors, you could put each row on a separate processor and then perform the entire @@ -38,19 +38,20 @@ In the VSIPL++ API, you have explicit control of the number of processors used for a computation. Since the default is to use - just a single processor, the program above will not run in - parallel, even on a multi-processor system. This section will show - you how to use maps to take advantage of - multiple processors. Using a map tells Sourcery VSIPL++ to - distribute a single block of data across multiple processors. - Then, Sourcery VSIPL++ will automatically move data between - processors as necessary. + just a single processor, the program in will not run in parallel, even on a + multi-processor system. This section will show you how to use + maps to take advantage of multiple + processors. Using a map tells Sourcery VSIPL++ to distribute a + single block of data across multiple processors. Then, Sourcery + VSIPL++ will automatically move data between processors as + necessary. The VSIPL++ API uses the Single-Program Multiple-Data (SPMD) model for parallelism. In this model, every processor runs the same - program, but operates on different sets of data. For instance, in + program, but operates on different sets of data. For example, in the fast convolution example, multiple processors perform FFTs at the same time, but each processor handles different rows in the matrix. @@ -218,12 +219,12 @@ Implicit Parallelism: Parallel Foreach - You may feel that the original formulation was simpler and more + You may feel that the original formulation using implicitly + parallel operators was simpler and more intuitive than the more-efficient variant using explicit loops. Sourcery VSIPL++ provides an extension to the VSIPL++ API that allows you to retain the elegance of that formulation while still - obtaining the temporal locality obtained with the style shown in - the previous two sections. + obtaining good temporal locality. @@ -373,11 +374,11 @@ Because the data will be arriving via DMA, you must explicitly - manage the memory used by Sourcery VSIPL++. Each processor must allocate - the memory for its local portion of - data_in_block. (All processors except the - actual input processor will allocate zero bytes, since the input - data is located on a single processor.) The code required to + manage the memory used by Sourcery VSIPL++. Because VSIPL++ uses the + SPMD model, each processor must allocate + the memory for its local portion the input block, even though all + processors except the actual input processor will allocate zero + bytes. The code required to set up the views is: Index: serial.xml =================================================================== --- serial.xml (revision 149238) +++ serial.xml (working copy) @@ -151,7 +151,7 @@ Before performing the actual convolution, you must convert the replica to the frequency domain using the FFT created above. Because the replica data is a property of the chirp, we only need to do - this once; even if our radar system runs for a long time, the + this once; even if the radar system runs for a long time, the converted replica will always be the same. VSIPL++ FFT objects behave like functions, so you can just "call" the FFT object: @@ -165,7 +165,7 @@ objects you've already created to go into and out of the frequency domain. While in the frequency domain, you will use the vmmul operator to perform a - vector-matrix multiply. This will multiply each row + vector-matrix multiply. This operator multiplies each row (dimension zero) of the frequency-domain matrix by the replica. The vmmul operator is a template taking a single parameter which indicates whether the multiplication should @@ -284,7 +284,7 @@ }]]> - The following graph shows that the new "interleaves" + The following graph shows that the new "interleaved" formulation is faster than the original "phased" approach for large data sets. For smaller data sets (where all of the data fits in the cache anyhow), the original method is faster because @@ -309,12 +309,12 @@ - To perform I/O with external routines (such as posix - read and write - it is necessary to obtain a pointer to data. - Sourcery VSIPL++ provides multiple ways to do this: - using user-defined storage, and - using external data access. + To perform I/O with external routines (such as the POSIX + read and write functions) + it is necessary to obtain a pointer to the raw data used by + Sourcery VSIPL++. Sourcery VSIPL++ provides two ways to do this: + you may use either user-defined storage or + external data access. In this section you will use user-defined storage to perform I/O. Later, in you will see how to use external data access for I/O. @@ -385,7 +385,7 @@ The true argument indicates that the data values sould be preserved by the admit. In cases where the values do not need to preserved (such as admitting a block - after outout I/O has been performed and before the block will be + after output I/O has been performed and before the block will be overwritten by new values in VSIPL++) you can use false instead. @@ -417,14 +417,13 @@ In this section, you will use External Data - Access to get pointer to a block's data. - External data access allows a pointer to any block's - data to be taken, even if the block was not created with - user-specified storage (or if the block is not a Dense - block at all!) This capability is useful in context where you - cannot control how a block is created. To illustrate - this, you will create a utility routine for I/O that works - with any view passed as a parameter. + Access to get a pointer to a block's data. + You can use this method with any block, even if the block does not + use user-specified storage. The external data access method is + useful in contexts where you cannot control how the block is + allocate. For example, in this section, you will create a utility + routine for I/O that works with any matrix or vector, even if it + was not created with user-defined storage. @@ -440,30 +439,30 @@ block_type and the requested layout layout_type. The constructor takes two parameters: the block being accessed, and the type of - syncing necessary. + synchronization necessary. - The layout_type parameter is an - specialized Layout class template that + The layout_type parameter is a + specialized Layout class template that determines the layout of data that Ext_data provides. If no type is given, the natural layout of the block is used. However, in some - cases it is necessary to access the data in a certain way, - such as dense or row-major. + cases you may wish to specify row-major or column-major layout. - Layout class template takes 4 parameters to + The Layout class template takes 4 parameters to indicate dimensionality, dimension-ordering, packing format, and complex storage format (if complex). In the example below you will use the layout_type to request the data access to be dense, - row-major, with interleaved real and imaginar values if complex. - This will allow you to read data sequentially from a file. + row-major, with interleaved real and imaginary values. This layout + corresponds to a common storage format used for binary files + storing complex data. - The sync type is analgous to the update flags for + The synchronization type is analgous to the update flags for admit() and release(). SYNC_IN indicates that the block and pointer @@ -486,15 +485,18 @@ - The pointer provided is valid only during the life of the object. - Moreover, the block being accessed should not be used during that time. + The pointer provided is valid only during the life of the + Ext_data object. + Moreover, the block referred to by the + Ext_data object must not be used during this + period. - Putting this together, you can create a routine to perform + Using these capabilities together, you can create a routine to perform I/O into a block. This routine will take two arguments: - a filename to read, and a view to put the data into. + a filename to read, and a view in which to store the data. The amount of data read from the file will be determined by the view's size. From jules at codesourcery.com Fri Sep 15 02:11:15 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Thu, 14 Sep 2006 22:11:15 -0400 Subject: [patch] Shared builtin libdirs for merged package Message-ID: <450A0BC3.3080903@codesourcery.com> This patch installs builtin libraries such as ATLAS and FFTW3 into 'builtin_libdir', which can be set from configure. By default it is the same as libdir, so it only makes a difference when explicitly used. packpage.py and scripts/config is updated to use this so that builtin libraries are shared amongst different library variants when possible. This patch includes a small bug-fix to simd.hpp, and some test updates made for debugging the windows' solver-lu failures. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: misc URL: From jules at codesourcery.com Fri Sep 15 02:19:12 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Thu, 14 Sep 2006 22:19:12 -0400 Subject: [vsipl++] [patch] Shared builtin libdirs for merged package In-Reply-To: <450A0BC3.3080903@codesourcery.com> References: <450A0BC3.3080903@codesourcery.com> Message-ID: <450A0DA0.1030407@codesourcery.com> Oops! Here's the right patch. -- Jules Jules Bergmann wrote: > This patch installs builtin libraries such as ATLAS and FFTW3 into > 'builtin_libdir', which can be set from configure. By default it is the > same as libdir, so it only makes a difference when explicitly used. > > packpage.py and scripts/config is updated to use this so that builtin > libraries are shared amongst different library variants when possible. > > This patch includes a small bug-fix to simd.hpp, and some test updates > made for debugging the windows' solver-lu failures. > > -- Jules > > > ------------------------------------------------------------------------ > > > configure options for gannon > --disable-mpi > --with-lapack=builtin > --with-atlas-tarball=/home/jules/csl/atlas/atlas3.7.11_SunOS_SunUS2.tar.gz > --enable-fft=builtin --disable-fft-long-double > --enable-profile-timer=posix -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: misc.diff URL: From jules at codesourcery.com Fri Sep 15 03:30:07 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Thu, 14 Sep 2006 23:30:07 -0400 Subject: [patch] Regression test for icc-windows bug with transpose. Message-ID: <450A1E3F.2070503@codesourcery.com> Patch applied. -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ta.diff URL: From jules at codesourcery.com Fri Sep 15 05:52:10 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Fri, 15 Sep 2006 01:52:10 -0400 Subject: [patch] IPP and MKL configuration for windows Message-ID: <450A3F8A.3040102@codesourcery.com> This patch makes it slightly easier to configure for using IPP and MKL on windows. First, since configure doesn't like paths with spaces, it is necessary to put the paths for IPP and MKL into the LIB and INCLUDE environment variables. Once this is done, use the configure options: --enable-ipp=win and --with-lapack=mkl_win In the future we can clean these up to have configure "do the right thing" on windows without the "win" hints, but that can wait for now. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: wincfg.diff URL: From jules at codesourcery.com Fri Sep 15 18:02:10 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Fri, 15 Sep 2006 14:02:10 -0400 Subject: [patch] benchmark updates Message-ID: <450AEAA2.6060807@codesourcery.com> NOTE: This patch does not affect the core library or the contents of binary packages. This patch adds a Impl_pop case to the Fftm benchmark which measures performance of a out-of-place Fftm as implemented by a loop of out-of-place Ffts. Besides illustrating the advantages of using Fftm over Fft for some backend, this can be used to measure the performance of Fft when its data is not guaranteed to start in cache. For example, in the FIR bank benchmark, when processing is done one row at a time, the forward Ffts are performing a disjoint Fftm. Depending on the problem size vs cache size, their data may not be in cache. This benchmark case partially models that. Patch applied. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: bm.diff URL: From jules at codesourcery.com Fri Sep 15 21:14:36 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Fri, 15 Sep 2006 17:14:36 -0400 Subject: [patch] Updated Qr benchmark Message-ID: <450B17BC.2020503@codesourcery.com> This updates the benchmark to cover the various Q save options (no Q, thin Q, full Q). It also add coverage for row-major and col-major source data. Patch applied. -- Jules -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705 From jules at codesourcery.com Tue Sep 19 02:22:49 2006 From: jules at codesourcery.com (Jules Bergmann) Date: Mon, 18 Sep 2006 22:22:49 -0400 Subject: Sourcery VSIPL++ 1.2 Available Message-ID: <450F5479.4060503@codesourcery.com> CodeSourcery is pleased to announce the availability of Sourcery VSIPL++ 1.2. This new version of Sourcery VSIPL++, a toolkit for developing high-performance signal- and image-processing applications has a number of improvements and new features. Highlights include greater portability with support for the Windows platform and the Intel C++ compiler, improved performance with SIMD loop fusion to make greater use of the PowerPC AltiVec and Intel SSE instruction sets, improved parallelism with support for Mercury's Parallel Acceleration System (PAS) library, and increased productivity with an integrated profiling capability to gather application performance data. Sourcery VSIPL++ is a full implementation of the VSIPL++ API, an open standard for platform-independent signal- and image-processing developed by the DOD High Performance Embedded Computing Software Initiative (HPEC-SI) and the VSIPL Forum. Sourcery VSIPL++ provides many high-level routines used in SIP computing, such as FFTs, FIR filters, SVD and QR decomposition, and linear algebra. For more information about Sourcery VSIPL++, including information about receiving a free 30-day evaluation, please visit our website: http://www.codesourcery.com/vsiplplusplus For more information on the new features in this release, please visit: http://www.codesourcery.com/vsiplplusplus/1.2/news.html -- Jules Bergmann CodeSourcery jules at codesourcery.com (650) 331-3385 x705