[vsipl++] [patch] Built-in function profiling
Jules Bergmann
jules at codesourcery.com
Mon Jul 10 14:31:02 UTC 2006
Don McCoy wrote:
> This is the first of a series of patches to add built-in profiling
> capability to VSIPL++. For this initial version, only the FFT,
> Convolution and Correlation objects have been modified.
>
> An example of how to use the profiling features is included in the
> examples/fft.cpp directory. In brief, it shows the exact type of
> operation performed, the total amount of time spent (even over multiple
> calls), the number of times called, the operation count per call and
> finally, the (calculated) rate of operations (in MFLOPS). Full
> documentation will be provided in an upcoming patch.
Don, this looks good. I have several comments below. Can you please
address #1 and #3-#5 before checking in? We can address #2 later.
thanks
-- Jules
> @@ -171,6 +179,7 @@
> Profiler::dump(char* filename, char /*mode*/)
> {
> std::ofstream file;
> + const char delim[] = " : ";
While the spaces improve the human readability, they are more difficult
to post-process in a languages like perl and python (although the extra
difficulty is pretty minor). Let's leave the space in, but in general
we should "error" on the side of easier post-processing rather than
readability since the profiling output will have a lot of raw
information that will be difficult to digest without some reduction.
>
> file.open(filename);
>
> @@ -179,16 +188,20 @@
> file << "# mode: pm_trace" << std::endl;
> file << "# timer: " << TP::name() << std::endl;
> file << "# clocks_per_sec: " << TP::ticks(TP::clocks_per_sec) << std::endl;
> + file << "# " << std::endl;
> + file << "# index" << delim << "tag" << delim << "ticks" << delim << "open id"
> + << delim << "op count" << std::endl;
>
> for (iterator cur = accum_.begin(); cur != accum_.end(); ++cur)
> {
> - file << (*cur).first << ":"
> - << TP::ticks((*cur).second.total) << ":"
> - << (*cur).second.count << std::endl;
> + float mflops = (*cur).second.count * (*cur).second.value /
> + (1e6 * TP::seconds((*cur).second.total));
> + file << (*cur).first
> + << delim << TP::ticks((*cur).second.total)
> + << delim << (*cur).second.count
> + << delim << (*cur).second.value
> + << delim << mflops
> + << std::endl;
> + // clear log
> cur->second.total = TP::zero();
> cur->second.count = 0;
[1] I think this should also clear 'value' too.
>
> +namespace conv
> +{
> +template <typename T>
> +struct Op_count
> +{
> + static length_type
> + apply(length_type kernel_len, length_type output_len)
> + {
> + return static_cast<length_type>(kernel_len * output_len *
> + (Ops_info<T>::mul + Ops_info<T>::add));
> + }
> +};
This could also have been a template function:
template <typename T>
inline length_type
op_count(...)
{
return ...
}
Using a template function slightly simplifies the definition and use,
but it makes it difficult to use partial specializations (for example,
if you needed to compute the ops counts for complex versus non-complex
differently). However, the 'Ops_info' class makes such specialization
unnecessary.
Either way (class template as it currently is, or function template) is
fine.
> +} // namespace conv
> } // namespace impl
>
> template <template <typename, typename> class ConstViewT,
> @@ -110,7 +124,11 @@
> impl_View<V2, Block2, T, dim> out)
> VSIP_NOTHROW
> {
> - timer_.start();
> + length_type const M = this->kernel_size()[0].size();
> + length_type const P = this->output_size()[0].size();
> + int ops = impl::conv::Op_count<T>::apply(M, P);
> + impl::profile::Scope_event scope_event("convolve_impl_view", ops);
> + impl::profile::Time_in_scope scope(this->timer_);
[2] This is correct in functionality, but it would be good to abstract
it somewhat so that each scope_event doesn't need to compute the ops count.
One way to do this is with a new class 'Profile_event' or
'Persistent_event' that was initialized with its name and ops count::
class Convolution
{
...
Convolution(...)
: ...
event_("convolve_vector", ... precompute ops count ...)
{ ... }
// Member data
Profile_event event_;
};
This would then be used in the operator
impl::profile::Event_in_scope scope(this->event_);
We could remove the duplicated effort between this->event_ and
this->timer_ by having the 'Profile_event' class be able to return its
accumulated time and invocation count:
this->event_->total() and this->event_->count()
Since not all users of 'Profile_event' will need this ability, it could
be made a policy::
Profile_event<Accumulate_time> event_;
or
Profile_event<No_accumulate_time> event_;
> for (dimension_type d=0; d<dim; ++d)
> assert(in.size(d) == this->input_size()[d].size());
>
>
> return out;
> }
> @@ -152,7 +172,11 @@
> Matrix<T, Block2> out)
> VSIP_NOTHROW
> {
> - timer_.start();
> + length_type const M = this->kernel_size()[0].size();
> + length_type const P = this->output_size()[0].size();
> + int ops = impl::conv::Op_count<T>::apply(M, P);
[3a] For a matrix convolution, the ops count will also depend on the
kernel_size()[1].size() and output_size()[1].size(). (The ops
computation in the original impl_performance was also broken for matrices).
> + impl::profile::Scope_event scope_event("convolve_matrix", ops);
> + impl::profile::Time_in_scope scope(this->timer_);
> for (dimension_type d=0; d<dim; ++d)
> assert(in.size(d) == this->input_size()[d].size());
> {
> if (!strcmp(what, "mflops"))
> {
> - int count = timer_.count();
> length_type const M = this->kernel_size()[0].size();
> length_type const P = this->output_size()[0].size();
> - float ops = 2.f * count * P * M;
> - return ops / (1e6*timer_.total());
> + int ops = impl::conv::Op_count<T>::apply(M, P);
> + return timer_.count() * ops / (1e6 * timer_.total());
[3b] also correct matrix ops count here.
> @@ -124,7 +141,11 @@
> Matrix<T, Block2> out)
> VSIP_NOTHROW
> {
> - impl::profile::Scope_timer t(timer_);
> + length_type const M = this->reference_size()[0].size();
> + length_type const P = this->output_size()[0].size();
> + int ops = impl::corr::Op_count<T>::apply(M, P);
[3c] matrix ops count
> + impl::profile::Scope_event scope_event("correlate_matrix", ops);
> + impl::profile::Time_in_scope scope(this->timer_);
>
> for (dimension_type d=0; d<dim; ++d)
> {
> @@ -142,11 +163,10 @@
> {
> if (!strcmp(what, "mflops"))
> {
> - int count = timer_.count();
> - length_type const M = this->kernel_size()[0].size();
> + length_type const M = this->reference_size()[0].size();
> length_type const P = this->output_size()[0].size();
> - float ops = 2.f * count * P * M;
> - return ops / (1e6*timer_.total());
> + int ops = impl::corr::Op_count<T>::apply(M, P);
[3d] matrix ops count
> + return timer_.count() * ops / (1e6 * timer_.total());
> +
> +template <typename T>
> +struct Ops_info
> +{
> + static int const div = 1;
> + static int const sqr = 1;
> + static int const mul = 1;
> + static int const add = 1;
[4] if we change these from 'int' to 'unsigned', we can get rid of the
static_casts in the Ops_count::apply functions.
>
> +length_type
> +op_count(length_type len)
> +{
> + return static_cast<length_type>(5 * len * std::log((float)len) /
> + std::log(2.f));
> +}
> +
[5] This should take into account the FFT input/output types to
determine if the transform is C to C (in which case the ops count is
correct at 5 * len * log2(len)), or R to C / C to R (in which case the
ops count is halved).
--
Jules Bergmann
CodeSourcery
jules at codesourcery.com
(650) 331-3385 x705
More information about the vsipl++
mailing list