[vsipl++] [patch] Built-in function profiling

Mon Jul 10 14:31:02 UTC 2006

Don McCoy wrote:
> This is the first of a series of patches to add built-in profiling 
> capability to VSIPL++.  For this initial version, only the FFT, 
> Convolution and Correlation objects have been modified.
> 
> An example of how to use the profiling features is included in the 
> examples/fft.cpp directory.  In brief, it shows the exact type of 
> operation performed, the total amount of time spent (even over multiple 
> calls), the number of times called, the operation count per call and 
> finally, the (calculated) rate of operations (in MFLOPS).  Full 
> documentation will be provided in an upcoming patch.

Don, this looks good.  I have several comments below.  Can you please 
address #1 and #3-#5 before checking in?  We can address #2 later.

				thanks
				-- Jules

> @@ -171,6 +179,7 @@
>  Profiler::dump(char* filename, char /*mode*/)
>  {
>    std::ofstream    file;
> +  const char delim[] = " : ";

While the spaces improve the human readability, they are more difficult 
to post-process in a languages like perl and python (although the extra 
difficulty is pretty minor).  Let's leave the space in, but in general 
we should "error" on the side of easier post-processing rather than 
readability since the profiling output will have a lot of raw 
information that will be difficult to digest without some reduction.

>  
>    file.open(filename);
>  
> @@ -179,16 +188,20 @@
>      file << "# mode: pm_trace" << std::endl;
>      file << "# timer: " << TP::name() << std::endl;
>      file << "# clocks_per_sec: " << TP::ticks(TP::clocks_per_sec) << std::endl;
> +    file << "# " << std::endl;
> +    file << "# index" << delim << "tag" << delim << "ticks" << delim << "open id" 
> +         << delim << "op count" << std::endl;
>  

>      for (iterator cur = accum_.begin(); cur != accum_.end(); ++cur)
>      {
> -      file << (*cur).first << ":"
> -	   << TP::ticks((*cur).second.total) << ":"
> -	   << (*cur).second.count << std::endl;
> +      float mflops = (*cur).second.count * (*cur).second.value /
> +        (1e6 * TP::seconds((*cur).second.total));
> +      file << (*cur).first 
> +           << delim << TP::ticks((*cur).second.total)
> +           << delim << (*cur).second.count
> +           << delim << (*cur).second.value
> +           << delim << mflops
> +           << std::endl;
> +      // clear log
>        cur->second.total = TP::zero();
>        cur->second.count = 0;

[1] I think this should also clear 'value' too.

>  
> +namespace conv
> +{
> +template <typename T>
> +struct Op_count
> +{
> +  static length_type
> +  apply(length_type kernel_len, length_type output_len)
> +  {
> +    return static_cast<length_type>(kernel_len * output_len *
> +      (Ops_info<T>::mul + Ops_info<T>::add));
> +  }
> +};

This could also have been a template function:

	template <typename T>
	inline length_type
	op_count(...)
	{
	   return ...
	}

Using a template function slightly simplifies the definition and use, 
but it makes it difficult to use partial specializations (for example, 
if you needed to compute the ops counts for complex versus non-complex 
differently).  However, the 'Ops_info' class makes such specialization 
unnecessary.

Either way (class template as it currently is, or function template) is 
fine.

> +} // namespace conv

>  } // namespace impl
>  
>  template <template <typename, typename> class ConstViewT,
> @@ -110,7 +124,11 @@
>      impl_View<V2, Block2, T, dim>       out)
>      VSIP_NOTHROW
>    {
> -    timer_.start();
> +    length_type const M = this->kernel_size()[0].size();
> +    length_type const P = this->output_size()[0].size();
> +    int ops = impl::conv::Op_count<T>::apply(M, P);
> +    impl::profile::Scope_event scope_event("convolve_impl_view", ops);
> +    impl::profile::Time_in_scope scope(this->timer_);

[2] This is correct in functionality, but it would be good to abstract 
it somewhat so that each scope_event doesn't need to compute the ops count.

One way to do this is with a new class 'Profile_event' or 
'Persistent_event' that was initialized with its name and ops count::

	class Convolution
	{
	  ...

	  Convolution(...)
	  : ...
	    event_("convolve_vector", ... precompute ops count ...)
	  { ... }

	  // Member data
	  Profile_event event_;
	};

This would then be used in the operator

	  impl::profile::Event_in_scope scope(this->event_);

We could remove the duplicated effort between this->event_ and 
this->timer_ by having the 'Profile_event' class be able to return its 
accumulated time and invocation count:

	  this->event_->total() and this->event_->count()

Since not all users of 'Profile_event' will need this ability, it could 
be made a policy::

	  Profile_event<Accumulate_time> event_;

or

	  Profile_event<No_accumulate_time> event_;

>      for (dimension_type d=0; d<dim; ++d)
>        assert(in.size(d) == this->input_size()[d].size());
>  

>  
>      return out;
>    }
> @@ -152,7 +172,11 @@
>      Matrix<T, Block2>       out)
>      VSIP_NOTHROW
>    {
> -    timer_.start();
> +    length_type const M = this->kernel_size()[0].size();
> +    length_type const P = this->output_size()[0].size();
> +    int ops = impl::conv::Op_count<T>::apply(M, P);

[3a] For a matrix convolution, the ops count will also depend on the 
kernel_size()[1].size() and output_size()[1].size().  (The ops 
computation in the original impl_performance was also broken for matrices).

> +    impl::profile::Scope_event scope_event("convolve_matrix", ops);
> +    impl::profile::Time_in_scope scope(this->timer_);
>      for (dimension_type d=0; d<dim; ++d)
>        assert(in.size(d) == this->input_size()[d].size());

>    {
>      if (!strcmp(what, "mflops"))
>      {
> -      int count = timer_.count();
>        length_type const M = this->kernel_size()[0].size();
>        length_type const P = this->output_size()[0].size();
> -      float ops = 2.f * count * P * M;
> -      return ops / (1e6*timer_.total());
> +      int ops = impl::conv::Op_count<T>::apply(M, P);
> +      return timer_.count() * ops / (1e6 * timer_.total());

[3b] also correct matrix ops count here.

> @@ -124,7 +141,11 @@
>      Matrix<T, Block2>       out)
>      VSIP_NOTHROW
>    {
> -    impl::profile::Scope_timer t(timer_);
> +    length_type const M = this->reference_size()[0].size();
> +    length_type const P = this->output_size()[0].size();
> +    int ops = impl::corr::Op_count<T>::apply(M, P);

[3c] matrix ops count

> +    impl::profile::Scope_event scope_event("correlate_matrix", ops);
> +    impl::profile::Time_in_scope scope(this->timer_);
>  
>      for (dimension_type d=0; d<dim; ++d)
>      {
> @@ -142,11 +163,10 @@
>    {
>      if (!strcmp(what, "mflops"))
>      {
> -      int count = timer_.count();
> -      length_type const M = this->kernel_size()[0].size();
> +      length_type const M = this->reference_size()[0].size();
>        length_type const P = this->output_size()[0].size();
> -      float ops = 2.f * count * P * M;
> -      return ops / (1e6*timer_.total());
> +      int ops = impl::corr::Op_count<T>::apply(M, P);

[3d] matrix ops count

> +      return timer_.count() * ops / (1e6 * timer_.total());

> +
> +template <typename T>
> +struct Ops_info
> +{
> +  static int const div = 1;
> +  static int const sqr = 1;
> +  static int const mul = 1;
> +  static int const add = 1;

[4] if we change these from 'int' to 'unsigned', we can get rid of the 
static_casts in the Ops_count::apply functions.

>  
> +length_type
> +op_count(length_type len)
> +{ 
> +  return static_cast<length_type>(5 * len * std::log((float)len) / 
> +                                  std::log(2.f)); 
> +}
> +

[5] This should take into account the FFT input/output types to 
determine if the transform is C to C (in which case the ops count is 
correct at 5 * len * log2(len)), or R to C / C to R (in which case the 
ops count is halved).

-- 
Jules Bergmann
CodeSourcery
jules at codesourcery.com
(650) 331-3385 x705