[vsipl++] patch: SIMD loop fusion

Wed Jul 26 21:06:54 UTC 2006

Stefan Seefeld wrote:
 > The attached patch adds support for SIMD (i.e. vectorized) loop fusion
 > to our expression block evaluation harness. It is enabled for homogenous
 > scalar expressions (i.e. all leaf blocks have float or double value_type)
 > on platforms supporting SSE(2) instructions (such as 'gcc -msse2').

Stefan,

Excellent!  I have some comments below and a general comment:

  - We need to make sure that mixed types are handled properly
    I.e. we need to avoid dispatching on expressions like:

	Vector<float>  fA, fB;
	Vector<double> dC;

	dC = fA + fB; // 1

	fA = dC + fB; // 2

    IIUC, the Proxy_factor dectects #2.  We should extend the expression
    evalutor to catch #1.

plus a couple of general nits:

  - When you use single-char template parameters, can you document
    what they are?  i.e. "B is a block type"?

  - There are a few 80+ character lines, can you split them?

Once these are fixed, please check this in.

				thanks,
				-- Jules

 > ------------------------------------------------------------------------

 > Index: src/vsip/impl/simd/expr_evaluator.hpp
 > ===================================================================

B is a block?  Can you capture that in a comment?  Likewise below for
other single character template parameters (other than T).

 > +template <typename B>
 > +struct Proxy_factory
 > +{
 > +  typedef Direct_access_traits<typename B::value_type> access_traits;
 > +  typedef Proxy<access_traits> proxy_type;
 > +  static bool const ct_valid = true;

I think this should check if block B has direct access.

	ct_valid = Ext_data_cost<B>::value == 0;

 > +
 > +  static bool
 > +  rt_valid(B const &b)
 > +  {
 > +    Ext_data<B> dda(b, SYNC_IN);
 > +    return dda.stride(0) == 1 &&
 > +      Simd_traits<typename B::value_type>::alignment_of(dda.data()) 
== 0;
 > +  }
 > +  static proxy_type
 > +  create(B const &b)
 > +  {
 > +    Ext_data<B> dda(b, SYNC_IN);
 > +    return proxy_type(dda.data());
 > +  }
 > +};

 > +template <typename LB,
 > +	  typename RB>
 > +struct Serial_expr_evaluator<1, LB, RB, Simd_tag>
 > +{
 > +  static bool const ct_valid =
 > +    // Is SIMD supported at all ?
 > +    simd::Simd_traits<typename LB::value_type>::is_accel &&
 > +    // Check that direct access is possible.
 > +    Ext_data_cost<LB>::value == 0 &&
 > +    simd::Proxy_factory<RB>::ct_valid;

To be safe, this should check if LB::value_type == RB::value_type

 > +
 > +  static bool rt_valid(LB& lhs, RB const& rhs)
 > +  {
 > +    Ext_data<LB> dda(lhs, SYNC_OUT);
 > +    return (dda.stride(0) == 1 &&
 > +	    simd::Simd_traits<typename LB::value_type>::
 > +	      alignment_of(dda.data()) == 0 &&
 > +	    simd::Proxy_factory<RB>::rt_valid(rhs));
 > +  }
 > +
 > +  static void exec(LB& lhs, RB const& rhs)
 > +  {

Maybe these types could named be LAT/RAT to follow LB/RB and li/ri?

 > +    typedef typename simd::LValue_access_traits<typename 
LB::value_type> WAT;
 > +    typedef typename simd::Proxy_factory<RB>::access_traits EAT;
 > +    length_type const vec_size = simd::Simd_traits<typename 
LB::value_type>::vec_size;
 > +    Ext_data<LB> dda(lhs, SYNC_OUT);
 > +    length_type size = dda.size(0) / vec_size;
 > +    simd::Proxy<WAT> lp(dda.data());
 > +    simd::Iterator<WAT> li(lp);
 > +    // Map the expression block to an expression iterator and loop.
 > +    simd::Iterator<EAT> ri(simd::Proxy_factory<RB>::create(rhs));
 > +    while (size--) *li++ = *ri++;
 > +    // Process the remainder, using simple loop fusion.
 > +    size = dda.size(0);
 > +    length_type remainder = size % vec_size;
 > +    for (index_type i = size - remainder; i != size; ++i) lhs.put(i, 
rhs.get(i));
 > +  }

 > Index: src/vsip/impl/simd/expr_iterator.hpp
 > ===================================================================

 > +template <typename T>
 > +struct Unary_operator_map<T, op::Plus>
 > +{
 > +  typedef typename Simd_traits<T>::simd_type simd_type;
 > +  static bool const is_supported = true;
 > +  static simd_type
 > +  apply(simd_type const &op)
 > +  { return Simd_traits<T>::add(Simd_traits<T>::zero(), op);}

This looks correct, but it would be faster to just return 'op' directly.

 > +template <typename T>
 > +class Proxy<Scalar_access_traits<T> >
 > +{
 > +public:
 > +  typedef T value_type;
 > +  typedef typename Simd_traits<value_type>::simd_type simd_type;
 > +
 > +  Proxy(value_type value) : value_(value) {}
 > +
 > +  simd_type load() const
 > +  { return Simd_traits<value_type>::load_scalar_all(value_);}

You could create a simd_valud_ member variable to avoid the call to
'load_scalar_all' each time load is called.  This increases the SIMD
register pressure, but it would be a win in simple expressions like:

	vector = scalar * vector;

 > +
 > +  void increment() {}
 > +
 > +private:
 > +  value_type value_;
 > +};

-- 
Jules Bergmann
CodeSourcery
jules at codesourcery.com
(650) 331-3385 x705