[vsipl++] SIMD Loop fusion support for unaligned vectors

Mon May 21 15:32:27 UTC 2007

Assem Salama wrote:
 > Everyone,
 >  This patch adds support for unaligned vectors using loop fusion.

Assem,

This looks promising.  I have a couple of design comments below.  Can 
you address those and then send out an updated patch?

Let me know if you have any questions.

				-- Jules

 >
 > Index: src/vsip/opt/simd/simd.hpp
 > ===================================================================
 > --- src/vsip/opt/simd/simd.hpp	(revision 171547)
 > +++ src/vsip/opt/simd/simd.hpp	(working copy)
 > @@ -143,6 +143,10 @@
 >    static simd_type load_unaligned(value_type const* addr)
 >    { return *addr; }

[1] I see what you are trying to do with this variant of load_unaligned,
however, there are a couple of problems with the design:

  - First the name 'load_unaligned'.

    It implies that a load is being done.

    In faux-SIMD and SSE/SSE2 cases this is true (although in those
    cases it is confusing what x0 and x1 are for since they aren't
    used.

    In the AltiVec case, no load is being done (but x0 and x1 are
    being used).

    From below, it looks like the new 'load_unaligned' is intended to
    be used for AltiVec (where no load is being done).  The faux-SIMD
    and SSE/SSE2 versions are provided to avoid complation error.

    The function should give a good a indication of what it is to be
    used for.  Instead I would call it something like "shift_unaligned"
    or "extract_unaligned".

  - Second, if the function can't easily be defined for SSE/SSE2, instead
    of giving it different behavior (doing a load instead of a
    permute), it would be better to either leave it undefined, or
    have it assert(0).

  - Finally, the permutation vector (sh) can be reused.

Given this, it would expose more efficiency to have two functions:

    	static simd_itype unaligned_permutation(value_type const* addr)
	{
	  return vec_lvsl(0, (value_type*)addr);
	}

	static simd_type permute(simd_type x0, simd_type x1,
				 simd_itype sh)
         {
           return vec_perm(x0, x1, sh);
         }

Does SSE/SSE2 have an equivalent permute?

If so, let's use it.

If not, let's either:

  - have permute and unaligned_permute assert(0), and define a static
    bool in the simd traits as to the presence of permute

    	static bool const has_permute = true/false;

    By defining the functions to assert 0, the 'has_permute' check
    can either be performed at run-time or compile-time.

  - fake it with a union.

 >
 > +  static simd_type load_unaligned(simd_type x0, simd_type x1,
 > +                                  value_type const* addr)
 > +  { return *addr; }
 > +
 >    static simd_type load_scalar(value_type value)
 >    { return value; }
 >

 > @@ -262,6 +266,13 @@
 >      return vec_perm(x0, x1, sh);
 >    }
 >
 > +  static simd_type load_unaligned(simd_type x0, simd_type x1,
 > +                                  value_type const* addr)
 > +  {
 > +    __vector unsigned char sh = vec_lvsl(0, (value_type*)addr);
 > +    return vec_perm(x0, x1, sh);
 > +  }
 > +
 >    static simd_type load_scalar(value_type value)
 >    {
 >      union

 > @@ -646,6 +676,9 @@
 >    static simd_type load_unaligned(value_type* addr)
 >    { return _mm_loadu_si128((simd_type*)addr); }
 >
 > +  static simd_type load_unaligned(simd_type x0, simd_type x1, 
value_type* addr)
 > +  { return _mm_loadu_si128((simd_type*)addr); }
 > +
 >    static simd_type load_scalar(value_type value)
 >    { return _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
 >  			0, 0, 0, 0, 0, 0, 0, value); }

 > Index: src/vsip/opt/simd/expr_evaluator.hpp
 > ===================================================================
 > --- src/vsip/opt/simd/expr_evaluator.hpp	(revision 171353)
 > +++ src/vsip/opt/simd/expr_evaluator.hpp	(working copy)
 > @@ -43,11 +43,11 @@
 >  namespace simd
 >  {
 >
 > -template <typename BlockT>
 > +template <typename BlockT, bool A>

[2] what is A?  Aligned?  Document, or pick a more descriptive name
(or both :)

You use "IsAligned" in expr_iterator.hpp.  That would be good here too.

 >  struct Proxy_factory
 >  {
 >    typedef Direct_access_traits<typename BlockT::value_type> 
access_traits;
 > -  typedef Proxy<access_traits> proxy_type;
 > +  typedef Proxy<access_traits, A> proxy_type;
 >    typedef typename Adjust_layout_dim<
 >                       1, typename 
Block_layout<BlockT>::layout_type>::type
 >  		layout_type;
 > @@ -62,7 +62,15 @@
 >      return dda.stride(0) == 1 &&
 >        Simd_traits<typename 
BlockT::value_type>::alignment_of(dda.data()) == 0;
 >    }
 > -  static proxy_type
 > +  static bool
 > +  is_aligned(BlockT const& b)
 > +  {
 > +    Ext_data<BlockT, layout_type> dda(b, SYNC_IN);
 > +    return
 > +      Simd_traits<typename 
BlockT::value_type>::alignment_of(dda.data()) == 0;
 > +  }

[3] I don't see how this works.  Since rt_valid still checks alignment,
rt_valid will be false for unaligned data.  There is no to tell whether
rt_valid is false because data is unaligned, or because it doesn't have
unit stride.

Rather than add an 'is_aligned()' method to each proxy, you should
modify 'rt_valid' to account for IsAligned.  When IsAligned (aka A) is
true, it should check alignment and stride.  When IsAlined is false,
it should only check stride:

   static bool
   rt_valid(BlockT const &b)
   {
     Ext_data<BlockT, layout_type> dda(b, SYNC_IN);
     return dda.stride(0) == 1 &&
       (!IsAligned ||
        Simd_traits<typename 
BlockT::value_type>::alignment_of(dda.data()) == 0);
   }

 > @@ -221,20 +262,17 @@
 >      return (dda.stride(0) == 1 &&
 >  	    simd::Simd_traits<typename LB::value_type>::
 >  	      alignment_of(dda.data()) == 0 &&
 > -	    simd::Proxy_factory<RB>::rt_valid(rhs));
 > +	    simd::Proxy_factory<RB, true>::rt_valid(rhs));

[4] Again, I don't see how this works for unaligned data.  rt_valid
will be false if any data is unaligned, preventing this evaluator from
being used.

The is_aligned check below will always be true, because exec() will
only be called when rt_valid is true, i.e. when the data is aligned.

With the rt_valid change suggested above, you could have two
evaluators: Simd_loop_fusion and Simd_loop_fusion_unaligned.  They
could share a common base class that takes alignment as a template
parameter.

(You could have a single evaluator, but that requires either checking
the Proxy's rt_valid a third time (bad), or keeping state between the
evaluator's rt_valid and exec (difficult, since no evaluator object is
created.  Using two evaluators captures that state in the
conditional).

 >    }
 >
 >    static void exec(LB& lhs, RB const& rhs)
 >    {
 >      typedef typename simd::LValue_access_traits<typename 
LB::value_type> WAT;
 > -    typedef typename simd::Proxy_factory<RB>::access_traits EAT;
 >      length_type const vec_size =
 >        simd::Simd_traits<typename LB::value_type>::vec_size;
 >      Ext_data<LB, layout_type> dda(lhs, SYNC_OUT);
 >      length_type const size = dda.size(0);
 >      length_type n = size;
 > -    simd::Proxy<WAT> lp(dda.data());
 > -    simd::Proxy<EAT> rp(simd::Proxy_factory<RB>::create(rhs));
 >  #if 0
 >      // simple iterator-based loop. It has the most concise syntax,
 >      // but generates suboptimal code with gcc 3.4
 > @@ -271,12 +309,35 @@
 >      // loop using proxy interface. This generates the best code
 >      // with gcc 3.4 (with gcc 4.1 the difference to the first case
 >      // above is negligible).
 > -    while (n >= vec_size)
 > -    {
 > -      lp.store(rp.load());
 > -      n -= vec_size;
 > -      lp.increment();
 > -      rp.increment();
 > +
 > +    // If any of the blocks are unaligned, we treat all of the blocks as
 > +    // unaligned
 > +    if(simd::Proxy_factory<RB, true>::is_aligned(rhs)) {
 > +      typedef typename simd::Proxy_factory<RB, true>::access_traits EAT;
 > +
 > +      simd::Proxy<WAT,true> lp(dda.data());
 > +      simd::Proxy<EAT,true> 
rp(simd::Proxy_factory<RB,true>::create(rhs));
 > +
 > +      while (n >= vec_size)
 > +      {
 > +        lp.store(rp.load());
 > +        n -= vec_size;
 > +        lp.increment();
 > +        rp.increment();
 > +      }
 > +    } else {

[5] Coding standards: put braces on separate lines for consistency

 > +      typedef typename simd::Proxy_factory<RB, false>::access_traits 
EAT;
 > +
 > +      simd::Proxy<WAT,false> lp(dda.data());
 > +      simd::Proxy<EAT,false> 
rp(simd::Proxy_factory<RB,false>::create(rhs));
 > +
 > +      while (n >= vec_size)
 > +      {
 > +        lp.store(rp.load());
 > +        n -= vec_size;
 > +        lp.increment();
 > +        rp.increment();
 > +      }
 >      }
 >  #endif
 >      // Process the remainder, using simple loop fusion.
 > Index: src/vsip/opt/simd/expr_iterator.hpp
 > ===================================================================

 > +template <typename T>
 > +class Proxy<Direct_access_traits<T>,false >
 > +{
 > +public:
 > +  typedef T value_type;
 > +  typedef Simd_traits<value_type>                     simd;
 > +  typedef typename simd::simd_type                    simd_type;
 > +
 > +  Proxy(value_type const *ptr) : ptr_unaligned_(ptr)
 > +  {
 > +    ptr_aligned_    = (simd_type*)((intptr_t)ptr & 
~(simd::alignment-1));
 > +
 > +    // We do not need x0 and x1 if we are using sse because sse has
 > +    // a uload intruction.
 > +#if !defined(__SSE__) and !defined(_SSE2__)
 > +    x0_             = simd::load((value_type*)ptr_aligned_);
 > +    x1_             = 
simd::load((value_type*)(ptr_aligned_+simd::vec_size));
 > +#endif

[6] Don't mix the pre-processor and traits like this!

The simd traits class abstracts the interface to the SIMD ISA.  Using
the preprocessor on top of the simd traits creates redundancy that
makes the code more difficult to maintain and understand.

 > +  }
 > +
 > +  simd_type load() const
 > +  {
 > +    return simd::load_unaligned(x0_, x1_, ptr_unaligned_);
 > +  }
 > +
 > +  void set_x0(simd_type x0) { x0_ = x0; }

[7] what is set_x0 for?

 > +
 > +  void increment(length_type n = 1)
 > +  {
 > +    ptr_unaligned_ += n * Simd_traits<value_type>::vec_size;
 > +    ptr_aligned_   += n;
 > +
 > +    // We do not need x0 and x1 if we are using sse because sse has
 > +    // a uload intruction.
 > +#if !defined(__SSE__) and !defined(_SSE2__)
 > +    // update x0
 > +    x0_ = (n == 1)? x1_:simd::load((value_type*)ptr_aligned_);
 > +
 > +    // update x1
 > +    x1_ = simd::load((value_type*)(ptr_aligned_+simd::vec_size));
 > +#endif
 > +  }
 > +
 > +private:
 > +  simd_type  x0_;
 > +  simd_type  x1_;
 > +
 > +  simd_type  const *ptr_aligned_;
 > +  value_type const *ptr_unaligned_;
 > +};

[8] You're using the preprocessor because some SIMD ISAs support
permute (AltiVec), and some don't (SSE).  Instead, let's add a static
const 'has_premute') that indicates the presence of permute, as
suggested above in [1].

Then you can make the use of permute conditional.

Becuase using permute requires storage for x0 and x1, the best
time to make the decision is at compile-time with a template
parameter:

	template <typename T>
	class Proxy<Direct_access_traits<T>, false>
	  : public Proxy_direct_access_helper<T, Simt_traits<T>::has_permute>
	{
	   ...
	};

Then you can specialize Proxy_direct_access_helper to use permute
if it is available

	template <typename T,
		  bool     HasPermute>
	struct Proxy_direct_access_helper;

	template <typename T>
	struct Proxy_direct_access_helper<T, true>
	{
	  Proxy(value_type const *ptr)
	    : ptr_unaligned_(ptr)
	      ptr_aligned_  ((simd_type*)((intptr_t)ptr & ~(simd::alignment-1))),
	      x1_ 
(simd::load((value_type*)(ptr_aligned_+simd::vec_size))),
	      perm_         (simd::unaligned_permuation((value_type*)(ptr)))
	  {}

	  simd_type load() const
	  {
	    // update x0
	    x0_ = (n == 1)? x1_:simd::load((value_type*)ptr_aligned_);

	    // update x1
	    x1_ = simd::load((value_type*)(ptr_aligned_+simd::vec_size));

	    return simd::permute(x0_, x1_, perm_);
	  }

	  void increment(length_type n = 1)
	  {
	    ptr_unaligned_ += n * Simd_traits<value_type>::vec_size;
	    ptr_aligned_   += n;

	  }

	private:
	  simd_type  x0_;
	  simd_type  x1_;
	  simd_type  perm_;

	  simd_type  const *ptr_aligned_;
	  value_type const *ptr_unaligned_;
	};

 > @@ -522,7 +578,7 @@
 >    B b_;
 >    C c_;
 >  };
 > -
 > +/*
 >  template <typename T>
 >  struct Iterator
 >  {
 > @@ -548,7 +604,7 @@
 >    r += n;
 >    return r;
 >  }
 > -
 > +*/

[9] Why is Iterator being commented out?

 >  } // namespace vsip::impl::simd
 >  } // namespace vsip::impl
 >  } // namespace vsip

-- 
Jules Bergmann
CodeSourcery
jules at codesourcery.com
(650) 331-3385 x705