Patch: Speed Pooma Evaluations

Jeffrey Oldham oldham at codesourcery.com
Tue Dec 18 00:50:38 UTC 2001


1. This patch permits the KCC optimizer to move most of non-data array
accesses out of the inner loop.  By using local variables rather than
references, the optimizer can determine that the inner loops'
assignments do not change the local objects' data members, e.g.,
strides_m.  Thus, these values need not be reloaded inside the inner
loops so the code is similar to hand-coded C loops.  In effect, the
compiler can determine which data members are loop-invariant and which
are not.  Example execution time (seconds) for Linux/KCC's Doof2d with
N=1000 include
 
                  C     Brick   FieldBrick stencil Brick
before change     6.20  9.89    13.29      7.29
after change      6.21  7.48    7.44       7.16
 
2. The idea can be implemented in at least two different ways.
Stephen Smith suggested the idea for the attached patch.  It relies on
two assumptions:
1) cheap, shallow copies and
2) copies of all non-pointer data members.

3. In Mark Mitchell's suggested implementation, container and engine
data members that are invariant during loop iterations are explicitly
stored in LoopInvariant_t structures.  These constant structures are
constructed before the loop and passed to the reads and writes within
the loop.  These operations use the constant structures rather than
the containers' and engines' data members.  Thus, the optimizer can
determine that the uses of the constant data members can be hoisted
out of the loops.  Although this implementation can deliver cleaner
code since we, as smart humans, might be able to determine better
code, it requires much more programmer time and code.  We can always
implement the idea if needed.  A patch for part of the work is
attached.

4. Two other sets of loops could be sped up using a similar technique
but were not.

a. Evaluator/LoopApply.h uses a function object.  Since we do not know
   whether we can copy the object much less copy back into the
   original, I do not know how to transform the loops.

b. Engine/RemoteEngine.h's EngineBlockSerialize could be modified but
   I could not find any user code to confirm the transformation's
   correctness.

Thanks to Mark Mitchell for finding the idea and creating the
technique.  Thanks to Stephen Smith for finding the slicker
implementation.

2001-11-02  Jeffrey D. Oldham  <oldham at codesourcery.com>

        * InlineEvaluator.h
        (KernelEvaluator<InlineKernelTag>::evaluate() for Dim=1..7:
        Use local variables for the left-hand side and the right-hand
        side.  This permits the KCC optimizer to move loop-invariant code
        out of the innermost loop, significantly reducing running times.
        * ReductionEvaluator.h
        (ReductionEvaluator<InlineKernelTag>::evaluate() for Dim=1..7:
        Use local variables for the expression and the accumulator
        variable.  This permits the KCC optimizer to move loop-invariant
        code out of the innermost loop, significantly reducing running
        times.

Tested on       Linux/KCC by compiling Doof2d and running all the
                array regression tests.  Only the inner loops of
                Doof2d and src/Evaluator/tests/ReductionTest1 were
                investigated.  (Doof2d compiled 17Dec using LINUXgcc
                --opt.)
Approved by     Stephen Smith
Applied to      mainline

Thanks,
Jeffrey D. Oldham
oldham at codesourcery.com
-------------- next part --------------
Index: InlineEvaluator.h
===================================================================
RCS file: /home/pooma/Repository/r2/src/Evaluator/InlineEvaluator.h,v
retrieving revision 1.26
diff -c -p -r1.26 InlineEvaluator.h
*** InlineEvaluator.h	2001/04/13 02:15:06	1.26
--- InlineEvaluator.h	2001/11/03 00:15:14
*************** struct KernelEvaluator<InlineKernelTag>
*** 157,165 ****
    {
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
      int e0 = domain[0].length();
      for (int i0=0; i0<e0; ++i0)
!       op(lhs(i0),rhs.read(i0));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
--- 157,167 ----
    {
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
+     LHS localLHS(lhs);
+     RHS localRHS(rhs);
      int e0 = domain[0].length();
      for (int i0=0; i0<e0; ++i0)
!       op(localLHS(i0),localRHS.read(i0));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
*************** struct KernelEvaluator<InlineKernelTag>
*** 169,179 ****
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
      PAssert(domain[1].first() == 0);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      for (int i1=0; i1<e1; ++i1)
        for (int i0=0; i0<e0; ++i0)
! 	op(lhs(i0,i1),rhs.read(i0,i1));
    }
    
    template<class LHS,class Op,class RHS,class Domain>
--- 171,183 ----
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
      PAssert(domain[1].first() == 0);
+     LHS localLHS(lhs);
+     RHS localRHS(rhs);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      for (int i1=0; i1<e1; ++i1)
        for (int i0=0; i0<e0; ++i0)
! 	op(localLHS(i0,i1),localRHS.read(i0,i1));
    }
    
    template<class LHS,class Op,class RHS,class Domain>
*************** struct KernelEvaluator<InlineKernelTag>
*** 184,196 ****
      PAssert(domain[0].first() == 0);
      PAssert(domain[1].first() == 0);
      PAssert(domain[2].first() == 0);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
      for (int i2=0; i2<e2; ++i2)
        for (int i1=0; i1<e1; ++i1)
  	for (int i0=0; i0<e0; ++i0)
! 	  op(lhs(i0,i1,i2),rhs.read(i0,i1,i2));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
--- 188,202 ----
      PAssert(domain[0].first() == 0);
      PAssert(domain[1].first() == 0);
      PAssert(domain[2].first() == 0);
+     LHS localLHS(lhs);
+     RHS localRHS(rhs);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
      for (int i2=0; i2<e2; ++i2)
        for (int i1=0; i1<e1; ++i1)
  	for (int i0=0; i0<e0; ++i0)
! 	  op(localLHS(i0,i1,i2),localRHS.read(i0,i1,i2));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
*************** struct KernelEvaluator<InlineKernelTag>
*** 202,207 ****
--- 208,215 ----
      PAssert(domain[1].first() == 0);
      PAssert(domain[2].first() == 0);
      PAssert(domain[3].first() == 0);
+     LHS localLHS(lhs);
+     RHS localRHS(rhs);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct KernelEvaluator<InlineKernelTag>
*** 210,216 ****
        for (int i2=0; i2<e2; ++i2)
  	for (int i1=0; i1<e1; ++i1)
  	  for (int i0=0; i0<e0; ++i0)
! 	    op(lhs(i0,i1,i2,i3),rhs.read(i0,i1,i2,i3));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
--- 218,224 ----
        for (int i2=0; i2<e2; ++i2)
  	for (int i1=0; i1<e1; ++i1)
  	  for (int i0=0; i0<e0; ++i0)
! 	    op(localLHS(i0,i1,i2,i3),localRHS.read(i0,i1,i2,i3));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
*************** struct KernelEvaluator<InlineKernelTag>
*** 223,228 ****
--- 231,238 ----
      PAssert(domain[2].first() == 0);
      PAssert(domain[3].first() == 0);
      PAssert(domain[4].first() == 0);
+     LHS localLHS(lhs);
+     RHS localRHS(rhs);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct KernelEvaluator<InlineKernelTag>
*** 233,239 ****
  	for (int i2=0; i2<e2; ++i2)
  	  for (int i1=0; i1<e1; ++i1)
  	    for (int i0=0; i0<e0; ++i0)
! 	      op(lhs(i0,i1,i2,i3,i4),rhs.read(i0,i1,i2,i3,i4));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
--- 243,249 ----
  	for (int i2=0; i2<e2; ++i2)
  	  for (int i1=0; i1<e1; ++i1)
  	    for (int i0=0; i0<e0; ++i0)
! 	      op(localLHS(i0,i1,i2,i3,i4),localRHS.read(i0,i1,i2,i3,i4));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
*************** struct KernelEvaluator<InlineKernelTag>
*** 247,252 ****
--- 257,264 ----
      PAssert(domain[3].first() == 0);
      PAssert(domain[4].first() == 0);
      PAssert(domain[5].first() == 0);
+     LHS localLHS(lhs);
+     RHS localRHS(rhs);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct KernelEvaluator<InlineKernelTag>
*** 259,266 ****
  	  for (int i2=0; i2<e2; ++i2)
  	    for (int i1=0; i1<e1; ++i1)
  	      for (int i0=0; i0<e0; ++i0)
! 		op(lhs(i0,i1,i2,i3,i4,i5),
! 		   rhs.read(i0,i1,i2,i3,i4,i5));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
--- 271,278 ----
  	  for (int i2=0; i2<e2; ++i2)
  	    for (int i1=0; i1<e1; ++i1)
  	      for (int i0=0; i0<e0; ++i0)
! 		op(localLHS(i0,i1,i2,i3,i4,i5),
! 		   localRHS.read(i0,i1,i2,i3,i4,i5));
    }
  
    template<class LHS,class Op,class RHS,class Domain>
*************** struct KernelEvaluator<InlineKernelTag>
*** 275,280 ****
--- 287,294 ----
      PAssert(domain[4].first() == 0);
      PAssert(domain[5].first() == 0);
      PAssert(domain[6].first() == 0);
+     LHS localLHS(lhs);
+     RHS localRHS(rhs);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct KernelEvaluator<InlineKernelTag>
*** 289,296 ****
  	    for (int i2=0; i2<e2; ++i2)
  	      for (int i1=0; i1<e1; ++i1)
  		for (int i0=0; i0<e0; ++i0)
! 		  op(lhs(i0,i1,i2,i3,i4,i5,i6),
! 		     rhs.read(i0,i1,i2,i3,i4,i5,i6));
    }
  
  private:
--- 303,310 ----
  	    for (int i2=0; i2<e2; ++i2)
  	      for (int i1=0; i1<e1; ++i1)
  		for (int i0=0; i0<e0; ++i0)
! 		  op(localLHS(i0,i1,i2,i3,i4,i5,i6),
! 		     localRHS.read(i0,i1,i2,i3,i4,i5,i6));
    }
  
  private:
Index: ReductionEvaluator.h
===================================================================
RCS file: /home/pooma/Repository/r2/src/Evaluator/ReductionEvaluator.h,v
retrieving revision 1.5
diff -c -p -r1.5 ReductionEvaluator.h
*** ReductionEvaluator.h	2001/09/13 19:27:58	1.5
--- ReductionEvaluator.h	2001/11/03 00:15:14
*************** struct ReductionEvaluator<InlineKernelTa
*** 124,134 ****
    {
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
      int e0 = domain[0].length();
  
!     ret = e.read(0);
      for (int i0 = 1; i0 < e0; ++i0)
!       op(ret, e.read(i0));
    }
  
    template<class T, class Op, class Expr, class Domain>
--- 124,137 ----
    {
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
+     Expr localExpr(e);
      int e0 = domain[0].length();
  
!     T answer(localExpr.read(0));
      for (int i0 = 1; i0 < e0; ++i0)
!       op(answer, localExpr.read(i0));
! 
!     ret = answer;
    }
  
    template<class T, class Op, class Expr, class Domain>
*************** struct ReductionEvaluator<InlineKernelTa
*** 138,150 ****
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
      PAssert(domain[1].first() == 0);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
  
      int i00;
      bool firstLoop = true;
      
!     ret = e.read(0, 0);
      for (int i1 = 0; i1 < e1; ++i1)
        {
          if (firstLoop)
--- 141,154 ----
      CTAssert(Domain::unitStride);
      PAssert(domain[0].first() == 0);
      PAssert(domain[1].first() == 0);
+     Expr localExpr(e);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
  
      int i00;
      bool firstLoop = true;
      
!     T answer(localExpr.read(0, 0));
      for (int i1 = 0; i1 < e1; ++i1)
        {
          if (firstLoop)
*************** struct ReductionEvaluator<InlineKernelTa
*** 155,162 ****
          else
            i00 = 0;
          for (int i0 = i00; i0 < e0; ++i0)
!           op(ret, e.read(i0, i1));
        }
    }
    
    template<class T, class Op, class Expr, class Domain>
--- 159,168 ----
          else
            i00 = 0;
          for (int i0 = i00; i0 < e0; ++i0)
!           op(answer, localExpr.read(i0, i1));
        }
+ 
+     ret = answer;
    }
    
    template<class T, class Op, class Expr, class Domain>
*************** struct ReductionEvaluator<InlineKernelTa
*** 167,172 ****
--- 173,179 ----
      PAssert(domain[0].first() == 0);
      PAssert(domain[1].first() == 0);
      PAssert(domain[2].first() == 0);
+     Expr localExpr(e);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct ReductionEvaluator<InlineKernelTa
*** 174,180 ****
      int i00;
      bool firstLoop = true;
      
!     ret = e.read(0, 0, 0);
      for (int i2 = 0; i2 < e2; ++i2)
        for (int i1 = 0; i1 < e1; ++i1)
          {
--- 181,187 ----
      int i00;
      bool firstLoop = true;
      
!     T answer(localExpr.read(0, 0, 0));
      for (int i2 = 0; i2 < e2; ++i2)
        for (int i1 = 0; i1 < e1; ++i1)
          {
*************** struct ReductionEvaluator<InlineKernelTa
*** 186,193 ****
            else
              i00 = 0;
            for (int i0 = i00; i0 < e0; ++i0)
!             op(ret, e.read(i0, i1, i2));
          }
    }
  
    template<class T, class Op, class Expr, class Domain>
--- 193,202 ----
            else
              i00 = 0;
            for (int i0 = i00; i0 < e0; ++i0)
!             op(answer, localExpr.read(i0, i1, i2));
          }
+     
+     ret = answer;
    }
  
    template<class T, class Op, class Expr, class Domain>
*************** struct ReductionEvaluator<InlineKernelTa
*** 199,204 ****
--- 208,214 ----
      PAssert(domain[1].first() == 0);
      PAssert(domain[2].first() == 0);
      PAssert(domain[3].first() == 0);
+     Expr localExpr(e);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct ReductionEvaluator<InlineKernelTa
*** 207,213 ****
      int i00;
      bool firstLoop = true;
      
!     ret = e.read(0, 0, 0, 0);
      for (int i3 = 0; i3 < e3; ++i3)
        for (int i2 = 0; i2 < e2; ++i2)
          for (int i1 = 0; i1 < e1; ++i1)
--- 217,223 ----
      int i00;
      bool firstLoop = true;
      
!     T answer(localExpr.read(0, 0, 0, 0));
      for (int i3 = 0; i3 < e3; ++i3)
        for (int i2 = 0; i2 < e2; ++i2)
          for (int i1 = 0; i1 < e1; ++i1)
*************** struct ReductionEvaluator<InlineKernelTa
*** 220,227 ****
              else
                i00 = 0;
              for (int i0 = i00; i0 < e0; ++i0)
!               op(ret, e.read(i0, i1, i2, i3));
            }
    }
  
    template<class T, class Op, class Expr, class Domain>
--- 230,239 ----
              else
                i00 = 0;
              for (int i0 = i00; i0 < e0; ++i0)
!               op(answer, localExpr.read(i0, i1, i2, i3));
            }
+ 
+     ret = answer;
    }
  
    template<class T, class Op, class Expr, class Domain>
*************** struct ReductionEvaluator<InlineKernelTa
*** 234,239 ****
--- 246,252 ----
      PAssert(domain[2].first() == 0);
      PAssert(domain[3].first() == 0);
      PAssert(domain[4].first() == 0);
+     Expr localExpr(e);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct ReductionEvaluator<InlineKernelTa
*** 243,249 ****
      int i00;
      bool firstLoop = true;
      
!     ret = e.read(0, 0, 0, 0, 0);
      for (int i4 = 0; i4 < e4; ++i4)
        for (int i3 = 0; i3 < e3; ++i3)
          for (int i2 = 0; i2 < e2; ++i2)
--- 256,262 ----
      int i00;
      bool firstLoop = true;
      
!     T answer(localExpr.read(0, 0, 0, 0, 0));
      for (int i4 = 0; i4 < e4; ++i4)
        for (int i3 = 0; i3 < e3; ++i3)
          for (int i2 = 0; i2 < e2; ++i2)
*************** struct ReductionEvaluator<InlineKernelTa
*** 257,264 ****
                else
                  i00 = 0;
                for (int i0 = i00; i0 < e0; ++i0)
!                 op(ret, e.read(i0, i1, i2, i3, i4));
              }
    }
  
    template<class T, class Op, class Expr, class Domain>
--- 270,279 ----
                else
                  i00 = 0;
                for (int i0 = i00; i0 < e0; ++i0)
!                 op(answer, localExpr.read(i0, i1, i2, i3, i4));
              }
+ 
+     ret = answer;
    }
  
    template<class T, class Op, class Expr, class Domain>
*************** struct ReductionEvaluator<InlineKernelTa
*** 272,277 ****
--- 287,293 ----
      PAssert(domain[3].first() == 0);
      PAssert(domain[4].first() == 0);
      PAssert(domain[5].first() == 0);
+     Expr localExpr(e);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct ReductionEvaluator<InlineKernelTa
*** 282,288 ****
      int i00;
      bool firstLoop = true;
      
!     ret = e.read(0, 0, 0, 0, 0, 0);
      for (int i5 = 0; i5 < e5; ++i5)
        for (int i4 = 0; i4 < e4; ++i4)
          for (int i3 = 0; i3 < e3; ++i3)
--- 298,304 ----
      int i00;
      bool firstLoop = true;
      
!     T answer(localExpr.read(0, 0, 0, 0, 0, 0));
      for (int i5 = 0; i5 < e5; ++i5)
        for (int i4 = 0; i4 < e4; ++i4)
          for (int i3 = 0; i3 < e3; ++i3)
*************** struct ReductionEvaluator<InlineKernelTa
*** 297,304 ****
                  else
                    i00 = 0;
                  for (int i0 = i00; i0 < e0; ++i0)
!                   op(ret, e.read(i0, i1, i2, i3, i4, i5));
                }
    }
  
    template<class T, class Op, class Expr, class Domain>
--- 313,322 ----
                  else
                    i00 = 0;
                  for (int i0 = i00; i0 < e0; ++i0)
!                   op(answer, localExpr.read(i0, i1, i2, i3, i4, i5));
                }
+ 
+     ret = answer;
    }
  
    template<class T, class Op, class Expr, class Domain>
*************** struct ReductionEvaluator<InlineKernelTa
*** 313,318 ****
--- 331,337 ----
      PAssert(domain[4].first() == 0);
      PAssert(domain[5].first() == 0);
      PAssert(domain[6].first() == 0);
+     Expr localExpr(e);
      int e0 = domain[0].length();
      int e1 = domain[1].length();
      int e2 = domain[2].length();
*************** struct ReductionEvaluator<InlineKernelTa
*** 324,330 ****
      int i00;
      bool firstLoop = true;
      
!     ret = e.read(0, 0, 0, 0, 0, 0, 0);
      for (int i6 = 0; i6 < e6; ++i6)
        for (int i5 = 0; i5 < e5; ++i5)
          for (int i4 = 0; i4 < e4; ++i4)
--- 343,349 ----
      int i00;
      bool firstLoop = true;
      
!     T answer(localExpr.read(0, 0, 0, 0, 0, 0, 0));
      for (int i6 = 0; i6 < e6; ++i6)
        for (int i5 = 0; i5 < e5; ++i5)
          for (int i4 = 0; i4 < e4; ++i4)
*************** struct ReductionEvaluator<InlineKernelTa
*** 340,347 ****
                    else
                      i00 = 0;
                    for (int i0 = i00; i0 < e0; ++i0)
!                     op(ret, e.read(i0, i1, i2, i3, i4, i5, i6));
                  }
    }
  };
  
--- 359,368 ----
                    else
                      i00 = 0;
                    for (int i0 = i00; i0 < e0; ++i0)
!                     op(answer, localExpr.read(i0, i1, i2, i3, i4, i5, i6));
                  }
+ 
+     ret = answer;
    }
  };
  


More information about the pooma-dev mailing list