Temporary copies do appear...??

Fri May 21 08:07:39 UTC 2004

Dear POOMA developers,

My name is Radek and I am a researcher at the Oxford University, UK.
I have started implementing POOMA into our numerical model of liquid 
crystals in 3D. I feel that the it is a suitable tool for this 
challenging problem where we have 10 unknowns at each node of the 
finite-element mesh. If things go right, we will be happy to express 
our thanks to POOMA in all our publications that will follow.

The reason why I am contacting you today is to inform you about a 
possible POOMA problem that I have encountered why testing my POOMA/
PETE based implementation of an automatic-differentiation class which 
otherwise works perfectly (I can share the code with you if you are 
interested, by the way). Please note that I already found a couple of 
minor POOMA bugs, such as:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
- although Tensor.h:338 claims:
// The format is: ((t(0,0) t(1,0),... ) ( t(0,1) t(1,1) ... ) ... ))
  the truth is in fact:
// The format is: ((t(0,0) t(0,1),... ) ( t(1,0) t(1,1) ... ) ... ))
- this is contrary to TinyMatrix because of the i,j-swapping
  (compare: Tensor.h:361 and TinyMatrix.h:236)
====================================================================
- line /src/Tiny/VectorOperators.h:189
inline typename BinaryReturn< Vector<D,T1,E>, T2, TAG >::Type_t
  should correctly be:
inline typename BinaryReturn< T1, Vector<D,T2,E>, TAG >::Type_t
- this error may cause problems if T1 and T2 are different types and
  when stricter type-conversions are imposed
====================================================================
- line /src/DynamicArray/DynamicArray.h:373
: Array<Dim, T, EngineTag>(s1, model)
  should correctly be:
: Array<1, T, EngineTag>(s1, model)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

but the problem that I would like to describe in the 
rest of this email seems more serious than that.

Basically, simple algebraic expressions based on the tiny Vector class 
do create temporary Full-engine copies of individual subexpressions, 
as opposed to what POOMA claims to prevent. The following short main 
code:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#include "Pooma/Arrays.h"

int main(int argc, char* argv[])
{
  Pooma::initialize(argc, argv);

  Vector<2> v1(1, 2), v2;
  v2 = v1*v1 + v1*v1;

  Pooma::finalize();
  return 0;
}
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

was tested by modifying the file /src/Tiny/Vector.h by adding the 
following line:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
PrintTypeName(this); PrintTypeName(x); std::cout << std::endl;
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

to the Vector(const X& x)-constructor on line 117 and the
VectorEngine(const X& x)-constructor on line 290. The diagnostic 
function PrintTypeName is defined as:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
template<class T> inline void PrintTypeName(const T& t)
{
  std::ostringstream out;
  out << "c++filt " << typeid(t).name();
  system(out.str().c_str());
}
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

where the GNU tool c++filt is used to demangle the type names. The 
following optimising g++ (v. 3.3.1) command has been used to build the 
executable under SuSE Linux 9.0 (i586):

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
g++ -ftemplate-depth-60 -Drestrict=__restrict__ -fno-exceptions 
-DNOPAssert -DNOCTAssert -O2 -fno-default-inline -funroll-loops 
-fstrict-aliasing -o Main Main.cpp -I$HOME/lib/Optim/POOMA/linux/lib/
PoomaConfiguration-gcc -I$HOME/lib/Optim/POOMA/linux/src -I$HOME/lib/
Optim/POOMA/linux/lib -fno-exceptions -L$HOME/lib/Optim/POOMA/linux/
lib -lpooma-gcc -lm
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The code execution output is listed in the following box:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
VectorEngine<2, double, Full>*
Vector<2, double, BinaryVectorOp<Vector<2, double, Full>, Vector<2, 
double, Full>, OpMultiply> >

Vector<2, double, Full>*
Vector<2, double, BinaryVectorOp<Vector<2, double, Full>, Vector<2, 
double, Full>, OpMultiply> >

VectorEngine<2, double, Full>*
Vector<2, double, BinaryVectorOp<Vector<2, double, Full>, Vector<2, 
double, Full>, OpMultiply> >

Vector<2, double, Full>*
Vector<2, double, BinaryVectorOp<Vector<2, double, Full>, Vector<2, 
double, Full>, OpMultiply> >

VectorEngine<2, double, Full>*
Vector<2, double, BinaryVectorOp<Vector<2, double, Full>, Vector<2, 
double, Full>, OpAdd> >

Vector<2, double, Full>*
Vector<2, double, BinaryVectorOp<Vector<2, double, Full>, Vector<2, 
double, Full>, OpAdd> >
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Clearly, every operation in the expression v1*v1 + v1*v1 invokes a 
Full-engine copy of the BinaryVectorOp-engine subexpression result. 
Is this behaviour correct or am I doing something wrong, please? Do I 
need any better-optimising compiler (I already ordered the latest 
Intel's ICC, the successor of KAI) or any other command-line flags? 
If there is any way how to prevent this waste of resources, I would 
very much appreciate your kind help.

Sincerely,
Radek

__________________________________
Dr. Radek Pecher
Research Assistant
Department of Engineering Science
University of Oxford
Parks Road, Oxford, OX1 3PJ, UK
Tel:  +44 (0)1865 273044
Fax: +44 (0)1865 273905
radek.pecher at eng.ox.ac.uk