[vsipl++] Example of parallel processing

Wed Jul 14 14:51:28 UTC 2010

Your response does help although I need to clarify a few things a bit
further if I may.

(1) You stated that foreach_vector is not recommended.  My presumption
is that the recommended way is to code an explicit loop as was done in
the 8.2 section of the users-guide.pdf file using local subviews
instead.

(2) You stated that each process runs the exact same code.  Let's take a
distributed scenario where, for example I have 4 individual machines
named A,B,C, and D.  Each of these machines has 4 processors.  So...I
have 16 processors on which "work" can be done.  However, only Node A
has a particular input file on disk that contains the data that must be
first read in before being processed.

   (A) Is my understanding correct that the actual binary program must
physically exist on each and every machine or does VSIPL++/MPI 
       "take care of" sending the necessary instruction codes from the
master Node to the slave nodes...In this case Node A, to each of the 
        machines.

   (B) Does the input file have to exist on all 4 nodes or is it
possible to read the data in on Node A, load the data into an
appropriate 
       data structure and then let VSIPL++ "distribute" the processing
using either foreach_vector or using the explicitly coded loop.

(3) Is there a complete example somewhere of running an parallel
program?  I found directory ../sourceryvsipl++-2.2/src/tests/parallel
and there are good example of coding parallel programs.  At this point,
however, I have found scant documentation on actually running these
program in a parallel mode.  I had to figure out on my own that a
program must be invoked using mpirun and had I no prior knowledge of
mpirun I probably wouldn't have even tried it...Just a
documentation/example suggestion for those of us who are just starting
to learn parallel programming.  I haven't yet delved into the CUDA parts
of vsipl++ either so an example runner script would be quite handy and
probably cut down on some questions.

VSIPL++ shows great promise and at this point at least it seems a bit
easier than trying to code an MPI program from scratch.

Thanks,
Bill
-----Original Message-----
From: Stefan Seefeld [mailto:stefan at codesourcery.com] 
Sent: Tuesday, July 13, 2010 6:05 PM
To: vsipl++ at codesourcery.com
Subject: Re: [vsipl++] Example of parallel processing

Hi Bill,

On 07/13/2010 04:24 PM, Cassanova, Bill wrote:
> Hi all,
>
> I was just wondering if there is a good example available of the
> foreach_vector method for parallel processing.

Please be aware that the foreach_vector function is not part of the 
public Sourcery VSIPL++ API, and neither is part of the VSIPL++ 
specification.

We don't recommend to use functions or types from the vsip::impl 
namespace, as we can't make any guarantees about their stability or
support.

That being said, we are right now experimenting with new APIs to address

similar problems, and expect those to be published soon.

> In particular I am thinking of a case with the following constraints:
>
> (1) A very large matrix. Each row or block of rows should be processed
> by a single processor. The assumption is there will be multiple
> processors and that using a parallel processing scheme makes "sense".

OK.

> (2) The primary thread, or using MPI terminology, the root process
will
> initialize or otherwise acquire the data.

OK.

> (3) The secondary thread or threads, assuming again MPI methodology,
> mpirun was started with -np of greater than 1.
>
> (4) The secondary threads do the "work" on the matrix, a row or group
of
> rows at a time.
>
> (5) The main thread waits until all processing is complete.

You are using a vocabulary from multi-threading that is not quite 
accurate in this context: While you may identify a single process as the

"main" process (typically the one with rank=0), there really is nothing 
particular about that, as far as its work-flow is concerned.

All processes normally process the exact same code. This is the "Single 
Program Multiple Data" model, which is different from the worker thread 
or thread pool pattern.

Thus, the line

     foreach_vector< tuple<0,1> >( mw, (*matrix) );

is executed by all processes, and there is typically no need to "wait" 
for other processes to reach the same point.

> I have searched the VSIPL++ distribution and have a working example of
> more than one thread doing "work". I am having trouble understanding
how
> the main thread waits until all other processing is done. Using MPI
> terminology, how does one determine when the rank of the process is 0
> and considered the root process and if so wait.

[...]

> foreach_vector< tuple<0,1> >( mw, (*matrix) );

After this line you can assume that all processes have finished this 
function. You may insert a barrier, but that shouldn't actually be 
needed in most cases:

   comm.barrier();

To print out the result, you should indeed use the "if (comm.rank() == 
0)" idiom.

I'm not sure whether I actually answered any of your questions. If not, 
let me know.

Thanks,
		Stefan

-- 
Stefan Seefeld
CodeSourcery
stefan at codesourcery.com
(650) 331-3385 x718