[pooma-dev] Re: [PATCH] Fix deadlocks in MPI reduction evaluators

Fri Jan 16 02:58:21 UTC 2004

Richard Guenther wrote:
> On Tue, 13 Jan 2004, Jeffrey D. Oldham wrote:
> 
> 
>>Richard Guenther wrote:
>>
>>>Hi!
>>>
>>>The following patch is necessary to avoid deadlocks with the MPI
>>>implementation and multi-patch setups where one context does not
>>>participate in the reduction.
>>>
>>>Fixes failure of array_test_.. - I don't remember - with MPI.
>>>
>>>Basically the scenario is that the collective synchronous MPI_Gather is
>>>called from ReduceOverContexts<> on the non-participating (and thus
>>>not receiving) contexts while the SendIterates are still in the
>>>schedulers queue.  The calculation participating contexts will wait for
>>>the ReceiveIterates and patch reductions to complete using the CSem
>>>forever then.
>>>
>>>So the fix is to make the not participating contexts wait on the CSem,
>>>too, by using a fake write iterate queued after the send iterates which
>>>will trigger as soon as the send iterates complete.
>>
>>Instead of adding fake write iterate can we adjust the MPI_Gather so
>>non-participating contexts do not participate?
> 
> 
> The problem is not easy to tackle in MPI_Gather, as collective
> communication primitives involve all contexts and this can be overcome
> only by creating a new MPI communicator, which is costly.  Also I'm not
> sure that this will solve the problem at all.
> 
> The problem is that contexts participating only via sending their data to
> a remote context (i.e. are participating, but not computing) don't have
> the counting semaphore to block on (its height is zero for them).  So
> after queuing the send iterates they go straight to the final reduction
> which is not done via an extra iterate and block there, not firing off the
> send iterate in the first place. Ugh. Same of course for completely non
> participating contexts, and even this may be a problem because of old
> unrun iterates.
> 
> So in first I thought of creating a DataObject to hold the reduction
> result, so we can do usual data-flow evaluation on it, and not ignore
> dependencies on it, as we do now.  But this turned out to be more invasive
> and I didn't have time to complete this.
> 
> So the fake writing iterate solves the problem (only partly, because, I
> could imagine for completely non-participating contexts the problem is
> still there) for me.
> 
> But anyway, I'm not pushing this very hard now, but it's guaranteed to
> deadlock at reductions otherwise for MPI for me (so there's a race even
> in the case of all-participating contexts, or the intersector is doing
> something strange).
> 
> Richard.

I appreciate your finding the difficulty and your taking the time to 
explain the problem.  I am reluctant to add code that is known to be 
broken for some situations.  Is there a way to mark the code so 1) the 
known brokenness is marked and 2) the program asks sensibly when the 
brokenness is experienced?

-- 
Jeffrey D. Oldham
oldham at codesourcery.com