[pooma-dev] Re: [PATCH] Fix deadlocks in MPI reduction evaluators
Jeffrey D. Oldham
oldham at codesourcery.com
Fri Jan 16 02:58:21 UTC 2004
Richard Guenther wrote:
> On Tue, 13 Jan 2004, Jeffrey D. Oldham wrote:
>
>
>>Richard Guenther wrote:
>>
>>>Hi!
>>>
>>>The following patch is necessary to avoid deadlocks with the MPI
>>>implementation and multi-patch setups where one context does not
>>>participate in the reduction.
>>>
>>>Fixes failure of array_test_.. - I don't remember - with MPI.
>>>
>>>Basically the scenario is that the collective synchronous MPI_Gather is
>>>called from ReduceOverContexts<> on the non-participating (and thus
>>>not receiving) contexts while the SendIterates are still in the
>>>schedulers queue. The calculation participating contexts will wait for
>>>the ReceiveIterates and patch reductions to complete using the CSem
>>>forever then.
>>>
>>>So the fix is to make the not participating contexts wait on the CSem,
>>>too, by using a fake write iterate queued after the send iterates which
>>>will trigger as soon as the send iterates complete.
>>
>>Instead of adding fake write iterate can we adjust the MPI_Gather so
>>non-participating contexts do not participate?
>
>
> The problem is not easy to tackle in MPI_Gather, as collective
> communication primitives involve all contexts and this can be overcome
> only by creating a new MPI communicator, which is costly. Also I'm not
> sure that this will solve the problem at all.
>
> The problem is that contexts participating only via sending their data to
> a remote context (i.e. are participating, but not computing) don't have
> the counting semaphore to block on (its height is zero for them). So
> after queuing the send iterates they go straight to the final reduction
> which is not done via an extra iterate and block there, not firing off the
> send iterate in the first place. Ugh. Same of course for completely non
> participating contexts, and even this may be a problem because of old
> unrun iterates.
>
> So in first I thought of creating a DataObject to hold the reduction
> result, so we can do usual data-flow evaluation on it, and not ignore
> dependencies on it, as we do now. But this turned out to be more invasive
> and I didn't have time to complete this.
>
> So the fake writing iterate solves the problem (only partly, because, I
> could imagine for completely non-participating contexts the problem is
> still there) for me.
>
> But anyway, I'm not pushing this very hard now, but it's guaranteed to
> deadlock at reductions otherwise for MPI for me (so there's a race even
> in the case of all-participating contexts, or the intersector is doing
> something strange).
>
> Richard.
I appreciate your finding the difficulty and your taking the time to
explain the problem. I am reluctant to add code that is known to be
broken for some situations. Is there a way to mark the code so 1) the
known brokenness is marked and 2) the program asks sensibly when the
brokenness is experienced?
--
Jeffrey D. Oldham
oldham at codesourcery.com
More information about the pooma-dev
mailing list