Skip to content

Commit

Permalink
mpi: fix setting local ranks
Browse files Browse the repository at this point in the history
In very constrained environments like GHA it may sometimes happen that
some ranks _completely_ finish (i.e. call destroy) before others have
even called Init.

With our current implementation, this meant that the world may have been
removed from the registry, even though there were still active local
ranks to run.

To fix this, we set the number of active local ranks once, at the
beginning, and decrement it on a per-thread basis. Note that this does
not invaliate the fix to the previous race condition because, in fact,
what we fixed is that we now _always_ init a world when we execute in
it.
  • Loading branch information
csegarragonz committed Apr 21, 2024
1 parent 5b65b3e commit 4f5a928
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion src/mpi/MpiWorld.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,6 @@ void MpiWorld::initialiseFromMsg(faabric::Message& msg)
void MpiWorld::initialiseRankFromMsg(faabric::Message& msg)
{
rankState.msg = &msg;
activeLocalRanks++;

// Pin this thread to a free CPU
#ifdef FAABRIC_USE_SPINLOCK
Expand Down Expand Up @@ -343,6 +342,11 @@ void MpiWorld::initLocalRemoteLeaders()

// Persist the local leader in this host for further use
localLeader = (*ranksForHost[thisHost].begin());

// Lastly, set the number of local ranks to know when it is safe to remove
// the world from the registry
activeLocalRanks.store(ranksForHost.at(thisHost).size(),
std::memory_order_release);
}

void MpiWorld::getCartesianRank(int rank,
Expand Down

0 comments on commit 4f5a928

Please sign in to comment.