You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
static ncclResult_t doLaunches(struct ncclComm* head) {
ncclResult_t result = ncclSuccess;
struct ncclComm* cliqueComm0 = head->intraComm0;
struct ncclComm* cliqueHead = head;
struct ncclComm* cliqueNextHead;
bool useBarrier = ncclParamLaunchMode == ncclLaunchModeGroup;
// This outer loop iterates over cliques of comms which are siblings of the
// same global entity. We calculate a clique as all comms which have the same
// `intraComm0` value.
do {
struct ncclComm* comm = cliqueHead;
bool capturingYes = false, capturingNo = false;
do {
(ncclCudaGraphValid(comm->planner.capturingGraph) ? capturingYes : capturingNo) = true;
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), result, failure);
NCCLCHECKGOTO(ncclLaunchPrepare(comm), result, failure);
if (useBarrier) ncclCommIntraBarrierIn(comm, 1);
comm = comm->groupNext;
} while (comm != nullptr && comm->intraComm0 == cliqueComm0);
cliqueNextHead = comm;
In this code, cliqueComm0 is set to head->intraComm0 in the initialization phase;
Then the outer loop use cliqueComm0 to find all comms which have the same intraComm0;
But after the first clique is found, the cliqueComm0 is not updated for the next round, so the next round still use the old intraComm0.
Maybe this is a potential bug, @sjeaugey@gcongiu@kwen2501@borisfom
The text was updated successfully, but these errors were encountered:
Hi, NCCL Team
In this code, cliqueComm0 is set to head->intraComm0 in the initialization phase;
Then the outer loop use cliqueComm0 to find all comms which have the same intraComm0;
But after the first clique is found, the cliqueComm0 is not updated for the next round, so the next round still use the old intraComm0.
Maybe this is a potential bug, @sjeaugey @gcongiu @kwen2501 @borisfom
The text was updated successfully, but these errors were encountered: