Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in group.cc #1599

Open
songhexiang opened this issue Feb 8, 2025 · 1 comment
Open

Potential bug in group.cc #1599

songhexiang opened this issue Feb 8, 2025 · 1 comment

Comments

@songhexiang
Copy link

songhexiang commented Feb 8, 2025

Hi, NCCL Team

static ncclResult_t doLaunches(struct ncclComm* head) {
  ncclResult_t result = ncclSuccess;
  struct ncclComm* cliqueComm0 = head->intraComm0;
  struct ncclComm* cliqueHead = head;
  struct ncclComm* cliqueNextHead;
  bool useBarrier = ncclParamLaunchMode == ncclLaunchModeGroup;
  // This outer loop iterates over cliques of comms which are siblings of the
  // same global entity. We calculate a clique as all comms which have the same
  // `intraComm0` value.
  do {
    struct ncclComm* comm = cliqueHead;
    bool capturingYes = false, capturingNo = false;
    do {
      (ncclCudaGraphValid(comm->planner.capturingGraph) ? capturingYes : capturingNo) = true;
      CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), result, failure);
      NCCLCHECKGOTO(ncclLaunchPrepare(comm), result, failure);
      if (useBarrier) ncclCommIntraBarrierIn(comm, 1);
      comm = comm->groupNext;
    } while (comm != nullptr && comm->intraComm0 == cliqueComm0);
    cliqueNextHead = comm;

In this code, cliqueComm0 is set to head->intraComm0 in the initialization phase;
Then the outer loop use cliqueComm0 to find all comms which have the same intraComm0;
But after the first clique is found, the cliqueComm0 is not updated for the next round, so the next round still use the old intraComm0.
Maybe this is a potential bug, @sjeaugey @gcongiu @kwen2501 @borisfom

@jbachan
Copy link
Collaborator

jbachan commented Feb 8, 2025

You are correct this is a bug. A quick fix is to delete cliqueComm0 and always use cliqueHead->intraComm0 in it's place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants