Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a relationship between ncclTopoSearchNextGpuSort and followPath #1596

Open
zhangdexin opened this issue Feb 7, 2025 · 0 comments
Open

Comments

@zhangdexin
Copy link

zhangdexin commented Feb 7, 2025

In search.c:
I found one main effect in followPath func is the path->list[step].bw subtract fwBw:

static ncclResult_t followPath(struct ncclTopoLinkList* path, struct ncclTopoNode* start, int maxSteps, float bw, int* steps) {
  // ...

  struct ncclTopoNode* node = start;
  for (int step=0; step<maxSteps; step++) {
    struct ncclTopoLink* link = path->list[step];
    struct ncclTopoLink* revLink = NULL;
    float fwBw = link->type == LINK_PCI ? pciBw : bw;

    // ...
    
    SUB_ROUND(link->bw, fwBw);                                // <<  here
    node = link->remNode;
  }
  *steps = maxSteps;
  return ncclSuccess;
}

And I found ncclTopoSearchNextGpuSort func that use path.bw(not path.link.bw) cal score?
In that case, each calculation is the same and will not be dynamically adjusted based on bandwidth consumption?
Is there a relationship between ncclTopoSearchNextGpuSort and followPath?
Do you need to add path.bw subtract fwBw in followPath?


followPath 函数中减去的是path.link.bw
然后我在ncclTopoSearchNextGpuSort 函数中发现计算分数使用的path.bw,那这样每次迭代都是一样的,是不是有问题呢?
是不是需要在followPath 函数增加一个减少path.bw的代码?

@zhangdexin zhangdexin changed the title Why followpath function subtract link->bw? Is there a relationship between ncclTopoSearchNextGpuSort and followPath Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant