Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀[FEA]: Distributed Training/Inference: handle scatter/gather better and more consistently #520

Open
stadlmax opened this issue May 23, 2024 · 2 comments
Labels
? - Needs Triage Need team to review and classify distributed Distributed and model parallel tools enhancement New feature or request

Comments

@stadlmax
Copy link
Collaborator

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Low (would be nice)

Please provide a clear description of problem you would like to solve.

Problem exists in model-parallel settings where not all ranks have valid tensors, mainly around gather and scatter routines.

Scatter

  • scatter assumes a single tensor on a source rank which is distributed in parts across other ranks
  • to be able to receive these chunks, however, these ranks need to know the dtype, and other meta-information like requires_grad to not break training pipelines
  • current solutions either require the user to specify these things on each rank, or assume empty "dummy" tensors on each rank that carry these information, these, however, might be not robust when registered in compute-graphs of autograd-frameworks

Gather

  • the backward pass of gather is a a scatter call, so similar problems arise, although this case can be handled more easily by e.g. storing meta-data in the corresponding context of the torch.autograd.Function
  • main issue rather arises in upstream layers if gather returns None on all participating ranks, as it could be more informative to have an object carrying information about this None just being the null-part of a distributed tensor which currently is valid on rank X

Potential Solution

  • in general, we should make things more consistent throughout
  • a potential solution would be to define something like a TensorPlaceholder which carries meta-data on ranks where the tensor is currently not valid and is more informative than just None

Describe any alternatives you have considered

No response

@stadlmax stadlmax added enhancement New feature or request ? - Needs Triage Need team to review and classify labels May 23, 2024
@akshaysubr akshaysubr added the distributed Distributed and model parallel tools label May 28, 2024
@mnabian
Copy link
Collaborator

mnabian commented Oct 18, 2024

@stadlmax could you please provide an update on this issue?

@stadlmax
Copy link
Collaborator Author

I would say that the problem still exists in general. In our implementations, we should have enough workarounds in place to avoid these issues. If we want to simplify the lives of people implementing other distributed solutions, one could think of tackling that issue. I would say low priority.
Orthogonal to that, we can keep monitoring the progress on DTensor in upstream PyTorch which could be a solution to this minor problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify distributed Distributed and model parallel tools enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants