PyTorch DDP-related fixes and improvements #2028
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Follow-up to #2018. This PR makes some PyTorch DDP-related fixes and improvements. It also bumps up the required
torch
andtorchvision
versions.Changes include:
DDPContextManager
).torchrun
scenarios (e.g. assuming thatdist.is_initialized()
isTrue
beforedist.init_process_group()
).Learner.main()
within a singleDDPContextManager
.time.time()
withtime.perf_counter()
.Checklist
needs-backport
label if the change should be back-ported to the previous releaseNotes
N/A
Testing Instructions
These changes have been successfully tested on a 2-node, 8-GPU setup.