PyTorch DDP-related fixes and improvements #2028

AdeelH · 2024-01-11T19:42:44Z

Overview

Follow-up to #2018. This PR makes some PyTorch DDP-related fixes and improvements. It also bumps up the required torch and torchvision versions.

Changes include:

Refactor process group initialization and destruction into a context manager (DDPContextManager).
Fix incorrect assumptions about torchrun scenarios (e.g. assuming that dist.is_initialized() is True before dist.init_process_group()).
Download data on "local" master processes instead of only the "global" master process.
Run the whole of Learner.main() within a single DDPContextManager.
Replace time.time() with time.perf_counter().
General refactoring and doc improvements.

Checklist

Added unit tests, if applicable
Updated documentation, if applicable
Added needs-backport label if the change should be back-ported to the previous release
PR has a name that won't get you publicly shamed for vagueness

Notes

N/A

Testing Instructions

These changes have been successfully tested on a 2-node, 8-GPU setup.

codecov · 2024-01-11T20:43:19Z

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (9d034ac) 85.21% compared to head (4314b4f) 84.90%.

Files	Patch %	Lines
...ch_learner/rastervision/pytorch_learner/learner.py	69.56%	21 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2028      +/-   ##
==========================================
- Coverage   85.21%   84.90%   -0.32%     
==========================================
  Files         195      196       +1     
  Lines        9809     9856      +47     
==========================================
+ Hits         8359     8368       +9     
- Misses       1450     1488      +38

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bump torch and torchvision versions

812c13e

AdeelH force-pushed the ddp-fixes branch from e9324a8 to 8daf304 Compare January 11, 2024 19:43

ddp fixes and improvements

4314b4f

AdeelH force-pushed the ddp-fixes branch from 8daf304 to 4314b4f Compare January 11, 2024 20:18

AdeelH merged commit a378468 into azavea:master Jan 11, 2024
2 checks passed

AdeelH deleted the ddp-fixes branch January 17, 2024 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch DDP-related fixes and improvements #2028

PyTorch DDP-related fixes and improvements #2028

AdeelH commented Jan 11, 2024

codecov bot commented Jan 11, 2024

PyTorch DDP-related fixes and improvements #2028

PyTorch DDP-related fixes and improvements #2028

Conversation

AdeelH commented Jan 11, 2024

Overview

Checklist

Notes

Testing Instructions

codecov bot commented Jan 11, 2024

Codecov Report