Custom DDP implementation to halt when any of the iterable node datasets are exhausted #19557
Unanswered
jamiesalter
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
It seems like it should be possible to implement DDP for an iterable dataset of unknown length.
How it would work
Each node receives a different part of the dataset. As soon as one of the nodes exhausts their dataset, training on all nodes is halted and the epoch ends. We can then shuffle the dataset and repeat.
Does this make sense and do you agree this is possible? And if so, any recommendations on where to start and what to override? One immediate question is whether I will need to use Lightning Fabric or I can just override everything I need within the existing Lightning Trainer.
At the moment, I have DDP running on my iterable dataset if I have a batch limit < the total number of batches.
More details of my setup
I have an iterable dataset made up of a known number of sessions, where each session is time series data of an unknown length. Each session is cut into patches and the patches are batched and put through the model. When training with multiple workers, each worker receives a different list of sessions and each session is an unknown number of patches.
When training with DDP, I want each node to have a different set of sessions (similar to having multiple workers), but because they're of an unknown length, I need to gracefully stop the epoch when one of the nodes exhausts its list of sessions so DDP doesn't hang waiting to combine the gradients at the end of a batch.
An example of how I might do this, but this might be the totally wrong approach as I'm new to DDP:
Beta Was this translation helpful? Give feedback.
All reactions