Custom DDP implementation to halt when any of the iterable node datasets are exhausted #19557

jamiesalter · 2024-03-01T23:14:56Z

jamiesalter
Mar 1, 2024

Summary
It seems like it should be possible to implement DDP for an iterable dataset of unknown length.

How it would work
Each node receives a different part of the dataset. As soon as one of the nodes exhausts their dataset, training on all nodes is halted and the epoch ends. We can then shuffle the dataset and repeat.

Does this make sense and do you agree this is possible? And if so, any recommendations on where to start and what to override? One immediate question is whether I will need to use Lightning Fabric or I can just override everything I need within the existing Lightning Trainer.

At the moment, I have DDP running on my iterable dataset if I have a batch limit < the total number of batches.

More details of my setup
I have an iterable dataset made up of a known number of sessions, where each session is time series data of an unknown length. Each session is cut into patches and the patches are batched and put through the model. When training with multiple workers, each worker receives a different list of sessions and each session is an unknown number of patches.

When training with DDP, I want each node to have a different set of sessions (similar to having multiple workers), but because they're of an unknown length, I need to gracefully stop the epoch when one of the nodes exhausts its list of sessions so DDP doesn't hang waiting to combine the gradients at the end of a batch.

An example of how I might do this, but this might be the totally wrong approach as I'm new to DDP:

def training_step(self, batch, batch_idx):
    # Training logic here
    
    if self.dataset_exhausted():
        # Notify other nodes to stop
        self.synchronize_nodes_to_stop()

def synchronize_nodes_to_stop(self):
    # Dummy tensor for collective operation
    stop_tensor = torch.tensor([1.0], device=self.device)  # Mark as 1 if this node should stop
    dist.all_reduce(stop_tensor, op=dist.ReduceOp.MAX)
    if stop_tensor.item() > 0:
        # Implement logic to gracefully stop training across all nodes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom DDP implementation to halt when any of the iterable node datasets are exhausted #19557

{{title}}

Replies: 0 comments

Select a reply

Custom DDP implementation to halt when any of the iterable node datasets are exhausted #19557

jamiesalter Mar 1, 2024

Replies: 0 comments

jamiesalter
Mar 1, 2024