How to auto-resume torchrun
multi-node training after hitting SLURM time limit?
#20263
Unanswered
amorehead
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello. I have recently been trying to set up fault-tolerant (and time limit-tolerant) multi-node training on a SLURM script I have access to. I can successfully train a model using 2 nodes and 4 GPUs each node. However, when my SLURM job times out (e.g., after 1 hour), my SLURM job does not get resubmitted automatically (like would happen when using the
SLURMEnvironment
) nor does my rendezvous node automatically restart the timed out workers. My question is, what is the standard process for setting up auto-restarts withtorchrun
and PyTorch Lightning on a SLURM cluster?Beta Was this translation helpful? Give feedback.
All reactions