Mitigating program hang from on_train_epoch_end() with self.all_gather() call #20294
Unanswered
isaacgerg
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to manually save my loss, which is a single scalar, each epoch and then print it out in on_train_epoch_end().
I am doing self.train_loss.append(loss.item()) during my training step. Next, in on_train_epoch_end(), I immediately do a self.all_gather(self.train_loss) but it hangs until NCCL times out. I am on a single node with 2 GPUs.
What really stumps me is that this paradigm works fine for on_test_epoch_end() and test_step().
Any thoughts on how to debug or fix? What is the "right" way this code should look when operating correctly.
Reference: Using pytorch 2.4 and lightning 2.4.0.
Beta Was this translation helpful? Give feedback.
All reactions