Replies: 3 comments
-
Dear @shamanez, I am not sure exactly which function you are running and if it could be moved to Callback. Could you give us more details ? 1 . Yes, you could use a Callback: https://pytorch-lightning.readthedocs.io/en/stable/generated/pytorch_lightning.callbacks.Callback.html?highlight=callback.
4 . If you want all processes to wait, you can use a barrier.
Best, |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for the quick response @tchaton. In order to run the function, I need to access the self. trainer.model (basically a branch in the main model). Actually the self.trainer.accelerator.barrier function ed mentioned by you seems to match my requirement perfectly. Can you embrace it more on the use of the barrier? I have a function (process) A, that gets called only with master-worker (GPU 0). So I want to make sure, other GPUs don't take batches till completion of function A. |
Beta Was this translation helpful? Give feedback.
-
I tried to use self.trainer.accelerator.barrier('block my processes') But it throws an error saying : AttributeError: 'Trainer' object has no attribute 'accelerator' P.S The script uses a custom accelerator as mentioned here. I went through the code base. It should be: self.trainer.accelerator_connector.accelerator.barrier |
Beta Was this translation helpful? Give feedback.
-
I want to modify a certain function during the training loop (let's save every 10 000 global training step). I am using multiple GPUs. Currently, I have implemented it inside the training_step. While the update is happening I want to make sure other DDP processes works. In the following code, I make sure if the global_rank is zero do the update.
Sometimes this process crashes because, while one GPU is updating the function, other trying to use it.
So is there a call back that I can use to execute my command, where I can make sure the process waits until this step is finish. More like on_end_of_global_training_step?
One more question: how can I get the global_step count inside the training loop?
Beta Was this translation helpful? Give feedback.
All reactions