How to use a callback to run a periodical process inside the training loop? #5978

shamanez · 2021-02-15T06:57:00Z

shamanez
Feb 15, 2021

I want to modify a certain function during the training loop (let's save every 10 000 global training step). I am using multiple GPUs. Currently, I have implemented it inside the training_step. While the update is happening I want to make sure other DDP processes works. In the following code, I make sure if the global_rank is zero do the update.

def training_step(self, batch, batch_idx) -> Dict:

      if (not batch_idx==0 and batch_idx%10000==0) and (self.trainer.global_rank==0):

        #do the update

Sometimes this process crashes because, while one GPU is updating the function, other trying to use it.

So is there a call back that I can use to execute my command, where I can make sure the process waits until this step is finish. More like on_end_of_global_training_step?

One more question: how can I get the global_step count inside the training loop?

tchaton · 2021-02-15T09:03:19Z

tchaton
Feb 15, 2021
Maintainer

Dear @shamanez,

I am not sure exactly which function you are running and if it could be moved to Callback. Could you give us more details ?

1 . Yes, you could use a Callback: https://pytorch-lightning.readthedocs.io/en/stable/generated/pytorch_lightning.callbacks.Callback.html?highlight=callback.

batch_idx restarts at 0 for every epoch. Use self.trainer.global_step instead.
You can use is_global_zero instead of checking the rank

def training_step(self, batch, batch_idx):
    if self.trainer.is_global_zero:
        print('in node 0, accelerator 0')

4 . If you want all processes to wait, you can use a barrier.

def training_step(self, batch, batch_idx):
    if self.trainer.is_global_zero:
        print('in node 0, accelerator 0')
    self.trainer.accelerator.barrier('block my processes')

Best,
T.C

0 replies

shamanez · 2021-02-15T09:28:19Z

shamanez
Feb 15, 2021
Author

Thanks a lot for the quick response @tchaton.

In order to run the function, I need to access the self. trainer.model (basically a branch in the main model). Actually the self.trainer.accelerator.barrier function ed mentioned by you seems to match my requirement perfectly. Can you embrace it more on the use of the barrier?

I have a function (process) A, that gets called only with master-worker (GPU 0). So I want to make sure, other GPUs don't take batches till completion of function A.

0 replies

shamanez · 2021-02-15T14:14:46Z

shamanez
Feb 15, 2021
Author

I tried to use self.trainer.accelerator.barrier('block my processes')
in this hugginface script.

But it throws an error saying : AttributeError: 'Trainer' object has no attribute 'accelerator'

P.S The script uses a custom accelerator as mentioned here.

I went through the code base. It should be:

self.trainer.accelerator_connector.accelerator.barrier

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use a callback to run a periodical process inside the training loop? #5978

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to use a callback to run a periodical process inside the training loop? #5978

shamanez Feb 15, 2021

Replies: 3 comments

tchaton Feb 15, 2021 Maintainer

shamanez Feb 15, 2021 Author

shamanez Feb 15, 2021 Author

shamanez
Feb 15, 2021

tchaton
Feb 15, 2021
Maintainer

shamanez
Feb 15, 2021
Author

shamanez
Feb 15, 2021
Author