tensorboard step and self.global_step do not correspond under accumulate_grad #20346

wuzhiyue111 · 2024-10-18T07:44:06Z

Bug description

Assume accumulate_grad=2, log_every_n_steps=50, val_check_interval=8000; tensorboard is the self.log. The self.global_step will add one when model.forward is done, but tensorboard step will add one when loss.step() is done. So my model will valid when tensorboard step = 4000, because this time the self.global_step is 8000.

I think this logic should be unified.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

The text was updated successfully, but these errors were encountered:

wuzhiyue111 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Oct 18, 2024

github-actions bot added the ver: 2.4.x label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorboard step and self.global_step do not correspond under accumulate_grad #20346

tensorboard step and self.global_step do not correspond under accumulate_grad #20346

wuzhiyue111 commented Oct 18, 2024

tensorboard step and self.global_step do not correspond under accumulate_grad #20346

tensorboard step and self.global_step do not correspond under accumulate_grad #20346

Comments

wuzhiyue111 commented Oct 18, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info