Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorboard step and self.global_step do not correspond under accumulate_grad #20346

Open
wuzhiyue111 opened this issue Oct 18, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@wuzhiyue111
Copy link

Bug description

Assume accumulate_grad=2, log_every_n_steps=50, val_check_interval=8000; tensorboard is the self.log. The self.global_step will add one when model.forward is done, but tensorboard step will add one when loss.step() is done. So my model will valid when tensorboard step = 4000, because this time the self.global_step is 8000.

I think this logic should be unified.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

@wuzhiyue111 wuzhiyue111 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

1 participant