-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Everything prints fine, but the loss doesn't descent #20344
Comments
My full code is a little bit complicated, but I believe the problem is just within the above logics, did I use Lightning wrong in the above code? |
grateful if somrone can give me any idea about what may cause such issue. i thought about if the cls_model is errorly frozen, but it is not, parameters of it are requires_grad. |
may be it is related to this issue #20128 i am also using huggingface's automodel from pretrain, and mode is eval. i tried to manually called training,but it doesnot work |
No, it is not because of that issue. I double checked that I called |
To debug, I print the parameters and gradients 's L2 norm every time
The optimizer indeed made a change to the model, which is Can any experts kindly tell me where did I am use wrong of Lightning? |
Hye, Am not an expert, but I checked your code and you seem to do Can you see if this helps? |
Bug description
Even after I set the learning rate to 1 and even 100,
the loss doesn't change at all, it is always 4.60.
I tried to debug into what happens, but it seems everything works fine, the loss is backwarded successfully, the grads of each parameters looks well, the optimizer is indeed called
What version are you seeing the problem on?
v2.3
How to reproduce the bug
Error messages and logs
everything is not crashing, and the model summary looks good, but
the training loss just doesn't change (different batch sample has a slight change, but not due to training of the model)
Environment
Current environment
The collect env script is not working, btw
More info
No response
The text was updated successfully, but these errors were encountered: