-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save ckpt error #18
Comments
I find there may be a bug in
Because the full Chameleon model consists of:
However,
the training can proceed. |
That's weird. The "get_fsdp_wrap_module_list" method is used for the auto_wrap_policy argument in the FSDP call: Lumina-mGPT/xllmx/solvers/finetune/finetune.py Lines 383 to 389 in 104abe4
Note that FSDP wrapping is a recursive process, which means not only the outmost model, but some of the inner submodules, are also wrapped into FSDP modules. Operations like parameter sharding, gather, and flattening are then conducted at the FSDP-module level. Importantly, the Therefore, according to our experience, the problem you mentioned might not be the real cause of the error you met. Have you made any other modifications to the code? Or what's your pytorch version? |
Thanks for your response! I use 1 GPU to debug the code. The only modification I made is probably I define a BTW, my pytorch version is 2.3.0. Best regards. |
During training, I found the training procedure crashes when running
Lumina-mGPT/xllmx/util/ckpt.py
Line 91 in 104abe4
And the error is:
AssertionError: FSDP assumes model.norm.weight is in the state_dict but the state_dict only has odict_keys
The text was updated successfully, but these errors were encountered: