-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set ddp_find_unused_parameters
to False when using distributed training
#222
base: main
Are you sure you want to change the base?
Conversation
Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU. My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts. |
When using DDP, if gradient checkpointing is enabled, |
I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable. |
@chenjiasheng |
As described in Huggingface doc,
ddp_find_unused_parameters
should be set to False if enable gradient_checkpointing.I've tested on my machine with 2 3090 Ti GPUs, run with following scripts:
It may resolve the #12 .