Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set ddp_find_unused_parameters to False when using distributed training #222

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aresnow1
Copy link

@aresnow1 aresnow1 commented Jul 21, 2023

As described in Huggingface doc, ddp_find_unused_parameters should be set to False if enable gradient_checkpointing.

I've tested on my machine with 2 3090 Ti GPUs, run with following scripts:

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node=2 \
--master_port=1234 \
qlora.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --use_auth \

It may resolve the #12 .

@artidoro
Copy link
Owner

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.

My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts?
--ddp_find_unused_parameters False

I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

@aresnow1
Copy link
Author

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.

My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False

I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

@chenjiasheng
Copy link

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.
My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False
I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable.

@nickmitchko
Copy link

nickmitchko commented Aug 29, 2023

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.
My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False
I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable.

@chenjiasheng
would you mind sharing your code / script? I have been struggling to implement parallel training on a multi-gpu node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants