Set `ddp_find_unused_parameters` to False when using distributed training #222

aresnow1 · 2023-07-21T11:16:23Z

As described in Huggingface doc, ddp_find_unused_parameters should be set to False if enable gradient_checkpointing.

I've tested on my machine with 2 3090 Ti GPUs, run with following scripts:

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node=2 \
--master_port=1234 \
qlora.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --use_auth \

It may resolve the #12 .

artidoro · 2023-07-21T15:53:34Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.

My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts?
--ddp_find_unused_parameters False

I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

aresnow1 · 2023-07-21T16:33:06Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.

My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False

I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

chenjiasheng · 2023-08-26T07:06:23Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.
My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False
I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable.

nickmitchko · 2023-08-29T18:11:54Z

Thank you for the PR! Using ddp is quite powerful with QLoRA as even the large LLaMA models can fit on a single 48GB GPU.
My question with your change is the following: why do you think this solution is better than manually adding the following setting to DDP scripts? --ddp_find_unused_parameters False
I am slightly leaning in favor of adding a section to the README about DDP with sample scripts for how to use it. But I am happy to hear your thoughts.

When using DDP, if gradient checkpointing is enabled, ddp_find_unused_parameters must be set to False, otherwise it will raise an error. I prefer to make it not configurable and reduce the chance of errors.

I agree! It took me two days to get "QLoRA + Expand Vocab + DDP + Gradient Checking" to work properly. I encountered numerous bugs, conflicts, and obscure configurations, including this particular issue. These different aspects were intertwined with each other, making the entire process difficult and frustrating. Therefore, I strongly believe that explicit code and detailed comments are highly preferable.

@chenjiasheng
would you mind sharing your code / script? I have been struggling to implement parallel training on a multi-gpu node.

aresnow1 added 2 commits July 21, 2023 19:02

Set ddp_find_unused_parameters to False when using ddp

813b323

Fix

dcb4c14

shawnanastasio mentioned this pull request Jul 26, 2023

Multi-gpu training example? #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `ddp_find_unused_parameters` to False when using distributed training #222

Set `ddp_find_unused_parameters` to False when using distributed training #222

aresnow1 commented Jul 21, 2023 •

edited

Loading

artidoro commented Jul 21, 2023

aresnow1 commented Jul 21, 2023

chenjiasheng commented Aug 26, 2023

nickmitchko commented Aug 29, 2023 •

edited

Loading

Set ddp_find_unused_parameters to False when using distributed training #222

Are you sure you want to change the base?

Set ddp_find_unused_parameters to False when using distributed training #222

Conversation

aresnow1 commented Jul 21, 2023 • edited Loading

artidoro commented Jul 21, 2023

aresnow1 commented Jul 21, 2023

chenjiasheng commented Aug 26, 2023

nickmitchko commented Aug 29, 2023 • edited Loading

Set `ddp_find_unused_parameters` to False when using distributed training #222

Set `ddp_find_unused_parameters` to False when using distributed training #222

aresnow1 commented Jul 21, 2023 •

edited

Loading

nickmitchko commented Aug 29, 2023 •

edited

Loading