Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu training example? #12

Open
Qubitium opened this issue May 25, 2023 · 25 comments
Open

Multi-gpu training example? #12

Qubitium opened this issue May 25, 2023 · 25 comments

Comments

@Qubitium
Copy link
Contributor

Testing 4bit qlora training on 33b llama and the training runs fine on 1x gpu but fails with the following using torchrun on 2x gpu. I am referring to parallel training where each gpu has a full model.

Anyone got multiple-gpu parallel training working yet?

WORLD_SIZE=2 torchrun --rdzv-endpoint=localhost:23456 --nproc_per_node=2
device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}
 File "/root/miniconda3/lib/python3.10/site-packages/transformers-4.30.0.dev0-py3.10.egg/transformers/trainer.py", line 2804, in training_step
    loss.backward()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 226, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared acro
ss multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example
, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready
 multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 557 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG
 to either INFO or DETAIL to print parameter names for further debugging.
@photonOli
Copy link

photonOli commented May 25, 2023

On my side, when I run the following command, it is working on multiple GPUs. But the GPUs are under-used, while one CPU core is at 100%, the rest at 0%. It seems like the full power of the machine is not used.

python qlora.py --model_name_or_path /path/to/model-7B
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    70W / 400W |   4037MiB / 40960MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    57W / 400W |   3021MiB / 40960MiB |     34%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0    62W / 400W |   3021MiB / 40960MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:09:00.0 Off |                    0 |
| N/A   35C    P0    61W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    60W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0   158W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:0C:00.0 Off |                    0 |
| N/A   37C    P0    84W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:0D:00.0 Off |                    0 |
| N/A   36C    P0    60W / 400W |   7435MiB / 40960MiB |     15%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

@yea0
Copy link

yea0 commented May 25, 2023

That's perfect information

@chu-tianxiang
Copy link

Not sure if it's right, but in my case adding ddp_find_unused_parameters=False makes it work.

@SparkJiao
Copy link

I didn't get work till now for DDP with deepspeed.

I got a multiplication error:

File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers-4.30.0.dev0-py3.9.egg/transformers/models/llama/modeling_llama.py", line 194, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/peft/tuners/lora.py", line 348, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (44x6656 and 1x22151168)

@ChenDelong1999
Copy link

I didn't get work till now for DDP with deepspeed.

I got a multiplication error:

File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers-4.30.0.dev0-py3.9.egg/transformers/models/llama/modeling_llama.py", line 194, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/peft/tuners/lora.py", line 348, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (44x6656 and 1x22151168)

Same problem:

│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/transformers/model │
│ s/llama/modeling_llama.py:194 in forward                                                         │
│                                                                                                  │
│   191 │   ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:       │
│   192 │   │   bsz, q_len, _ = hidden_states.size()                                               │
│   193 │   │                                                                                      │
│ ❱ 194 │   │   query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.   │
│   195 │   │   key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.he   │
│   196 │   │   value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.   │
│   197                                                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/torch/nn/modules/m │
│ odule.py:1501 in _call_impl                                                                      │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/peft/tuners/lora.p │
│ y:768 in forward                                                                                 │
│                                                                                                  │
│   765 │   │   │   │   self.active_adapter = adapter_name                                         │
│   766 │   │   │                                                                                  │
│   767 │   │   │   def forward(self, x: torch.Tensor):                                            │
│ ❱ 768 │   │   │   │   result = super().forward(x)                                                │
│   769 │   │   │   │                                                                              │
│   770 │   │   │   │   if self.disable_adapters or self.active_adapter not in self.lora_A.keys(   │
│   771 │   │   │   │   │   return result                                                          │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/bitsandbytes/nn/mo │
│ dules.py:219 in forward                                                                          │
│                                                                                                  │
│   216 │   │   │   x = x.to(self.compute_dtype)                                                   │
│   217 │   │                                                                                      │
│   218 │   │   bias = None if self.bias is None else self.bias.to(self.compute_dtype)             │
│ ❱ 219 │   │   out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.qua   │
│   220 │   │                                                                                      │
│   221 │   │   out = out.to(inp_dtype)                                                            │
│   222                                                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/bitsandbytes/autog │
│ rad/_functions.py:564 in matmul_4bit                                                             │
│                                                                                                  │
│   561                                                                                            │
│   562 def matmul_4bit(A: tensor, B: tensor, quant_state: List, out: tensor = None, bias=None):   │
│   563 │   assert quant_state is not None                                                         │
│ ❱ 564 │   return MatMul4Bit.apply(A, B, out, bias, quant_state)                                  │
│   565                                                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/torch/autograd/fun │
│ ction.py:506 in apply                                                                            │
│                                                                                                  │
│   503 │   │   if not torch._C._are_functorch_transforms_active():                                │
│   504 │   │   │   # See NOTE: [functorch vjp and autograd interaction]                           │
│   505 │   │   │   args = _functorch.utils.unwrap_dead_wrappers(args)                             │
│ ❱ 506 │   │   │   return super().apply(*args, **kwargs)  # type: ignore[misc]                    │
│   507 │   │                                                                                      │
│   508 │   │   if cls.setup_context == _SingleLevelFunction.setup_context:                        │
│   509 │   │   │   raise RuntimeError(                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/bitsandbytes/autog │
│ rad/_functions.py:512 in forward                                                                 │
│                                                                                                  │
│   509 │   │                                                                                      │
│   510 │   │   # 1. Dequantize                                                                    │
│   511 │   │   # 2. MatmulnN                                                                      │
│ ❱ 512 │   │   output = torch.nn.functional.linear(A, F.dequantize_fp4(B, state).to(A.dtype).t(   │
│   513 │   │                                                                                      │
│   514 │   │   # 3. Save state                                                                    │
│   515 │   │   ctx.state = state                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1011x4096 and 1x8388608)

@SparkJiao
Copy link

@ChenDelong1999 I find disable zero, i.e., set zero=0 can help resolve this problem. But I cannot train 30B model on two A6000 due to OOM.

@ChenDelong1999
Copy link

@SparkJiao Sorry but what do you mean by zero=0?

By the way, I just find that removing model.cuda() or model.eval() help me to solve the multiplication error:

model = AutoModelForCausalLM.from_pretrained(
      args.lm_path,
      torch_dtype=torch.bfloat16,
      load_in_4bit=True,
      device_map={"": 0}
  )
    
# model = model.cuda().eval() <- DO NOT ADD THIS

@AlpinDale
Copy link
Contributor

@SparkJiao Sorry but what do you mean by zero=0?

By the way, I just find that removing model.cuda() or model.eval() help me to solve the multiplication error:

model = AutoModelForCausalLM.from_pretrained(
      args.lm_path,
      torch_dtype=torch.bfloat16,
      load_in_4bit=True,
      device_map={"": 0}
  )
    
# model = model.cuda().eval() <- DO NOT ADD THIS

Where do I look for that? It doesn't seem to be in the QLoRA code.

@ChenDelong1999
Copy link

Yes, my own code has this additional line model = model.cuda().eval() originally, and I find that removing it solves my multiplication error.

@hemangjoshi37a
Copy link

@Qubitium,

Based on the information provided, it seems that when running the training command with multiple GPUs, you encountered a runtime error related to variable readiness. This error message suggests that there might be issues with sharing module parameters across multiple concurrent forward-backward passes.

To address this issue, we recommend trying the following approaches:

  1. Ensure that the torch.distributed package is installed and up to date. You can install or update it using the following command:
pip install torch.distributed
  1. Modify your training script to include the following line of code before starting the training loop:
torch.distributed.init_process_group(backend="nccl")

This line initializes the distributed training process group using the NCCL backend, which is commonly used for multi-GPU training with PyTorch.

  1. In your script, set the environment variable TORCH_DISTRIBUTED_DEBUG to either "INFO" or "DETAIL". This will provide additional information for debugging purposes and may help identify the root cause of the issue.

Additionally, we recommend checking the compatibility and version requirements of the libraries and frameworks you are using. Ensuring that all dependencies are up to date can help resolve potential compatibility issues.

@ghost
Copy link

ghost commented May 28, 2023

added multi gpu reference: #68

@bliu3650
Copy link

Temporarily disable gradient-checkpointing can help training on multi-gpu.

@SkyAndCloud
Copy link

@bliu3650 Can you share the command you used?

@zhangluoyang
Copy link

By this way, qlora can use deepspeed.
I have two gpu devices.
local_rank = int(os.environ.get("LOCAL_RANK", 0))
if local_rank ==0:
max_memory = {0 : "{0}MB".format(45 * 1024), 1: "0MB"}
else:
max_memory = {1: "{0}MB".format(45 * 1024), 0: "0MB"}

@whcisci
Copy link

whcisci commented Jun 2, 2023

By this way, qlora can use deepspeed. I have two gpu devices. local_rank = int(os.environ.get("LOCAL_RANK", 0)) if local_rank ==0: max_memory = {0 : "{0}MB".format(45 * 1024), 1: "0MB"} else: max_memory = {1: "{0}MB".format(45 * 1024), 0: "0MB"}

@zhangluoyang , I tried to do this and it‘s indeed effective, but I encountered a deadlock during the training process. Have you ever encountered this?

@bliu3650
Copy link

bliu3650 commented Jun 3, 2023

@bliu3650 Can you share the command you used?

@SkyAndCloud Code change in qlora.py:

line 273:
device_map = 'auto' -> device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}
line 175:
gradient_checkpointing: bool = field(default=True, ... -> gradient_checkpointing: bool = field(default=False, ...

Run command:
WORLD_SIZE=2 torchrun --nproc_per_node=2 qlora.py --model_name_or_path <path_or_name>

@danielwonght
Copy link

@bliu3650 Can you share the command you used?

@SkyAndCloud Code change in qlora.py:

line 273: device_map = 'auto' -> device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))} line 175: gradient_checkpointing: bool = field(default=True, ... -> gradient_checkpointing: bool = field(default=False, ...

Run command: WORLD_SIZE=2 torchrun --nproc_per_node=2 qlora.py --model_name_or_path <path_or_name>

The method above works for me. But, the point is that if gradient checkpointing is turned off, then it is hard to load big model into the limited memory. That's not what the qlora is proposed for. I am wondering if there is a way keeping the gradient checkpointing turned on.

@ichsan2895
Copy link

ichsan2895 commented Jul 16, 2023

On my side, when I run the following command, it is working on multiple GPUs. But the GPUs are under-used, while one CPU core is at 100%, the rest at 0%. It seems like the full power of the machine is not used.

python qlora.py --model_name_or_path /path/to/model-7B
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    70W / 400W |   4037MiB / 40960MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    57W / 400W |   3021MiB / 40960MiB |     34%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0    62W / 400W |   3021MiB / 40960MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:09:00.0 Off |                    0 |
| N/A   35C    P0    61W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    60W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0   158W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:0C:00.0 Off |                    0 |
| N/A   37C    P0    84W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:0D:00.0 Off |                    0 |
| N/A   36C    P0    60W / 400W |   7435MiB / 40960MiB |     15%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Faces similar case.
All GPUs has distributed VRAM, but under-utilized for GPU

@bliu3650 Can you share the command you used?

@SkyAndCloud Code change in qlora.py:

line 273: device_map = 'auto' -> device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))} line 175: gradient_checkpointing: bool = field(default=True, ... -> gradient_checkpointing: bool = field(default=False, ...

Run command: WORLD_SIZE=2 torchrun --nproc_per_node=2 qlora.py --model_name_or_path <path_or_name>

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 255 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

@shawnanastasio
Copy link

Using the ddp_find_unused_parameters fix from #222, I can launch training on multiple GPUs (accelerate launch qlora.py) and I see the expected utilization on all cards in nvidia-smi, but the actual speed of training isn't faster than using a single GPU. Has anybody else been seeing this behavior?

@artidoro
Copy link
Owner

I found DDP to significantly reduce training time. One reason why you might not see that is because using DDP changes the way per_device_train_batch_size is interpreted.

  • When you are NOT using DDP and use the standard qlora.py scripts the batch size is global regardless of how many gpus you use. Note that using more GPUs in this setup means you spread the model over several GPUs according to device_map. This helps fit large models on GPUs that are not large enough individually.
  • When you ARE using DDP, the batch size is per device. So if you use 4 GPUs and per_device_train_batch_size=3 you have an effective batch size of 12. Here the model is replicated on each GPU so you need to have a GPU that is large enough to fit the model. Each GPU sees a portion of the batch (in our example each GPU sees 3 data points). I found this setup to significantly reduce the training time.

@ehartford
Copy link

I found DDP to significantly reduce training time. One reason why you might not see that is because using DDP changes the way per_device_train_batch_size is interpreted.

  • When you are NOT using DDP and use the standard qlora.py scripts the batch size is global regardless of how many gpus you use. Note that using more GPUs in this setup means you spread the model over several GPUs according to device_map. This helps fit large models on GPUs that are not large enough individually.
  • When you ARE using DDP, the batch size is per device. So if you use 4 GPUs and per_device_train_batch_size=3 you have an effective batch size of 12. Here the model is replicated on each GPU so you need to have a GPU that is large enough to fit the model. Each GPU sees a portion of the batch (in our example each GPU sees 3 data points). I found this setup to significantly reduce the training time.

can you please give example of how to make this work?
I also can't for the same reasons as everyone else is having trouble.

@ichsan2895
Copy link

ichsan2895 commented Aug 21, 2023

On my side, when I run the following command, it is working on multiple GPUs. But the GPUs are under-used, while one CPU core is at 100%, the rest at 0%. It seems like the full power of the machine is not used.

Faces similar case. All GPUs has distributed VRAM, but under-utilized for GPU

Finally, it works.
Now it utilized all GPUs

!pip install bitsandbytes==0.41.1
!pip install transformers==4.31.0
!pip install peft==0.4.0 
!pip install accelerate==0.21.0 einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99

*change qlora.py
device_map='auto' => device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=True --ddp_find_unused_parameters=False

Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117

Got 15 seconds/iters
image

Compared to

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=False

Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs.
image

Compared to the vanilla one (original)

!python3.10 qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True

Got 55 seconds/iter. So it is very slow compared previous method.
image

@Qubitium
Copy link
Contributor Author

@ichsan2895 Interesting result with gradient_checkpointing off and on. Using it causes 33% slowdown in qlora in multi-gpu yet the train/e loss values are identical. Did you see exact wandb graphs between this two over the full fine-tune? Also was there a huge memory diff between gradient_checkpointing on/off? Thanks.

@ichsan2895
Copy link

ichsan2895 commented Aug 22, 2023

@ichsan2895 Interesting result with gradient_checkpointing off and on. Using it causes 33% slowdown in qlora in multi-gpu yet the train/e loss values are identical. Did you see exact wandb graphs between this two over the full fine-tune? Also was there a huge memory diff between gradient_checkpointing on/off? Thanks.

when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20x2=40 GB total,
If three GPUs, it needs 20x3 GB=60 GB total.

when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20/2=10 GB/GPU,
If three GPUs, it needs 20/3 GB=6,67 GB/GPU.

But I am sorry I don't use Wandb for tracking logs.

@shubhanjan99
Copy link

@ichsan2895 your solution above worked. Do you know how we can extend it to multi-node multi-gpu?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests