Multi-gpu training example? #12

Qubitium · 2023-05-25T02:30:30Z

Testing 4bit qlora training on 33b llama and the training runs fine on 1x gpu but fails with the following using torchrun on 2x gpu. I am referring to parallel training where each gpu has a full model.

Anyone got multiple-gpu parallel training working yet?

WORLD_SIZE=2 torchrun --rdzv-endpoint=localhost:23456 --nproc_per_node=2
device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

 File "/root/miniconda3/lib/python3.10/site-packages/transformers-4.30.0.dev0-py3.10.egg/transformers/trainer.py", line 2804, in training_step
    loss.backward()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 226, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared acro
ss multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example
, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready
 multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 557 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG
 to either INFO or DETAIL to print parameter names for further debugging.

The text was updated successfully, but these errors were encountered:

photonOli · 2023-05-25T02:45:26Z

On my side, when I run the following command, it is working on multiple GPUs. But the GPUs are under-used, while one CPU core is at 100%, the rest at 0%. It seems like the full power of the machine is not used.

python qlora.py --model_name_or_path /path/to/model-7B

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    70W / 400W |   4037MiB / 40960MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    57W / 400W |   3021MiB / 40960MiB |     34%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0    62W / 400W |   3021MiB / 40960MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:09:00.0 Off |                    0 |
| N/A   35C    P0    61W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    60W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0   158W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:0C:00.0 Off |                    0 |
| N/A   37C    P0    84W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:0D:00.0 Off |                    0 |
| N/A   36C    P0    60W / 400W |   7435MiB / 40960MiB |     15%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

yea0 · 2023-05-25T06:28:26Z

That's perfect information

chu-tianxiang · 2023-05-25T07:27:07Z

Not sure if it's right, but in my case adding ddp_find_unused_parameters=False makes it work.

SparkJiao · 2023-05-25T13:15:14Z

I didn't get work till now for DDP with deepspeed.

I got a multiplication error:

File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers-4.30.0.dev0-py3.9.egg/transformers/models/llama/modeling_llama.py", line 194, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/peft/tuners/lora.py", line 348, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (44x6656 and 1x22151168)

ChenDelong1999 · 2023-05-25T14:07:06Z

I didn't get work till now for DDP with deepspeed.

I got a multiplication error:

File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers-4.30.0.dev0-py3.9.egg/transformers/models/llama/modeling_llama.py", line 194, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/peft/tuners/lora.py", line 348, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (44x6656 and 1x22151168)

Same problem:

│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/transformers/model │
│ s/llama/modeling_llama.py:194 in forward                                                         │
│                                                                                                  │
│   191 │   ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:       │
│   192 │   │   bsz, q_len, _ = hidden_states.size()                                               │
│   193 │   │                                                                                      │
│ ❱ 194 │   │   query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.   │
│   195 │   │   key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.he   │
│   196 │   │   value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.   │
│   197                                                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/torch/nn/modules/m │
│ odule.py:1501 in _call_impl                                                                      │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/peft/tuners/lora.p │
│ y:768 in forward                                                                                 │
│                                                                                                  │
│   765 │   │   │   │   self.active_adapter = adapter_name                                         │
│   766 │   │   │                                                                                  │
│   767 │   │   │   def forward(self, x: torch.Tensor):                                            │
│ ❱ 768 │   │   │   │   result = super().forward(x)                                                │
│   769 │   │   │   │                                                                              │
│   770 │   │   │   │   if self.disable_adapters or self.active_adapter not in self.lora_A.keys(   │
│   771 │   │   │   │   │   return result                                                          │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/bitsandbytes/nn/mo │
│ dules.py:219 in forward                                                                          │
│                                                                                                  │
│   216 │   │   │   x = x.to(self.compute_dtype)                                                   │
│   217 │   │                                                                                      │
│   218 │   │   bias = None if self.bias is None else self.bias.to(self.compute_dtype)             │
│ ❱ 219 │   │   out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.qua   │
│   220 │   │                                                                                      │
│   221 │   │   out = out.to(inp_dtype)                                                            │
│   222                                                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/bitsandbytes/autog │
│ rad/_functions.py:564 in matmul_4bit                                                             │
│                                                                                                  │
│   561                                                                                            │
│   562 def matmul_4bit(A: tensor, B: tensor, quant_state: List, out: tensor = None, bias=None):   │
│   563 │   assert quant_state is not None                                                         │
│ ❱ 564 │   return MatMul4Bit.apply(A, B, out, bias, quant_state)                                  │
│   565                                                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/torch/autograd/fun │
│ ction.py:506 in apply                                                                            │
│                                                                                                  │
│   503 │   │   if not torch._C._are_functorch_transforms_active():                                │
│   504 │   │   │   # See NOTE: [functorch vjp and autograd interaction]                           │
│   505 │   │   │   args = _functorch.utils.unwrap_dead_wrappers(args)                             │
│ ❱ 506 │   │   │   return super().apply(*args, **kwargs)  # type: ignore[misc]                    │
│   507 │   │                                                                                      │
│   508 │   │   if cls.setup_context == _SingleLevelFunction.setup_context:                        │
│   509 │   │   │   raise RuntimeError(                                                            │
│                                                                                                  │
│ /cpfs/user/chendelong/anaconda3/envs/openflamingo/lib/python3.9/site-packages/bitsandbytes/autog │
│ rad/_functions.py:512 in forward                                                                 │
│                                                                                                  │
│   509 │   │                                                                                      │
│   510 │   │   # 1. Dequantize                                                                    │
│   511 │   │   # 2. MatmulnN                                                                      │
│ ❱ 512 │   │   output = torch.nn.functional.linear(A, F.dequantize_fp4(B, state).to(A.dtype).t(   │
│   513 │   │                                                                                      │
│   514 │   │   # 3. Save state                                                                    │
│   515 │   │   ctx.state = state                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1011x4096 and 1x8388608)

SparkJiao · 2023-05-25T14:10:19Z

@ChenDelong1999 I find disable zero, i.e., set zero=0 can help resolve this problem. But I cannot train 30B model on two A6000 due to OOM.

ChenDelong1999 · 2023-05-25T14:21:47Z

@SparkJiao Sorry but what do you mean by zero=0?

By the way, I just find that removing model.cuda() or model.eval() help me to solve the multiplication error:

model = AutoModelForCausalLM.from_pretrained(
      args.lm_path,
      torch_dtype=torch.bfloat16,
      load_in_4bit=True,
      device_map={"": 0}
  )
    
# model = model.cuda().eval() <- DO NOT ADD THIS

AlpinDale · 2023-05-25T15:12:46Z

@SparkJiao Sorry but what do you mean by zero=0?

By the way, I just find that removing model.cuda() or model.eval() help me to solve the multiplication error:
model = AutoModelForCausalLM.from_pretrained(
      args.lm_path,
      torch_dtype=torch.bfloat16,
      load_in_4bit=True,
      device_map={"": 0}
  )
    
# model = model.cuda().eval() <- DO NOT ADD THIS

Where do I look for that? It doesn't seem to be in the QLoRA code.

ChenDelong1999 · 2023-05-25T15:15:45Z

Yes, my own code has this additional line model = model.cuda().eval() originally, and I find that removing it solves my multiplication error.

hemangjoshi37a · 2023-05-28T08:13:12Z

@Qubitium,

Based on the information provided, it seems that when running the training command with multiple GPUs, you encountered a runtime error related to variable readiness. This error message suggests that there might be issues with sharing module parameters across multiple concurrent forward-backward passes.

To address this issue, we recommend trying the following approaches:

Ensure that the torch.distributed package is installed and up to date. You can install or update it using the following command:

pip install torch.distributed

Modify your training script to include the following line of code before starting the training loop:

torch.distributed.init_process_group(backend="nccl")

This line initializes the distributed training process group using the NCCL backend, which is commonly used for multi-GPU training with PyTorch.

In your script, set the environment variable TORCH_DISTRIBUTED_DEBUG to either "INFO" or "DETAIL". This will provide additional information for debugging purposes and may help identify the root cause of the issue.

Additionally, we recommend checking the compatibility and version requirements of the libraries and frameworks you are using. Ensuring that all dependencies are up to date can help resolve potential compatibility issues.

ghost · 2023-05-28T14:31:11Z

added multi gpu reference: #68

bliu3650 · 2023-05-31T12:34:27Z

Temporarily disable gradient-checkpointing can help training on multi-gpu.

SkyAndCloud · 2023-06-01T03:16:38Z

@bliu3650 Can you share the command you used?

zhangluoyang · 2023-06-01T11:01:53Z

By this way, qlora can use deepspeed.
I have two gpu devices.
local_rank = int(os.environ.get("LOCAL_RANK", 0))
if local_rank ==0:
max_memory = {0 : "{0}MB".format(45 * 1024), 1: "0MB"}
else:
max_memory = {1: "{0}MB".format(45 * 1024), 0: "0MB"}

whcisci · 2023-06-02T03:43:08Z

By this way, qlora can use deepspeed. I have two gpu devices. local_rank = int(os.environ.get("LOCAL_RANK", 0)) if local_rank ==0: max_memory = {0 : "{0}MB".format(45 * 1024), 1: "0MB"} else: max_memory = {1: "{0}MB".format(45 * 1024), 0: "0MB"}

@zhangluoyang , I tried to do this and it‘s indeed effective, but I encountered a deadlock during the training process. Have you ever encountered this?

bliu3650 · 2023-06-03T10:49:44Z

@bliu3650 Can you share the command you used?

@SkyAndCloud Code change in qlora.py:

line 273:
device_map = 'auto' -> device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}
line 175:
gradient_checkpointing: bool = field(default=True, ... -> gradient_checkpointing: bool = field(default=False, ...

Run command:
WORLD_SIZE=2 torchrun --nproc_per_node=2 qlora.py --model_name_or_path <path_or_name>

danielwonght · 2023-06-14T07:22:31Z

@bliu3650 Can you share the command you used?

@SkyAndCloud Code change in qlora.py:

line 273: device_map = 'auto' -> device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))} line 175: gradient_checkpointing: bool = field(default=True, ... -> gradient_checkpointing: bool = field(default=False, ...

Run command: WORLD_SIZE=2 torchrun --nproc_per_node=2 qlora.py --model_name_or_path <path_or_name>

The method above works for me. But, the point is that if gradient checkpointing is turned off, then it is hard to load big model into the limited memory. That's not what the qlora is proposed for. I am wondering if there is a way keeping the gradient checkpointing turned on.

ichsan2895 · 2023-07-16T15:46:04Z

On my side, when I run the following command, it is working on multiple GPUs. But the GPUs are under-used, while one CPU core is at 100%, the rest at 0%. It seems like the full power of the machine is not used.

python qlora.py --model_name_or_path /path/to/model-7B

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    70W / 400W |   4037MiB / 40960MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    57W / 400W |   3021MiB / 40960MiB |     34%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0    62W / 400W |   3021MiB / 40960MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:09:00.0 Off |                    0 |
| N/A   35C    P0    61W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    60W / 400W |   3021MiB / 40960MiB |     11%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0   158W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:0C:00.0 Off |                    0 |
| N/A   37C    P0    84W / 400W |   3021MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:0D:00.0 Off |                    0 |
| N/A   36C    P0    60W / 400W |   7435MiB / 40960MiB |     15%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Faces similar case.
All GPUs has distributed VRAM, but under-utilized for GPU

@bliu3650 Can you share the command you used?

@SkyAndCloud Code change in qlora.py:

line 273: device_map = 'auto' -> device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))} line 175: gradient_checkpointing: bool = field(default=True, ... -> gradient_checkpointing: bool = field(default=False, ...

Run command: WORLD_SIZE=2 torchrun --nproc_per_node=2 qlora.py --model_name_or_path <path_or_name>

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 255 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

shawnanastasio · 2023-07-26T00:12:04Z

Using the ddp_find_unused_parameters fix from #222, I can launch training on multiple GPUs (accelerate launch qlora.py) and I see the expected utilization on all cards in nvidia-smi, but the actual speed of training isn't faster than using a single GPU. Has anybody else been seeing this behavior?

artidoro · 2023-07-26T09:23:23Z

I found DDP to significantly reduce training time. One reason why you might not see that is because using DDP changes the way per_device_train_batch_size is interpreted.

When you are NOT using DDP and use the standard qlora.py scripts the batch size is global regardless of how many gpus you use. Note that using more GPUs in this setup means you spread the model over several GPUs according to device_map. This helps fit large models on GPUs that are not large enough individually.
When you ARE using DDP, the batch size is per device. So if you use 4 GPUs and per_device_train_batch_size=3 you have an effective batch size of 12. Here the model is replicated on each GPU so you need to have a GPU that is large enough to fit the model. Each GPU sees a portion of the batch (in our example each GPU sees 3 data points). I found this setup to significantly reduce the training time.

ehartford · 2023-08-04T22:29:41Z

I found DDP to significantly reduce training time. One reason why you might not see that is because using DDP changes the way per_device_train_batch_size is interpreted.

When you are NOT using DDP and use the standard qlora.py scripts the batch size is global regardless of how many gpus you use. Note that using more GPUs in this setup means you spread the model over several GPUs according to device_map. This helps fit large models on GPUs that are not large enough individually.

When you ARE using DDP, the batch size is per device. So if you use 4 GPUs and per_device_train_batch_size=3 you have an effective batch size of 12. Here the model is replicated on each GPU so you need to have a GPU that is large enough to fit the model. Each GPU sees a portion of the batch (in our example each GPU sees 3 data points). I found this setup to significantly reduce the training time.

can you please give example of how to make this work?
I also can't for the same reasons as everyone else is having trouble.

ichsan2895 · 2023-08-21T14:02:46Z

On my side, when I run the following command, it is working on multiple GPUs. But the GPUs are under-used, while one CPU core is at 100%, the rest at 0%. It seems like the full power of the machine is not used.

Faces similar case. All GPUs has distributed VRAM, but under-utilized for GPU

Finally, it works.
Now it utilized all GPUs

!pip install bitsandbytes==0.41.1
!pip install transformers==4.31.0
!pip install peft==0.4.0 
!pip install accelerate==0.21.0 einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99

*change qlora.py
device_map='auto' => device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=True --ddp_find_unused_parameters=False

Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117

Got 15 seconds/iters

Compared to

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=False

Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs.

Compared to the vanilla one (original)

!python3.10 qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True

Got 55 seconds/iter. So it is very slow compared previous method.

Qubitium · 2023-08-22T06:04:20Z

@ichsan2895 Interesting result with gradient_checkpointing off and on. Using it causes 33% slowdown in qlora in multi-gpu yet the train/e loss values are identical. Did you see exact wandb graphs between this two over the full fine-tune? Also was there a huge memory diff between gradient_checkpointing on/off? Thanks.

ichsan2895 · 2023-08-22T07:58:58Z

@ichsan2895 Interesting result with gradient_checkpointing off and on. Using it causes 33% slowdown in qlora in multi-gpu yet the train/e loss values are identical. Did you see exact wandb graphs between this two over the full fine-tune? Also was there a huge memory diff between gradient_checkpointing on/off? Thanks.

when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20x2=40 GB total,
If three GPUs, it needs 20x3 GB=60 GB total.

when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20/2=10 GB/GPU,
If three GPUs, it needs 20/3 GB=6,67 GB/GPU.

But I am sorry I don't use Wandb for tracking logs.

shubhanjan99 · 2024-02-19T17:42:49Z

@ichsan2895 your solution above worked. Do you know how we can extend it to multi-node multi-gpu?

taishan1994 mentioned this issue May 27, 2023

lora weights are not saved correctly #41

Open

suclogger mentioned this issue Jun 6, 2023

Train using qlora exist with error hiyouga/LLaMA-Factory#15

Closed

aresnow1 mentioned this issue Jul 21, 2023

Set ddp_find_unused_parameters to False when using distributed training #222

Open

ichsan2895 mentioned this issue Aug 22, 2023

Multi-GPU Training #96

Open

ichsan2895 mentioned this issue Sep 18, 2023

uneven distribution of GPU workload #262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu training example? #12

Multi-gpu training example? #12

Qubitium commented May 25, 2023

photonOli commented May 25, 2023 •

edited

Loading

yea0 commented May 25, 2023

chu-tianxiang commented May 25, 2023

SparkJiao commented May 25, 2023

ChenDelong1999 commented May 25, 2023

SparkJiao commented May 25, 2023

ChenDelong1999 commented May 25, 2023

AlpinDale commented May 25, 2023

ChenDelong1999 commented May 25, 2023

hemangjoshi37a commented May 28, 2023

ghost commented May 28, 2023

bliu3650 commented May 31, 2023

SkyAndCloud commented Jun 1, 2023

zhangluoyang commented Jun 1, 2023

whcisci commented Jun 2, 2023 •

edited

Loading

bliu3650 commented Jun 3, 2023 •

edited

Loading

danielwonght commented Jun 14, 2023

ichsan2895 commented Jul 16, 2023 •

edited

Loading

shawnanastasio commented Jul 26, 2023

artidoro commented Jul 26, 2023

ehartford commented Aug 4, 2023

ichsan2895 commented Aug 21, 2023 •

edited

Loading

Qubitium commented Aug 22, 2023

ichsan2895 commented Aug 22, 2023 •

edited

Loading

shubhanjan99 commented Feb 19, 2024

Multi-gpu training example? #12

Multi-gpu training example? #12

Comments

Qubitium commented May 25, 2023

photonOli commented May 25, 2023 • edited Loading

yea0 commented May 25, 2023

chu-tianxiang commented May 25, 2023

SparkJiao commented May 25, 2023

ChenDelong1999 commented May 25, 2023

SparkJiao commented May 25, 2023

ChenDelong1999 commented May 25, 2023

AlpinDale commented May 25, 2023

ChenDelong1999 commented May 25, 2023

hemangjoshi37a commented May 28, 2023

ghost commented May 28, 2023

bliu3650 commented May 31, 2023

SkyAndCloud commented Jun 1, 2023

zhangluoyang commented Jun 1, 2023

whcisci commented Jun 2, 2023 • edited Loading

bliu3650 commented Jun 3, 2023 • edited Loading

danielwonght commented Jun 14, 2023

ichsan2895 commented Jul 16, 2023 • edited Loading

shawnanastasio commented Jul 26, 2023

artidoro commented Jul 26, 2023

ehartford commented Aug 4, 2023

ichsan2895 commented Aug 21, 2023 • edited Loading

Qubitium commented Aug 22, 2023

ichsan2895 commented Aug 22, 2023 • edited Loading

shubhanjan99 commented Feb 19, 2024

photonOli commented May 25, 2023 •

edited

Loading

whcisci commented Jun 2, 2023 •

edited

Loading

bliu3650 commented Jun 3, 2023 •

edited

Loading

ichsan2895 commented Jul 16, 2023 •

edited

Loading

ichsan2895 commented Aug 21, 2023 •

edited

Loading

ichsan2895 commented Aug 22, 2023 •

edited

Loading