-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-gpu training example? #12
Comments
On my side, when I run the following command, it is working on multiple GPUs. But the GPUs are under-used, while one CPU core is at 100%, the rest at 0%. It seems like the full power of the machine is not used.
|
That's perfect information |
Not sure if it's right, but in my case adding |
I didn't get work till now for DDP with deepspeed. I got a multiplication error:
|
Same problem:
|
@ChenDelong1999 I find disable zero, i.e., set |
@SparkJiao Sorry but what do you mean by By the way, I just find that removing model.cuda() or model.eval() help me to solve the multiplication error:
|
Where do I look for that? It doesn't seem to be in the QLoRA code. |
Yes, my own code has this additional line |
Based on the information provided, it seems that when running the training command with multiple GPUs, you encountered a runtime error related to variable readiness. This error message suggests that there might be issues with sharing module parameters across multiple concurrent forward-backward passes. To address this issue, we recommend trying the following approaches:
pip install torch.distributed
torch.distributed.init_process_group(backend="nccl") This line initializes the distributed training process group using the NCCL backend, which is commonly used for multi-GPU training with PyTorch.
Additionally, we recommend checking the compatibility and version requirements of the libraries and frameworks you are using. Ensuring that all dependencies are up to date can help resolve potential compatibility issues. |
added multi gpu reference: #68 |
Temporarily disable gradient-checkpointing can help training on multi-gpu. |
@bliu3650 Can you share the command you used? |
By this way, qlora can use deepspeed. |
@zhangluoyang , I tried to do this and it‘s indeed effective, but I encountered a deadlock during the training process. Have you ever encountered this? |
@SkyAndCloud Code change in qlora.py: line 273: Run command: |
The method above works for me. But, the point is that if gradient checkpointing is turned off, then it is hard to load big model into the limited memory. That's not what the qlora is proposed for. I am wondering if there is a way keeping the gradient checkpointing turned on. |
Faces similar case.
|
Using the |
I found DDP to significantly reduce training time. One reason why you might not see that is because using DDP changes the way
|
can you please give example of how to make this work? |
@ichsan2895 Interesting result with gradient_checkpointing off and on. Using it causes 33% slowdown in qlora in multi-gpu yet the train/e loss values are identical. Did you see exact wandb graphs between this two over the full fine-tune? Also was there a huge memory diff between gradient_checkpointing on/off? Thanks. |
when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM. For example if one GPU, it needs 20 GBs of VRAM. when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage. For example if one GPU, it needs 20 GBs of VRAM. But I am sorry I don't use Wandb for tracking logs. |
@ichsan2895 your solution above worked. Do you know how we can extend it to multi-node multi-gpu? |
Testing 4bit qlora training on 33b llama and the training runs fine on 1x gpu but fails with the following using
torchrun
on 2x gpu. I am referring to parallel training where each gpu has a full model.Anyone got multiple-gpu parallel training working yet?
The text was updated successfully, but these errors were encountered: