-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Node Training Timeout Error #688
Comments
@jonghyunL there are several things that could cause but have you tried setting like below on your two VMS
|
Thank you for your reply Hamid. I have tried exporting the env_variables before running them but it still give me this time out error. I tried setting up environment iwth [rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1723102898088/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:318, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5 With NCCL_DEBUG=WARNING it shows this error message. Any clue? is there any more settings that I need to provide? |
Hi @jonghyunL NCCL_SOCKET_IFNAME will need to be specific to your environment. Is the bridge interface eno1 in the VM? Also, can you check that the master IP is correct? Is that the ip of the bridge? Can you provide the output of ifconfig on the VMs and check if ping between the machines works? |
Previously when I was running, with configuration Before finishing the first iteration this error comes out. there was some post regarding the same issue NVIDIA/nccl#626. |
System Info
Env: pytorch 2.5 nightly, CUDA 12.4, python 3.10, NVIDIA Hopper GPU, 2 GPU, NCCL 2.21.5(?)
Information
🐛 Describe the bug
Hi I am trying to run multi-node finetuning of LLAMA, where each GPU reside in the separate VMs (2 VMs in a single machine with one GPU per VM) connected by a bridge network. From hardware research perspective, I only try running single epoch 200steps for testing.
I donot have a great understanding of how multi-node distributed data parallelism work in multi-node setting but I come across this error message in both of my VMs.
I tried changing the timeout limit of torch.distributed.init_process_group(backend="nccl", timeout=timedelta(hours=1)) so that the this exit barrier doesn't get triggered by timeout. I also tried changing barrier timeout point from but that won't work too.
Is there anyone who can help me what this message implies? and How I can solve this?
Error logs
Expected behavior
I expected the system to perform all_reduce but it just terminates due to timeout.
The text was updated successfully, but these errors were encountered: