Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Node Training Timeout Error #688

Open
1 of 2 tasks
jonghyunL opened this issue Sep 27, 2024 · 6 comments
Open
1 of 2 tasks

Multi-Node Training Timeout Error #688

jonghyunL opened this issue Sep 27, 2024 · 6 comments
Labels

Comments

@jonghyunL
Copy link

System Info

Env: pytorch 2.5 nightly, CUDA 12.4, python 3.10, NVIDIA Hopper GPU, 2 GPU, NCCL 2.21.5(?)

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Hi I am trying to run multi-node finetuning of LLAMA, where each GPU reside in the separate VMs (2 VMs in a single machine with one GPU per VM) connected by a bridge network. From hardware research perspective, I only try running single epoch 200steps for testing.

I donot have a great understanding of how multi-node distributed data parallelism work in multi-node setting but I come across this error message in both of my VMs.

I tried changing the timeout limit of torch.distributed.init_process_group(backend="nccl", timeout=timedelta(hours=1)) so that the this exit barrier doesn't get triggered by timeout. I also tried changing barrier timeout point from but that won't work too.

Is there anyone who can help me what this message implies? and How I can solve this?

Error logs

image

Expected behavior

I expected the system to perform all_reduce but it just terminates due to timeout.

@HamidShojanazeri
Copy link
Contributor

@jonghyunL there are several things that could cause but have you tried setting like below on your two VMS

export MASTER_ADDR="192.168.1.1"    # Same as the primary node
export MASTER_PORT=12355            # Same port
export WORLD_SIZE=2                 # Same total number of processes
export RANK=1 
export MASTER_ADDR="192.168.1.1"    # Same as the primary node
export MASTER_PORT=12355            # Same port
export WORLD_SIZE=2                 # Same total number of processes
export RANK=0

@jonghyunL
Copy link
Author

jonghyunL commented Sep 30, 2024

Thank you for your reply Hamid. I have tried exporting the env_variables before running them but it still give me this time out error.

I tried setting up environment iwth
export NCCL_SOCKET_IFNAME=eno1
but now it gives me

[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1723102898088/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:318, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Error: network not found.

With NCCL_DEBUG=WARNING it shows this error message.
transport/net_socket.cc:46 NCCL WARN NET/Socket : no interface found

Any clue? is there any more settings that I need to provide?

@mreso
Copy link
Contributor

mreso commented Oct 1, 2024

Hi @jonghyunL NCCL_SOCKET_IFNAME will need to be specific to your environment. Is the bridge interface eno1 in the VM? Also, can you check that the master IP is correct? Is that the ip of the bridge? Can you provide the output of ifconfig on the VMs and check if ping between the machines works?

@jonghyunL
Copy link
Author

jonghyunL commented Oct 1, 2024

So based on ifconfig, I set my env variable NCCL_SOCKET_IFNAME=enp0s1
Both ping and iperf3 works well between the two VMs.
image
image

@jonghyunL
Copy link
Author

Previously when I was running, with configuration
"torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.20.189.64 --master_port=1234 finetuning.py --dist_checkpoint_root_folder model_checkpoint --dist_checkpoint_folder fine-tuned --model_name meta-llama/Llama-2-7b-hf --output_dir output_llama_7b/ --use_peft --peft_method lora --use_fp16 --max_train_step=10 --batch_size_training=1 --num_epoch=2 --enable_fsdp > llama_7b_native_multi2.out"

Before finishing the first iteration this error comes out.
image

there was some post regarding the same issue NVIDIA/nccl#626.
So I added "export NCCL_PROTO=Simple".

Now another timeout error comes out at the first iteration.
image

@jonghyunL
Copy link
Author

jonghyunL commented Oct 2, 2024

This is also a screenshot of testing of torch.distributed.all_reduce working under the same environment
image

This is the code used.
image

Kind of really lost in what I should be doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants