Multi-Node Training Timeout Error #688

jonghyunL · 2024-09-27T06:25:50Z

System Info

Env: pytorch 2.5 nightly, CUDA 12.4, python 3.10, NVIDIA Hopper GPU, 2 GPU, NCCL 2.21.5(?)

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

Hi I am trying to run multi-node finetuning of LLAMA, where each GPU reside in the separate VMs (2 VMs in a single machine with one GPU per VM) connected by a bridge network. From hardware research perspective, I only try running single epoch 200steps for testing.

I donot have a great understanding of how multi-node distributed data parallelism work in multi-node setting but I come across this error message in both of my VMs.

I tried changing the timeout limit of torch.distributed.init_process_group(backend="nccl", timeout=timedelta(hours=1)) so that the this exit barrier doesn't get triggered by timeout. I also tried changing barrier timeout point from but that won't work too.

Is there anyone who can help me what this message implies? and How I can solve this?

Error logs

Expected behavior

I expected the system to perform all_reduce but it just terminates due to timeout.

HamidShojanazeri · 2024-09-27T22:41:00Z

@jonghyunL there are several things that could cause but have you tried setting like below on your two VMS

export MASTER_ADDR="192.168.1.1"    # Same as the primary node
export MASTER_PORT=12355            # Same port
export WORLD_SIZE=2                 # Same total number of processes
export RANK=1

export MASTER_ADDR="192.168.1.1"    # Same as the primary node
export MASTER_PORT=12355            # Same port
export WORLD_SIZE=2                 # Same total number of processes
export RANK=0

jonghyunL · 2024-09-30T06:51:49Z

Thank you for your reply Hamid. I have tried exporting the env_variables before running them but it still give me this time out error.

I tried setting up environment iwth
export NCCL_SOCKET_IFNAME=eno1
but now it gives me

[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1723102898088/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:318, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Error: network not found.

With NCCL_DEBUG=WARNING it shows this error message.
transport/net_socket.cc:46 NCCL WARN NET/Socket : no interface found

Any clue? is there any more settings that I need to provide?

mreso · 2024-10-01T17:47:34Z

Hi @jonghyunL NCCL_SOCKET_IFNAME will need to be specific to your environment. Is the bridge interface eno1 in the VM? Also, can you check that the master IP is correct? Is that the ip of the bridge? Can you provide the output of ifconfig on the VMs and check if ping between the machines works?

jonghyunL · 2024-10-01T19:28:51Z

So based on ifconfig, I set my env variable NCCL_SOCKET_IFNAME=enp0s1
Both ping and iperf3 works well between the two VMs.

jonghyunL · 2024-10-01T20:57:36Z

Previously when I was running, with configuration
"torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.20.189.64 --master_port=1234 finetuning.py --dist_checkpoint_root_folder model_checkpoint --dist_checkpoint_folder fine-tuned --model_name meta-llama/Llama-2-7b-hf --output_dir output_llama_7b/ --use_peft --peft_method lora --use_fp16 --max_train_step=10 --batch_size_training=1 --num_epoch=2 --enable_fsdp > llama_7b_native_multi2.out"

Before finishing the first iteration this error comes out.

there was some post regarding the same issue NVIDIA/nccl#626.
So I added "export NCCL_PROTO=Simple".

Now another timeout error comes out at the first iteration.

jonghyunL · 2024-10-02T19:07:37Z

This is also a screenshot of testing of torch.distributed.all_reduce working under the same environment

This is the code used.

Kind of really lost in what I should be doing.

HamidShojanazeri added the triaged label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Node Training Timeout Error #688

Multi-Node Training Timeout Error #688

jonghyunL commented Sep 27, 2024

HamidShojanazeri commented Sep 27, 2024

jonghyunL commented Sep 30, 2024 •

edited

Loading

mreso commented Oct 1, 2024

jonghyunL commented Oct 1, 2024 •

edited

Loading

jonghyunL commented Oct 1, 2024

jonghyunL commented Oct 2, 2024 •

edited

Loading

Multi-Node Training Timeout Error #688

Multi-Node Training Timeout Error #688

Comments

jonghyunL commented Sep 27, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

HamidShojanazeri commented Sep 27, 2024

jonghyunL commented Sep 30, 2024 • edited Loading

mreso commented Oct 1, 2024

jonghyunL commented Oct 1, 2024 • edited Loading

jonghyunL commented Oct 1, 2024

jonghyunL commented Oct 2, 2024 • edited Loading

jonghyunL commented Sep 30, 2024 •

edited

Loading

jonghyunL commented Oct 1, 2024 •

edited

Loading

jonghyunL commented Oct 2, 2024 •

edited

Loading