[Bug] Get error when use two nodes #3510

Tian14267 · 2025-02-12T03:43:42Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I get an error when use two nodes:

Reproduction

# node 1
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000


# node 2
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

Environment

dokcer is use : lmsysorg/sglang:dev

The text was updated successfully, but these errors were encountered:

jhinpan · 2025-02-12T03:59:29Z

According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.

A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.

Tian14267 · 2025-02-12T05:33:33Z

According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.

A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.

Thank you .
And I add NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME Info in my code ,like this:

# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出（PyTorch）
export GLOO_DEBUG=1
# node 1
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000

# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出（PyTorch）
export GLOO_DEBUG=1
# node 2
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

And I get This error:

Node-1:

log-node-1.txt

Node-2:

log-node-2.txt

my ip addr is :

(base) root@ubuntu:/data/fffan/0_experiment/3_SGLang/0_docker/1_two_nodes_allModel# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens8f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c0
    altname enp151s0f0np0
3: ens8f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c1
    altname enp151s0f1np1
4: ens16f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:30 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f0
5: ens16f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:31 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f1
6: ens16f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:32 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f2
7: ens16f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:33 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f3
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff
    inet 10.68.27.14/24 brd 10.68.27.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::8cce:cfff:fe7f:81d6/64 scope link 
       valid_lft forever preferred_lft forever
9: br-c2260f9ba85a: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:ba:e6:a7:f8 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global br-c2260f9ba85a
       valid_lft forever preferred_lft forever
10: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d6:7d:9e:dd brd ff:ff:ff:ff:ff:ff
    inet 172.250.0.1/20 brd 172.250.15.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:d6ff:fe7d:9edd/64 scope link 
       valid_lft forever preferred_lft forever

Can you please take a look again?

@jhinpan

jhinpan self-assigned this Feb 12, 2025

This was referenced Feb 12, 2025

[Bug] It will stuck when run python -m sglang.launch_server #3488

Open

something wrong when use SLURM for Multi-Node Inference #3512

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Get error when use two nodes #3510

[Bug] Get error when use two nodes #3510

Tian14267 commented Feb 12, 2025 •

edited

Loading

jhinpan commented Feb 12, 2025

Tian14267 commented Feb 12, 2025 •

edited

Loading

[Bug] Get error when use two nodes #3510

[Bug] Get error when use two nodes #3510

Comments

Tian14267 commented Feb 12, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

jhinpan commented Feb 12, 2025

Tian14267 commented Feb 12, 2025 • edited Loading

Tian14267 commented Feb 12, 2025 •

edited

Loading

Tian14267 commented Feb 12, 2025 •

edited

Loading