Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Get error when use two nodes #3510

Open
5 tasks
Tian14267 opened this issue Feb 12, 2025 · 2 comments
Open
5 tasks

[Bug] Get error when use two nodes #3510

Tian14267 opened this issue Feb 12, 2025 · 2 comments
Assignees

Comments

@Tian14267
Copy link

Tian14267 commented Feb 12, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I get an error when use two nodes:

log.txt

Reproduction

# node 1
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000


# node 2
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

Environment

dokcer is use : lmsysorg/sglang:dev

@jhinpan jhinpan self-assigned this Feb 12, 2025
@jhinpan
Copy link
Collaborator

jhinpan commented Feb 12, 2025

According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.

A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.

@Tian14267
Copy link
Author

Tian14267 commented Feb 12, 2025

According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.

A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.

Thank you .
And I add NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME Info in my code ,like this:

# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 1
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 2
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

And I get This error:

Node-1:

log-node-1.txt

Node-2:

log-node-2.txt

my ip addr is :

(base) root@ubuntu:/data/fffan/0_experiment/3_SGLang/0_docker/1_two_nodes_allModel# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens8f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c0
    altname enp151s0f0np0
3: ens8f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c1
    altname enp151s0f1np1
4: ens16f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:30 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f0
5: ens16f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:31 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f1
6: ens16f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:32 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f2
7: ens16f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:33 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f3
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff
    inet 10.68.27.14/24 brd 10.68.27.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::8cce:cfff:fe7f:81d6/64 scope link 
       valid_lft forever preferred_lft forever
9: br-c2260f9ba85a: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:ba:e6:a7:f8 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global br-c2260f9ba85a
       valid_lft forever preferred_lft forever
10: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d6:7d:9e:dd brd ff:ff:ff:ff:ff:ff
    inet 172.250.0.1/20 brd 172.250.15.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:d6ff:fe7d:9edd/64 scope link 
       valid_lft forever preferred_lft forever

Can you please take a look again?

@jhinpan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants