[Usage]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13294

AlbiRadtke · 2025-02-14T14:44:42Z

Your current environment

Hey guys :)

since version 0.6.6 up to the current V0.7.2 I have a slightly annoying problem. When I start up my AI server, vllm everything works fine. The model is loaded and can be used as desired.
However, as soon as I end my start script and want to load the same model again, or even a different model, vLLM always freezes at this point:

[W214 15:20:56.973624780 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.008814328 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.104798409 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.107680744 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.196595399 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.199089483 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.205991785 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.207522727 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
(VllmWorkerProcess pid=8968) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8969) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8968) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8969) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8971) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8967) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8971) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8967) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8970) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8972) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8973) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8970) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8972) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8973) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5

When I quit vLLM before, I always get this warning:
[rank0]:[W214 15:18:06.519697808 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

Unfortunately, I cannot find any examples of how I can execute the start script so that the shutdown is executed correctly. Can you help me with this?

My startskript.sh:

#!/bin/bash

token=50000
export HF_TOKEN=hf_myToken
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_LAUNCH_BLOCKING=0
export disable_custom_all_reduce=True

python -m vllm.entrypoints.openai.api_server \
        --model=mistralai/Pixtral-Large-Instruct-2411 \
        --config-format mistral \
        --load-format mistral \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=10' \
        --host 192.uuu.xxx.yyy \
        --port myPORT \
        --trust-remote-code \
        --device cuda \
        --tensor-parallel-size 8 \
        --gpu-memory-utilization 1 \
        --swap-space 10 \
        --max_num_seqs 3 \
        --max_num_batched_tokens $token \
        --max_model_len $token

How would you like to use vllm

I would like it if I didn't always have to restart the AI machine to reload a model with vLLM

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

SwarmKit · 2025-02-17T05:29:59Z

AlbiRadtke · 2025-02-17T18:03:42Z

Yes, at this point it stucks when reloading a model. Do you have a solution? :)

AlbiRadtke · 2025-02-20T15:59:15Z

Does anyone has an idea? Or should i report it as a bug?
Thank you all! :)

W-Wuxian · 2025-02-23T09:28:32Z

same error for me, either with pip install or building from source.

vgabbo · 2025-02-25T16:49:37Z

Hi @AlbiRadtke ! Did you solve this problem?
I have the same issue and I can't understand how come people do not seem to have it, maybe there's some config we are not aware of? How come no one encounters this problem?
If you have any solutions or workarounds it would be really helpful.

AlbiRadtke · 2025-02-25T17:54:16Z

Unfortunately, despite a lot of research, I have not found a solution and since I am obviously not the only one with the problem, I have now reported this as bug #13836.
I will therefore close the issu here now and only pursue the bug further

Best regards :)

AlbiRadtke added the usage How to use vllm label Feb 14, 2025

AlbiRadtke mentioned this issue Feb 25, 2025

[Bug]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13836

Open

2 tasks

AlbiRadtke closed this as completed Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13294

[Usage]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13294

AlbiRadtke commented Feb 14, 2025

SwarmKit commented Feb 17, 2025

AlbiRadtke commented Feb 17, 2025

AlbiRadtke commented Feb 20, 2025

W-Wuxian commented Feb 23, 2025

vgabbo commented Feb 25, 2025

AlbiRadtke commented Feb 25, 2025

[Usage]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13294

[Usage]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13294

Comments

AlbiRadtke commented Feb 14, 2025

Your current environment

How would you like to use vllm

Before submitting a new issue...

SwarmKit commented Feb 17, 2025

AlbiRadtke commented Feb 17, 2025

AlbiRadtke commented Feb 20, 2025

W-Wuxian commented Feb 23, 2025

vgabbo commented Feb 25, 2025

AlbiRadtke commented Feb 25, 2025