[Bug]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13836

AlbiRadtke · 2025-02-25T17:52:22Z

Your current environment

I already posted it as "issue" #13294 and several user answered that they have the same Problem - so it seems to be a bug :)

Your current environment

Hey guys :)

since version 0.6.6 up to the current V0.7.2 I have a slightly annoying problem. When I start up my AI server, vllm everything works fine. The model is loaded and can be used as desired.
However, as soon as I end my start script and want to load the same model again, or even a different model, vLLM always freezes at this point:

[W214 15:20:56.973624780 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.008814328 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.104798409 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.107680744 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.196595399 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.199089483 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.205991785 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.207522727 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
(VllmWorkerProcess pid=8968) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8969) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8968) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8969) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8971) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8967) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8971) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8967) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8970) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8972) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8973) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8970) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8972) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8973) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5

When I quit vLLM before, I always get this warning:
[rank0]:[W214 15:18:06.519697808 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

Unfortunately, I cannot find any examples of how I can execute the start script so that the shutdown is executed correctly. Can you help me with this?

How would you like to use vllm

I would like it if I didn't always have to restart the AI machine to reload a model with vLLM

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

🐛 Describe the bug

My startskript.sh:

#!/bin/bash

token=50000
export HF_TOKEN=hf_myToken
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_LAUNCH_BLOCKING=0
export disable_custom_all_reduce=True

python -m vllm.entrypoints.openai.api_server \
        --model=mistralai/Pixtral-Large-Instruct-2411 \
        --config-format mistral \
        --load-format mistral \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=10' \
        --host 192.uuu.xxx.yyy \
        --port myPORT \
        --trust-remote-code \
        --device cuda \
        --tensor-parallel-size 8 \
        --gpu-memory-utilization 1 \
        --swap-space 10 \
        --max_num_seqs 3 \
        --max_num_batched_tokens $token \
        --max_model_len $token

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

AlbiRadtke added the bug Something isn't working label Feb 25, 2025

AlbiRadtke mentioned this issue Feb 25, 2025

[Usage]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13294

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13836

[Bug]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13836

AlbiRadtke commented Feb 25, 2025

[Bug]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13836

[Bug]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13836

Comments

AlbiRadtke commented Feb 25, 2025

Your current environment

Your current environment

How would you like to use vllm

Before submitting a new issue...

🐛 Describe the bug

Before submitting a new issue...