Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL" #13294

Closed
1 task done
AlbiRadtke opened this issue Feb 14, 2025 · 6 comments
Labels
usage How to use vllm

Comments

@AlbiRadtke
Copy link

Your current environment

Hey guys :)

since version 0.6.6 up to the current V0.7.2 I have a slightly annoying problem. When I start up my AI server, vllm everything works fine. The model is loaded and can be used as desired.
However, as soon as I end my start script and want to load the same model again, or even a different model, vLLM always freezes at this point:

[W214 15:20:56.973624780 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.008814328 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.104798409 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.107680744 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.196595399 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.199089483 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.205991785 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W214 15:20:56.207522727 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
(VllmWorkerProcess pid=8968) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8969) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8968) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8969) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8971) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8967) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8971) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8967) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8970) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8972) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8973) INFO 02-14 15:20:56 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8970) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8972) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=8973) INFO 02-14 15:20:56 pynccl.py:69] vLLM is using nccl==2.21.5

When I quit vLLM before, I always get this warning:
[rank0]:[W214 15:18:06.519697808 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

Unfortunately, I cannot find any examples of how I can execute the start script so that the shutdown is executed correctly. Can you help me with this?

My startskript.sh:

#!/bin/bash

token=50000
export HF_TOKEN=hf_myToken
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_LAUNCH_BLOCKING=0
export disable_custom_all_reduce=True

python -m vllm.entrypoints.openai.api_server \
        --model=mistralai/Pixtral-Large-Instruct-2411 \
        --config-format mistral \
        --load-format mistral \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=10' \
        --host 192.uuu.xxx.yyy \
        --port myPORT \
        --trust-remote-code \
        --device cuda \
        --tensor-parallel-size 8 \
        --gpu-memory-utilization 1 \
        --swap-space 10 \
        --max_num_seqs 3 \
        --max_num_batched_tokens $token \
        --max_model_len $token

How would you like to use vllm

I would like it if I didn't always have to restart the AI machine to reload a model with vLLM

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@AlbiRadtke AlbiRadtke added the usage How to use vllm label Feb 14, 2025
@SwarmKit
Copy link

Image

@AlbiRadtke
Copy link
Author

Yes, at this point it stucks when reloading a model. Do you have a solution? :)

@AlbiRadtke
Copy link
Author

Does anyone has an idea? Or should i report it as a bug?
Thank you all! :)

@W-Wuxian
Copy link

same error for me, either with pip install or building from source.

@vgabbo
Copy link

vgabbo commented Feb 25, 2025

Hi @AlbiRadtke ! Did you solve this problem?
I have the same issue and I can't understand how come people do not seem to have it, maybe there's some config we are not aware of? How come no one encounters this problem?
If you have any solutions or workarounds it would be really helpful.

@AlbiRadtke
Copy link
Author

Unfortunately, despite a lot of research, I have not found a solution and since I am obviously not the only one with the problem, I have now reported this as bug #13836.
I will therefore close the issu here now and only pursue the bug further

Best regards :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

4 participants