You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WARNING 02-12 03:17:26 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
WARNING 02-12 03:17:27 _custom_ops.py:20] Failed to import from vllm._C with ImportError('/usr/local/lib/python3.10/dist-packages/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled')
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
returntorch._C._cuda_getDeviceCount() > 0
[2025-02-12 03:17:33] server_args=ServerArgs(model_path='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer_path='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.95, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=4, stream_interval=1, random_seed=984868094, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:33 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:44 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP2] Init torch distributed begin.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP1] Init torch distributed begin.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP3] Init torch distributed begin.
Unrecognized keys in`rope_scaling`for'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP0] Init torch distributed begin.
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-12 03:17:48 TP2] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP1] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP3] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP0] Load weight begin. avail mem=43.57 GB
INFO 02-12 03:17:48 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:00<00:04, 2.35it/s]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:00<00:04, 2.09it/s]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:01<00:04, 1.99it/s]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:01<00:03, 1.95it/s]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:02<00:03, 1.94it/s]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:03<00:02, 1.92it/s]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:03<00:02, 1.89it/s]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:04<00:01, 1.90it/s]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:04<00:01, 1.84it/s]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:05<00:00, 1.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:05<00:00, 2.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:05<00:00, 1.99it/s]
[2025-02-12 03:17:55 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:56 TP1] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP2] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP3] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP0] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP1] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP2] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP3] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP0] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:57] INFO: Started server process [1]
[2025-02-12 03:17:57] INFO: Waiting for application startup.
[2025-02-12 03:17:57] INFO: Application startup complete.
[2025-02-12 03:17:57] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-02-12 03:17:58] INFO: 127.0.0.1:46600 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-12 03:17:58 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_loc(o"p/su/sdre/cloodcea_la/tlitbe/nptyitohno.np3y."1:0310/:d16i)s: terror: -operation scheduled before its operandspackages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
[2025-02-12 03:18:11] INFO: 127.0.0.1:46602 - "POST /generate HTTP/1.1" 200 OK
[2025-02-12 03:18:11] The server is fired up and ready to roll!
[2025-02-12 03:18:22] INFO: 127.0.0.1:47060 - "GET /health HTTP/1.1" 200 OK
[2025-02-12 03:18:28 TP0] Prefill batch. #new-seq: 1, #new-token: 38, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-12 03:18:30 TP0] Decode batch. #running-req: 1, #token: 71, token usage: 0.00, gen throughput (token/s): 1.18, #queue-req: 0
[2025-02-12 03:18:33 TP0] Decode batch. #running-req: 1, #token: 111, token usage: 0.00, gen throughput (token/s): 13.41, #queue-req: 0
[2025-02-12 03:18:37 TP0] Decode batch. #running-req: 1, #token: 151, token usage: 0.00, gen throughput (token/s): 12.48, #queue-req: 0
[2025-02-12 03:18:40 TP0] Decode batch. #running-req: 1, #token: 191, token usage: 0.00, gen throughput (token/s): 13.10, #queue-req: 0
[2025-02-12 03:18:43 TP0] Decode batch. #running-req: 1, #token: 231, token usage: 0.00, gen throughput (token/s): 13.08, #queue-req: 0
[2025-02-12 03:18:46 TP0] Decode batch. #running-req: 1, #token: 271, token usage: 0.00, gen throughput (token/s): 13.68, #queue-req: 0
[2025-02-12 03:18:49 TP0] Decode batch. #running-req: 1, #token: 311, token usage: 0.00, gen throughput (token/s): 12.30, #queue-req: 0
According to the docker logs, it seems to me that you need to update the GPU driver to match the CUDA toolkit. And also you can align PyTorch/CUDA versions and verify that environment is correctly set. By the way, you can also try remove --disable-cuda-graph if possible and enable advanced optimizations.
There is another reason that the log show roughly 1.64 GB free after loading, which is tight for a 72B-parameter model. Memory fragmentation might be causing overhead with every decode step.
docker logs:
Starup Commaned:
The text was updated successfully, but these errors were encountered: