Why is the throughput of A40 15 token/s, slower than vllm? #3508

EvanSong77 · 2025-02-12T03:31:04Z

docker logs:

WARNING 02-12 03:17:26 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
WARNING 02-12 03:17:27 _custom_ops.py:20] Failed to import from vllm._C with ImportError('/usr/local/lib/python3.10/dist-packages/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled')
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
[2025-02-12 03:17:33] server_args=ServerArgs(model_path='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer_path='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.95, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=4, stream_interval=1, random_seed=984868094, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:33 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:44 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP2] Init torch distributed begin.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP1] Init torch distributed begin.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP3] Init torch distributed begin.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP0] Init torch distributed begin.
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-12 03:17:48 TP2] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP1] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP3] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP0] Load weight begin. avail mem=43.57 GB
INFO 02-12 03:17:48 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod

Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:   9% Completed | 1/11 [00:00<00:04,  2.35it/s]

Loading safetensors checkpoint shards:  18% Completed | 2/11 [00:00<00:04,  2.09it/s]

Loading safetensors checkpoint shards:  27% Completed | 3/11 [00:01<00:04,  1.99it/s]

Loading safetensors checkpoint shards:  36% Completed | 4/11 [00:01<00:03,  1.95it/s]

Loading safetensors checkpoint shards:  45% Completed | 5/11 [00:02<00:03,  1.94it/s]

Loading safetensors checkpoint shards:  55% Completed | 6/11 [00:03<00:02,  1.92it/s]

Loading safetensors checkpoint shards:  64% Completed | 7/11 [00:03<00:02,  1.89it/s]

Loading safetensors checkpoint shards:  73% Completed | 8/11 [00:04<00:01,  1.90it/s]

Loading safetensors checkpoint shards:  82% Completed | 9/11 [00:04<00:01,  1.84it/s]

Loading safetensors checkpoint shards:  91% Completed | 10/11 [00:05<00:00,  1.85it/s]

Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:05<00:00,  2.14it/s]

Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:05<00:00,  1.99it/s]
[2025-02-12 03:17:55 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:56 TP1] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP2] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP3] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP0] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP1] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP2] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP3] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP0] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:57] INFO:     Started server process [1]
[2025-02-12 03:17:57] INFO:     Waiting for application startup.
[2025-02-12 03:17:57] INFO:     Application startup complete.
[2025-02-12 03:17:57] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-02-12 03:17:58] INFO:     127.0.0.1:46600 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-12 03:17:58 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_loc(o"p/su/sdre/cloodcea_la/tlitbe/nptyitohno.np3y."1:0310/:d16i)s: terror: -operation scheduled before its operandsp
ackages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
[2025-02-12 03:18:11] INFO:     127.0.0.1:46602 - "POST /generate HTTP/1.1" 200 OK
[2025-02-12 03:18:11] The server is fired up and ready to roll!
[2025-02-12 03:18:22] INFO:     127.0.0.1:47060 - "GET /health HTTP/1.1" 200 OK
[2025-02-12 03:18:28 TP0] Prefill batch. #new-seq: 1, #new-token: 38, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-12 03:18:30 TP0] Decode batch. #running-req: 1, #token: 71, token usage: 0.00, gen throughput (token/s): 1.18, #queue-req: 0
[2025-02-12 03:18:33 TP0] Decode batch. #running-req: 1, #token: 111, token usage: 0.00, gen throughput (token/s): 13.41, #queue-req: 0
[2025-02-12 03:18:37 TP0] Decode batch. #running-req: 1, #token: 151, token usage: 0.00, gen throughput (token/s): 12.48, #queue-req: 0
[2025-02-12 03:18:40 TP0] Decode batch. #running-req: 1, #token: 191, token usage: 0.00, gen throughput (token/s): 13.10, #queue-req: 0
[2025-02-12 03:18:43 TP0] Decode batch. #running-req: 1, #token: 231, token usage: 0.00, gen throughput (token/s): 13.08, #queue-req: 0
[2025-02-12 03:18:46 TP0] Decode batch. #running-req: 1, #token: 271, token usage: 0.00, gen throughput (token/s): 13.68, #queue-req: 0
[2025-02-12 03:18:49 TP0] Decode batch. #running-req: 1, #token: 311, token usage: 0.00, gen throughput (token/s): 12.30, #queue-req: 0

Starup Commaned:

services:
  sglang:
    image: llms/sglang:0.4.2
    container_name: sglang
    volumes:
      - /etc/hosts:/etc/hosts
      - nfsshare:/nfsshare:ro
    restart: always
    ports:
      - 11018:30000
    entrypoint: python3 -m sglang.launch_server
    command:
      --model-path /nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4
      --tp-size 4
      --mem-fraction-static 0.95

      --disable-cuda-graph
      --host 0.0.0.0
      --port 30000
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:30000/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1', '2', '7']
              capabilities: [gpu]

volumes:
  nfsshare:
    external: true
    name: nfsshare

jhinpan · 2025-02-12T04:05:36Z

According to the docker logs, it seems to me that you need to update the GPU driver to match the CUDA toolkit. And also you can align PyTorch/CUDA versions and verify that environment is correctly set. By the way, you can also try remove --disable-cuda-graph if possible and enable advanced optimizations.

There is another reason that the log show roughly 1.64 GB free after loading, which is tight for a 72B-parameter model. Memory fragmentation might be causing overhead with every decode step.

jhinpan self-assigned this Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the throughput of A40 15 token/s, slower than vllm? #3508

Why is the throughput of A40 15 token/s, slower than vllm? #3508

EvanSong77 commented Feb 12, 2025

jhinpan commented Feb 12, 2025

Why is the throughput of A40 15 token/s, slower than vllm? #3508

Why is the throughput of A40 15 token/s, slower than vllm? #3508

Comments

EvanSong77 commented Feb 12, 2025

jhinpan commented Feb 12, 2025