Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the throughput of A40 15 token/s, slower than vllm? #3508

Open
EvanSong77 opened this issue Feb 12, 2025 · 1 comment
Open

Why is the throughput of A40 15 token/s, slower than vllm? #3508

EvanSong77 opened this issue Feb 12, 2025 · 1 comment
Assignees

Comments

@EvanSong77
Copy link

docker logs:

WARNING 02-12 03:17:26 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
WARNING 02-12 03:17:27 _custom_ops.py:20] Failed to import from vllm._C with ImportError('/usr/local/lib/python3.10/dist-packages/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled')
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
[2025-02-12 03:17:33] server_args=ServerArgs(model_path='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer_path='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.95, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=4, stream_interval=1, random_seed=984868094, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:33 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 02-12 03:17:38 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:44 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP2] Init torch distributed begin.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP1] Init torch distributed begin.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP3] Init torch distributed begin.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
INFO 02-12 03:17:45 gptq_marlin.py:108] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
[2025-02-12 03:17:45 TP0] Init torch distributed begin.
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
INFO 02-12 03:17:46 utils.py:961] Found nccl from library libnccl.so.2
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-12 03:17:48 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-12 03:17:48 TP2] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP1] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP3] Load weight begin. avail mem=43.57 GB
[2025-02-12 03:17:48 TP0] Load weight begin. avail mem=43.57 GB
INFO 02-12 03:17:48 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 02-12 03:17:49 gptq_marlin.py:196] Using MarlinLinearKernel for GPTQMarlinLinearMethod

Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:   9% Completed | 1/11 [00:00<00:04,  2.35it/s]

Loading safetensors checkpoint shards:  18% Completed | 2/11 [00:00<00:04,  2.09it/s]

Loading safetensors checkpoint shards:  27% Completed | 3/11 [00:01<00:04,  1.99it/s]

Loading safetensors checkpoint shards:  36% Completed | 4/11 [00:01<00:03,  1.95it/s]

Loading safetensors checkpoint shards:  45% Completed | 5/11 [00:02<00:03,  1.94it/s]

Loading safetensors checkpoint shards:  55% Completed | 6/11 [00:03<00:02,  1.92it/s]

Loading safetensors checkpoint shards:  64% Completed | 7/11 [00:03<00:02,  1.89it/s]

Loading safetensors checkpoint shards:  73% Completed | 8/11 [00:04<00:01,  1.90it/s]

Loading safetensors checkpoint shards:  82% Completed | 9/11 [00:04<00:01,  1.84it/s]

Loading safetensors checkpoint shards:  91% Completed | 10/11 [00:05<00:00,  1.85it/s]

Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:05<00:00,  2.14it/s]

Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:05<00:00,  1.99it/s]
[2025-02-12 03:17:55 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:55 TP3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=33.71 GB
[2025-02-12 03:17:56 TP1] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP2] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP3] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP0] Memory pool end. avail mem=1.64 GB
[2025-02-12 03:17:56 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-02-12 03:17:56 TP1] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP2] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP3] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:56 TP0] max_total_num_tokens=413096, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2025-02-12 03:17:57] INFO:     Started server process [1]
[2025-02-12 03:17:57] INFO:     Waiting for application startup.
[2025-02-12 03:17:57] INFO:     Application startup complete.
[2025-02-12 03:17:57] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-02-12 03:17:58] INFO:     127.0.0.1:46600 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-12 03:17:58 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_loc(o"p/su/sdre/cloodcea_la/tlitbe/nptyitohno.np3y."1:0310/:d16i)s: terror: -operation scheduled before its operandsp
ackages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
[2025-02-12 03:18:11] INFO:     127.0.0.1:46602 - "POST /generate HTTP/1.1" 200 OK
[2025-02-12 03:18:11] The server is fired up and ready to roll!
[2025-02-12 03:18:22] INFO:     127.0.0.1:47060 - "GET /health HTTP/1.1" 200 OK
[2025-02-12 03:18:28 TP0] Prefill batch. #new-seq: 1, #new-token: 38, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-12 03:18:30 TP0] Decode batch. #running-req: 1, #token: 71, token usage: 0.00, gen throughput (token/s): 1.18, #queue-req: 0
[2025-02-12 03:18:33 TP0] Decode batch. #running-req: 1, #token: 111, token usage: 0.00, gen throughput (token/s): 13.41, #queue-req: 0
[2025-02-12 03:18:37 TP0] Decode batch. #running-req: 1, #token: 151, token usage: 0.00, gen throughput (token/s): 12.48, #queue-req: 0
[2025-02-12 03:18:40 TP0] Decode batch. #running-req: 1, #token: 191, token usage: 0.00, gen throughput (token/s): 13.10, #queue-req: 0
[2025-02-12 03:18:43 TP0] Decode batch. #running-req: 1, #token: 231, token usage: 0.00, gen throughput (token/s): 13.08, #queue-req: 0
[2025-02-12 03:18:46 TP0] Decode batch. #running-req: 1, #token: 271, token usage: 0.00, gen throughput (token/s): 13.68, #queue-req: 0
[2025-02-12 03:18:49 TP0] Decode batch. #running-req: 1, #token: 311, token usage: 0.00, gen throughput (token/s): 12.30, #queue-req: 0

Starup Commaned:

services:
  sglang:
    image: llms/sglang:0.4.2
    container_name: sglang
    volumes:
      - /etc/hosts:/etc/hosts
      - nfsshare:/nfsshare:ro
    restart: always
    ports:
      - 11018:30000
    entrypoint: python3 -m sglang.launch_server
    command:
      --model-path /nfsshare/model-checkpoint/Qwen2.5-72B-Instruct-GPTQ-Int4
      --tp-size 4
      --mem-fraction-static 0.95

      --disable-cuda-graph
      --host 0.0.0.0
      --port 30000
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:30000/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1', '2', '7']
              capabilities: [gpu]

volumes:
  nfsshare:
    external: true
    name: nfsshare
@jhinpan jhinpan self-assigned this Feb 12, 2025
@jhinpan
Copy link
Collaborator

jhinpan commented Feb 12, 2025

According to the docker logs, it seems to me that you need to update the GPU driver to match the CUDA toolkit. And also you can align PyTorch/CUDA versions and verify that environment is correctly set. By the way, you can also try remove --disable-cuda-graph if possible and enable advanced optimizations.

There is another reason that the log show roughly 1.64 GB free after loading, which is tight for a 72B-parameter model. Memory fragmentation might be causing overhead with every decode step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants