Skip to content

vllm-project/production-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Production Stack: reference stack for production vLLM deployment

vLLM Production Stack project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:

  • 🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code
  • 💻 Monitor the through a web dashboard
  • 😄 Enjoy the performance benefits brought by request routing and KV cache offloading

Latest News:

  • 🔥 vLLM Production Stack is released! Checkout our release blogs [01-22-2025]
  • ✨Join us at #production-stack channel of vLLM slack, LMCache slack, or fill out this interest form for a chat!

Architecture

The stack is set up using Helm, and contains the following key parts:

  • Serving engine: The vLLM engines that run different LLMs
  • Request router: Directs requests to appropriate backends based on routing keys or session IDs to maximize KV cache reuse.
  • Observability stack: monitors the metrics of the backends through Prometheus + Grafana
Architecture of the stack

Roadmap

We are actively working on this project and will release the following features soon. Please stay tuned!

  • Autoscaling based on vLLM-specific metrics
  • Support for disaggregated prefill
  • Router improvements (e.g., more performant router using non-python languages, KV-cache-aware routing algorithm, better fault tolerance, etc)

Deploying the stack via Helm

Prerequisites

  • A running Kubernetes (K8s) environment with GPUs
    • Run cd utils && bash install-minikube-cluster.sh
    • Or follow our tutorial

Deployment

vLLM Production Stack can be deployed via helm charts. Clone the repo to local and execute the following commands for a minimal deployment:

git clone https://github.com/vllm-project/production-stack.git
cd production-stack/
sudo helm repo add llmstack-repo https://lmcache.github.io/helm/
sudo helm install llmstack llmstack-repo/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml

The deployed stack provides the same OpenAI API interface as vLLM, and can be accessed through kubernetes service.

To validate the installation and and send query to the stack, refer to this tutorial.

For more information about customizing the helm chart, please refer to values.yaml and our other tutorials.

Uninstall

sudo helm uninstall llmstack

Grafana Dashboard

Features

The Grafana dashboard provides the following insights:

  1. Available vLLM Instances: Displays the number of healthy instances.
  2. Request Latency Distribution: Visualizes end-to-end request latency.
  3. Time-to-First-Token (TTFT) Distribution: Monitors response times for token generation.
  4. Number of Running Requests: Tracks the number of active requests per instance.
  5. Number of Pending Requests: Tracks requests waiting to be processed.
  6. GPU KV Usage Percent: Monitors GPU KV cache usage.
  7. GPU KV Cache Hit Rate: Displays the hit rate for the GPU KV cache.
Grafana dashboard to monitor the deployment

Configuration

See the details in observability/README.md

Router

Overview

The router ensures efficient request distribution among backends. It supports:

  • Routing to endpoints that run different models
  • Exporting observability metrics for each serving engine instance, including QPS, time-to-first-token (TTFT), number of pending/running/finished requests, and uptime
  • Automatic service discovery and fault tolerance by Kubernetes API
  • Multiple different routing algorithms
    • Round-robin routing
    • Session-ID based routing
    • (WIP) prefix-aware routing

Contributing

Contributions are welcome! Please follow the standard GitHub flow:

  1. Fork the repository.
  2. Create a feature branch.
  3. Submit a pull request with detailed descriptions.

License

This project is licensed under the MIT License. See the LICENSE file for details.


For any issues or questions, feel free to open an issue or contact the maintainers.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published