Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

BenjaminBraunDev
Copy link

@BenjaminBraunDev BenjaminBraunDev commented Nov 27, 2024

What does the PR do?

This PR adds code to HTTPAPIServer::GenerateRequestClass::StartResponse inside src/http_server.cc to add both kv_cache_utilization and max_token_capacity metrics composed from the existing prometheus metrics in TensorRT-LLM Backend's nv_trt_llm_kv_cache_block_metrics metric family.

This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.

Checklist

  • I have read the Contribution guidelines and signed the Contributor License
    Agreement
  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • I ran pre-commit locally (pre-commit install, pre-commit run --all)
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Where should the reviewer start?

Changes are contained to 2 files:

  • src/http_server.cc
  • src/http_server.h (the former's header file)

The changes start in HTTPAPIServer::GenerateRequestClass::StartResponse() where the environment variable is checked and the header is written. There are 3 other funcitons below: MetricFamilyExtractor() which parses serialized prometheus metrics into a vector of PromMetric (which have a map of their metric labels), and ExtractKVMetrics() that pulls the values from the structured metrics and calculates the composite kv metrics, and finally OrcaKVMetricHeader() which forms the metrics into an endpoint-load-metrics header in the ORCA format specified by TRITON_ORCA_METRIC_FORMAT. If there are no TensorRT-LLM Backend metrics, no metrics found for the header, or an invalid format type the header is simply not written.

The valid values for TRITON_ORCA_METRIC_FORMAT are documented in the feature request (related issue linked below) and comments in StartResponse()

Test plan:

The feature is gated behind a feature flag in the form of the TRITON_ORCA_METRIC_FORMAT environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an #ifdef and are only included if metrics are enabled during the Triton Server build.

Caveats:

  1. This feature only works on Triton Inference Server running with TensorRT-LLM Backend, as otherwise the KV-cache metrics are not included in the server metrics.

  2. This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily (including metrics that don't require TensorRT-LLM Backend).

Background

This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.

Related Issues:

Screenshots

Response header before changes (or if TRITON_ORCA_METRIC_FORMAT environment variable is unset):
orca_before

Response header with TRITON_ORCA_METRIC_FORMAT="json":
orca_json

Response header with TRITON_ORCA_METRIC_FORMAT="http":
orca_http

cc @yinggeh @krishung5 @jbkyang-nvi

@BenjaminBraunDev BenjaminBraunDev changed the title ORCA Format KV Cache Utilization in Inference Response Header feat: ORCA Format KV Cache Utilization in Inference Response Header Dec 10, 2024
@BenjaminBraunDev BenjaminBraunDev marked this pull request as ready for review December 11, 2024 02:21
src/http_server.h Outdated Show resolved Hide resolved
src/http_server.h Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
@nnshah1 nnshah1 requested a review from indrajit96 December 13, 2024 03:13
@jbkyang-nvi
Copy link
Contributor

@nnshah1 @indrajit96 are we merging into the already frozen r24.10? Or should this be onto main instead?

@nnshah1
Copy link
Contributor

nnshah1 commented Dec 13, 2024

@nnshah1 @indrajit96 are we merging into the already frozen r24.10? Or should this be onto main instead?

good catch we should target main

@BenjaminBraunDev
Copy link
Author

@jbkyang-nvi @nnshah1 Is Ubuntu 24.04 required to build the main branch, with container version 24.11? I ported my changes over but am having trouble building the server image. I see in the nvidia container support matrix Ubuntu 24.04 is required for CUDA Deep-Learning base container for 24.11, but I also see in the Release 2.52.0 Known Issues that TRT-LLM backend is built from TensorRT-LLM version 0.15.0 and built out of nvcr.io/nvidia/tritonserver:24.10-py3-min.

To build, I'm using a base image of nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 as a base image (ran into errors using 24.11 base image) but I'm getting an error while building target grpc-repo without much of an error log. Is this the right base image or should I update my Ubuntu and and use 24.11?

For context the full build command I'm using is:

TRTLLM_BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
              --enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
              --filesystem=gcs \
              --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
              --backend=ensemble --enable-gpu --no-container-pull \
              --repoagent=checksum --cache=local --cache=redis \
              --image=base,${TRTLLM_BASE_IMAGE}

@BenjaminBraunDev
Copy link
Author

BenjaminBraunDev commented Dec 20, 2024

Follow up, I got the image built, but when trying to run it I get a version error:

/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.36' not found (required by /opt/tritonserver/bin/tritonserver)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /opt/tritonserver/bin/tritonserver)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/tritonserver/bin/tritonserver)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /opt/tritonserver/bin/../lib/libtritonserver.so)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/tritonserver/bin/../lib/libtritonserver.so)

From what I looked up I don't think I can get GLIBC_2.36 (or higher) on Ubuntu 22.04?

@nnshah1
Copy link
Contributor

nnshah1 commented Dec 21, 2024

@nv-kmcgill53 , @nvda-mesharma

After the holiday can we help with updated instructions on building? I think we are moving everything to 24.04 - but may be mistaken.

@indrajit96
Copy link
Contributor

indrajit96 commented Jan 6, 2025

@BenjaminBraunDev @liu-cong Could you add some tests to help with the review?
A few examples of tests are here
https://github.com/triton-inference-server/server/tree/main/qa/L0_http
You can also have a simple python script that does the test with triton running in the background.
Also can you post the latest build issue you are facing after rebasing it to main?

… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.
…nctionality to HTTPAPIServer::GenerateRequestClass::StartResponse() to extract metrics after inference request is processed for up-to-date metrics.
@BenjaminBraunDev BenjaminBraunDev changed the base branch from r24.10 to main January 13, 2025 18:46
@BenjaminBraunDev BenjaminBraunDev changed the base branch from main to r24.10 January 13, 2025 20:20
@BenjaminBraunDev BenjaminBraunDev changed the base branch from r24.10 to main January 13, 2025 20:56
@BenjaminBraunDev
Copy link
Author

BenjaminBraunDev commented Jan 13, 2025

NOTE: The force push here is resetting my forked r24.10 branch to the state of my forked main branch. The head branch name r24.10 is a misnomer, in reality it's the base main branch + my changes.

src/http_server.h Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
@BenjaminBraunDev
Copy link
Author

BenjaminBraunDev commented Jan 21, 2025

I've added a manual test to the qa common folder that can be run so long as triton is running on some port. For example, if you run the tritonserver container, something like this would work:

export ORCA_TEST_PATH=~/triton-inference-server-repo/qa/common/orca_header_test.py
docker run --rm -ti -v `pwd`:/mnt -w /mnt -v ~/.cache/huggingface:/root/.cache/huggingface -v ${ORCA_TEST_PATH}:/orca_header_test.py  --gpus all tritonserver:latest bash

python3 /orca_header_test.py http://localhost:8000/v2/models/ensemble/generate

Let me know if you have any more feedback for the PR. Otherwise all I need is approval to merge to main. @yinggeh @indrajit96 @nnshah1 @jbkyang-nvi

@rmccorm4
Copy link
Contributor

Closes #7865

else:
print(f"No key '{METRIC_KEY}' in header data: {data}")
return None
return data

Check warning

Code scanning / CodeQL

Unreachable code Warning

This statement is unreachable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants