Initializer

Initializer

Initializer for KServe Cluster with shell scripts and kubernetes YAML files.

Project Structure

YAML: Contains the YAML files for deploying KServe, Triton Inference Server, and other Kubernetes resources.
Shell: Contains the scripts for running the installation and test operation.
- main.sh: main script for running the whole process.
  - ./main.sh run: convert checkpoints, build engines and deploy Triton Inference Server.
  - ./main.sh test: test the availability of KServe and Triton Inference Server.
- KServe/install.sh: installing KServe in Kubernetes.
- KServe/test_simple.sh: simple testing KServe's availability.
- TIS/install.sh: installing Triton Inference Server Backend in Kubernetes.
- TIS/run.sh: automatic execution at container startup.
- TIS/test_serve.sh: simply testing inference service's availability.
- TRTLLM/upload_hf_model.sh: uploading huggingface weights to the PVC.
- TRTLLM/convert_weight.sh: converting huggingface weights to formated TensorRT-LLM weights.
- TRTLLM/build_engine.sh: building optimized TensorRT-LLM engines.
- TRTLLM/test_inference.sh: testing TensorRT-LLM engines' availability.

Environment

Ubuntu: 22.04
Kubernetes cluster: v1.26.9
containerd: v1.7.2
runc: 1.1.12
cni: v1.5.1
Istio: 1.21.3
Knative: v1.12.4
KServe: v0.13.0
TensorRT-LLM: release v0.10.0
Triton Inference Server: release v0.10.0
Container Image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
Model: Llama-3-8B-Instruct/Llama-3-70B-Instruct

Basic Steps

Install KServe, please check KServe sub-directory.
Save Model Weights to PVC, please check KServe Official Website.
(Optional) Build REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Build REServe Image.
Use REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Use REServe Image.
Convert Llama-3 huggingface weights to TensorRT weights, and build TensorRT engines, please check Convert and Build TensorRT-LLM Engines.
Deploy Triton Inference Server with TensorRT-LLM engines, please check Deploy Triton Inference Server.

Build REServe Image

Build REServe Image with TensorRT-LLM/Backend release v0.10.0:

Clone the repository:

git clone https://github.com/REServeLLM/tensorrtllm_backend.git
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

Build the TensorRT-LLM Backend image (contains the TensorRT-LLM and Backend components):

# Use the Dockerfile to build the backend in a container
# For Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
                               --progress auto \
                               --network host \
                               -f dockerfile/Dockerfile.trt_llm_backend_network_proxy .
# For No Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
                               --progress auto \
                               -f dockerfile/Dockerfile.trt_llm_backend .

Run the REServe image:

docker run -it -d --network=host --runtime=nvidia \
                  --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN \
                  --security-opt seccomp=unconfined \
                  --shm-size=16g --privileged --ulimit memlock=-1 \
                  --gpus=all --name=reserve \
                  reserve-llm:latest
                  
docker exec -it reserve /bin/bash

Copy the latest REServe source code to the REServe image:

docker cp REServe reserve:/code

Commit and push the REServe image to the registry:

docker commit reserve harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240709

Use REServe Image

We provide pre-built REServe image, just pull image from registry:

docker pull harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240700

# Update the REServe Source Code
cd /code/REServe
cd Initializer
git pull
cd ../tensorrtllm_backend
git submodule update --init --recursive
git lfs install

Or you can use your own REServe image from the previous step.

Convert and Build TensorRT-LLM Engines

Operations in the REServe container:

cd /code/REServe/TRTLLM
./convert_engine.sh
./build_engine.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Initializer

Project Structure

Environment

Basic Steps

Build REServe Image

Use REServe Image

Convert and Build TensorRT-LLM Engines

Deploy Triton Inference Server

Files

README.md

Latest commit

History

README.md

File metadata and controls

Initializer

Project Structure

Environment

Basic Steps

Build REServe Image

Use REServe Image

Convert and Build TensorRT-LLM Engines

Deploy Triton Inference Server