Skip to content

Latest commit

 

History

History
125 lines (113 loc) · 4.99 KB

README.md

File metadata and controls

125 lines (113 loc) · 4.99 KB

Initializer

Initializer for KServe Cluster with shell scripts and kubernetes YAML files.

Project Structure

  • YAML: Contains the YAML files for deploying KServe, Triton Inference Server, and other Kubernetes resources.
  • Shell: Contains the scripts for running the installation and test operation.
    • main.sh: main script for running the whole process.
      • ./main.sh run: convert checkpoints, build engines and deploy Triton Inference Server.
      • ./main.sh test: test the availability of KServe and Triton Inference Server.
    • KServe/install.sh: installing KServe in Kubernetes.
    • KServe/test_simple.sh: simple testing KServe's availability.
    • TIS/install.sh: installing Triton Inference Server Backend in Kubernetes.
    • TIS/run.sh: automatic execution at container startup.
    • TIS/test_serve.sh: simply testing inference service's availability.
    • TRTLLM/upload_hf_model.sh: uploading huggingface weights to the PVC.
    • TRTLLM/convert_weight.sh: converting huggingface weights to formated TensorRT-LLM weights.
    • TRTLLM/build_engine.sh: building optimized TensorRT-LLM engines.
    • TRTLLM/test_inference.sh: testing TensorRT-LLM engines' availability.

Environment

  • Ubuntu: 22.04
  • Kubernetes cluster: v1.26.9
  • containerd: v1.7.2
  • runc: 1.1.12
  • cni: v1.5.1
  • Istio: 1.21.3
  • Knative: v1.12.4
  • KServe: v0.13.0
  • TensorRT-LLM: release v0.10.0
  • Triton Inference Server: release v0.10.0
  • Container Image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
  • Model: Llama-3-8B-Instruct/Llama-3-70B-Instruct

Basic Steps

  1. Install KServe, please check KServe sub-directory.
  2. Save Model Weights to PVC, please check KServe Official Website.
  3. (Optional) Build REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Build REServe Image.
  4. Use REServe Image with TensorRT-LLM/Backend release v0.10.0, please check Use REServe Image.
  5. Convert Llama-3 huggingface weights to TensorRT weights, and build TensorRT engines, please check Convert and Build TensorRT-LLM Engines.
  6. Deploy Triton Inference Server with TensorRT-LLM engines, please check Deploy Triton Inference Server.

Build REServe Image

Build REServe Image with TensorRT-LLM/Backend release v0.10.0:

  1. Clone the repository:
git clone https://github.com/REServeLLM/tensorrtllm_backend.git
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
  1. Build the TensorRT-LLM Backend image (contains the TensorRT-LLM and Backend components):
# Use the Dockerfile to build the backend in a container
# For Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
                               --progress auto \
                               --network host \
                               -f dockerfile/Dockerfile.trt_llm_backend_network_proxy .
# For No Network Issue
DOCKER_BUILDKIT=1 docker build -t reserve-llm:latest \
                               --progress auto \
                               -f dockerfile/Dockerfile.trt_llm_backend .
  1. Run the REServe image:
docker run -it -d --network=host --runtime=nvidia \
                  --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN \
                  --security-opt seccomp=unconfined \
                  --shm-size=16g --privileged --ulimit memlock=-1 \
                  --gpus=all --name=reserve \
                  reserve-llm:latest
                  
docker exec -it reserve /bin/bash
  1. Copy the latest REServe source code to the REServe image:
docker cp REServe reserve:/code
  1. Commit and push the REServe image to the registry:
docker commit reserve harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240709

Use REServe Image

We provide pre-built REServe image, just pull image from registry:

docker pull harbor.act.buaa.edu.cn/nvidia/reserve-llm:v20240700

# Update the REServe Source Code
cd /code/REServe
cd Initializer
git pull
cd ../tensorrtllm_backend
git submodule update --init --recursive
git lfs install

Or you can use your own REServe image from the previous step.

Convert and Build TensorRT-LLM Engines

Operations in the REServe container:

cd /code/REServe/TRTLLM
./convert_engine.sh
./build_engine.sh

Deploy Triton Inference Server