-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Helm Chart for OpenVINO vLLM #403
base: main
Are you sure you want to change the base?
Changes from all commits
df8c195
d339c74
c8a420c
21be6c9
25528c9
140d1b5
815c51b
2621fa3
4ac8fb0
5fffdd0
7497322
4154f02
027923c
1f513a4
8b911f5
2ba4c8f
207d2bd
b36ac56
20670b7
01eb2b4
738ff59
ad96222
e7de84c
86b8064
34f71b6
e382dea
4065c9e
81d269c
1fce716
05a2be2
f0dae33
a8b85d7
294f1a0
9b12618
e890448
ef59964
acd9a47
afc3d45
cbb8d65
d570861
26562a9
03b7d26
9e1cdf1
df8261e
ac341ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -9,37 +9,91 @@ Helm chart for deploying ChatQnA service. ChatQnA depends on the following servi | |||||
- [redis-vector-db](../common/redis-vector-db/README.md) | ||||||
- [reranking-usvc](../common/reranking-usvc/README.md) | ||||||
- [teirerank](../common/teirerank/README.md) | ||||||
- [llm-uservice](../common/llm-uservice/README.md) | ||||||
- [tgi](../common/tgi/README.md) | ||||||
|
||||||
For LLM inference, two more microservices will be required. We can either use [TGI](https://github.com/huggingface/text-generation-inference) or [vLLM](https://github.com/vllm-project/vllm) as our LLM backend. Depending on that, we will have following microservices as part of dependencies for ChatQnA application. | ||||||
|
||||||
1. For using **TGI** as an inference service, following 2 microservices will be required: | ||||||
|
||||||
- [llm-uservice](../common/llm-uservice/README.md) | ||||||
- [tgi](../common/tgi/README.md) | ||||||
|
||||||
2. For using **vLLM** as an inference service, following 2 microservices would be required: | ||||||
|
||||||
- [llm-ctrl-uservice](../common/llm-ctrl-uservice/README.md) | ||||||
Comment on lines
+13
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto, why add wrappers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR is from 1.0 release time, so with some old code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, but note that I'm testing my PR only with vLLM Gaudi version. I.e. currently both CPU and GPU/Openvino support need to be added / tested after it. That PR has also quite a few comment TODOs about vLLM options where some feedback would be needed / appreciated. |
||||||
- [vllm](../common/vllm/README.md) | ||||||
|
||||||
> **_NOTE :_** We shouldn't have both inference engine deployed. It is required to only setup either of them. To achieve this, conditional flags are added in the chart dependency. We will be switching off flag corresponding to one service and switching on the other, in order to have a proper setup of all ChatQnA dependencies. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why there could not be multiple inferencing engines? ChatQnA has 4 inferencing subservices for which it is already using 2 inferencing engines, TEI and TGI. And I do not see why it could not use e.g. TEI for embed + rerank, TGI for guardrails, and vLLM for LLM. Please rephrase. |
||||||
|
||||||
## Installing the Chart | ||||||
|
||||||
To install the chart, run the following: | ||||||
Please follow the following steps to install the ChatQnA Chart: | ||||||
|
||||||
1. Clone the GenAIInfra repository: | ||||||
|
||||||
```bash | ||||||
git clone https://github.com/opea-project/GenAIInfra.git | ||||||
``` | ||||||
|
||||||
2. Setup the dependencies and required environment variables: | ||||||
|
||||||
```console | ||||||
```bash | ||||||
cd GenAIInfra/helm-charts/ | ||||||
./update_dependency.sh | ||||||
helm dependency update chatqna | ||||||
export HFTOKEN="insert-your-huggingface-token-here" | ||||||
export MODELDIR="/mnt/opea-models" | ||||||
export MODELNAME="Intel/neural-chat-7b-v3-3" | ||||||
``` | ||||||
|
||||||
3. Depending on the device which we are targeting for running ChatQnA, please use one the following installation commands: | ||||||
|
||||||
```bash | ||||||
# Install the chart on a Xeon machine | ||||||
|
||||||
# If you would like to use the traditional UI, please change the image as well as the containerport within the values | ||||||
# append these at the end of the command "--set chatqna-ui.image.repository=opea/chatqna-ui,chatqna-ui.image.tag=latest,chatqna-ui.containerPort=5173" | ||||||
|
||||||
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} | ||||||
``` | ||||||
|
||||||
```bash | ||||||
# To use Gaudi device | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that there's support for both TGI and vLLM, all these comments here could state which one is used, e.g. like this:
Suggested change
|
||||||
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml | ||||||
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml | ||||||
``` | ||||||
|
||||||
```bash | ||||||
# To use Nvidia GPU | ||||||
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml | ||||||
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml | ||||||
``` | ||||||
|
||||||
```bash | ||||||
# To include guardrail component in chatqna on Xeon | ||||||
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-values.yaml | ||||||
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-values.yaml | ||||||
``` | ||||||
|
||||||
```bash | ||||||
# To include guardrail component in chatqna on Gaudi | ||||||
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-gaudi-values.yaml | ||||||
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-gaudi-values.yaml | ||||||
``` | ||||||
|
||||||
> **_NOTE :_** Default installation will use [TGI (Text Generation Inference)](https://github.com/huggingface/text-generation-inference) as inference engine. To use vLLM as inference engine, please see below. | ||||||
|
||||||
```bash | ||||||
# To use vLLM inference engine on XEON device | ||||||
|
||||||
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set llm-ctrl-uservice.LLM_MODEL_ID=${MODELNAME} --set vllm.LLM_MODEL_ID=${MODELNAME} --set tgi.enabled=false --set vllm.enabled=true | ||||||
|
||||||
# To use OpenVINO optimized vLLM inference engine on XEON device | ||||||
|
||||||
helm install -f ./chatqna/vllm-openvino-values.yaml chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set llm-ctrl-uservice.LLM_MODEL_ID=${MODELNAME} --set vllm.LLM_MODEL_ID=${MODELNAME} --set tgi.enabled=false --set vllm.enabled=true | ||||||
``` | ||||||
|
||||||
### IMPORTANT NOTE | ||||||
|
||||||
1. Make sure your `MODELDIR` exists on the node where your workload is scheduled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model. | ||||||
|
||||||
2. Please set `http_proxy`, `https_proxy` and `no_proxy` values while installing chart, if you are behind a proxy. | ||||||
|
||||||
Comment on lines
+95
to
+96
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO duplicating general information to application READMEs is not maintainable, there are too many of them. Instead you could include link to general options ( |
||||||
## Verify | ||||||
|
||||||
To verify the installation, run the command `kubectl get pod` to make sure all pods are running. | ||||||
|
@@ -52,8 +106,9 @@ Run the command `kubectl port-forward svc/chatqna 8888:8888` to expose the servi | |||||
|
||||||
Open another terminal and run the following command to verify the service if working: | ||||||
|
||||||
```console | ||||||
```bash | ||||||
curl http://localhost:8888/v1/chatqna \ | ||||||
-X POST \ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why add redundant |
||||||
-H "Content-Type: application/json" \ | ||||||
-d '{"messages": "What is the revenue of Nike in 2023?"}' | ||||||
``` | ||||||
|
@@ -71,12 +126,13 @@ Open a browser to access `http://<k8s-node-ip-address>:${port}` to play with the | |||||
|
||||||
## Values | ||||||
|
||||||
| Key | Type | Default | Description | | ||||||
| ----------------- | ------ | ----------------------------- | -------------------------------------------------------------------------------------- | | ||||||
| image.repository | string | `"opea/chatqna"` | | | ||||||
| service.port | string | `"8888"` | | | ||||||
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory | | ||||||
| global.monitoring | bop; | false | Enable usage metrics for the service components. See ../monitoring.md before enabling! | | ||||||
| Key | Type | Default | Description | | ||||||
| -------------------------- | ------ | ----------------------------- | -------------------------------------------------------------------------------------- | | ||||||
| image.repository | string | `"opea/chatqna"` | | | ||||||
| service.port | string | `"8888"` | | | ||||||
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory | | ||||||
| vllm-openvino.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory | | ||||||
| global.monitoring | bop; | false | Enable usage metrics for the service components. See ../monitoring.md before enabling! | | ||||||
|
||||||
## Troubleshooting | ||||||
|
||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As this is identical to values file, it should be symlink, not a copy of it. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
tgi: | ||
enabled: false | ||
|
||
vllm: | ||
enabled: true | ||
openvino_enabled: true | ||
image: | ||
repository: opea/vllm-openvino | ||
pullPolicy: IfNotPresent | ||
# Overrides the image tag whose default is the chart appVersion. | ||
tag: "latest" | ||
|
||
extraCmdArgs: [] | ||
|
||
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3 | ||
|
||
CUDA_GRAPHS: "0" | ||
VLLM_CPU_KVCACHE_SPACE: 50 | ||
VLLM_OPENVINO_KVCACHE_SPACE: 32 | ||
OMPI_MCA_btl_vader_single_copy_mechanism: none | ||
|
||
ov_command: ["/bin/bash"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
tgi: | ||
enabled: false | ||
|
||
vllm: | ||
enabled: true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
vllm: | ||
openvino_enabled: true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does not confirm to Helm best practices: https://helm.sh/docs/chart_best_practices/values/ Should be either |
||
image: | ||
repository: opea/vllm-openvino | ||
pullPolicy: IfNotPresent | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Drop the value, it breaks CI testing for |
||
# Overrides the image tag whose default is the chart appVersion. | ||
tag: "latest" | ||
|
||
extraCmdArgs: [] | ||
|
||
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3 | ||
|
||
CUDA_GRAPHS: "0" | ||
VLLM_CPU_KVCACHE_SPACE: 50 | ||
VLLM_OPENVINO_KVCACHE_SPACE: 32 | ||
OMPI_MCA_btl_vader_single_copy_mechanism: none | ||
|
||
ov_command: ["/bin/bash"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Patterns to ignore when building packages. | ||
# This supports shell glob matching, relative path matching, and | ||
# negation (prefixed with !). Only one pattern per line. | ||
.DS_Store | ||
# Common VCS dirs | ||
.git/ | ||
.gitignore | ||
.bzr/ | ||
.bzrignore | ||
.hg/ | ||
.hgignore | ||
.svn/ | ||
# Common backup files | ||
*.swp | ||
*.bak | ||
*.tmp | ||
*.orig | ||
*~ | ||
# Various IDEs | ||
.project | ||
.idea/ | ||
*.tmproj | ||
.vscode/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
apiVersion: v2 | ||
name: llm-ctrl-uservice | ||
description: A Helm chart for LLM controller microservice which connects with vLLM microservice to provide inferences. | ||
type: application | ||
version: 1.0.0 | ||
appVersion: "v1.0" | ||
dependencies: | ||
- name: vllm | ||
version: 1.0.0 | ||
repository: file://../vllm | ||
condition: vllm.enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you're adding wrappers?
They were removed over month ago for v1.1 (#474), are unnecessary, and LLM wrapper uses a
langserve
component with a problematic license (opea-project/GenAIComps#264).