From 18092f3dec2e173c180c4d1b1742bf34dbc2ed73 Mon Sep 17 00:00:00 2001
From: Harsha Ramayanam <harsha.ramayanam@intel.com>
Date: Fri, 13 Sep 2024 22:52:56 -0700
Subject: [PATCH] Changes to comps/llms/text-generation/README (#678)

* Added changes per dbkinder's review

Signed-off-by: Harsha Ramayanam <harsha.ramayanam@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Harsha Ramayanam <harsha.ramayanam@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---
 comps/llms/text-generation/README.md | 320 +++++++++++++++++++--------
 1 file changed, 230 insertions(+), 90 deletions(-)

diff --git a/comps/llms/text-generation/README.md b/comps/llms/text-generation/README.md
index 18897572a..9c4af98c1 100644
--- a/comps/llms/text-generation/README.md
+++ b/comps/llms/text-generation/README.md
@@ -6,108 +6,149 @@ A prerequisite for using this microservice is that users must have a LLM text ge
 
 Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI/vLLM/Ray service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses.
 
-## 🚀1. Start Microservice with Python (Option 1)
+## Validated LLM Models
 
-To start the LLM microservice, you need to install python packages first.
+| Model                       | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | Ray |
+| --------------------------- | --------- | -------- | ---------- | --- |
+| [Intel/neural-chat-7b-v3-3] | ✓         | ✓        | ✓          | ✓   |
+| [Llama-2-7b-chat-hf]        | ✓         | ✓        | ✓          | ✓   |
+| [Llama-2-70b-chat-hf]       | ✓         | -        | ✓          | x   |
+| [Meta-Llama-3-8B-Instruct]  | ✓         | ✓        | ✓          | ✓   |
+| [Meta-Llama-3-70B-Instruct] | ✓         | -        | ✓          | x   |
+| [Phi-3]                     | x         | Limit 4K | Limit 4K   | ✓   |
 
-### 1.1 Install Requirements
+## Clone OPEA GenAIComps
+
+Clone this repository at your desired location and set an environment variable for easy setup and usage throughout the instructions.
 
 ```bash
-pip install -r requirements.txt
+git clone https://github.com/opea-project/GenAIComps.git
+
+export OPEA_GENAICOMPS_ROOT=$(pwd)/GenAIComps
 ```
 
-### 1.2 Start LLM Service
+## 🚀1. Start Microservice with Python (Option 1)
 
-#### 1.2.1 Start TGI Service
+To start the LLM microservice, you need to install python packages first.
+
+### 1.1 Install Requirements
 
 ```bash
-export HF_TOKEN=${your_hf_api_token}
-docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model}
-```
+pip install opea-comps
+pip install -r ${OPEA_GENAICOMPS_ROOT}/comps/llms/requirements.txt
 
-#### 1.2.2 Start vLLM Service
+# Install requirements of your choice of microservice in the text-generation folder (tgi, vllm, vllm-ray, etc.)
+export MICROSERVICE_DIR=your_chosen_microservice
 
-```bash
-export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
-docker run -it --name vllm_service -p 8008:80 -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -v ./data:/data opea/vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model ${your_hf_llm_model} --port 80"
+pip install -r ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/${MICROSERVICE_DIR}/requirements.txt
 ```
 
-### 1.2.3 Start Ray Service
+Set an environment variable `your_ip` to the IP address of the machine where you would like to consume the microservice.
 
 ```bash
-export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
-export TRUST_REMOTE_CODE=True
-docker run -it --runtime=habana --name ray_serve_service -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -p 8008:80 -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE opea/llm-ray:latest /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path ${your_hf_llm_model} --chat_processor ${your_hf_chatprocessor}"
+# For example, this command would set the IP address of your currently logged-in machine.
+export your_ip=$(hostname -I | awk '{print $1}')
 ```
 
-### 1.3 Verify the LLM Service
+### 1.2 Start LLM Service with Python Script
 
-#### 1.3.1 Verify the TGI Service
+#### 1.2.1 Start the TGI Service
 
 ```bash
-curl http://${your_ip}:8008/generate \
-  -X POST \
-  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-  -H 'Content-Type: application/json'
+export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
+python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/tgi/llm.py
+python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/tgi/llm.py
 ```
 
-#### 1.3.2 Verify the vLLM Service
+#### 1.2.2 Start the vLLM Service
 
 ```bash
-curl http://${your_ip}:8008/v1/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-  "model": ${your_hf_llm_model},
-  "prompt": "What is Deep Learning?",
-  "max_tokens": 32,
-  "temperature": 0
-  }'
+export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
+python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/llm.py
+python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/llm.py
 ```
 
-#### 1.3.3 Verify the Ray Service
+#### 1.2.3 Start the Ray Service
 
 ```bash
-curl http://${your_ip}:8008/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-  "model": ${your_hf_llm_model},
-  "messages": [
-        {"role": "assistant", "content": "You are a helpful assistant."},
-        {"role": "user", "content": "What is Deep Learning?"},
-    ],
-  "max_tokens": 32,
-  "stream": True
-  }'
+export RAY_Serve_ENDPOINT="http://${your_ip}:8008"
+python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/ray_serve/llm.py
+python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/ray_serve/llm.py
 ```
 
-### 1.4 Start LLM Service with Python Script
+## 🚀2. Start Microservice with Docker (Option 2)
+
+You can use either a published docker image or build your own docker image with the respective microservice Dockerfile of your choice. You must create a user account with [HuggingFace] and obtain permission to use the restricted LLM models by adhering to the guidelines provided on the respective model's webpage.
 
-#### 1.4.1 Start the TGI Service
+### 2.1 Start LLM Service with published image
+
+#### 2.1.1 Start TGI Service
 
 ```bash
-export TGI_LLM_ENDPOINT="http://${your_ip}:8008"
-python text-generation/tgi/llm.py
+export HF_LLM_MODEL=${your_hf_llm_model}
+export HF_TOKEN=${your_hf_api_token}
+
+docker run \
+  -p 8008:80 \
+  -e HF_TOKEN=${HF_TOKEN} \
+  -v ./data:/data \
+  --name tgi_service \
+  --shm-size 1g \
+  ghcr.io/huggingface/text-generation-inference:1.4 \
+  --model-id ${HF_LLM_MODEL}
 ```
 
-#### 1.4.2 Start the vLLM Service
+#### 2.1.2 Start vLLM Service
 
 ```bash
-export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
-python text-generation/vllm/llm.py
+# Use the script to build the docker image as opea/vllm:cpu
+bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/build_docker_vllm.sh cpu
+
+export HF_LLM_MODEL=${your_hf_llm_model}
+export HF_TOKEN=${your_hf_api_token}
+
+docker run -it \
+  --name vllm_service \
+  -p 8008:80 \
+  -e HF_TOKEN=${HF_TOKEN} \
+  -e VLLM_CPU_KVCACHE_SPACE=40 \
+  -v ./data:/data \
+  opea/vllm:cpu \
+  --model ${HF_LLM_MODEL}
+  --port 80
 ```
 
-#### 1.4.3 Start the Ray Service
+#### 2.1.3 Start Ray Service
 
 ```bash
-export RAY_Serve_ENDPOINT="http://${your_ip}:8008"
-python text-generation/ray_serve/llm.py
+export HF_LLM_MODEL=${your_hf_llm_model}
+export HF_CHAT_PROCESSOR=${your_hf_chatprocessor}
+export HF_TOKEN=${your_hf_api_token}
+export TRUST_REMOTE_CODE=True
+
+docker run -it \
+  --runtime=habana \
+  --name ray_serve_service \
+  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
+  --cap-add=sys_nice \
+  --ipc=host \
+  -p 8008:80 \
+  -e HF_TOKEN=$HF_TOKEN \
+  -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE \
+  opea/llm-ray:latest \
+  /bin/bash -c " \
+    ray start --head && \
+    python api_server_openai.py \
+      --port_number 80 \
+      --model_id_or_path ${HF_LLM_MODEL} \
+      --chat_processor ${HF_CHAT_PROCESSOR}"
 ```
 
-## 🚀2. Start Microservice with Docker (Option 2)
+### 2.2 Start LLM Service with image built from source
 
 If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker.
 
-### 2.1 Setup Environment Variables
+#### 2.2.1 Setup Environment Variables
 
 In order to start TGI and LLM services, you need to setup the following environment variables first.
 
@@ -120,7 +161,7 @@ export LLM_MODEL_ID=${your_hf_llm_model}
 In order to start vLLM and LLM services, you need to setup the following environment variables first.
 
 ```bash
-export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+export HF_TOKEN=${your_hf_api_token}
 export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
 export LLM_MODEL_ID=${your_hf_llm_model}
 ```
@@ -128,7 +169,7 @@ export LLM_MODEL_ID=${your_hf_llm_model}
 In order to start Ray serve and LLM services, you need to setup the following environment variables first.
 
 ```bash
-export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+export HF_TOKEN=${your_hf_api_token}
 export RAY_Serve_ENDPOINT="http://${your_ip}:8008"
 export LLM_MODEL=${your_hf_llm_model}
 export CHAT_PROCESSOR="ChatModelLlama"
@@ -139,8 +180,13 @@ export CHAT_PROCESSOR="ChatModelLlama"
 #### 2.2.1 TGI
 
 ```bash
-cd ../../../
-docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile .
+cd ${OPEA_GENAICOMPS_ROOT}
+
+docker build \
+  -t opea/llm-tgi:latest \
+  --build-arg https_proxy=$https_proxy \
+  --build-arg http_proxy=$http_proxy \
+  -f comps/llms/text-generation/tgi/Dockerfile .
 ```
 
 #### 2.2.2 vLLM
@@ -148,15 +194,19 @@ docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build
 Build vllm docker.
 
 ```bash
-cd text-generation/vllm/langchain/dependency
-bash build_docker_vllm.sh
+bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm.sh
 ```
 
 Build microservice docker.
 
 ```bash
-cd ../../../
-docker build -t opea/llm-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm/langchain/Dockerfile .
+cd ${OPEA_GENAICOMPS_ROOT}
+
+docker build \
+  -t opea/llm-vllm:latest \
+  --build-arg https_proxy=$https_proxy \
+  --build-arg http_proxy=$http_proxy \
+  -f comps/llms/text-generation/vllm/langchain/Dockerfile .
 ```
 
 #### 2.2.3 Ray Serve
@@ -164,15 +214,19 @@ docker build -t opea/llm-vllm:latest --build-arg https_proxy=$https_proxy --buil
 Build Ray Serve docker.
 
 ```bash
-cd text-generation/vllm/ray/dependency
-bash build_docker_vllmray.sh
+bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray/dependency/build_docker_vllmray.sh
 ```
 
 Build microservice docker.
 
 ```bash
-cd ../../../
-docker build -t opea/llm-ray:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm/ray/Dockerfile .
+cd ${OPEA_GENAICOMPS_ROOT}
+
+docker build \
+  -t opea/llm-ray:latest \
+  --build-arg https_proxy=$https_proxy \
+  --build-arg http_proxy=$http_proxy \
+  -f comps/llms/text-generation/vllm/ray/Dockerfile .
 ```
 
 To start a docker container, you have two options:
@@ -187,7 +241,15 @@ You can choose one as needed.
 #### 2.3.1 TGI
 
 ```bash
-docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-tgi:latest
+docker run -d \
+  --name="llm-tgi-server" \
+  -p 9000:9000 \
+  --ipc=host \
+  -e http_proxy=$http_proxy \
+  -e https_proxy=$https_proxy \
+  -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT \
+  -e HF_TOKEN=$HF_TOKEN \
+  opea/llm-tgi:latest
 ```
 
 #### 2.3.2 vLLM
@@ -195,13 +257,23 @@ docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$htt
 Start vllm endpoint.
 
 ```bash
-bash launch_vllm_service.sh
+bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/langchain/dependency/launch_vllm_service.sh
 ```
 
 Start vllm microservice.
 
 ```bash
-docker run --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=${no_proxy} -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e LLM_MODEL_ID=$LLM_MODEL_ID opea/llm-vllm:latest
+docker run \
+  --name="llm-vllm-server" \
+  -p 9000:9000 \
+  --ipc=host \
+  -e http_proxy=$http_proxy \
+  -e https_proxy=$https_proxy \
+  -e no_proxy=${no_proxy} \
+  -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT \
+  -e HF_TOKEN=$HF_TOKEN \
+  -e LLM_MODEL_ID=$LLM_MODEL_ID \
+  opea/llm-vllm:latest
 ```
 
 #### 2.3.3 Ray Serve
@@ -209,13 +281,22 @@ docker run --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_
 Start Ray Serve endpoint.
 
 ```bash
-bash launch_ray_service.sh
+bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray/dependency/launch_vllmray.sh
 ```
 
 Start Ray Serve microservice.
 
 ```bash
-docker run -d --name="llm-ray-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e RAY_Serve_ENDPOINT=$RAY_Serve_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e LLM_MODEL=$LLM_MODEL opea/llm-ray:latest
+docker run -d \
+  --name="llm-ray-server" \
+  -p 9000:9000 \
+  --ipc=host \
+  -e http_proxy=$http_proxy \
+  -e https_proxy=$https_proxy \
+  -e RAY_Serve_ENDPOINT=$RAY_Serve_ENDPOINT \
+  -e HF_TOKEN=$HF_TOKEN \
+  -e LLM_MODEL=$LLM_MODEL \
+  opea/llm-ray:latest
 ```
 
 ### 2.4 Run Docker with Docker Compose (Option B)
@@ -223,21 +304,21 @@ docker run -d --name="llm-ray-server" -p 9000:9000 --ipc=host -e http_proxy=$htt
 #### 2.4.1 TGI
 
 ```bash
-cd text-generation/tgi
+cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/tgi
 docker compose -f docker_compose_llm.yaml up -d
 ```
 
 #### 2.4.2 vLLM
 
 ```bash
-cd text-generation/vllm/langchain
+cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/langchain
 docker compose -f docker_compose_llm.yaml up -d
 ```
 
 #### 2.4.3 Ray Serve
 
 ```bash
-cd text-genetation/vllm/ray
+cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray
 docker compose -f docker_compose_llm.yaml up -d
 ```
 
@@ -251,7 +332,47 @@ curl http://${your_ip}:9000/v1/health_check\
   -H 'Content-Type: application/json'
 ```
 
-### 3.2 Consume LLM Service
+### 3.2 Verify the LLM Service
+
+#### 3.2.1 Verify the TGI Service
+
+```bash
+curl http://${your_ip}:8008/generate \
+  -X POST \
+  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
+  -H 'Content-Type: application/json'
+```
+
+#### 3.2.2 Verify the vLLM Service
+
+```bash
+curl http://${your_ip}:8008/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": ${your_hf_llm_model},
+  "prompt": "What is Deep Learning?",
+  "max_tokens": 32,
+  "temperature": 0
+  }'
+```
+
+#### 3.2.3 Verify the Ray Service
+
+```bash
+curl http://${your_ip}:8008/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": ${your_hf_llm_model},
+  "messages": [
+        {"role": "assistant", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "What is Deep Learning?"},
+    ],
+  "max_tokens": 32,
+  "stream": True
+  }'
+```
+
+### 3.3 Consume LLM Service
 
 You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`.
 
@@ -261,23 +382,42 @@ The `streaming` parameter determines the format of the data returned by the API.
 # non-streaming mode
 curl http://${your_ip}:9000/v1/chat/completions \
   -X POST \
-  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
-  -H 'Content-Type: application/json'
+  -H 'Content-Type: application/json' \
+  -d '{
+  "query":"What is Deep Learning?",
+  "max_new_tokens":17,
+  "top_k":10,
+  "top_p":0.95,
+  "typical_p":0.95,
+  "temperature":0.01,
+  "repetition_penalty":1.03,
+  "streaming":false
+  }'
+
 
 # streaming mode
 curl http://${your_ip}:9000/v1/chat/completions \
   -X POST \
-  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
-  -H 'Content-Type: application/json'
+  -H 'Content-Type: application/json' \
+  -d '{
+  "query":"What is Deep Learning?",
+  "max_new_tokens":17,
+  "top_k":10,
+  "top_p":0.95,
+  "typical_p":0.95,
+  "temperature":0.01,
+  "repetition_penalty":1.03,
+  "streaming":true
+  }'
+
 ```
 
-### 4. Validated Model
+<!--Below are links used in these document. They are not rendered: -->
 
-| Model                     | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | Ray |
-| ------------------------- | --------- | -------- | ---------- | --- |
-| Intel/neural-chat-7b-v3-3 | ✓         | ✓        | ✓          | ✓   |
-| Llama-2-7b-chat-hf        | ✓         | ✓        | ✓          | ✓   |
-| Llama-2-70b-chat-hf       | ✓         | -        | ✓          | x   |
-| Meta-Llama-3-8B-Instruct  | ✓         | ✓        | ✓          | ✓   |
-| Meta-Llama-3-70B-Instruct | ✓         | -        | ✓          | x   |
-| Phi-3                     | x         | Limit 4K | Limit 4K   | ✓   |
+[Intel/neural-chat-7b-v3-3]: https://huggingface.co/Intel/neural-chat-7b-v3-3
+[Llama-2-7b-chat-hf]: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
+[Llama-2-70b-chat-hf]: https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
+[Meta-Llama-3-8B-Instruct]: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
+[Meta-Llama-3-70B-Instruct]: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
+[Phi-3]: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3
+[HuggingFace]: https://huggingface.co/