add speculatice decoding to opea

Signed-off-by: Chen Xi <[email protected]>
opea-project · Sep 5, 2024 · 66348ea · 66348ea
1 parent 7d2cd6b
commit 66348ea
Show file tree

Hide file tree

Showing 17 changed files with 987 additions and 0 deletions.
diff --git a/comps/spec_decode/README.md b/comps/spec_decode/README.md
@@ -0,0 +1,146 @@
+# Speculative Decoding Microservice
+
+This microservice, designed for Speculative Decoding, processes input consisting of a query string. It constructs a prompt based on the query and documents, which is then used to perform inference with a large language model. The service delivers the inference results as output.
+
+A prerequisite for using this microservice is that users must have a LLM text generation service already running. Users need to set the LLM service's endpoint into an environment variable. The microservice utilizes this endpoint to create an speculative decoding object, enabling it to communicate with the speculative decoding service for executing language model operations.
+
+Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a vLLM service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses.
+
+## 🚀1. Start Microservice with Python (Option 1)
+
+To start the LLM microservice, you need to install python packages first.
+
+### 1.1 Install Requirements
+
+```bash
+pip install -r requirements.txt
+```
+
+### 1.2 Start Speculative Decoding Service
+
+#### 1.2.1 Start vLLM Service
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+docker run -it --name vllm_service -p 8008:8008 -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -v ./data:/data opea/vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model ${your_hf_llm_model} --speculative_model ${your_speculative_model} --num_speculative_tokens ${your_speculative_tokens} --use-v2-block-manager --tensor-parallel-size 1 --port 8008"
+```
+
+### 1.3 Verify the Speculative Decoding Service
+
+#### 1.3.2 Verify the vLLM Service
+
+```bash
+curl http://${your_ip}:8008/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": ${your_hf_llm_model},
+  "prompt": "What is Deep Learning?",
+  "max_tokens": 32,
+  "temperature": 0
+  }'
+```
+
+### 1.4 Start Speculative Decoding Service with Python Script
+
+#### 1.4.1 Start the vLLM Service
+
+```bash
+export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
+python text-generation/vllm/llm.py
+```
+
+## 🚀2. Start Microservice with Docker (Option 2)
+
+If you start an LLM microservice with docker, the `docker_compose_spec_decode.yaml` file will automatically start a vLLM service with docker.
+
+### 2.1 Setup Environment Variables
+
+In order to start vLLM and LLM services, you need to setup the following environment variables first.
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+export vLLM_LLM_ENDPOINT="http://${your_ip}:8008"
+export LLM_MODEL_ID=${your_hf_llm_model}
+export SPEC_MODEL_ID=${your_hf_spec_model}
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_PROJECT="opea/spec_decode"
+```
+
+### 2.2 Build Docker Image
+
+#### 2.2.1 vLLM
+
+Build vllm docker.
+
+```bash
+bash build_docker_vllm.sh
+```
+
+Build microservice docker.
+
+```bash
+cd ../../../../
+docker build -t opea/spec_decode-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/spec_decode/text-generation/vllm/docker/Dockerfile.microservice .
+```
+
+To start a docker container, you have two options:
+
+- A. Run Docker with CLI
+- B. Run Docker with Docker Compose
+
+You can choose one as needed.
+
+### 2.3 Run Docker with CLI (Option A)
+
+#### 2.3.1 vLLM
+
+Start vllm endpoint.
+
+```bash
+bash launch_vllm_service.sh
+```
+
+Start vllm microservice.
+
+```bash
+docker run --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=${no_proxy} -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e SPEC_MODEL_ID=$SPEC_MODEL_ID -e LLM_MODEL_ID=$LLM_MODEL_ID opea/specs_decode-vllm:latest
+```
+
+### 2.4 Run Docker with Docker Compose (Option B)
+
+#### 2.4.1 vLLM
+
+```bash
+cd text-generation/vllm
+docker compose -f docker_compose_llm.yaml up -d
+```
+
+## 🚀3. Consume LLM Service
+
+### 3.1 Check Service Status
+
+```bash
+curl http://${your_ip}:9000/v1/health_check\
+  -X GET \
+  -H 'Content-Type: application/json'
+```
+
+### 3.2 Consume LLM Service
+
+You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`.
+
+The `streaming` parameter determines the format of the data returned by the API. It will return text string with `streaming=false`, return text streaming flow with `streaming=true`.
+
+```bash
+# non-streaming mode
+curl http://${your_ip}:9000/v1/spec_decode/completions \
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
+  -H 'Content-Type: application/json'
+
+# streaming mode
+curl http://${your_ip}:9000/v1/spec_decode/completions \
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
+  -H 'Content-Type: application/json'
+```
diff --git a/comps/spec_decode/requirements.txt b/comps/spec_decode/requirements.txt
@@ -0,0 +1,9 @@
+docarray[full]
+fastapi
+huggingface_hub
+langchain==0.1.16
+opentelemetry-api
+opentelemetry-exporter-otlp
+opentelemetry-sdk
+shortuuid
+uvicorn
diff --git a/comps/spec_decode/text-generation/vllm/README.md b/comps/spec_decode/text-generation/vllm/README.md
@@ -0,0 +1,127 @@
+# vLLM Endpoint Serve
+
+[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.
+
+## set up environment variables
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=<token>
+export vLLM_ENDPOINT="http://${your_ip}:8008"
+export LLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
+```
+
+For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
+
+## Set up VLLM Service
+
+### vLLM on CPU
+
+First let's enable VLLM on CPU.
+
+#### Build docker
+
+```bash
+bash ./build_docker_vllm.sh
+```
+
+The `build_docker_vllm` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `gpu`.
+
+#### Launch vLLM service
+
+```bash
+bash ./launch_vllm_service.sh
+```
+
+If you want to customize the port or model_name, can run:
+
+```bash
+bash ./launch_vllm_service.sh ${port_number} ${model_name}
+```
+
+### vLLM on GPU
+
+Then we show how to enable VLLM on GPU.
+
+#### Build docker
+
+```bash
+bash ./build_docker_vllm.sh gpu
+```
+
+Set `hw_mode` to `gpu`.
+
+#### Launch vLLM service on single node
+
+For small model, we can just use single node.
+
+```bash
+bash ./launch_vllm_service.sh ${port_number} ${model_name} gpu 1
+```
+
+Set `hw_mode` to `gpu` and `parallel_number` to 1.
+
+The `launch_vllm_service.sh` script accepts 7 parameters:
+
+- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
+- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
+- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "gpu".
+- parallel_number: parallel nodes number for 'gpu' mode
+- block_size: default set to 128 for better performance on GPU
+- max_num_seqs: default set to 256 for better performance on GPU
+- max_seq_len_to_capture: default set to 2048 for better performance on GPU
+
+If you want to get more performance tuning tips, can refer to [Performance tuning](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#performance-tips).
+
+#### Launch vLLM service on multiple nodes
+
+For large model such as `meta-llama/Meta-Llama-3-70b`, we need to launch on multiple nodes.
+
+```bash
+bash ./launch_vllm_service.sh ${port_number} ${model_name} hpu ${parallel_number}
+```
+
+For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use following command.
+
+```bash
+bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8
+```
+
+### Query the service
+
+And then you can make requests like below to check the service status:
+
+```bash
+curl http://${your_ip}:8008/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+  "prompt": "What is Deep Learning?",
+  "max_tokens": 32,
+  "temperature": 0
+  }'
+```
+
+## Set up OPEA microservice
+
+Then we warp the VLLM service into OPEA microcervice.
+
+### Build docker
+
+```bash
+bash build_docker_microservice.sh
+```
+
+### Launch the microservice
+
+```bash
+bash launch_microservice.sh
+```
+
+### Query the microservice
+
+```bash
+curl http://${your_ip}:9000/v1/chat/completions \
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_p":0.95,"temperature":0.01,"streaming":false}' \
+  -H 'Content-Type: application/json'
+```
diff --git a/comps/spec_decode/text-generation/vllm/build_docker_microservice.sh b/comps/spec_decode/text-generation/vllm/build_docker_microservice.sh
@@ -0,0 +1,9 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+cd ../../../../
+docker build  \
+    -t opea/spec_decode-vllm:latest \
+    --build-arg https_proxy=$https_proxy \
+    --build-arg http_proxy=$http_proxy \
+    -f comps/spec_decode/text-generation/vllm/docker/Dockerfile.microservice .
diff --git a/comps/spec_decode/text-generation/vllm/build_docker_vllm.sh b/comps/spec_decode/text-generation/vllm/build_docker_vllm.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+# Copyright (c) 2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# # Set default values
+# default_hw_mode="cpu"
+# 
+# # Assign arguments to variable
+# hw_mode=${1:-$default_hw_mode}
+# 
+# # Check if all required arguments are provided
+# if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then
+#     echo "Usage: $0 [hw_mode]"
+#     echo "Please customize the arguments you want to use.
+#     - hw_mode: The hardware mode for the Ray Gaudi endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'hpu'."
+#     exit 1
+# fi
+
+# # Build the docker image for vLLM based on the hardware mode
+# if [ "$hw_mode" = "hpu" ]; then
+#     docker build -f docker/Dockerfile.hpu -t opea/vllm:hpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+# else
+# change to the cpu git
+git clone https://github.com/jiqing-feng/vllm.git
+cd ./vllm/
+docker build -f Dockerfile -t opea/vllm:gpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy