forked from mlrun/mlrun
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
97fbcc2
commit 015f4fc
Showing
6 changed files
with
346 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,190 @@ | ||
(genai-serving)= | ||
# Serving GenAI Models | ||
|
||
Serving a GenAI model is in essence the same as serving any other model. The main differences are with the inputs and outputs, which are usually unstructured (text or images) and the model is usually a transformer model. With MLRun you can serve any model, including pretrained models from the Hugging Face model hub as well as models fine-tuned with MLRun. | ||
|
||
Another common use case is to serve the model as part of an inference pipeline, where the model is used as part of a larger pipeline that includes data preprocessing, model execution, and post-processing. This is covered in the {ref}`GenAI serving graph section <genai-serving-graph>`. | ||
|
||
|
||
## Serving using the function hub | ||
|
||
The function hub has a serving class called `hugging_face_serving` to run Hugging Face models. The following code shows how to import the function to your project | ||
|
||
```python | ||
hugging_face_serving = project.set_function("hub://hugging_face_serving") | ||
``` | ||
|
||
Next, you can add a model to the function using the following code: | ||
|
||
```python | ||
|
||
hugging_face_serving.add_model( | ||
'mymodel', | ||
class_name='HuggingFaceModelServer', | ||
model_path='123', # This is not used, just for enabling the process. | ||
|
||
task="text-generation", | ||
model_class="AutoModelForCausalLM", | ||
model_name="openai-community/gpt2", | ||
tokenizer_class="AutoTokenizer", | ||
tokenizer_name="openai-community/gpt2", | ||
) | ||
``` | ||
|
||
And test the model | ||
```python | ||
hugging_face_mock_server = hugging_face_serving.to_mock_server() | ||
result = hugging_face_mock_server.test( | ||
"/v2/models/mymodel", | ||
body={"inputs": ["write a short poem"]} | ||
) | ||
print(f"Output: {result['outputs']}") | ||
``` | ||
|
||
## Implementing your own model serving function | ||
|
||
The following code shows how to build a simple model serving function using MLRun. The function loads a pretrained model from the Hugging Face model hub and serves it using the MLRun model server. | ||
|
||
```{admonition} Note | ||
This example uses the [ONNX runtime](https://onnxruntime.ai/docs/) in this example, but it's here for illustrative purposes, you can use any other runtime within your model serving class. | ||
To run this code, make sure to run `pip install huggingface_hub onnxruntime_genai` in your python environment | ||
``` | ||
|
||
|
||
```python | ||
import os | ||
from typing import Any, Dict | ||
|
||
from huggingface_hub import snapshot_download | ||
import onnxruntime_genai as og | ||
import mlrun | ||
|
||
|
||
class OnnxGenaiModelServer(mlrun.serving.v2_serving.V2ModelServer): | ||
|
||
def __init__( | ||
self, | ||
context: mlrun.MLClientCtx, | ||
name: str, | ||
model_path: str, | ||
model_name: str, | ||
search_options: Dict = {}, | ||
chat_template: str = "<|user|>\n{prompt} <|end|>\n<|assistant|>", | ||
**class_args, | ||
): | ||
# Initialize the base server: | ||
super(OnnxGenaiModelServer, self).__init__( | ||
context=context, | ||
name=name, | ||
model_path=model_path, | ||
**class_args, | ||
) | ||
|
||
self.chat_template = chat_template | ||
self.search_options = search_options | ||
|
||
# Set the max length to something sensible by default, unless it is specified by the user, | ||
# since otherwise it will be set to the entire context length | ||
if "max_length" not in self.search_options: | ||
self.search_options["max_length"] = 2048 | ||
|
||
# Save hub loading parameters: | ||
self.model_name = model_name | ||
|
||
# Prepare variables for future use: | ||
self.model_folder = None | ||
self.model = None | ||
self.tokenizer = None | ||
|
||
def load(self): | ||
# Download the model snapshot and save it to the model folder | ||
self.model_folder = snapshot_download(self.model_name) | ||
|
||
# Load the model from the model folder | ||
self.model = og.Model(os.path.join(self.model_folder, self.model_path)) | ||
|
||
# Create a tokenizer using the loaded model | ||
self.tokenizer = og.Tokenizer(self.model) | ||
|
||
def predict(self, request: Dict[str, Any]) -> list: | ||
# Get prompts from inputs:: | ||
prompts = [f'{self.chat_template.format(prompt=input.get("prompt"))}' for input in request["inputs"]] | ||
|
||
# Tokenize: | ||
input_tokens = self.tokenizer.encode_batch(prompts) | ||
|
||
# Create the parameters | ||
params = og.GeneratorParams(self.model) | ||
params.set_search_options(**self.search_options) | ||
params.input_ids = input_tokens | ||
|
||
# Generate output tokens: | ||
output_tokens = self.model.generate(params) | ||
|
||
# Decode output tokens to text: | ||
response = [{"prediction": self.tokenizer.decode(output), "prompt": prompt} for (output, prompt) in zip(output_tokens, prompts)] | ||
|
||
return response | ||
``` | ||
|
||
During load, the code above downloads a model from Hugging Face hub creates a model object and a tokenizer. | ||
|
||
During prediction, the code collects all prompts, tokenizes the prompts, generates the response tokens and decodes the output tokens to text. | ||
|
||
If we save the code above to `src/onnx_genai_serving.ay` we can create a model serving functions with the following code: | ||
|
||
``` python | ||
import os | ||
import mlrun | ||
|
||
project = mlrun.get_or_create_project("genai-deployment", context = "./", user_project=True) | ||
|
||
genai_serving = project.set_function("src/onnx_genai_serving.py", | ||
name="genai-serving", | ||
kind="serving", | ||
image="mlrun/mlrun", | ||
requirements=["huggingface_hub", "onnxruntime_genai"]) | ||
|
||
genai_serving.add_model("mymodel", | ||
model_name="microsoft/Phi-3-mini-4k-instruct-onnx", | ||
model_path=os.path.join("cpu_and_mobile", "cpu-int4-rtn-block-32-acc-level-4"), | ||
class_name="OnnxGenaiModelServer" | ||
) | ||
|
||
``` | ||
|
||
The code loads a Phi-3 model. We use the CPU version here so it's easy to test and run, but you can just as easily provide a GPU-based model. | ||
|
||
We can test the model with the following code: | ||
|
||
```python | ||
mock_server = genai_serving.to_mock_server() | ||
|
||
result = mock_server.test( | ||
"/v2/models/mymodel", | ||
body={"inputs": [{"prompt":"What is 1+1?"}]} | ||
) | ||
print(f"Output: {result['outputs']}") | ||
``` | ||
|
||
A typical output would be | ||
``` | ||
Output: [{'prediction': '\nWhat is 1+1? \n1+1 equals 2. This is a basic arithmetic addition problem where you add one unit to another unit.', 'prompt': '<|user|>\nWhat is 1+1? <|end|>\n<|assistant|>'}] | ||
``` | ||
|
||
To deploy the model we run | ||
```python | ||
project.deploy_function(genai_serving) | ||
``` | ||
|
||
This build a docker images with the required dependencies and deploys a nuclio function. | ||
|
||
To test the model we can use the HTTP trigger as follows | ||
```python | ||
genai_serving.invoke( | ||
"/v2/models/mymodel", | ||
body={"inputs": [{"prompt":"What is 1+1?"}]} | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
(genai-serving-graph)= | ||
# GenAI Realtime Serving Graph | ||
|
||
During inference, it is common to serve a GenAI model as part of a larger pipeline that includes data preprocessing, model execution, and post-processing. This can be done with MLRun using the real-time serving pipeline feature. Prior to model inference, the context is typically enriched using a vector database, then the input is transformed to input tokens, and finally the model is executed. Pre-processing and post-processing may also include guardrails to ensure the input is valid (for example, prevent the user from asking questions that attempt to exploit the model) as well as output processing, to verify the model does not hallucinate or includes data that may not be shared. | ||
|
||
## A basic graph | ||
|
||
To run a model as part of a larger pipeline, you can use the `set_topology` method of the serving function. The following code shows how to set up a simple pipeline that includes a single step, this example is taken from the [Interactive bot demo using LLMs and MLRun](https://github.com/mlrun/demo-llm-bot) which calls OpenAI ChatGPT model: | ||
|
||
```python | ||
class QueryLLM: | ||
def __init__(self): | ||
config = AppConfig() | ||
self.agent = build_agent(config=config) | ||
|
||
def do(self, event): | ||
try: | ||
agent_resp = self.agent( | ||
{ | ||
"input": event.body["question"], | ||
"chat_history": messages_from_dict(event.body["chat_history"]), | ||
} | ||
) | ||
event.body["output"] = parse_agent_output(agent_resp=agent_resp) | ||
except ValueError as e: | ||
response = str(e) | ||
if not response.startswith("Could not parse LLM output: `"): | ||
raise e | ||
event.body["output"] = response.removeprefix( | ||
"Could not parse LLM output: `" | ||
).removesuffix("`") | ||
return event | ||
``` | ||
|
||
Store the code above to `src/serve-llm.py`, then to create the serving function, run the following code: | ||
|
||
```python | ||
|
||
serving_fn = project.set_function( | ||
name="serve-llm", | ||
func="src/serve_llm.py", | ||
kind="serving", | ||
image=image, | ||
) | ||
graph = serving_fn.set_topology("flow", engine="async") | ||
graph.add_step( | ||
name="llm", | ||
class_name="src.serve_llm.QueryLLM", | ||
full_event=True, | ||
).respond() | ||
``` | ||
|
||
We can now use a similar approach to add more steps to the pipeline. | ||
|
||
## Setting up a Multi-step Inference Pipeline | ||
|
||
The following code shows how to set up an multi-step inference pipeline using MLRun. This code is available in the [MLRun fine-tuning demo](https://github.com/mlrun/demo-llm-tuning): | ||
|
||
```python | ||
# Set the topology and get the graph object: | ||
graph = serving_function.set_topology("flow", engine="async") | ||
|
||
# Add the steps: | ||
graph.to(handler="preprocess", name="preprocess") \ | ||
.to("LLMModelServer", | ||
name="infer", | ||
model_args={"load_in_8bit": True, | ||
"device_map": "cuda:0", | ||
"trust_remote_code": True}, | ||
tokenizer_name="tiiuae/falcon-7b", | ||
model_name="tiiuae/falcon-7b", | ||
peft_model=project.get_artifact_uri("falcon-7b-mlrun")) \ | ||
.to(handler="postprocess", name="postprocess") \ | ||
.to("ToxicityClassifierModelServer", | ||
name="toxicity-classifier", | ||
threshold=0.7).respond() | ||
|
||
``` | ||
|
||
This flow is illustrated as follows: | ||
|
||
```{mermaid} | ||
flowchart LR | ||
A([start]) --> B(preprocess) | ||
B --> C(infer) | ||
C --> D(postprocess) | ||
D --> E(toxicity-classifier) | ||
``` | ||
|
||
Generally, each step can be a python function, a serving class, or a class that implements the `do` method. In this case we have `LLMModelServer` and `ToxicityClassifierModelServer` which are serving classes while `preprocess` and `postprocess` are python functions. | ||
|
||
```{admonition} Note | ||
Unlike the example of {ref}`GenAI serving class<genai-serving>` which showed a simplistic case of deploying a single model, with realtime serving pipelines, one can run a more realistic scenario of having an end-to-end inference pipeline which can retrieve any data, run multiple models and filter any data or results. | ||
``` | ||
|
||
Once you have the serving pipeline, it behaves just like any other serving function, including the use of `serving_function.to_mock_server()` to test the pipeline and `project.deploy_function(serving_function)` to deploy the pipeline. | ||
|
||
An example of calling the pipeline: | ||
|
||
```python | ||
generate_kwargs = {"max_length": 150, "temperature": 0.9, "top_p": 0.5, "top_k": 25, "repetition_penalty": 1.0} | ||
response = serving_function.invoke(path='/predict', body={"prompt": "What is MLRun?", **generate_kwargs}) | ||
print(response["outputs"]) | ||
``` | ||
|
||
## Distributed pipelines | ||
|
||
By default, all steps of the serving graph will run on the same pod in sequence. It is possible to run different steps on different pods using {ref}`distributed pipelines<distributed-graph>` which would typically run steps that require CPU on one pod, and steps that require a GPU on a different pod. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
(gpu-utilization)= | ||
# GPU Utilization | ||
|
||
GenAI models require GPU in order to run and they are usually large and thus require a lot of memory to run. However, GPU memory is limited and can be a bottleneck for running large models. In this section, we will discuss techniques to improve GPU utilization during inference and how to optimize it. The list here provides some important considerations, but this is not an exhaustive list. | ||
|
||
## Optimization Techniques | ||
|
||
### Reduce model size | ||
|
||
There are various ways to reduce the model size, starting by choosing a smaller model. For example, there are cases where a model with 7 billion parameters may be sufficient for a given task, while a model with 70 billion parameters may not provide a significant improvement in performance. | ||
|
||
MLRun provides the ability to use any model and automate the pipeline. This gives you the ability to test different models and see which one works best for your use case. | ||
|
||
A common technique to reduce the model size is quantization. Quantization reduces the precision of the weights and activations of the model, which can lead to a significant reduction in memory usage and a speedup in inference time. The most common quantization is 8-bit quantization, which reduces the precision from 32-bit floating point to 8-bit integers. This can lead to a 4x reduction in memory usage and a significant improvement in inference time. | ||
|
||
In some cases, quantization can lead to a significant reduction in accuracy, so it is important to test the quantized model on a validation set to ensure that accuracy is not severely impacted. | ||
|
||
MLRun provides the ability to automate the quantization process, which can help you quickly test different quantization values, and ensure that the quantization process happens automatically in your CI/CD pipeline. | ||
|
||
### Attention | ||
|
||
In deep learning models, attention mechanisms are used to focus on different parts of the input sequence. Attention mechanisms can be computationally expensive and can be a bottleneck for running large models. One way to improve GPU utilization is to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), which is a more efficient attention mechanism that can lead to a significant speedup and memory reduction. Standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. This translates to a 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths. FlashAttention-2 offer faster attention with better parallelism and work partition. | ||
|
||
## Inference Optimization | ||
|
||
### Batch Size | ||
|
||
Batch size is an important hyperparameter that can have a significant impact on GPU utilization. Increasing the batch size can lead to better GPU utilization and can lead to a speedup in inference time. However, increasing the batch size leads to higher latency. Static batching is not as optimal as dynamic batching for LLMs as not all inputs produce completion tokens at the same time, leading to the longest input to halt the rest. However, the big improvement here comes not just from GPU utilization but by increasing throughput. | ||
|
||
### GPU allocation | ||
|
||
When running multiple models, it is important to allocate the GPUs dynamically per demand. MLRun uses Nuclio for serverless functions, which can free up the GPU when the function is not running or when it scales down. This can lead to better GPU utilization. | ||
|
||
### Using CPUs | ||
|
||
There are tasks related to GenAI that are better suited for CPUs, such as data preprocessing, loading the model, and processing the outputs. By offloading these tasks to CPUs, you can free up the GPU for running the model, which can lead to better GPU utilization. Therefore, rather than running the entire pipeline on the GPU, you can run the CPU tasks on the CPU and the model on the GPU. This usually means that the inference pipeline will run on different nodes, and MLRun can automatically distribute the pipeline across different nodes. | ||
|
||
|
||
### Multiple GPUs | ||
|
||
When multiple GPUs are available, you can use multiple workers to run the model in parallel. This can lead to better GPU utilization and can lead to a speedup in inference time. Typically, orchestrating multiple GPUs requires significant engineering effort. MLRun provides the ability to run multiple workers in parallel. It uses automatically distribute the function code across multiple GPUs, but from the user's point of view, it is as simple as setting the number of workers to run in parallel. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters