Skip to content

Commit

Permalink
GenAI deployment docs
Browse files Browse the repository at this point in the history
  • Loading branch information
gilad-shaham committed Jun 14, 2024
1 parent 97fbcc2 commit 015f4fc
Show file tree
Hide file tree
Showing 6 changed files with 346 additions and 2 deletions.
5 changes: 3 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def current_version():
"sphinx_design",
"sphinx_reredirects",
"versionwarning.extension",
"sphinxcontrib.mermaid",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down Expand Up @@ -172,8 +173,8 @@ def current_version():
redirects = {
"runtimes/functions-architecture": "runtimes/functions.html",
"monitoring/initial-setup-configuration": "monitoring/model-monitoring-deployment.html",
"tutorials/05-batch-infer.ipynb": "tutorials/06-batch-infer.ipynb",
"tutorials/06-model-monitoring.ipynb": "tutorials/05-model-monitoring.ipynb",
"tutorials/05-batch-infer": "tutorials/06-batch-infer.html",
"tutorials/06-model-monitoring": "tutorials/05-model-monitoring.html",
}

smartquotes = False
Expand Down
190 changes: 190 additions & 0 deletions docs/genai/deployment/genai_serving.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
(genai-serving)=
# Serving GenAI Models

Serving a GenAI model is in essence the same as serving any other model. The main differences are with the inputs and outputs, which are usually unstructured (text or images) and the model is usually a transformer model. With MLRun you can serve any model, including pretrained models from the Hugging Face model hub as well as models fine-tuned with MLRun.

Another common use case is to serve the model as part of an inference pipeline, where the model is used as part of a larger pipeline that includes data preprocessing, model execution, and post-processing. This is covered in the {ref}`GenAI serving graph section <genai-serving-graph>`.


## Serving using the function hub

The function hub has a serving class called `hugging_face_serving` to run Hugging Face models. The following code shows how to import the function to your project

```python
hugging_face_serving = project.set_function("hub://hugging_face_serving")
```

Next, you can add a model to the function using the following code:

```python

hugging_face_serving.add_model(
'mymodel',
class_name='HuggingFaceModelServer',
model_path='123', # This is not used, just for enabling the process.

task="text-generation",
model_class="AutoModelForCausalLM",
model_name="openai-community/gpt2",
tokenizer_class="AutoTokenizer",
tokenizer_name="openai-community/gpt2",
)
```

And test the model
```python
hugging_face_mock_server = hugging_face_serving.to_mock_server()
result = hugging_face_mock_server.test(
"/v2/models/mymodel",
body={"inputs": ["write a short poem"]}
)
print(f"Output: {result['outputs']}")
```

## Implementing your own model serving function

The following code shows how to build a simple model serving function using MLRun. The function loads a pretrained model from the Hugging Face model hub and serves it using the MLRun model server.

```{admonition} Note
This example uses the [ONNX runtime](https://onnxruntime.ai/docs/) in this example, but it's here for illustrative purposes, you can use any other runtime within your model serving class.
To run this code, make sure to run `pip install huggingface_hub onnxruntime_genai` in your python environment
```


```python
import os
from typing import Any, Dict

from huggingface_hub import snapshot_download
import onnxruntime_genai as og
import mlrun


class OnnxGenaiModelServer(mlrun.serving.v2_serving.V2ModelServer):

def __init__(
self,
context: mlrun.MLClientCtx,
name: str,
model_path: str,
model_name: str,
search_options: Dict = {},
chat_template: str = "<|user|>\n{prompt} <|end|>\n<|assistant|>",
**class_args,
):
# Initialize the base server:
super(OnnxGenaiModelServer, self).__init__(
context=context,
name=name,
model_path=model_path,
**class_args,
)

self.chat_template = chat_template
self.search_options = search_options

# Set the max length to something sensible by default, unless it is specified by the user,
# since otherwise it will be set to the entire context length
if "max_length" not in self.search_options:
self.search_options["max_length"] = 2048

# Save hub loading parameters:
self.model_name = model_name

# Prepare variables for future use:
self.model_folder = None
self.model = None
self.tokenizer = None

def load(self):
# Download the model snapshot and save it to the model folder
self.model_folder = snapshot_download(self.model_name)

# Load the model from the model folder
self.model = og.Model(os.path.join(self.model_folder, self.model_path))

# Create a tokenizer using the loaded model
self.tokenizer = og.Tokenizer(self.model)

def predict(self, request: Dict[str, Any]) -> list:
# Get prompts from inputs::
prompts = [f'{self.chat_template.format(prompt=input.get("prompt"))}' for input in request["inputs"]]

# Tokenize:
input_tokens = self.tokenizer.encode_batch(prompts)

# Create the parameters
params = og.GeneratorParams(self.model)
params.set_search_options(**self.search_options)
params.input_ids = input_tokens

# Generate output tokens:
output_tokens = self.model.generate(params)

# Decode output tokens to text:
response = [{"prediction": self.tokenizer.decode(output), "prompt": prompt} for (output, prompt) in zip(output_tokens, prompts)]

return response
```

During load, the code above downloads a model from Hugging Face hub creates a model object and a tokenizer.

During prediction, the code collects all prompts, tokenizes the prompts, generates the response tokens and decodes the output tokens to text.

If we save the code above to `src/onnx_genai_serving.ay` we can create a model serving functions with the following code:

``` python
import os
import mlrun

project = mlrun.get_or_create_project("genai-deployment", context = "./", user_project=True)

genai_serving = project.set_function("src/onnx_genai_serving.py",
name="genai-serving",
kind="serving",
image="mlrun/mlrun",
requirements=["huggingface_hub", "onnxruntime_genai"])

genai_serving.add_model("mymodel",
model_name="microsoft/Phi-3-mini-4k-instruct-onnx",
model_path=os.path.join("cpu_and_mobile", "cpu-int4-rtn-block-32-acc-level-4"),
class_name="OnnxGenaiModelServer"
)

```

The code loads a Phi-3 model. We use the CPU version here so it's easy to test and run, but you can just as easily provide a GPU-based model.

We can test the model with the following code:

```python
mock_server = genai_serving.to_mock_server()

result = mock_server.test(
"/v2/models/mymodel",
body={"inputs": [{"prompt":"What is 1+1?"}]}
)
print(f"Output: {result['outputs']}")
```

A typical output would be
```
Output: [{'prediction': '\nWhat is 1+1? \n1+1 equals 2. This is a basic arithmetic addition problem where you add one unit to another unit.', 'prompt': '<|user|>\nWhat is 1+1? <|end|>\n<|assistant|>'}]
```

To deploy the model we run
```python
project.deploy_function(genai_serving)
```

This build a docker images with the required dependencies and deploys a nuclio function.

To test the model we can use the HTTP trigger as follows
```python
genai_serving.invoke(
"/v2/models/mymodel",
body={"inputs": [{"prompt":"What is 1+1?"}]}
)
```
109 changes: 109 additions & 0 deletions docs/genai/deployment/genai_serving_graph.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
(genai-serving-graph)=
# GenAI Realtime Serving Graph

During inference, it is common to serve a GenAI model as part of a larger pipeline that includes data preprocessing, model execution, and post-processing. This can be done with MLRun using the real-time serving pipeline feature. Prior to model inference, the context is typically enriched using a vector database, then the input is transformed to input tokens, and finally the model is executed. Pre-processing and post-processing may also include guardrails to ensure the input is valid (for example, prevent the user from asking questions that attempt to exploit the model) as well as output processing, to verify the model does not hallucinate or includes data that may not be shared.

## A basic graph

To run a model as part of a larger pipeline, you can use the `set_topology` method of the serving function. The following code shows how to set up a simple pipeline that includes a single step, this example is taken from the [Interactive bot demo using LLMs and MLRun](https://github.com/mlrun/demo-llm-bot) which calls OpenAI ChatGPT model:

```python
class QueryLLM:
def __init__(self):
config = AppConfig()
self.agent = build_agent(config=config)

def do(self, event):
try:
agent_resp = self.agent(
{
"input": event.body["question"],
"chat_history": messages_from_dict(event.body["chat_history"]),
}
)
event.body["output"] = parse_agent_output(agent_resp=agent_resp)
except ValueError as e:
response = str(e)
if not response.startswith("Could not parse LLM output: `"):
raise e
event.body["output"] = response.removeprefix(
"Could not parse LLM output: `"
).removesuffix("`")
return event
```

Store the code above to `src/serve-llm.py`, then to create the serving function, run the following code:

```python

serving_fn = project.set_function(
name="serve-llm",
func="src/serve_llm.py",
kind="serving",
image=image,
)
graph = serving_fn.set_topology("flow", engine="async")
graph.add_step(
name="llm",
class_name="src.serve_llm.QueryLLM",
full_event=True,
).respond()
```

We can now use a similar approach to add more steps to the pipeline.

## Setting up a Multi-step Inference Pipeline

The following code shows how to set up an multi-step inference pipeline using MLRun. This code is available in the [MLRun fine-tuning demo](https://github.com/mlrun/demo-llm-tuning):

```python
# Set the topology and get the graph object:
graph = serving_function.set_topology("flow", engine="async")

# Add the steps:
graph.to(handler="preprocess", name="preprocess") \
.to("LLMModelServer",
name="infer",
model_args={"load_in_8bit": True,
"device_map": "cuda:0",
"trust_remote_code": True},
tokenizer_name="tiiuae/falcon-7b",
model_name="tiiuae/falcon-7b",
peft_model=project.get_artifact_uri("falcon-7b-mlrun")) \
.to(handler="postprocess", name="postprocess") \
.to("ToxicityClassifierModelServer",
name="toxicity-classifier",
threshold=0.7).respond()

```

This flow is illustrated as follows:

```{mermaid}
flowchart LR
A([start]) --> B(preprocess)
B --> C(infer)
C --> D(postprocess)
D --> E(toxicity-classifier)
```

Generally, each step can be a python function, a serving class, or a class that implements the `do` method. In this case we have `LLMModelServer` and `ToxicityClassifierModelServer` which are serving classes while `preprocess` and `postprocess` are python functions.

```{admonition} Note
Unlike the example of {ref}`GenAI serving class<genai-serving>` which showed a simplistic case of deploying a single model, with realtime serving pipelines, one can run a more realistic scenario of having an end-to-end inference pipeline which can retrieve any data, run multiple models and filter any data or results.
```

Once you have the serving pipeline, it behaves just like any other serving function, including the use of `serving_function.to_mock_server()` to test the pipeline and `project.deploy_function(serving_function)` to deploy the pipeline.

An example of calling the pipeline:

```python
generate_kwargs = {"max_length": 150, "temperature": 0.9, "top_p": 0.5, "top_k": 25, "repetition_penalty": 1.0}
response = serving_function.invoke(path='/predict', body={"prompt": "What is MLRun?", **generate_kwargs})
print(response["outputs"])
```

## Distributed pipelines

By default, all steps of the serving graph will run on the same pod in sequence. It is possible to run different steps on different pods using {ref}`distributed pipelines<distributed-graph>` which would typically run steps that require CPU on one pod, and steps that require a GPU on a different pod.
42 changes: 42 additions & 0 deletions docs/genai/deployment/gpu_utilization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
(gpu-utilization)=
# GPU Utilization

GenAI models require GPU in order to run and they are usually large and thus require a lot of memory to run. However, GPU memory is limited and can be a bottleneck for running large models. In this section, we will discuss techniques to improve GPU utilization during inference and how to optimize it. The list here provides some important considerations, but this is not an exhaustive list.

## Optimization Techniques

### Reduce model size

There are various ways to reduce the model size, starting by choosing a smaller model. For example, there are cases where a model with 7 billion parameters may be sufficient for a given task, while a model with 70 billion parameters may not provide a significant improvement in performance.

MLRun provides the ability to use any model and automate the pipeline. This gives you the ability to test different models and see which one works best for your use case.

A common technique to reduce the model size is quantization. Quantization reduces the precision of the weights and activations of the model, which can lead to a significant reduction in memory usage and a speedup in inference time. The most common quantization is 8-bit quantization, which reduces the precision from 32-bit floating point to 8-bit integers. This can lead to a 4x reduction in memory usage and a significant improvement in inference time.

In some cases, quantization can lead to a significant reduction in accuracy, so it is important to test the quantized model on a validation set to ensure that accuracy is not severely impacted.

MLRun provides the ability to automate the quantization process, which can help you quickly test different quantization values, and ensure that the quantization process happens automatically in your CI/CD pipeline.

### Attention

In deep learning models, attention mechanisms are used to focus on different parts of the input sequence. Attention mechanisms can be computationally expensive and can be a bottleneck for running large models. One way to improve GPU utilization is to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), which is a more efficient attention mechanism that can lead to a significant speedup and memory reduction. Standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. This translates to a 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths. FlashAttention-2 offer faster attention with better parallelism and work partition.

## Inference Optimization

### Batch Size

Batch size is an important hyperparameter that can have a significant impact on GPU utilization. Increasing the batch size can lead to better GPU utilization and can lead to a speedup in inference time. However, increasing the batch size leads to higher latency. Static batching is not as optimal as dynamic batching for LLMs as not all inputs produce completion tokens at the same time, leading to the longest input to halt the rest. However, the big improvement here comes not just from GPU utilization but by increasing throughput.

### GPU allocation

When running multiple models, it is important to allocate the GPUs dynamically per demand. MLRun uses Nuclio for serverless functions, which can free up the GPU when the function is not running or when it scales down. This can lead to better GPU utilization.

### Using CPUs

There are tasks related to GenAI that are better suited for CPUs, such as data preprocessing, loading the model, and processing the outputs. By offloading these tasks to CPUs, you can free up the GPU for running the model, which can lead to better GPU utilization. Therefore, rather than running the entire pipeline on the GPU, you can run the CPU tasks on the CPU and the model on the GPU. This usually means that the inference pipeline will run on different nodes, and MLRun can automatically distribute the pipeline across different nodes.


### Multiple GPUs

When multiple GPUs are available, you can use multiple workers to run the model in parallel. This can lead to better GPU utilization and can lead to a speedup in inference time. Typically, orchestrating multiple GPUs requires significant engineering effort. MLRun provides the ability to run multiple workers in parallel. It uses automatically distribute the function code across multiple GPUs, but from the user's point of view, it is as simple as setting the number of workers to run in parallel.

1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ sphinx-version-warning~=1.1
# https://stackoverflow.com/questions/72441758/typeerror-descriptors-cannot-not-be-created-directly
# which is not generating the API by module pages (using PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python does not work)
protobuf~=3.20.3
sphinxcontrib-mermaid~=0.9.2
1 change: 1 addition & 0 deletions docs/serving/distributed-graph.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"(distributed-graph)=\n",
"# Distributed (multi-function) pipeline example\n",
"\n",
"This example demonstrates how to run a pipeline that consists of multiple serverless functions (connected using streams).\n",
Expand Down

0 comments on commit 015f4fc

Please sign in to comment.