Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve docs #86

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions Conceptual_Guide/Part_2-improving_resource_utilization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,12 @@ Part-1 of this series introduced the mechanisms to set up a Triton Inference Ser
Dynamic batching, in reference to the Triton Inference Server, refers to the functionality which allows the combining of one or more inference requests into a single batch (which has to be created dynamically) to maximize throughput.

Dynamic batching can be enabled and configured on per model basis by specifying selections in the model's `config.pbtxt`. Dynamic Batching can be enabled with its default settings by adding the following to the `config.pbtxt` file:
```
```text proto
dynamic_batching { }
```
While Triton batches these incoming requests without any delay, users can choose to allocate a limited delay for the scheduler to collect more inference requests to be used by the dynamic batcher.

```
```text proto
dynamic_batching {
max_queue_delay_microseconds: 100
}
Expand All @@ -65,7 +65,7 @@ As observed from the above, the use of Dynamic Batching can lead to improvements

The Triton Inference Server can spin up multiple instances of the same model, which can process queries in parallel. Triton can spawn instances on the same device (GPU), or a different device on the same node as per the user's specifications. This customizability is especially useful when considering ensembles that have models with different throughputs. Multiple copies of heavier models can be spawned on a separate GPU to allow for more parallel processing. This is enabled via the use of `instance groups` option in a model's configuration.

```
```text proto
instance_group [
{
count: 2
Expand All @@ -90,13 +90,13 @@ This section showcases the use of dynamic batching and concurrent model executio
### Getting access to the model

Let's use the `text recognition` used in part 1. We do need to make some minor changes in the model, namely making the 0th axes of the model have dynamic shape to enable batching. Step 1, download the Text Recognition model weights. Use the NGC PyTorch container as the environment for the following.
```
```bash
docker run -it --gpus all -v ${PWD}:/scratch nvcr.io/nvidia/pytorch:<yy.mm>-py3
cd /scratch
wget https://www.dropbox.com/sh/j3xmli4di1zuv3s/AABzCC1KGbIRe2wRwa3diWKwa/None-ResNet-None-CTC.pth
```
Export the models as `.onnx` using the file in the `utils` folder. This file is adapted from [Baek et. al. 2019](https://github.com/clovaai/deep-text-recognition-benchmark).
```
```python
import torch
from utils.model import STRModel

Expand All @@ -116,7 +116,7 @@ torch.onnx.export(model, trace_input, "str.onnx", verbose=True, dynamic_axes={'i
### Launching the server

As discussed in `Part 1`, a model repository is a filesystem based repository of models and configuration schema used by the Triton Inference Server (refer to `Part 1` for a more detailed explanation for model repositories). For this example, the model repository structure would need to be set up in the following manner:
```
```text
model_repository
|
|-- text_recognition
Expand All @@ -128,7 +128,7 @@ model_repository
```
This repository is a subset from the previous example. The key difference in this set up is the use of `instance_group`(s) and `dynamic_batching` in the model configuration. The additions are as follows:

```
```text proto
instance_group [
{
count: 2
Expand All @@ -142,7 +142,7 @@ With `instance_group` users can primarily tweak two things. First, the number of
Adding `dynamic_batching {}` will enable the use of dynamic batches. Users can also add `preferred_batch_size` and `max_queue_delay_microseconds` in the body of dynamic batching to manage more efficient batching per their use case. Explore the [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-configuration) documentation for more information.

With the model repository set up, the Triton Inference Server can be launched.
```
```bash
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash

tritonserver --model-repository=/models
Expand All @@ -151,19 +151,19 @@ tritonserver --model-repository=/models
### Measuring Performance

Having made some improvements to the model's serving capabilities by enabling `dynamic batching` and the use of `multiple model instances`, the next step is to measure the impact of these features. To that end, the Triton Inference Server comes packaged with the [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md) which is a tool specifically designed to measure performance for Triton Inference Servers. For ease of use, it is recommended that users run this inside the same container used to run client code in Part 1 of this series.
```
```bash
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash
```
On a third terminal, it is advisable to monitor the GPU Utilization to see if the deployment is saturating GPU resources.
```
```bash
watch -n0.1 nvidia-smi
```

To measure the performance gain, let's run performance analyzer on the following configurations:

* **No Dynamic Batching, single model instance**: This configuration will be the baseline measurement. To set up the Triton Server in this configuration, do not add `instance_group` or `dynamic_batching` in `config.pbtxt` and make sure to include `--gpus=1` in the `docker run` command to set up the server.

```
```bash
# perf_analyzer -m <model name> -b <batch size> --shape <input layer>:<input shape> --concurrency-range <lower number of request>:<higher number of request>:<step>

# Query
Expand Down Expand Up @@ -198,7 +198,7 @@ Request concurrency: 16
```

* **Just Dynamic Batching**: To set up the Triton Server in this configuration, add `dynamic_batching` in `config.pbtxt`.
```
```bash
# Query
perf_analyzer -m text_recognition -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95

Expand Down Expand Up @@ -233,7 +233,7 @@ As each of the requests had a batch size (of 2), while the maximum batch size of

* **Dynamic Batching with multiple model instances**: To set up the Triton Server in this configuration, add `instance_group` in `config.pbtxt` and make sure to include `--gpus=1` and make sure to include `--gpus=1` in the `docker run` command to set up the server. Include `dynamic_batching` per instructions of the previous section in the model configuration. A point to note is that peak GPU utilization on the GPU shot up to 74% (A100 in this case) while just using a single model instance with dynamic batching. Adding one more instance will definitely improve performance but linear perf scaling will not be achieved in this case.

```
```bash
# Query
perf_analyzer -m text_recognition -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ With Model Analyzer users can:

Refer to Part 2 of this series to get access to the models. Refer to the Model Analyzer [installation guide](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md#recommended-installation-method) for more information about installing Model Analyzer. For ease of following along, use these commands to install model analyzer:

```
```bash
sudo apt-get update && sudo apt-get install python3-pip
sudo apt-get update && sudo apt-get install wkhtmltopdf
pip3 install triton-model-analyzer
Expand Down Expand Up @@ -106,13 +106,13 @@ Consider the deployment of the text recognition model with a latency budget of `

Note: The config file contains the shape of the query image. Refer the Launch mode [documentation](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/launch_modes.md) for more info about the launch mode flag.

```
```bash
model-analyzer profile --model-repository /workspace/model_repository --profile-models text_recognition --triton-launch-mode=local --output-model-repository-path /workspace/output/ -f perf.yaml --override-output-model-repository --latency-budget 10 --run-config-search-mode quick
```

Once the sweeps are done users can then use `report` to summarize the top configurations.

```
```bash
model-analyzer report --report-model-configs text_recognition_config_4,text_recognition_config_5,text_recognition_config_6 --export-path /workspace --config-file perf.yaml
```

Expand Down
18 changes: 9 additions & 9 deletions Conceptual_Guide/Part_4-inference_acceleration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ There are three routes for users to use to convert their models to TensorRT: the

That said, there are two main steps needed. First, convert the model to a TensorRT Engine. It is recommended to use the [TensorRT Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt) to run the command.

```
```bash
trtexec --onnx=model.onnx \
--saveEngine=model.plan \
--explicitBatch
Expand Down Expand Up @@ -95,7 +95,7 @@ There are three options to accelerate the ONNX runtime: with `TensorRT` and `CUD
In general TensorRT will provide better optimizations than the CUDA execution provider however, this depends on the exact structure of the model, more precisely, it depends in the operators used in the network being accelerated. If all the operators are supported, conversion to TensorRT will yield better performance. When `TensorRT` is selected as the accelerator, all supported subgraphs are accelerated by TensorRT and the rest of the graph runs on the CUDA execution provider. Users can achieve this with the following additions to the config file.

**TensorRT acceleration**
```
```text proto
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
Expand All @@ -112,7 +112,7 @@ There are a few other ONNX runtime specific optimizations. Refer to this section

## CPU Based Acceleration
Triton Inference Server also supports acceleration for CPU only model with [OpenVINO](https://docs.openvino.ai/latest/index.html). In configuration file, users can add the following to enable CPU acceleration.
```
```text proto
optimization {
execution_accelerators {
cpu_execution_accelerator : [{
Expand All @@ -133,7 +133,7 @@ On the other end of the spectrum, Deep Learning practitioners are drawn to Large
## Working Example
Before proceeding, please set up a model repository for the Text Recognition model being used in Part 1-3 of this series. Then, navigate to the model repository and launch two containers:

```
```bash
# Server Container
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd):/workspace/ -v/$(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:22.11-py3 bash

Expand All @@ -150,15 +150,15 @@ While using ONNX RT there are some [general optimizations](https://github.com/tr

With this context, let's launch the Triton Inference Server with the appropriate configuration file.

```
```bash
tritonserver --model-repository=/models
```
**NOTE: These benchmarks are just to illustrate the general curve of the performance gain. This is not the highest throughput obtainable via Triton as resource utilization features haven't been enabled (eg. Dynamic Batching). Refer to the Model Analyzer tutorial for the best deployment configuration once model optimization are done.**

**NOTE**: These settings are to maximize throughput. Refer to the Model Analyzer tutorial which covers managing latency requirements.

For reference, the baseline performance is as follows:
```
```text
Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 4191.7 infer/sec, latency 7633 usec
```
Expand All @@ -167,7 +167,7 @@ Concurrency: 2, throughput: 4191.7 infer/sec, latency 7633 usec

For this model, an exhaustive search for the best convolution algorithm is enabled. [Learn about more options](https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-cuda-execution-provider-optimization).

```
```bash
## Additions to Config
parameters { key: "cudnn_conv_algo_search" value: { string_value: "0" } }
parameters { key: "gpu_mem_limit" value: { string_value: "4294967200" } }
Expand All @@ -182,7 +182,7 @@ Concurrency: 2, throughput: 4257.9 infer/sec, latency 7672 usec
### ONNX RT execution on GPU w. TRT acceleration
While specifying the use of TensorRT Execution Provider, the CUDA Execution provider is used as a fallback for operators not supported by TensorRT. It is recommended to use TensorRT natively if all operators are supported as the performance boost and optimization options are considerably better. In this case, TensorRT accelerator has been used with lower `FP16` precision.

```
```text proto
## Additions to Config
optimization {
graph : {
Expand All @@ -208,7 +208,7 @@ Concurrency: 2, throughput: 11820.2 infer/sec, latency 2706 usec

Triton users can also use OpenVINO for CPU deployment. This can be enabled via the following:

```
```text proto
optimization { execution_accelerators {
cpu_execution_accelerator : [ {
name : "openvino"
Expand Down
2 changes: 1 addition & 1 deletion Conceptual_Guide/Part_5-Model_Ensembles/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ print(output_data)
```

Now, run the full inference pipeline by executing the following command
```
```bash
python client.py
```
You should see the parsed text printed out to your console.
Expand Down
12 changes: 6 additions & 6 deletions Conceptual_Guide/Part_6-building_complex_pipelines/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ In this example, the models are being run on:
* Python Backend

Both the models deployed on a framework backend can be triggered using the following API:
```
```python
encoding_request = pb_utils.InferenceRequest(
model_name="text_encoder",
requested_output_names=["last_hidden_state"],
Expand All @@ -66,13 +66,13 @@ Before starting, clone this repository and navigate to the root folder. Use thre

### Step 1: Prepare the Server Environment
* First, run the Triton Inference Server Container.
```
```bash
# Replace yy.mm with year and month of release. Eg. 22.08
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
```
* Next, install all the dependencies required by the models running in the python backend and login with your [huggingface token](https://huggingface.co/settings/tokens)(Account on [HuggingFace](https://huggingface.co/) is required).

```
```bash
# PyTorch & Transformers Lib
pip install torch torchvision torchaudio
pip install transformers ftfy scipy accelerate
Expand All @@ -84,7 +84,7 @@ huggingface-cli login
### Step 2: Exporting and converting the models
Use the NGC PyTorch container, to export and convert the models.

```
```bash
docker run -it --gpus all -p 8888:8888 -v ${PWD}:/mount nvcr.io/nvidia/pytorch:yy.mm-py3

pip install transformers ftfy scipy
Expand All @@ -106,13 +106,13 @@ mv encoder.onnx model_repository/text_encoder/1/model.onnx

### Step 3: Launch the Server
From the server container, launch the Triton Inference Server.
```
```bash
tritonserver --model-repository=/models
```

### Step 4: Run the client
Use the client container and run the client.
```
```bash
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash

# Client with no GUI
Expand Down