triton-inference-server · bot66 · Mar 18, 2024 · Mar 18, 2024 · Aug 22, 2024
diff --git a/Conceptual_Guide/Part_2-improving_resource_utilization/README.md b/Conceptual_Guide/Part_2-improving_resource_utilization/README.md
@@ -39,12 +39,12 @@ Part-1 of this series introduced the mechanisms to set up a Triton Inference Ser
 Dynamic batching, in reference to the Triton Inference Server, refers to the functionality which allows the combining of one or more inference requests into a single batch (which has to be created dynamically) to maximize throughput.
 
 Dynamic batching can be enabled and configured on per model basis by specifying selections in the model's `config.pbtxt`. Dynamic Batching can be enabled with its default settings by adding the following to the `config.pbtxt` file:
-```
+```text proto
 dynamic_batching { }
 ```
 While Triton batches these incoming requests without any delay, users can choose to allocate a limited delay for the scheduler to collect more inference requests to be used by the dynamic batcher.
 
-```
+```text proto
 dynamic_batching {
     max_queue_delay_microseconds: 100
 }
@@ -65,7 +65,7 @@ As observed from the above, the use of Dynamic Batching can lead to improvements
 
 The Triton Inference Server can spin up multiple instances of the same model, which can process queries in parallel. Triton can spawn instances on the same device (GPU), or a different device on the same node as per the user's specifications. This customizability is especially useful when considering ensembles that have models with different throughputs. Multiple copies of heavier models can be spawned on a separate GPU to allow for more parallel processing. This is enabled via the use of `instance groups` option in a model's configuration.
 
-```
+```text proto
 instance_group [
   {
     count: 2
@@ -90,13 +90,13 @@ This section showcases the use of dynamic batching and concurrent model executio
 ### Getting access to the model
 
 Let's use the `text recognition` used in part 1. We do need to make some minor changes in the model, namely making the 0th axes of the model have dynamic shape to enable batching. Step 1, download the Text Recognition model weights. Use the NGC PyTorch container as the environment for the following.
-```
+```bash
 docker run -it --gpus all -v ${PWD}:/scratch nvcr.io/nvidia/pytorch:<yy.mm>-py3
 cd /scratch
 wget https://www.dropbox.com/sh/j3xmli4di1zuv3s/AABzCC1KGbIRe2wRwa3diWKwa/None-ResNet-None-CTC.pth
 ```
 Export the models as `.onnx` using the file in the `utils` folder. This file is adapted from [Baek et. al. 2019](https://github.com/clovaai/deep-text-recognition-benchmark).
-```
+```python
 import torch
 from utils.model import STRModel
 
@@ -116,7 +116,7 @@ torch.onnx.export(model, trace_input, "str.onnx", verbose=True, dynamic_axes={'i
 ### Launching the server
 
 As discussed in `Part 1`, a model repository is a filesystem based repository of models and configuration schema used by the Triton Inference Server (refer to `Part 1` for a more detailed explanation for model repositories). For this example, the model repository structure would need to be set up in the following manner:
-```
+```text
 model_repository
 |
 |-- text_recognition
@@ -128,7 +128,7 @@ model_repository
 ```
 This repository is a subset from the previous example. The key difference in this set up is the use of `instance_group`(s) and `dynamic_batching` in the model configuration. The additions are as follows:
 
-```
+```text proto
 instance_group [
     {
       count: 2
@@ -142,7 +142,7 @@ With `instance_group` users can primarily tweak two things. First, the number of
 Adding `dynamic_batching {}` will enable the use of dynamic batches. Users can also add `preferred_batch_size` and `max_queue_delay_microseconds` in the body of dynamic batching to manage more efficient batching per their use case. Explore the [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-configuration) documentation for more information.
 
 With the model repository set up, the Triton Inference Server can be launched.
-```
+```bash
 docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
 
 tritonserver --model-repository=/models
@@ -151,19 +151,19 @@ tritonserver --model-repository=/models
 ### Measuring Performance
 
 Having made some improvements to the model's serving capabilities by enabling `dynamic batching` and the use of `multiple model instances`, the next step is to measure the impact of these features. To that end, the Triton Inference Server comes packaged with the [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md) which is a tool specifically designed to measure performance for Triton Inference Servers. For ease of use, it is recommended that users run this inside the same container used to run client code in Part 1 of this series.
-```
+```bash
 docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash
 ```
 On a third terminal, it is advisable to monitor the GPU Utilization to see if the deployment is saturating GPU resources.
-```
+```bash
 watch -n0.1 nvidia-smi
 ```
 
 To measure the performance gain, let's run performance analyzer on the following configurations:
 
 * **No Dynamic Batching, single model instance**: This configuration will be the baseline measurement. To set up the Triton Server in this configuration, do not add `instance_group` or `dynamic_batching` in `config.pbtxt` and make sure to include `--gpus=1` in the `docker run` command to set up the server.
 
-```
+```bash
 # perf_analyzer -m <model name> -b <batch size> --shape <input layer>:<input shape> --concurrency-range <lower number of request>:<higher number of request>:<step>
 
 # Query
@@ -198,7 +198,7 @@ Request concurrency: 16
 ```
 
 * **Just Dynamic Batching**: To set up the Triton Server in this configuration, add `dynamic_batching` in `config.pbtxt`.
-```
+```bash
 # Query
 perf_analyzer -m text_recognition -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95
 
@@ -233,7 +233,7 @@ As each of the requests had a batch size (of 2), while the maximum batch size of
 
 * **Dynamic Batching with multiple model instances**: To set up the Triton Server in this configuration, add `instance_group` in `config.pbtxt` and make sure to include `--gpus=1` and make sure to include `--gpus=1` in the `docker run` command to set up the server. Include `dynamic_batching` per instructions of the previous section in the model configuration. A point to note is that peak GPU utilization on the GPU shot up to 74% (A100 in this case) while just using a single model instance with dynamic batching. Adding one more instance will definitely improve performance but linear perf scaling will not be achieved in this case.
 
-```
+```bash
 # Query
 perf_analyzer -m text_recognition -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95
 

diff --git a/Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md b/Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md
@@ -70,7 +70,7 @@ With Model Analyzer users can:
 
 Refer to Part 2 of this series to get access to the models. Refer to the Model Analyzer [installation guide](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md#recommended-installation-method) for more information about installing Model Analyzer. For ease of following along, use these commands to install model analyzer:
 
-```
+```bash
 sudo apt-get update && sudo apt-get install python3-pip
 sudo apt-get update && sudo apt-get install wkhtmltopdf
 pip3 install triton-model-analyzer
@@ -106,13 +106,13 @@ Consider the deployment of the text recognition model with a latency budget of `
 
 Note: The config file contains the shape of the query image. Refer the Launch mode [documentation](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/launch_modes.md) for more info about the launch mode flag.
 
-```
+```bash
 model-analyzer profile --model-repository /workspace/model_repository --profile-models text_recognition --triton-launch-mode=local --output-model-repository-path /workspace/output/ -f perf.yaml --override-output-model-repository --latency-budget 10 --run-config-search-mode quick
 ```
 
 Once the sweeps are done users can then use `report` to summarize the top configurations.
 
-```
+```bash
 model-analyzer report --report-model-configs text_recognition_config_4,text_recognition_config_5,text_recognition_config_6 --export-path /workspace --config-file perf.yaml
 ```
 

diff --git a/Conceptual_Guide/Part_4-inference_acceleration/README.md b/Conceptual_Guide/Part_4-inference_acceleration/README.md
@@ -64,7 +64,7 @@ There are three routes for users to use to convert their models to TensorRT: the
 
 That said, there are two main steps needed. First, convert the model to a TensorRT Engine. It is recommended to use the [TensorRT Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt) to run the command.
 
-```
+```bash
 trtexec --onnx=model.onnx \
         --saveEngine=model.plan \
         --explicitBatch
@@ -95,7 +95,7 @@ There are three options to accelerate the ONNX runtime: with `TensorRT` and `CUD
 In general TensorRT will provide better optimizations than the CUDA execution provider however, this depends on the exact structure of the model, more precisely, it depends in the operators used in the network being accelerated. If all the operators are supported, conversion to TensorRT will yield better performance. When `TensorRT` is selected as the accelerator, all supported subgraphs are accelerated by TensorRT and the rest of the graph runs on the CUDA execution provider. Users can achieve this with the following additions to the config file.
 
 **TensorRT acceleration**
-```
+```text proto
 optimization {
   execution_accelerators {
     gpu_execution_accelerator : [ {
@@ -112,7 +112,7 @@ There are a few other ONNX runtime specific optimizations. Refer to this section
 
 ## CPU Based Acceleration
 Triton Inference Server also supports acceleration for CPU only model with [OpenVINO](https://docs.openvino.ai/latest/index.html). In configuration file, users can add the following to enable CPU acceleration.
-```
+```text proto
 optimization {
   execution_accelerators {
     cpu_execution_accelerator : [{
@@ -133,7 +133,7 @@ On the other end of the spectrum, Deep Learning practitioners are drawn to Large
 ## Working Example
 Before proceeding, please set up a model repository for the Text Recognition model being used in Part 1-3 of this series. Then, navigate to the model repository and launch two containers:
 
-```
+```bash
 # Server Container
 docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd):/workspace/ -v/$(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:22.11-py3 bash
 
@@ -150,15 +150,15 @@ While using ONNX RT there are some [general optimizations](https://github.com/tr
 
 With this context, let's launch the Triton Inference Server with the appropriate configuration file.
 
-```
+```bash
 tritonserver --model-repository=/models
 ```
 **NOTE: These benchmarks are just to illustrate the general curve of the performance gain. This is not the highest throughput obtainable via Triton as resource utilization features haven't been enabled (eg. Dynamic Batching). Refer to the Model Analyzer tutorial for the best deployment configuration once model optimization are done.**
 
 **NOTE**: These settings are to maximize throughput. Refer to the Model Analyzer tutorial which covers managing latency requirements.
 
 For reference, the baseline performance is as follows:
-```
+```text
 Inferences/Second vs. Client Average Batch Latency
 Concurrency: 2, throughput: 4191.7 infer/sec, latency 7633 usec
 ```
@@ -167,7 +167,7 @@ Concurrency: 2, throughput: 4191.7 infer/sec, latency 7633 usec
 
 For this model, an exhaustive search for the best convolution algorithm is enabled. [Learn about more options](https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-cuda-execution-provider-optimization).
 
-```
+```bash
 ## Additions to Config
 parameters { key: "cudnn_conv_algo_search" value: { string_value: "0" } }
 parameters { key: "gpu_mem_limit" value: { string_value: "4294967200" } }
@@ -182,7 +182,7 @@ Concurrency: 2, throughput: 4257.9 infer/sec, latency 7672 usec
 ### ONNX RT execution on GPU w. TRT acceleration
 While specifying the use of TensorRT Execution Provider, the CUDA Execution provider is used as a fallback for operators not supported by TensorRT. It is recommended to use TensorRT natively if all operators are supported as the performance boost and optimization options are considerably better. In this case, TensorRT accelerator has been used with lower `FP16` precision.
 
-```
+```text proto
 ## Additions to Config
 optimization {
   graph : {
@@ -208,7 +208,7 @@ Concurrency: 2, throughput: 11820.2 infer/sec, latency 2706 usec
 
 Triton users can also use OpenVINO for CPU deployment. This can be enabled via the following:
 
-```
+```text proto
 optimization { execution_accelerators {
   cpu_execution_accelerator : [ {
     name : "openvino"

diff --git a/Conceptual_Guide/Part_5-Model_Ensembles/README.md b/Conceptual_Guide/Part_5-Model_Ensembles/README.md
@@ -356,7 +356,7 @@ print(output_data)
 ```
 
 Now, run the full inference pipeline by executing the following command
-```
+```bash
 python client.py
 ```
 You should see the parsed text printed out to your console.

diff --git a/Conceptual_Guide/Part_6-building_complex_pipelines/README.md b/Conceptual_Guide/Part_6-building_complex_pipelines/README.md
@@ -47,7 +47,7 @@ In this example, the models are being run on:
 * Python Backend
 
 Both the models deployed on a framework backend can be triggered using the following API:
-```
+```python
 encoding_request = pb_utils.InferenceRequest(
     model_name="text_encoder",
     requested_output_names=["last_hidden_state"],
@@ -66,13 +66,13 @@ Before starting, clone this repository and navigate to the root folder. Use thre
 
 ### Step 1: Prepare the Server Environment
 * First, run the Triton Inference Server Container.
-```
+```bash
 # Replace yy.mm with year and month of release. Eg. 22.08
 docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
 ```
 * Next, install all the dependencies required by the models running in the python backend and login with your [huggingface token](https://huggingface.co/settings/tokens)(Account on [HuggingFace](https://huggingface.co/) is required).
 
-```
+```bash
 # PyTorch & Transformers Lib
 pip install torch torchvision torchaudio
 pip install transformers ftfy scipy accelerate
@@ -84,7 +84,7 @@ huggingface-cli login
 ### Step 2: Exporting and converting the models
 Use the NGC PyTorch container, to export and convert the models.
 
-```
+```bash
 docker run -it --gpus all -p 8888:8888 -v ${PWD}:/mount nvcr.io/nvidia/pytorch:yy.mm-py3
 
 pip install transformers ftfy scipy
@@ -106,13 +106,13 @@ mv encoder.onnx model_repository/text_encoder/1/model.onnx
 
 ### Step 3: Launch the Server
 From the server container, launch the Triton Inference Server.
-```
+```bash
 tritonserver --model-repository=/models
 ```
 
 ### Step 4: Run the client
 Use the client container and run the client.
-```
+```bash
 docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash
 
 # Client with no GUI
-Original file line number
+Diff line change
@@ Expand Up / @@ -356,7 +356,7 @@ print(output_data) @@
     ```
     Now, run the full inference pipeline by executing the following command
-    ```
+    ```bash
     python client.py
     ```
     You should see the parsed text printed out to your console.
@@ Expand Down @@