Fix/update Triton + ORT (#116)

* fix: pin ort and trt versions * fix: bugs and warning * fix: api change * fix: remove all trt context manager as now there is a warning. It's not needed anymore, memory is cleared when possible. * fix: fix padding * fix: update triton docker image * fix: update minimal Python version * fix: update docker versions * fix: fix model path for model generation * fix: code formatting * fix: fix question answering output Co-authored-by: ayoub-louati <[email protected]>
ELS-RD · Aug 5, 2022 · d99e08e · d99e08e
1 parent 43b1449
commit d99e08e
Show file tree

Hide file tree

Showing 17 changed files with 127 additions and 152 deletions.
diff --git a/.gitignore b/.gitignore
@@ -220,7 +220,6 @@ cython_debug/
 .idea/
 TensorRT/
 triton_models/
-*.whl
 .vscode
 to_delete/
 .history/
diff --git a/Dockerfile b/Dockerfile
@@ -1,9 +1,9 @@
-FROM nvcr.io/nvidia/tritonserver:22.05-py3
+FROM nvcr.io/nvidia/tritonserver:22.07-py3
 
 # see .dockerignore to check what is transfered
 COPY . ./
 
 RUN pip3 install -U pip && \
     pip3 install nvidia-pyindex && \
-    pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
+    pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
     pip3 install sentence-transformers notebook pytorch-quantization ipywidgets
diff --git a/README.md b/README.md
@@ -1,8 +1,6 @@
 # Hugging Face Transformer submillisecond inference️ and deployment to production: 🤗 → 🤯
 
-[![Documentation](https://img.shields.io/website?label=documentation&style=for-the-badge&up_message=online&url=https%3A%2F%2Fels-rd.github.io%2Ftransformer-deploy%2F)](https://els-rd.github.io/transformer-deploy/) [![tests](https://img.shields.io/github/workflow/status/ELS-RD/transformer-deploy/tests/main?label=tests&style=for-the-badge)](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg?style=for-the-badge)](https://www.python.org/downloads/release/python-360/) [![Twitter Follow](https://img.shields.io/twitter/follow/pommedeterre33?color=orange&style=for-the-badge)](https://twitter.com/pommedeterre33)
-
-**WARNING**: Docker image of this project is version `0.4.0` which is now few months old. Next release will be done on June/July 2022 when some dependencies of this library will be updated.
+[![Documentation](https://img.shields.io/website?label=documentation&style=for-the-badge&up_message=online&url=https%3A%2F%2Fels-rd.github.io%2Ftransformer-deploy%2F)](https://els-rd.github.io/transformer-deploy/) [![tests](https://img.shields.io/github/workflow/status/ELS-RD/transformer-deploy/tests/main?label=tests&style=for-the-badge)](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [![Python 3.6](https://img.shields.io/badge/python-3.8-blue.svg?style=for-the-badge)](https://www.python.org/downloads/release/python-380/) [![Twitter Follow](https://img.shields.io/twitter/follow/pommedeterre33?color=orange&style=for-the-badge)](https://twitter.com/pommedeterre33)
 
 ### Optimize and deploy in **production** 🤗 Hugging Face Transformer models in a single command line.  
 
@@ -65,7 +63,7 @@ First, clone the repo as some commands below expect to find the `demo` folder:
 git clone [email protected]:ELS-RD/transformer-deploy.git
 cd transformer-deploy
 # docker image may take a few minutes
-docker pull ghcr.io/els-rd/transformer-deploy:0.4.0 
+docker pull ghcr.io/els-rd/transformer-deploy:0.5.0 
 ```
 
 ### Classification/reranking (encoder model)
@@ -79,7 +77,7 @@ This will optimize models, generate Triton configuration and Triton folder layou
 
 ```shell
 docker run -it --rm --gpus all \
-  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && \
     convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
     --backend tensorrt onnx \
@@ -109,7 +107,7 @@ For production, it's advised to build your own 3-line Docker image with `transfo
 
 ```shell
 docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
   bash -c "pip install transformers && tritonserver --model-repository=/models"
 
 # output:
@@ -149,7 +147,7 @@ This will optimize models, generate Triton configuration and Triton folder layou
 
 ```shell
 docker run -it --rm --gpus all \
-  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && \
     convert_model -m \"kamalkraj/bert-base-cased-ner-conll2003\" \
     --backend tensorrt onnx \
@@ -180,8 +178,8 @@ For production, it's advised to build your own 3-line Docker image with `transfo
 
 ```shell
 docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
-  bash -c "pip install transformers torch==1.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
+  bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
   tritonserver --model-repository=/models"
 
 # output:
@@ -214,7 +212,7 @@ This will optimize models, generate Triton configuration and Triton folder layou
 
 ```shell
 docker run -it --rm --gpus all \
-  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && \
     convert_model -m \"distilbert-base-cased-distilled-squad\" \
     --backend tensorrt onnx \
@@ -245,8 +243,8 @@ For production, it's advised to build your own 3-line Docker image with `transfo
 
 ```shell
 docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 1024m \
-  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
-  bash -c "pip install transformers torch==1.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
+  bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
   tritonserver --model-repository=/models"
 
 # output:
@@ -282,7 +280,7 @@ This project supports models from [sentence-transformers](https://github.com/UKP
 
 ```shell
 docker run -it --rm --gpus all \
-  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && \
     convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
     --backend tensorrt onnx \
@@ -305,7 +303,7 @@ docker run -it --rm --gpus all \
 
 ```shell
 docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
   bash -c "pip install transformers && tritonserver --model-repository=/models"
 
 # output:
@@ -343,7 +341,7 @@ One point to have in mind is that Triton run:
 
 ```shell
 docker run -it --rm --gpus all \
-  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && \
     convert_model -m gpt2 \
     --backend tensorrt onnx \
@@ -373,7 +371,7 @@ To optimize models which typically don't fit twice onto a single GPU, run the sc
 
 ```shell
 docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.1 \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && \
     convert_model -m gpt2-medium \
     --backend tensorrt onnx \
@@ -394,8 +392,8 @@ To run decoding algorithm server side, we need to install `Pytorch` on `Triton`
 
 ```shell
 docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \
-  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
-  bash -c "pip install transformers torch==1.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
+  bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
   tritonserver --model-repository=/models"
 
 # output:
@@ -427,7 +425,7 @@ You may want to tweak it regarding your needs (default is set for greedy search
 You may be interested in running optimized text generation on Python directly, without using any inference server:  
 
 ```shell
-docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
 ```
 
@@ -442,7 +440,7 @@ It makes it easy to use.
 To play with it, open this notebook:
 
 ```shell
-docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
 ```
 

diff --git a/demo/generative-model/t5.ipynb b/demo/generative-model/t5.ipynb
@@ -132,36 +132,10 @@
     }
    },
    "source": [
-    "## `ONNX Runtime` compilation\n",
-    "\n",
-    "Version 1.11.1 of `ONNX Runtime` and older have a bug which makes them much slower when most inputs are used by subgraphs of an `If` node.\n",
-    "Unfortunately, it's exactly what will do below, so we need to compile our own version of `ONNX Runtime` until the version 1.12 is released (in June 2022).\n",
-    "Code below has been tested on Ubuntu 22.04 and supposes that your machine has `CUDA` 11.4 installed.\n",
-    "If not, use the Docker image of this library.\n",
-    "\n",
-    "We use a specific commit of `ONNX Runtime` with a better management of `If`/`Else`/`Then` `ONNX` nodes:\n",
-    "\n",
-    "```shell\n",
-    "git clone --recursive https://github.com/Microsoft/onnxruntime\n",
-    "cd onnxruntime\n",
-    "git checkout -b fix 453c57f92f5294417f69d81a240484b7d59938f2\n",
-    "CUDACXX=/usr/local/cuda-11.4/bin/nvcc ./build.sh \\\n",
-    "    --config Release \\\n",
-    "    --build_wheel \\\n",
-    "    --parallel \\\n",
-    "    --use_cuda \\\n",
-    "    --cuda_home /usr/local/cuda-11.4 \\\n",
-    "    --cudnn_home /usr/lib/x86_\n",
-    "    -linux-gnu/ \\\n",
-    "    --skip_test\n",
-    "# pip install ...\n",
-    "# other required dependencies\n",
-    "# pip install nvtx seaborn\n",
-    "```\n",
-    "\n",
-    "On our machine, it takes around 20 minutes.\n",
-    "\n",
-    "> to clear previous compilation, delete content of `./build` folder"
+    "## `ONNX Runtime`\n",
+    "\n",
+    "Version 1.11.1 of `ONNX Runtime` and older have a bug which makes them much slower when most inputs are used by subgraphs of an `If` node.  \n",
+    "Use a version >= 1.12.0 instead."
    ]
   },
   {
@@ -667,7 +641,7 @@
     "keep_fp32_encoder = get_keep_fp32_nodes(onnx_model_path=encoder_model_path, get_input=get_random_input_encoder)\n",
     "assert len(keep_fp32_encoder) > 0\n",
     "enc_model_onnx = convert_fp16(onnx_model=encoder_model_path, nodes_to_exclude=keep_fp32_encoder)\n",
-    "save_onnx(proto=enc_model_onnx, model_path=encoder_fp16_model_path)\n",
+    "save_onnx(proto=enc_model_onnx, model_path=encoder_fp16_model_path, clean=False)\n",
     "\n",
     "del enc_model_onnx\n",
     "torch.cuda.empty_cache()\n",
@@ -824,7 +798,7 @@
     "keep_fp32_no_cache = get_keep_fp32_nodes(onnx_model_path=dec_no_cache_model_path, get_input=get_random_input_no_cache)\n",
     "\n",
     "onnx_model_no_cache_fp16 = convert_fp16(onnx_model=dec_no_cache_model_path, nodes_to_exclude=keep_fp32_no_cache)\n",
-    "save_onnx(proto=onnx_model_no_cache_fp16, model_path=dec_no_cache_fp16_model_path)\n",
+    "save_onnx(proto=onnx_model_no_cache_fp16, model_path=dec_no_cache_fp16_model_path, clean=False)\n",
     "del onnx_model_no_cache_fp16"
    ]
   },
@@ -943,7 +917,7 @@
     "\n",
     "dec_cache_model_fp32_all_nodes = add_output_nodes(model=dec_cache_model)\n",
     "dec_cache_model_fp32_all_nodes_path = dec_cache_model_path + \"_all_nodes.onnx\"\n",
-    "save_onnx(proto=dec_cache_model_fp32_all_nodes, model_path=dec_cache_model_fp32_all_nodes_path)\n",
+    "save_onnx(proto=dec_cache_model_fp32_all_nodes, model_path=dec_cache_model_fp32_all_nodes_path, clean=False)\n",
     "# reload after shape inference\n",
     "dec_cache_model_fp32_all_nodes = onnx.load_model(f=dec_cache_model_fp32_all_nodes_path, load_external_data=False)\n",
     "\n",
@@ -994,7 +968,7 @@
     "dec_no_cache_model.graph.output.extend(nodes_to_be_added)\n",
     "\n",
     "dec_no_cache_model_fp32_all_nodes_path = dec_no_cache_model_path + \"_all_nodes.onnx\"\n",
-    "save_onnx(proto=dec_no_cache_model, model_path=dec_no_cache_model_fp32_all_nodes_path)\n",
+    "save_onnx(proto=dec_no_cache_model, model_path=dec_no_cache_model_fp32_all_nodes_path, clean=False)\n",
     "\n",
     "# now that each model has the same number of output nodes, we can merge them!\n",
     "merge_autoregressive_model_graphs(\n",
@@ -1068,7 +1042,7 @@
     "gc.collect()\n",
     "\n",
     "onnx_model_cache_fp16 = convert_fp16(onnx_model=dec_cache_model_path, nodes_to_exclude=keep_fp32_cache)\n",
-    "save_onnx(proto=onnx_model_cache_fp16, model_path=dec_cache_fp16_model_path)\n",
+    "save_onnx(proto=onnx_model_cache_fp16, model_path=dec_cache_fp16_model_path, clean=False)\n",
     "\n",
     "del onnx_model_cache_fp16\n",
     "gc.collect()\n",
@@ -2029,4 +2003,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 1
-}
+}
diff --git a/demo/infinity/README.md b/demo/infinity/README.md
@@ -58,7 +58,7 @@ and Pytorch the simplest approach (at least it's the most well known tool).
 ```shell
 # add -v $PWD/src:/opt/tritonserver/src to apply source code modification to the container
 docker run -it --rm --gpus all \
-  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "cd /project && \
     convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
     --backend tensorrt onnx \
@@ -96,7 +96,7 @@ Launch `Nvidia Triton inference server`:
 ```shell
 # add --shm-size 256m -> to have up to 4 Python backends (tokenizer) at the same time (64Mb per instance) 
 docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
+  -v $PWD/triton_models:/models ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "pip install transformers && tritonserver --model-repository=/models"
 ```
 
@@ -111,16 +111,16 @@ Measures:
 ```shell
 # need a local installation of the package
 # pip install .[GPU]
-ubuntu@ip-XXX:~/transformer-deploy$ python3 demo/infinity/triton_client.py --length 16 --model tensorrt
-10/31/2021 12:09:34 INFO     timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
-[[-3.4355469  3.2753906]]
+python3 demo/infinity/triton_client.py --length 16 --model tensorrt
+# 10/31/2021 12:09:34 INFO     timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
+# [[-3.4355469  3.2753906]]
 ```
 
 * 128 tokens + TensorRT:
 ```shell
-ubuntu@ip-XXX:~/transformer-deploy$ python3 demo/infinity/triton_client.py --length 128 --model tensorrt
-10/31/2021 12:12:00 INFO     timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
-[[-3.4589844  3.3027344]]
+python3 demo/infinity/triton_client.py --length 128 --model tensorrt
+# 10/31/2021 12:12:00 INFO     timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
+# [[-3.4589844  3.3027344]]
 ```
 
 There is also a performance analysis tool provided by Nvidia called [`perf_analyzer`](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md)
@@ -157,7 +157,7 @@ Model analyzer is a powerful tool to adjust the Triton server configuration.
 To run it:
 
 ```shell
-docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.1.1 \
+docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
   bash -c "model-analyzer profile -f /project/demo/infinity/config_analyzer.yaml"
 ```
 

diff --git a/docs/run.md b/docs/run.md
@@ -44,7 +44,7 @@ convert_model -m philschmid/MiniLM-L6-H384-uncased-sst2 --backend onnx --seq-len
 
 ```shell
 docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
   bash -c "pip install transformers sentencepiece && tritonserver --model-repository=/models"
 ```
 

diff --git a/docs/setup_local.md b/docs/setup_local.md
@@ -47,7 +47,7 @@ cd transformer-deploy
 * for CPU/GPU support:
 
 ```shell
-pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com
+pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com
 # if you want to perform GPU quantization (recommended):
 pip3 install git+ssh://[email protected]/NVIDIA/TensorRT#egg=pytorch-quantization\&subdirectory=tools/pytorch-quantization/
 # if you want to accelerate dense embeddings extraction:

diff --git a/requirements_cpu.txt b/requirements_cpu.txt
@@ -1 +1 @@
-onnxruntime
+onnxruntime==1.12.0
diff --git a/requirements_gpu.txt b/requirements_gpu.txt
@@ -1,5 +1,5 @@
-onnxruntime-gpu
-nvidia-tensorrt
+onnxruntime-gpu==1.12.0
+nvidia-tensorrt==8.4.1.5
 onnx_graphsurgeon
 polygraphy
 cupy-cuda117
diff --git a/setup.py b/setup.py
@@ -48,7 +48,7 @@
         "GPU": extra_gpu,
         "CPU": extra_cpu,
     },
-    python_requires=">=3.6.0",
+    python_requires=">=3.8.0",
     entry_points={
         "console_scripts": [
             "convert_model = transformer_deploy.convert:entrypoint",

diff --git a/src/transformer_deploy/backends/pytorch_utils.py b/src/transformer_deploy/backends/pytorch_utils.py
@@ -22,7 +22,7 @@
 from torch.onnx import TrainingMode
 from transformers import AutoConfig, PreTrainedModel
 
-from src.transformer_deploy.backends.onnx_utils import save_onnx
+from transformer_deploy.backends.onnx_utils import save_onnx
 
 
 def infer_classification_pytorch(