Skip to content

Commit

Permalink
Fix/update Triton + ORT (#116)
Browse files Browse the repository at this point in the history
* fix: pin ort and trt versions

* fix: bugs and warning

* fix: api change

* fix: remove all trt context manager as now there is a warning. It's not needed anymore, memory is cleared when possible.

* fix: fix padding

* fix: update triton docker image

* fix: update minimal Python version

* fix: update docker versions

* fix: fix model path for model generation

* fix: code formatting

* fix: fix question answering output

Co-authored-by: ayoub-louati <[email protected]>
  • Loading branch information
pommedeterresautee and ayoub-louati authored Aug 5, 2022
1 parent 43b1449 commit d99e08e
Show file tree
Hide file tree
Showing 17 changed files with 127 additions and 152 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,6 @@ cython_debug/
.idea/
TensorRT/
triton_models/
*.whl
.vscode
to_delete/
.history/
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
FROM nvcr.io/nvidia/tritonserver:22.05-py3
FROM nvcr.io/nvidia/tritonserver:22.07-py3

# see .dockerignore to check what is transfered
COPY . ./

RUN pip3 install -U pip && \
pip3 install nvidia-pyindex && \
pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
pip3 install sentence-transformers notebook pytorch-quantization ipywidgets
38 changes: 18 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Hugging Face Transformer submillisecond inference️ and deployment to production: 🤗 → 🤯

[![Documentation](https://img.shields.io/website?label=documentation&style=for-the-badge&up_message=online&url=https%3A%2F%2Fels-rd.github.io%2Ftransformer-deploy%2F)](https://els-rd.github.io/transformer-deploy/) [![tests](https://img.shields.io/github/workflow/status/ELS-RD/transformer-deploy/tests/main?label=tests&style=for-the-badge)](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg?style=for-the-badge)](https://www.python.org/downloads/release/python-360/) [![Twitter Follow](https://img.shields.io/twitter/follow/pommedeterre33?color=orange&style=for-the-badge)](https://twitter.com/pommedeterre33)

**WARNING**: Docker image of this project is version `0.4.0` which is now few months old. Next release will be done on June/July 2022 when some dependencies of this library will be updated.
[![Documentation](https://img.shields.io/website?label=documentation&style=for-the-badge&up_message=online&url=https%3A%2F%2Fels-rd.github.io%2Ftransformer-deploy%2F)](https://els-rd.github.io/transformer-deploy/) [![tests](https://img.shields.io/github/workflow/status/ELS-RD/transformer-deploy/tests/main?label=tests&style=for-the-badge)](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [![Python 3.6](https://img.shields.io/badge/python-3.8-blue.svg?style=for-the-badge)](https://www.python.org/downloads/release/python-380/) [![Twitter Follow](https://img.shields.io/twitter/follow/pommedeterre33?color=orange&style=for-the-badge)](https://twitter.com/pommedeterre33)

### Optimize and deploy in **production** 🤗 Hugging Face Transformer models in a single command line.

Expand Down Expand Up @@ -65,7 +63,7 @@ First, clone the repo as some commands below expect to find the `demo` folder:
git clone [email protected]:ELS-RD/transformer-deploy.git
cd transformer-deploy
# docker image may take a few minutes
docker pull ghcr.io/els-rd/transformer-deploy:0.4.0
docker pull ghcr.io/els-rd/transformer-deploy:0.5.0
```

### Classification/reranking (encoder model)
Expand All @@ -79,7 +77,7 @@ This will optimize models, generate Triton configuration and Triton folder layou

```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && \
convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -109,7 +107,7 @@ For production, it's advised to build your own 3-line Docker image with `transfo

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
bash -c "pip install transformers && tritonserver --model-repository=/models"

# output:
Expand Down Expand Up @@ -149,7 +147,7 @@ This will optimize models, generate Triton configuration and Triton folder layou

```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && \
convert_model -m \"kamalkraj/bert-base-cased-ner-conll2003\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -180,8 +178,8 @@ For production, it's advised to build your own 3-line Docker image with `transfo

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
bash -c "pip install transformers torch==1.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
tritonserver --model-repository=/models"

# output:
Expand Down Expand Up @@ -214,7 +212,7 @@ This will optimize models, generate Triton configuration and Triton folder layou

```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && \
convert_model -m \"distilbert-base-cased-distilled-squad\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -245,8 +243,8 @@ For production, it's advised to build your own 3-line Docker image with `transfo

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 1024m \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
bash -c "pip install transformers torch==1.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
tritonserver --model-repository=/models"

# output:
Expand Down Expand Up @@ -282,7 +280,7 @@ This project supports models from [sentence-transformers](https://github.com/UKP

```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && \
convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
--backend tensorrt onnx \
Expand All @@ -305,7 +303,7 @@ docker run -it --rm --gpus all \

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
bash -c "pip install transformers && tritonserver --model-repository=/models"

# output:
Expand Down Expand Up @@ -343,7 +341,7 @@ One point to have in mind is that Triton run:

```shell
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && \
convert_model -m gpt2 \
--backend tensorrt onnx \
Expand Down Expand Up @@ -373,7 +371,7 @@ To optimize models which typically don't fit twice onto a single GPU, run the sc

```shell
docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.1 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && \
convert_model -m gpt2-medium \
--backend tensorrt onnx \
Expand All @@ -394,8 +392,8 @@ To run decoding algorithm server side, we need to install `Pytorch` on `Triton`

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
bash -c "pip install transformers torch==1.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
bash -c "pip install transformers torch==1.12.0 -f https://download.pytorch.org/whl/cu116/torch_stable.html && \
tritonserver --model-repository=/models"

# output:
Expand Down Expand Up @@ -427,7 +425,7 @@ You may want to tweak it regarding your needs (default is set for greedy search
You may be interested in running optimized text generation on Python directly, without using any inference server:

```shell
docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
```

Expand All @@ -442,7 +440,7 @@ It makes it easy to use.
To play with it, open this notebook:

```shell
docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
```

Expand Down
46 changes: 10 additions & 36 deletions demo/generative-model/t5.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -132,36 +132,10 @@
}
},
"source": [
"## `ONNX Runtime` compilation\n",
"\n",
"Version 1.11.1 of `ONNX Runtime` and older have a bug which makes them much slower when most inputs are used by subgraphs of an `If` node.\n",
"Unfortunately, it's exactly what will do below, so we need to compile our own version of `ONNX Runtime` until the version 1.12 is released (in June 2022).\n",
"Code below has been tested on Ubuntu 22.04 and supposes that your machine has `CUDA` 11.4 installed.\n",
"If not, use the Docker image of this library.\n",
"\n",
"We use a specific commit of `ONNX Runtime` with a better management of `If`/`Else`/`Then` `ONNX` nodes:\n",
"\n",
"```shell\n",
"git clone --recursive https://github.com/Microsoft/onnxruntime\n",
"cd onnxruntime\n",
"git checkout -b fix 453c57f92f5294417f69d81a240484b7d59938f2\n",
"CUDACXX=/usr/local/cuda-11.4/bin/nvcc ./build.sh \\\n",
" --config Release \\\n",
" --build_wheel \\\n",
" --parallel \\\n",
" --use_cuda \\\n",
" --cuda_home /usr/local/cuda-11.4 \\\n",
" --cudnn_home /usr/lib/x86_\n",
" -linux-gnu/ \\\n",
" --skip_test\n",
"# pip install ...\n",
"# other required dependencies\n",
"# pip install nvtx seaborn\n",
"```\n",
"\n",
"On our machine, it takes around 20 minutes.\n",
"\n",
"> to clear previous compilation, delete content of `./build` folder"
"## `ONNX Runtime`\n",
"\n",
"Version 1.11.1 of `ONNX Runtime` and older have a bug which makes them much slower when most inputs are used by subgraphs of an `If` node. \n",
"Use a version >= 1.12.0 instead."
]
},
{
Expand Down Expand Up @@ -667,7 +641,7 @@
"keep_fp32_encoder = get_keep_fp32_nodes(onnx_model_path=encoder_model_path, get_input=get_random_input_encoder)\n",
"assert len(keep_fp32_encoder) > 0\n",
"enc_model_onnx = convert_fp16(onnx_model=encoder_model_path, nodes_to_exclude=keep_fp32_encoder)\n",
"save_onnx(proto=enc_model_onnx, model_path=encoder_fp16_model_path)\n",
"save_onnx(proto=enc_model_onnx, model_path=encoder_fp16_model_path, clean=False)\n",
"\n",
"del enc_model_onnx\n",
"torch.cuda.empty_cache()\n",
Expand Down Expand Up @@ -824,7 +798,7 @@
"keep_fp32_no_cache = get_keep_fp32_nodes(onnx_model_path=dec_no_cache_model_path, get_input=get_random_input_no_cache)\n",
"\n",
"onnx_model_no_cache_fp16 = convert_fp16(onnx_model=dec_no_cache_model_path, nodes_to_exclude=keep_fp32_no_cache)\n",
"save_onnx(proto=onnx_model_no_cache_fp16, model_path=dec_no_cache_fp16_model_path)\n",
"save_onnx(proto=onnx_model_no_cache_fp16, model_path=dec_no_cache_fp16_model_path, clean=False)\n",
"del onnx_model_no_cache_fp16"
]
},
Expand Down Expand Up @@ -943,7 +917,7 @@
"\n",
"dec_cache_model_fp32_all_nodes = add_output_nodes(model=dec_cache_model)\n",
"dec_cache_model_fp32_all_nodes_path = dec_cache_model_path + \"_all_nodes.onnx\"\n",
"save_onnx(proto=dec_cache_model_fp32_all_nodes, model_path=dec_cache_model_fp32_all_nodes_path)\n",
"save_onnx(proto=dec_cache_model_fp32_all_nodes, model_path=dec_cache_model_fp32_all_nodes_path, clean=False)\n",
"# reload after shape inference\n",
"dec_cache_model_fp32_all_nodes = onnx.load_model(f=dec_cache_model_fp32_all_nodes_path, load_external_data=False)\n",
"\n",
Expand Down Expand Up @@ -994,7 +968,7 @@
"dec_no_cache_model.graph.output.extend(nodes_to_be_added)\n",
"\n",
"dec_no_cache_model_fp32_all_nodes_path = dec_no_cache_model_path + \"_all_nodes.onnx\"\n",
"save_onnx(proto=dec_no_cache_model, model_path=dec_no_cache_model_fp32_all_nodes_path)\n",
"save_onnx(proto=dec_no_cache_model, model_path=dec_no_cache_model_fp32_all_nodes_path, clean=False)\n",
"\n",
"# now that each model has the same number of output nodes, we can merge them!\n",
"merge_autoregressive_model_graphs(\n",
Expand Down Expand Up @@ -1068,7 +1042,7 @@
"gc.collect()\n",
"\n",
"onnx_model_cache_fp16 = convert_fp16(onnx_model=dec_cache_model_path, nodes_to_exclude=keep_fp32_cache)\n",
"save_onnx(proto=onnx_model_cache_fp16, model_path=dec_cache_fp16_model_path)\n",
"save_onnx(proto=onnx_model_cache_fp16, model_path=dec_cache_fp16_model_path, clean=False)\n",
"\n",
"del onnx_model_cache_fp16\n",
"gc.collect()\n",
Expand Down Expand Up @@ -2029,4 +2003,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}
18 changes: 9 additions & 9 deletions demo/infinity/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ and Pytorch the simplest approach (at least it's the most well known tool).
```shell
# add -v $PWD/src:/opt/tritonserver/src to apply source code modification to the container
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "cd /project && \
convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
--backend tensorrt onnx \
Expand Down Expand Up @@ -96,7 +96,7 @@ Launch `Nvidia Triton inference server`:
```shell
# add --shm-size 256m -> to have up to 4 Python backends (tokenizer) at the same time (64Mb per instance)
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
-v $PWD/triton_models:/models ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "pip install transformers && tritonserver --model-repository=/models"
```

Expand All @@ -111,16 +111,16 @@ Measures:
```shell
# need a local installation of the package
# pip install .[GPU]
ubuntu@ip-XXX:~/transformer-deploy$ python3 demo/infinity/triton_client.py --length 16 --model tensorrt
10/31/2021 12:09:34 INFO timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
[[-3.4355469 3.2753906]]
python3 demo/infinity/triton_client.py --length 16 --model tensorrt
# 10/31/2021 12:09:34 INFO timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
# [[-3.4355469 3.2753906]]
```

* 128 tokens + TensorRT:
```shell
ubuntu@ip-XXX:~/transformer-deploy$ python3 demo/infinity/triton_client.py --length 128 --model tensorrt
10/31/2021 12:12:00 INFO timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
[[-3.4589844 3.3027344]]
python3 demo/infinity/triton_client.py --length 128 --model tensorrt
# 10/31/2021 12:12:00 INFO timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
# [[-3.4589844 3.3027344]]
```

There is also a performance analysis tool provided by Nvidia called [`perf_analyzer`](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md)
Expand Down Expand Up @@ -157,7 +157,7 @@ Model analyzer is a powerful tool to adjust the Triton server configuration.
To run it:

```shell
docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.1.1 \
docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.0 \
bash -c "model-analyzer profile -f /project/demo/infinity/config_analyzer.yaml"
```

Expand Down
2 changes: 1 addition & 1 deletion docs/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ convert_model -m philschmid/MiniLM-L6-H384-uncased-sst2 --backend onnx --seq-len

```shell
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.05-py3 \
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
bash -c "pip install transformers sentencepiece && tritonserver --model-repository=/models"
```

Expand Down
2 changes: 1 addition & 1 deletion docs/setup_local.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ cd transformer-deploy
* for CPU/GPU support:

```shell
pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com
pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com
# if you want to perform GPU quantization (recommended):
pip3 install git+ssh://[email protected]/NVIDIA/TensorRT#egg=pytorch-quantization\&subdirectory=tools/pytorch-quantization/
# if you want to accelerate dense embeddings extraction:
Expand Down
2 changes: 1 addition & 1 deletion requirements_cpu.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
onnxruntime
onnxruntime==1.12.0
4 changes: 2 additions & 2 deletions requirements_gpu.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
onnxruntime-gpu
nvidia-tensorrt
onnxruntime-gpu==1.12.0
nvidia-tensorrt==8.4.1.5
onnx_graphsurgeon
polygraphy
cupy-cuda117
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
"GPU": extra_gpu,
"CPU": extra_cpu,
},
python_requires=">=3.6.0",
python_requires=">=3.8.0",
entry_points={
"console_scripts": [
"convert_model = transformer_deploy.convert:entrypoint",
Expand Down
2 changes: 1 addition & 1 deletion src/transformer_deploy/backends/pytorch_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from torch.onnx import TrainingMode
from transformers import AutoConfig, PreTrainedModel

from src.transformer_deploy.backends.onnx_utils import save_onnx
from transformer_deploy.backends.onnx_utils import save_onnx


def infer_classification_pytorch(
Expand Down
Loading

0 comments on commit d99e08e

Please sign in to comment.