From 88561571f6a0a77fe85e9f20389d422c2beac041 Mon Sep 17 00:00:00 2001 From: Damian Date: Thu, 27 Jul 2023 18:18:59 +0000 Subject: [PATCH 1/4] initial commit --- src/deepsparse/transformers/README.md | 25 +++++++++++-------- .../transformers/pipelines/text_generation.py | 1 + 2 files changed, 16 insertions(+), 10 deletions(-) diff --git a/src/deepsparse/transformers/README.md b/src/deepsparse/transformers/README.md index be8df1ebd5..b88209593f 100644 --- a/src/deepsparse/transformers/README.md +++ b/src/deepsparse/transformers/README.md @@ -71,7 +71,7 @@ of the sparsified transformers model. If no model is specified to the `Pipeline` for a given task, the `Pipeline` will automatically select a pruned and quantized model for the task from the `SparseZoo` that can be used for accelerated -inference. Note that other models in the SparseZoo will have different tradeoffs between speed, size, +inference. Note that other models in the SparseZoo gwill have different tradeoffs between speed, size, and accuracy. ### HTTP Server @@ -139,23 +139,28 @@ response.text >> '{"score":0.9534820914268494,"start":8,"end":14,"answer":"batman"}' ``` -### Text Generation -The text generation task generates a sequence of tokens given the prompt. Popular text generation LLMs (Large Language Models) are used +### Text Generation +The text generation task generates a sequence of tokens given the prompt. Popular text generation Large Language Models (LLMs) are used for the chatbots (the instruction models), code generation, text summarization, or filling out the missing text. The following example uses a sparsified text classification -OPT model to complete the prompt +OPT model to complete the prompt. -[List of available SparseZoo Text Generation Models]( -https://sparsezoo.neuralmagic.com/?useCase=text_generation) +#### KV Cache Injection +Please note, that to take the full advantage of the speedups provided by the DeepSparse Engine, it is essential to run inference using a model with the KV cache support. +If you are using one of the pre-sparsified models from SparseZoo ([list of available SparseZoo Text Generation Models]( +https://sparsezoo.neuralmagic.com/?useCase=text_generation)), you will automatically benefit from the KV cache support speedups. +However, if you are sparsifying your custom model, you may want to add the KV cache support to your model. This will be extremely beneficial when it comes to the inference speed. + +For more details, please refer to the [SparseML documentation on KV cache injection](...) #### Python Pipeline ```python from deepsparse import Pipeline -opt_pipeline = Pipeline.create(task="opt") +opt_pipeline = Pipeline.create(task="opt", max_generated_tokens=32) inference = opt_pipeline("Who is the president of the United States?") ->> 'The president of the United States is the head of the executive branch of government...' +>> 'The president of the United States is the head of the executive branch of government...' #TODO: Waiting for a good stub to use ``` #### HTTP Server @@ -163,7 +168,7 @@ Spinning up: ```bash deepsparse.server \ task text-generation \ - --model_path # TODO: Pending until text generation models get uploaded to SparseZoo + --model_path zoo:nlg/text_generation/opt-1.3b/pytorch/huggingface/pretrained/pruned50_quant-none #TODO: Waiting for a good stub to use ``` Making a request: @@ -177,7 +182,7 @@ obj = {"sequence": "Who is the president of the United States?"} response = requests.post(url, json=obj) response.text ->> 'The president of the United States is the head of the executive branch of government...' +>> 'The president of the United States is the head of the executive branch of government...' #TODO: Waiting for a good stub to use ``` ### Sentiment Analysis diff --git a/src/deepsparse/transformers/pipelines/text_generation.py b/src/deepsparse/transformers/pipelines/text_generation.py index 813a7fa700..f27ed75297 100644 --- a/src/deepsparse/transformers/pipelines/text_generation.py +++ b/src/deepsparse/transformers/pipelines/text_generation.py @@ -94,6 +94,7 @@ class Config: @Pipeline.register( task="text_generation", task_aliases=["codegen", "opt", "bloom"], + default_model_path="zoo:nlg/text_generation/opt-1.3b/pytorch/huggingface/pretrained/pruned50_quant-none", # noqa E501 ) class TextGenerationPipeline(TransformersPipeline): """ From 50f45d0f75ca80f32c6a2fa822f96750590f7568 Mon Sep 17 00:00:00 2001 From: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> Date: Thu, 27 Jul 2023 20:25:00 +0200 Subject: [PATCH 2/4] Update src/deepsparse/transformers/README.md --- src/deepsparse/transformers/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/deepsparse/transformers/README.md b/src/deepsparse/transformers/README.md index b88209593f..5409b90951 100644 --- a/src/deepsparse/transformers/README.md +++ b/src/deepsparse/transformers/README.md @@ -71,7 +71,7 @@ of the sparsified transformers model. If no model is specified to the `Pipeline` for a given task, the `Pipeline` will automatically select a pruned and quantized model for the task from the `SparseZoo` that can be used for accelerated -inference. Note that other models in the SparseZoo gwill have different tradeoffs between speed, size, +inference. Note that other models in the SparseZoo will have different tradeoffs between speed, size, and accuracy. ### HTTP Server From 11dc7c1f08e479e3fc206f32c702b07f27f9fe8a Mon Sep 17 00:00:00 2001 From: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> Date: Thu, 27 Jul 2023 20:26:00 +0200 Subject: [PATCH 3/4] Update src/deepsparse/transformers/README.md --- src/deepsparse/transformers/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/deepsparse/transformers/README.md b/src/deepsparse/transformers/README.md index 5409b90951..94944c8bce 100644 --- a/src/deepsparse/transformers/README.md +++ b/src/deepsparse/transformers/README.md @@ -150,7 +150,7 @@ If you are using one of the pre-sparsified models from SparseZoo ([list of avail https://sparsezoo.neuralmagic.com/?useCase=text_generation)), you will automatically benefit from the KV cache support speedups. However, if you are sparsifying your custom model, you may want to add the KV cache support to your model. This will be extremely beneficial when it comes to the inference speed. -For more details, please refer to the [SparseML documentation on KV cache injection](...) +For more details, please refer to the [SparseML documentation on KV cache injection](https://github.com/neuralmagic/sparseml/src/sparseml/exporters/README.md) #### Python Pipeline ```python From 62ac6435f70fd942ed9118d39d7d5f792d7a0c88 Mon Sep 17 00:00:00 2001 From: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> Date: Thu, 27 Jul 2023 20:27:54 +0200 Subject: [PATCH 4/4] Update src/deepsparse/transformers/README.md --- src/deepsparse/transformers/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/deepsparse/transformers/README.md b/src/deepsparse/transformers/README.md index 94944c8bce..9ea81ac580 100644 --- a/src/deepsparse/transformers/README.md +++ b/src/deepsparse/transformers/README.md @@ -148,7 +148,7 @@ OPT model to complete the prompt. Please note, that to take the full advantage of the speedups provided by the DeepSparse Engine, it is essential to run inference using a model with the KV cache support. If you are using one of the pre-sparsified models from SparseZoo ([list of available SparseZoo Text Generation Models]( https://sparsezoo.neuralmagic.com/?useCase=text_generation)), you will automatically benefit from the KV cache support speedups. -However, if you are sparsifying your custom model, you may want to add the KV cache support to your model. This will be extremely beneficial when it comes to the inference speed. +However, if you are sparsifying your custom neural network, you may want to add the KV cache support to your model. This will be extremely beneficial when it comes to inference speed. For more details, please refer to the [SparseML documentation on KV cache injection](https://github.com/neuralmagic/sparseml/src/sparseml/exporters/README.md)