Skip to content

Commit

Permalink
Merge branch 'openvinotoolkit:releases/2022/1' into releases/2022/1
Browse files Browse the repository at this point in the history
  • Loading branch information
tsavina authored Mar 10, 2022
2 parents dce9586 + 0047db7 commit 04ff28f
Show file tree
Hide file tree
Showing 14 changed files with 219 additions and 115 deletions.
2 changes: 1 addition & 1 deletion docs/OV_Runtime_UG/hetero_execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ If you want different devices in Hetero execution to have different device-speci

@endsphinxdirective

In the example above, `CPU` device is configured to enable profiling data, while only `GPU` device has configuration property to perform inference in `f16` precision, while CPU has default execution precision.
In the example above, `GPU` device is configured to enable profiling data, while only `CPU` device has configuration property to perform inference in `f32` precision, while GPU has default execution precision.

### Handling Difficult Topologies

Expand Down
15 changes: 3 additions & 12 deletions docs/OV_Runtime_UG/integrate_with_your_application.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,9 +132,7 @@ To learn how to change the device configuration, read the [Query device properti

### Step 3. Create an Inference Request

`ov::InferRequest` class provides methods for model inference in the OpenVINO™ Runtime.
This section demonstrates a simple pipeline, to get more information about other use cases, read the [InferRequest documentation](./ov_infer_request.md) dedicated article.
Create an infer request using the following code:
`ov::InferRequest` class provides methods for model inference in OpenVINO™ Runtime. Create an infer request using the following code (see [InferRequest detailed documentation](./ov_infer_request.md) for more details):

@sphinxdirective

Expand Down Expand Up @@ -174,7 +172,7 @@ You can use external memory to create `ov::Tensor` and use the `ov::InferRequest

### Step 5. Start Inference

OpenVINO™ Runtime supports inference in asynchronous or synchronous mode. Async API usage can improve overall frame-rate of the application, because rather than wait for inference to complete, the app can continue doing things on the host, while the accelerator is busy. You can use `ov::InferRequest::start_async()` to start model inference in the asynchronous mode and call `ov::InferRequest::wait()` to wait for the inference results:
OpenVINO™ Runtime supports inference in either synchronous or asynchronous mode. Using the Async API can improve application's overall frame-rate, because rather than wait for inference to complete, the app can keep working on the host, while the accelerator is busy. You can use `ov::InferRequest::start_async` to start model inference in the asynchronous mode and call `ov::InferRequest::wait` to wait for the inference results:

@sphinxdirective

Expand All @@ -192,14 +190,7 @@ OpenVINO™ Runtime supports inference in asynchronous or synchronous mode. Asyn

@endsphinxdirective

The asynchronous mode supports two methods to get the inference results:
* `ov::InferRequest::wait_for()` - Waits until the specified timeout (in milliseconds) has elapsed or the inference result becomes available, whichever comes first.
* `ov::InferRequest::wait()` - Waits until the inference result becomes available.

Both requests are thread-safe, which means they can be called from different threads without exposing erroneous behavior or producing unpredictable results.

While the request is ongoing, all its methods except `ov::InferRequest::cancel`, `ov::InferRequest::wait` or `ov::InferRequest::wait_for` throw
the `ov::Busy` exception indicating the request is busy with computations.
This section demonstrates a simple pipeline, to get more information about other ways to perform inference, read the dedicated ["Run inference" section](./ov_infer_request.md).

### Step 6. Process the Inference Results

Expand Down
32 changes: 17 additions & 15 deletions docs/OV_Runtime_UG/ov_infer_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This class allows to set and get data for model inputs, outputs and run inferenc

### Synchronous mode

You can use `ov::InferRequest::infer()`, which blocks the application execution, to infer model in synchronous mode:
You can use `ov::InferRequest::infer`, which blocks the application execution, to infer model in the synchronous mode:

@sphinxdirective

Expand All @@ -50,7 +50,7 @@ You can use `ov::InferRequest::infer()`, which blocks the application execution,

### Asynchronous mode

Asynchronous mode can improve overall frame-rate of the application, because rather than wait for inference to complete, the app can continue doing things on the host, while accelerator is busy. You can use `ov::InferRequest::start_async()` to infer model in asynchronous mode:
Asynchronous mode can improve application's overall frame-rate, because rather than wait for inference to complete, the app can keep working on the host, while the accelerator is busy. You can use `ov::InferRequest::start_async` to infer model in the asynchronous mode:

@sphinxdirective

Expand All @@ -68,8 +68,8 @@ Asynchronous mode can improve overall frame-rate of the application, because rat

@endsphinxdirective

Asynchronous mode supports two ways to wait inference results:
* `ov::InferRequest::wait_for()` - specify maximum duration in milliseconds to block for. The method is blocked until the specified timeout has elapsed, or the result becomes available, whichever comes first.
Asynchronous mode supports two ways the application waits for inference results:
* `ov::InferRequest::wait_for` - specifies the maximum duration in milliseconds to block the method. The method is blocked until the specified time has passed, or the result becomes available, whichever comes first.
@sphinxdirective

.. tab:: C++
Expand All @@ -85,7 +85,7 @@ Asynchronous mode supports two ways to wait inference results:
:fragment: [wait_for]

@endsphinxdirective
* `ov::InferRequest::wait()` - waits until inference result becomes available
* `ov::InferRequest::wait` - waits until inference result becomes available
@sphinxdirective

.. tab:: C++
Expand All @@ -102,10 +102,9 @@ Asynchronous mode supports two ways to wait inference results:

@endsphinxdirective

Both requests are thread-safe: can be called from different threads without fearing corruption and failures.
Both methods are thread-safe.

Also InferRequest provides an functionality which allows to avoid a call of `ov::InferRequest::wait()`, in order to do it, you can use `ov::InferRequest::set_callback()` method. This method allows to set callback which will be called after completing run of InferRequest, please use weak reference of infer_request (`ov::InferRequest*`, `ov::InferRequest&`, `std::weal_ptr<ov::InferRequest>` and etc) in the callback, it is needed to avoid cyclic references.
For more details please take a look too [Classification Sample Async](../../samples/cpp/classification_sample_async/README.md).
When you are running several inference requests in parallel, a device can process them simultaneously, with no garauntees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait` (unless your code needs to wait for the _all_ requests). For multi-request scenarios, consider using the `ov::InferRequest::set_callback` method to set a callback which is called upon completion of the request:

@sphinxdirective

Expand All @@ -123,7 +122,10 @@ For more details please take a look too [Classification Sample Async](../../samp

@endsphinxdirective

You can use `ov::InferRequest::cancel()` method in case if you want to cancel the current inference request:
> **NOTE**: Use weak reference of infer_request (`ov::InferRequest*`, `ov::InferRequest&`, `std::weal_ptr<ov::InferRequest>`, etc.) in the callback. It is necessary to avoid cyclic references.
For more details, check [Classification Sample Async](../../samples/cpp/classification_sample_async/README.md).

You can use the `ov::InferRequest::cancel` method if you want to abort execution of the current inference request:

@sphinxdirective

Expand All @@ -145,7 +147,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

`ov::InferRequest` allows to get input/output tensors by tensor name, index, port and without any arguments in case if model has only one input or output.

* `ov::InferRequest::get_input_tensor()`, `ov::InferRequest::set_input_tensor()`, `ov::InferRequest::get_output_tensor()`, `ov::InferRequest::set_output_tensor()` methods without arguments can be used to get or set input/output tensor for model with only one input/output:
* `ov::InferRequest::get_input_tensor`, `ov::InferRequest::set_input_tensor`, `ov::InferRequest::get_output_tensor`, `ov::InferRequest::set_output_tensor` methods without arguments can be used to get or set input/output tensor for model with only one input/output:
@sphinxdirective

.. tab:: C++
Expand All @@ -162,7 +164,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

@endsphinxdirective

* `ov::InferRequest::get_input_tensor()`, `ov::InferRequest::set_input_tensor()`, `ov::InferRequest::get_output_tensor()`, `ov::InferRequest::set_output_tensor()` methods with argument can be used to get or set input/output tensor by input/output index:
* `ov::InferRequest::get_input_tensor`, `ov::InferRequest::set_input_tensor`, `ov::InferRequest::get_output_tensor`, `ov::InferRequest::set_output_tensor` methods with argument can be used to get or set input/output tensor by input/output index:
@sphinxdirective

.. tab:: C++
Expand All @@ -179,7 +181,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

@endsphinxdirective

* `ov::InferRequest::get_tensor()`, `ov::InferRequest::set_tensor()` methods can be used to get or set input/output tensor by tensor name:
* `ov::InferRequest::get_tensor`, `ov::InferRequest::set_tensor` methods can be used to get or set input/output tensor by tensor name:
@sphinxdirective

.. tab:: C++
Expand All @@ -196,7 +198,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

@endsphinxdirective

* `ov::InferRequest::get_tensor()`, `ov::InferRequest::set_tensor()` methods can be used to get or set input/output tensor by port:
* `ov::InferRequest::get_tensor`, `ov::InferRequest::set_tensor` methods can be used to get or set input/output tensor by port:
@sphinxdirective

.. tab:: C++
Expand All @@ -218,7 +220,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th
### Cascade of models

`ov::InferRequest` can be used to organize cascade of models. You need to have infer requests for each model.
In this case you can get output tensor from the first request using `ov::InferRequest::get_tensor()` and set it as input for the second request using `ov::InferRequest::set_tensor()`. But be careful, shared tensors across compiled models can be rewritten by the first model if the first infer request is run once again, while the second model has not started yet.
In this case you can get output tensor from the first request using `ov::InferRequest::get_tensor` and set it as input for the second request using `ov::InferRequest::set_tensor`. But be careful, shared tensors across compiled models can be rewritten by the first model if the first infer request is run once again, while the second model has not started yet.

@sphinxdirective

Expand All @@ -238,7 +240,7 @@ In this case you can get output tensor from the first request using `ov::InferRe

### Using of ROI tensors

It is possible to re-use shared input by several models. You do not need to allocate separate input tensor for a model if it processes a ROI object located inside of already allocated input of a previous model. For instance, when first model detects objects on a video frame (stored as input tensor) and second model accepts detected bounding boxes (ROI inside of the frame) as input. In this case, it is allowed to re-use pre-allocated input tensor (used by first model) by second model and just crop ROI without allocation of new memory using `ov::Tensor()` with passing of `ov::Tensor` and `ov::Coordinate` as parameters.
It is possible to re-use shared input by several models. You do not need to allocate separate input tensor for a model if it processes a ROI object located inside of already allocated input of a previous model. For instance, when the first model detects objects in a video frame (stored as input tensor) and the second model accepts detected bounding boxes (ROI inside of the frame) as input. In this case, it is allowed to re-use pre-allocated input tensor (used by the first model) by the second model and just crop ROI without allocation of new memory using `ov::Tensor` with passing of `ov::Tensor` and `ov::Coordinate` as parameters.

@sphinxdirective

Expand Down
4 changes: 2 additions & 2 deletions docs/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@
:caption: Tuning for Performance
:hidden:

openvino_docs_performance_benchmarks
openvino_docs_optimization_guide_dldt_optimization_guide
openvino_docs_MO_DG_Getting_Performance_Numbers
pot_README
openvino_docs_model_optimization_guide
openvino_docs_tuning_utilities
openvino_docs_performance_benchmarks


.. toctree::
Expand Down
3 changes: 3 additions & 0 deletions docs/img/nncf_workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 9 additions & 21 deletions docs/optimization_guide/dldt_optimization_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,32 +9,20 @@ Performance means how fast the model is in deployment. Two key metrics are used

![](../img/LATENCY_VS_THROUGHPUT.svg)

Latency measures inference time (ms) required to process a single input. When it comes to batch input need to measure throughput (images per second or frames per second, FPS). To calculate throughput, divide number of frames that were processed by the processing time.
Latency measures inference time (ms) required to process a single input. When it comes to batch input need to measure throughput (images per second or frames per second, FPS). To calculate throughput, divide the number of frames that were processed by the processing time.

> **NOTE**: To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, check [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page.
## How to measure performance
To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, go to [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page.

## How to Improve Performance

> **NOTE**: Make sure that your model can be successfully inferred with OpenVINO Runtime.
> **NOTE**: Make sure that your model can be successfully inferred with OpenVINO Inference Engine before reffering to the optimization topic.
Inside OpenVINO there are two ways how to get better performance number: during developing and deployment your model. **It is possible to combine both developing and deployment optimizations**.
Inside OpenVINO there are two ways how to get better performance numbers: optimize the model, which is called **model optimization** or tune parameters of execution, which is also **deployment optimization**. Note, that it is possible to combine both types of optimizations.

- **Developing step** includes model modification. Inside developing optimization there are three ways to optimize your model:
- **Model optimization** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md).

- **Post-training Optimization tool** (POT) is designed to optimize the inference of deep learning models by applying special methods without model retraining or fine-tuning, like post-training quantization.
- **Deployment optimization** includes tuning inference parameters and optimizing model execution. To read more visit [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md).

- **Neural Network Compression Framework (NNCF)** provides a suite of advanced algorithms for Neural Networks inference optimization with minimal accuracy drop, available quantization, pruning and sparsity optimization algorithms.

- **Model Optimizer** implement some optimization to a model, most of them added by default, but you can configure mean/scale values, batch size RGB vs BGR input channels and other parameters to speed-up preprocess of a model ([Additional Optimization Use Cases](../MO_DG/prepare_model/Additional_Optimizations.md))

- **Deployment step** includes tuning inference parameters and optimizing model execution, to read more visit [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md).

More detailed workflow:

![](../img/DEVELOPMENT_FLOW_V3_crunch.svg)

To understand when to use each development optimization tool, follow this diagram:

POT is the easiest way to get optimized models and it is also really fast and usually takes several minutes depending on the model size and used HW. NNCF can be considered as an alternative or an addition when the first does not give accurate results.

![](../img/WHAT_TO_USE.svg)
## Performance benchmarks
To estimate the performance and compare performance numbers, measured on various supported devices, a wide range of public models are available at [Perforance benchmarks](../benchmarks/performance_benchmarks.md) section.
34 changes: 34 additions & 0 deletions docs/optimization_guide/model_optimization_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Model Optimization Guide {#openvino_docs_model_optimization_guide}

@sphinxdirective

.. toctree::
:maxdepth: 1
:hidden:

pot_README
docs_nncf_introduction

@endsphinxdirective

Model optimization assumes applying transformations to the model and relevant data flow to improve the inference performance. These transformations are basically offline and can require the availability of training and validation data. It includes such methods as quantization, pruning, preprocessing optimization, etc. OpenVINO provides several tools to optimize models at different steps of model development:

- **Post-training Optimization tool [(POT)](../../tools/pot/README.md)** is designed to optimize the inference of deep learning models by applying post-training methods that do not require model retraining or fine-tuning, like post-training quantization.

- **Neural Network Compression Framework [(NNCF)](./nncf_introduction.md)** provides a suite of advanced algorithms for Neural Networks inference optimization with minimal accuracy drop, for example, quantization, pruning algorithms.

- **Model Optimizer** implements optimization to a model, most of them added by default, but you can configure mean/scale values, batch size RGB vs BGR input channels, and other parameters to speed-up preprocess of a model ([Additional Optimization Use Cases](../MO_DG/prepare_model/Additional_Optimizations.md))


## Detailed workflow:

![](../img/DEVELOPMENT_FLOW_V3_crunch.svg)

To understand which development optimization tool you need, refer to the diagram:

POT is the easiest way to get optimized models, and usually takes several minutes depending on the model size and used HW. NNCF can be considered as an alternative or addition when the first one does not give accurate results.

![](../img/WHAT_TO_USE.svg)

## See also
- [Deployment optimization](./dldt_deployment_optimization_guide.md)
Loading

0 comments on commit 04ff28f

Please sign in to comment.