Files
openvino/docs/OV_Runtime_UG/performance_hints.md
Ilya Lavrenov e3098ece7e DOCS: port changes from releases/2022/1 (#11040)
* Added migration for deployment (#10800)

* Added migration for deployment

* Addressed comments

* more info after the What's new Sessions' questions (#10803)

* more info after the What's new Sessions' questions

* generalizing the optimal_batch_size vs explicit value message

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Perf Hints docs and General Opt Guide refactoring (#10815)

* Brushed the general optimization page

* Opt GUIDE, WIP

* perf hints doc placeholder

* WIP

* WIP2

* WIP 3

* added streams and few other details

* fixed titles, misprints etc

* Perf hints

* movin the runtime optimizations intro

* fixed link

* Apply suggestions from code review

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* some details on the FIL and other means when pure inference time is not the only factor

* shuffled according to general->use-case->device-specifics flow, minor brushing

* next iter

* section on optimizing for tput and latency

* couple of links to the features support matrix

* Links, brushing, dedicated subsections for Latency/FIL/Tput

* had to make the link less specific (otherwise docs compilations fails)

* removing the Temp/Should be moved to the Opt Guide

* shuffled the tput/latency/etc info into separated documents. also the following docs moved from the temp into specific feature, general product desc or corresponding plugins

-   openvino_docs_IE_DG_Model_caching_overview
-   openvino_docs_IE_DG_Int8Inference
-   openvino_docs_IE_DG_Bfloat16Inference
-   openvino_docs_OV_UG_NoDynamicShapes

* fixed toc for ov_dynamic_shapes.md

* referring the openvino_docs_IE_DG_Bfloat16Inference to avoid docs compilation errors

* fixed main product TOC, removed ref from the second-level items

* reviewers remarks

* reverted the openvino_docs_OV_UG_NoDynamicShapes

* reverting openvino_docs_IE_DG_Bfloat16Inference and openvino_docs_IE_DG_Int8Inference

* "No dynamic shapes" to the "Dynamic shapes" as TOC

* removed duplication

* minor brushing

* Caching to the next level in TOC

* brushing

* more on the perf counters ( for latency and dynamic cases)

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Updated common IE pipeline infer-request section (#10844)

* Updated common IE pipeline infer-reqest section

* Update ov_infer_request.md

* Apply suggestions from code review

Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>

Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com>
Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>

* DOCS: Removed useless 4 spaces in snippets (#10870)

* Updated snippets

* Added link to encryption

* [DOCS] ARM CPU plugin docs (#10885)

* initial commit

ARM_CPU.md added
ARM CPU is added to the list of supported devices

* Update the list of supported properties

* Update Device_Plugins.md

* Update CODEOWNERS

* Removed quotes in limitations section

* NVIDIA and Android are added to the list of supported devices

* Added See Also section and reg sign to arm

* Added Preprocessing acceleration section

* Update the list of supported layers

* updated list of supported layers

* fix typos

* Added support disclaimer

* update trade and reg symbols

* fixed typos

* fix typos

* reg fix

* add reg symbol back

Co-authored-by: Vitaly Tuzov <vitaly.tuzov@intel.com>

* Try to fix visualization (#10896)

* Try to fix visualization

* New try

* Update Install&Deployment for migration guide to 22/1 (#10933)

* updates

* update

* Getting started improvements (#10948)

* Onnx updates (#10962)

* onnx changes

* onnx updates

* onnx updates

* fix broken anchors api reference (#10976)

* add ote repo (#10979)

* DOCS: Increase content width (#10995)

* fixes

* fix

* Fixed compilation

Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>
Co-authored-by: Aleksandr Voron <aleksandr.voron@intel.com>
Co-authored-by: Vitaly Tuzov <vitaly.tuzov@intel.com>
Co-authored-by: Ilya Churaev <ilya.churaev@intel.com>
Co-authored-by: Yuan Xu <yuan1.xu@intel.com>
Co-authored-by: Victoria Yashina <victoria.yashina@intel.com>
Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com>
2022-03-18 17:48:45 +03:00

8.6 KiB

High-level Performance Hints

Each of the OpenVINO's supported devices offers low-level performance settings. Tweaking this detailed configuration requires deep architecture understanding. Also, while the performance may be optimal for the specific combination of the device and the inferred model, the resulting configuration is not necessarily optimal for another device or model. The OpenVINO performance hints are the new way to configure the performance with the portability in mind.

The hints also "reverse" the direction of the configuration in the right fashion: rather than map the application needs to the low-level performance settings, and keep an associated application logic to configure each possible device separately, the idea is to express a target scenario with a single config key and let the device to configure itself in response. As the hints are supported by every OpenVINO device, this is completely portable and future-proof solution.

Previously, certain level of automatic configuration was coming from the default values of the parameters. For example, number of the CPU streams was deduced from the number of CPU cores, when the ov::streams::AUTO (CPU_THROUGHPUT_AUTO in the pre-OpenVINO 2.0 parlance) is set. However, the resulting number of streams didn't account for actual compute requirements of the model to be inferred. The hints, in contrast, respect the actual model, so the parameters for the optimal throughput are calculated for each model individually (based on it's compute versus memory bandwidth requirements and capabilities of the device).

Performance Hints: Latency and Throughput

As discussed in the Optimization Guide there are few different metrics associated with the inference speed. Throughput and latency are some of the most critical factors that influence the overall performance of an application.

This is why, to ease the configuration of the device, the OpenVINO already offers two dedicated hints, namely ov::hint::PerformanceMode::THROUGHPUT and ov::hint::PerformanceMode::LATENCY. Every OpenVINO device supports these, which makes the things portable and future-proof. The also allows to do a performance configuration that is fully compatible with the automatic device selection. A special ov::hint::PerformanceMode::UNDEFINED acts same just as specifying no hint.

Please also see the last section in the document on conducting the performance measurements with the benchmark_app.

Notice that if there are other performance factors (other than inference time) like memory footprint and model load/compilation time are of concern, a typical model may take significantly more time to load with ov::hint::PerformanceMode::THROUGHPUT and then consume much more memory, compared to the ov::hint::PerformanceMode::LATENCY.

Performance Hints: How It Works?

Internally, every device "translates" the value of the hint to the actual performance settings. For example the ov::hint::PerformanceMode::THROUGHPUT selects number of CPU or GPU streams. For the GPU, additionally the optimal batch size is selected and the automatic batching is applied whenever possible (and also if the device supports that refer to the devices/features support matrix).

The resulting (device-specific) settings can be queried back from the instance of the ov:Compiled_Model.
Notice that the benchmark_app, outputs the actual settings for the THROUGHPUT hint, please the bottom of the output example:

 $benchmark_app -hint tput -d CPU -m 'path to your favorite model'
 ...
 [Step 8/11] Setting optimal runtime parameters
 [ INFO ] Device: CPU
 [ INFO ]   { PERFORMANCE_HINT , THROUGHPUT }
 ...
 [ INFO ]   { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 4 }
 [ INFO ]   { NUM_STREAMS , 4 }
 ...

Using the Performance Hints: Basic API

In the example code-snippet below the ov::hint::PerformanceMode::THROUGHPUT is specified for the ov::hint::performance_mode property for the compile_model: @sphinxdirective

.. tab:: C++

.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
   :language: cpp
   :fragment: [compile_model]

.. tab:: Python

.. doxygensnippet:: docs/snippets/ov_auto_batching.py
   :language: python
   :fragment: [compile_model]

@endsphinxdirective

Additional (Optional) Hints from the App

Let's take an example of an application that processes 4 video streams. The most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional ov::hint::num_requests configuration key set to 4. As discussed previosly, for the GPU this will limit the batch size, for the CPU - the number of inference streams, so each device uses the ov::hint::num_requests while converting the hint to the actual device configuration options: @sphinxdirective

.. tab:: C++

.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
   :language: cpp
   :fragment: [hint_num_requests]

.. tab:: Python

.. doxygensnippet:: docs/snippets/ov_auto_batching.py
   :language: python
   :fragment: [hint_num_requests]

@endsphinxdirective

Optimal Number of Inference Requests

Using the hints assumes that the application queries the ov::optimal_number_of_infer_requests to create and run the returned number of requests simultaneously: @sphinxdirective

.. tab:: C++

.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
   :language: cpp
   :fragment: [query_optimal_num_requests]

.. tab:: Python

.. doxygensnippet:: docs/snippets/ov_auto_batching.py
   :language: python
   :fragment: [query_optimal_num_requests]

@endsphinxdirective

While an application is free to create more requests if needed (for example to support asynchronous inputs population) it is very important to at least run the ov::optimal_number_of_infer_requests of the inference requests in parallel, for efficiency (device utilization) reasons.

Also, notice that ov::hint::PerformanceMode::LATENCY does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes the machine features. To make your application fully scalable, prefer to query the ov::optimal_number_of_infer_requests directly.

Prefer Async API

The API of the inference requests offers Sync and Async execution. While the ov::InferRequest::infer() is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread), the Async "splits" the infer() into ov::InferRequest::start_async() and use of the ov::InferRequest::wait() (or callbacks). Please consider the API examples. Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios).

Combining the Hints and Individual Low-Level Settings

While sacrificing the portability at a some extent, it is possible to combine the hints with individual device-specific settings. For example, you can let the device prepare a configuration ov::hint::PerformanceMode::THROUGHPUT while overriding any specific value:
@sphinxdirective

.. tab:: C++

.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
   :language: cpp
   :fragment: [hint_plus_low_level]

.. tab:: Python

.. doxygensnippet:: docs/snippets/ov_auto_batching.py
   :language: python
   :fragment: [hint_plus_low_level]

@endsphinxdirective

Testing the Performance of The Hints with the Benchmark_App

The benchmark_app, that exists in both C++ and Python versions, is the best way to evaluate the performance of the performance hints for a particular device:

  • benchmark_app -hint tput -d 'device' -m 'path to your model'
  • benchmark_app -hint latency -d 'device' -m 'path to your model'
  • Disabling the hints to emulate the pre-hints era (highly recommended before trying the individual low-level settings, such as the number of streams as below, threads, etc):
    • benchmark_app -hint none -nstreams 1 -d 'device' -m 'path to your model'

See Also

Supported Devices