Files
openvino/docs/OV_Runtime_UG/performance_hints.md
Maciej Smyk 9df5856849 DOCS: Fix for Runtime Inference - master (#13551)
Port from #13549

Fixed the following issues:

Switching between C++ and Python docs for "Shape Inference",
Removed repetitions,
Quote background in bullet list at the beginning of "Multi-device execution",
Broken note directives,
Fixed video player size in "Inference with OpenVINO Runtime",
Standardize Additional Resources throughout Runtime Inference.
2022-10-25 12:41:33 +02:00

136 lines
8.1 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# High-level Performance Hints {#openvino_docs_OV_UG_Performance_Hints}
Even though all [supported devices](supported_plugins/Device_Plugins.md) in OpenVINO™ offer low-level performance settings, utilizing them is not recommended outside of very few cases.
The preferred way to configure performance in OpenVINO Runtime is using performance hints. This is a future-proof solution fully compatible with the [automatic device selection inference mode](./auto_device_selection.md) and designed with *portability* in mind.
The hints also set the direction of the configuration in the right order. Instead of mapping the application needs to the low-level performance settings, and keeping an associated application logic to configure each possible device separately, the hints express a target scenario with a single config key and let the *device* configure itself in response.
Previously, a certain level of automatic configuration was the result of the *default* values of the parameters. For example, the number of CPU streams was deduced from the number of CPU cores, when `ov::streams::AUTO` (`CPU_THROUGHPUT_AUTO` in the pre-API 2.0 terminology) was set. However, the resulting number of streams did not account for actual compute requirements of the model to be inferred.
The hints, in contrast, respect the actual model, so the parameters for optimal throughput are calculated for each model individually (based on its compute versus memory bandwidth requirements and capabilities of the device).
## Performance Hints: Latency and Throughput
As discussed in the [Optimization Guide](../optimization_guide/dldt_optimization_guide.md) there are a few different metrics associated with inference speed.
Throughput and latency are some of the most widely used metrics that measure the overall performance of an application.
Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely `ov::hint::PerformanceMode::THROUGHPUT` and `ov::hint::PerformanceMode::LATENCY`.
A special `ov::hint::PerformanceMode::UNDEFINED` hint acts the same as specifying no hint.
For more information on conducting performance measurements with the `benchmark_app`, refer to the last section in this document.
Keep in mind that a typical model may take significantly more time to load with the `ov::hint::PerformanceMode::THROUGHPUT` and consume much more memory, compared to the `ov::hint::PerformanceMode::LATENCY`.
## Performance Hints: How It Works
Internally, every device "translates" the value of the hint to the actual performance settings.
For example, the `ov::hint::PerformanceMode::THROUGHPUT` selects the number of CPU or GPU streams.
Additionally, the optimal batch size is selected for the GPU and the [automatic batching](../OV_Runtime_UG/automatic_batching.md) is applied whenever possible. To check whether the device supports it, refer to the [devices/features support matrix](./supported_plugins/Device_Plugins.md) article.
The resulting (device-specific) settings can be queried back from the instance of the `ov:Compiled_Model`.
Be aware that the `benchmark_app` outputs the actual settings for the `THROUGHPUT` hint. See the example of the output below:
```
$benchmark_app -hint tput -d CPU -m 'path to your favorite model'
...
[Step 8/11] Setting optimal runtime parameters
[ INFO ] Device: CPU
[ INFO ] { PERFORMANCE_HINT , THROUGHPUT }
...
[ INFO ] { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 4 }
[ INFO ] { NUM_STREAMS , 4 }
...
```
## Using the Performance Hints: Basic API
In the example code snippet below, `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for `compile_model`:
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [compile_model]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [compile_model]
@endsphinxdirective
## Additional (Optional) Hints from the App
For an application that processes 4 video streams, the most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4.
As mentioned earlier, this will limit the batch size for the GPU and the number of inference streams for the CPU. Thus, each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [hint_num_requests]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [hint_num_requests]
@endsphinxdirective
## Optimal Number of Inference Requests
The hints are used on the presumption that the application queries `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [query_optimal_num_requests]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [query_optimal_num_requests]
@endsphinxdirective
While an application is free to create more requests if needed (for example to support asynchronous inputs population) **it is very important to at least run the `ov::optimal_number_of_infer_requests` of the inference requests in parallel**. It is recommended for efficiency, or device utilization, reasons.
Keep in mind that `ov::hint::PerformanceMode::LATENCY` does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as many requests at the same minimal latency as the number of NUMA nodes in the system.
To make your application fully scalable, make sure to query the `ov::optimal_number_of_infer_requests` directly.
## Prefer Async API
The API of the inference requests offers Sync and Async execution. The `ov::InferRequest::infer()` is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread). The Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()` (or callbacks). For more information, refer to the [API examples](../OV_Runtime_UG/ov_infer_request.md).
Although the Synchronous API can be somewhat easier to start with, it is recommended to use the Asynchronous (callbacks-based) API in the production code. It is the most general and scalable way to implement the flow control for any possible number of requests (and thus both latency and throughput scenarios).
## Combining the Hints and Individual Low-Level Settings
While sacrificing the portability to some extent, it is possible to combine the hints with individual device-specific settings.
For example, use `ov::hint::PerformanceMode::THROUGHPUT` to prepare a general configuration and override any of its specific values:
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [hint_plus_low_level]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [hint_plus_low_level]
@endsphinxdirective
## Testing Performance of the Hints with the Benchmark_App
The `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the functionality of the performance hints for a particular device:
- benchmark_app **-hint tput** -d 'device' -m 'path to your model'
- benchmark_app **-hint latency** -d 'device' -m 'path to your model'
- Disabling the hints to emulate the pre-hints era (highly recommended before trying the individual low-level settings, such as the number of streams as below, threads, etc):
- - benchmark_app **-hint none -nstreams 1** -d 'device' -m 'path to your model'
### Additional Resources
* [Supported Devices](./supported_plugins/Supported_Devices.md)