DOCS shift to rst - Performance Hints (#16386)

This commit is contained in:
Sebastian Golebiewski
2023-03-20 12:39:31 +01:00
committed by GitHub
parent 350f8fd95b
commit 76e60ff258
5 changed files with 179 additions and 116 deletions

View File

@@ -1,16 +1,18 @@
# Model Caching Overview {#openvino_docs_OV_UG_Model_caching_overview}
As described in the [Integrate OpenVINO™ with Your Application](integrate_with_your_application.md), a common application flow consists of the following steps:
@sphinxdirective
As described in the :doc:`Integrate OpenVINO™ with Your Application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`, a common application flow consists of the following steps:
1. **Create a Core object**: First step to manage available devices and read model objects
2. **Read the Intermediate Representation**: Read an Intermediate Representation file into an object of the `ov::Model`
2. **Read the Intermediate Representation**: Read an Intermediate Representation file into an object of the `ov::Model <classov_1_1Model.html#doxid-classov-1-1-model>`__
3. **Prepare inputs and outputs**: If needed, manipulate precision, memory layout, size or color format
4. **Set configuration**: Pass device-specific loading configurations to the device
5. **Compile and Load Network to device**: Use the `ov::Core::compile_model()` method with a specific device
5. **Compile and Load Network to device**: Use the `ov::Core::compile_model() <classov_1_1Core.html#doxid-classov-1-1-core-1a46555f0803e8c29524626be08e7f5c5a>`__ method with a specific device
6. **Set input data**: Specify input tensor
@@ -18,14 +20,14 @@ As described in the [Integrate OpenVINO™ with Your Application](integrate_with
Step 5 can potentially perform several time-consuming device-specific optimizations and network compilations,
and such delays can lead to a bad user experience on application startup. To avoid this, some devices offer
import/export network capability, and it is possible to either use the [Compile tool](../../tools/compile_tool/README.md)
import/export network capability, and it is possible to either use the :doc:`Compile tool <openvino_inference_engine_tools_compile_tool_README>`
or enable model caching to export compiled model automatically. Reusing cached model can significantly reduce compile model time.
### Set "cache_dir" config option to enable model caching
Set "cache_dir" config option to enable model caching
+++++++++++++++++++++++++++++++++++++++++++++++++++++
To enable model caching, the application must specify a folder to store cached blobs, which is done like this:
@sphinxdirective
.. tab:: C++
@@ -39,23 +41,24 @@ To enable model caching, the application must specify a folder to store cached b
:language: python
:fragment: [ov:caching:part0]
@endsphinxdirective
With this code, if the device specified by `device_name` supports import/export model capability, a cached blob is automatically created inside the `/path/to/cache/dir` folder.
With this code, if the device specified by ``device_name`` supports import/export model capability, a cached blob is automatically created inside the ``/path/to/cache/dir`` folder.
If the device does not support import/export capability, cache is not created and no error is thrown.
Depending on your device, total time for compiling model on application startup can be significantly reduced.
Also note that the very first `compile_model` (when cache is not yet created) takes slightly longer time to "export" the compiled blob into a cache file:
Also note that the very first ``compile_model`` (when cache is not yet created) takes slightly longer time to "export" the compiled blob into a cache file:
![](../img/caching_enabled.svg)
### Even faster: use compile_model(modelPath)
.. image:: _static/images/caching_enabled.svg
Even faster: use compile_model(modelPath)
+++++++++++++++++++++++++++++++++++++++++
In some cases, applications do not need to customize inputs and outputs every time. Such application always
call `model = core.read_model(...)`, then `core.compile_model(model, ..)` and it can be further optimized.
call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)`` and it can be further optimized.
For these cases, there is a more convenient API to compile the model in a single call, skipping the read step:
@sphinxdirective
.. tab:: C++
@@ -69,11 +72,9 @@ For these cases, there is a more convenient API to compile the model in a single
:language: python
:fragment: [ov:caching:part1]
@endsphinxdirective
With model caching enabled, total load time is even smaller, if `read_model` is optimized as well.
With model caching enabled, total load time is even smaller, if ``read_model`` is optimized as well.
@sphinxdirective
.. tab:: C++
@@ -87,16 +88,15 @@ With model caching enabled, total load time is even smaller, if `read_model` is
:language: python
:fragment: [ov:caching:part2]
@endsphinxdirective
![](../img/caching_times.svg)
.. image:: _static/images/caching_times.svg
### Advanced Examples
Advanced Examples
++++++++++++++++++++
Not every device supports network import/export capability. For those that don't, enabling caching has no effect.
To check in advance if a particular device supports model caching, your application can use the following code:
@sphinxdirective
.. tab:: C++
@@ -110,8 +110,9 @@ To check in advance if a particular device supports model caching, your applicat
:language: python
:fragment: [ov:caching:part3]
@endsphinxdirective
> **NOTE**: For GPU, model caching is currently implemented as a preview feature. Before it is fully supported, kernel caching can be used in the same manner: by setting the CACHE_DIR configuration key to a folder where the cache should be stored (see the [GPU plugin documentation](supported_plugins/GPU.md)).
> To activate the preview feature of model caching, set the OV_GPU_CACHE_MODEL environment variable to 1.
.. note::
For GPU, model caching is currently implemented as a preview feature. Before it is fully supported, kernel caching can be used in the same manner: by setting the CACHE_DIR configuration key to a folder where the cache should be stored (see the :doc:`GPU plugin documentation <openvino_docs_OV_UG_supported_plugins_GPU>`). To activate the preview feature of model caching, set the OV_GPU_CACHE_MODEL environment variable to 1.
@endsphinxdirective

View File

@@ -1,47 +1,56 @@
# High-level Performance Hints {#openvino_docs_OV_UG_Performance_Hints}
Even though all [supported devices](supported_plugins/Device_Plugins.md) in OpenVINO™ offer low-level performance settings, utilizing them is not recommended outside of very few cases.
The preferred way to configure performance in OpenVINO Runtime is using performance hints. This is a future-proof solution fully compatible with the [automatic device selection inference mode](./auto_device_selection.md) and designed with *portability* in mind.
@sphinxdirective
Even though all :doc:`supported devices <openvino_docs_OV_UG_Working_with_devices>` in OpenVINO™ offer low-level performance settings, utilizing them is not recommended outside of very few cases.
The preferred way to configure performance in OpenVINO Runtime is using performance hints. This is a future-proof solution fully compatible with the :doc:`automatic device selection inference mode <openvino_docs_OV_UG_supported_plugins_AUTO>` and designed with *portability* in mind.
The hints also set the direction of the configuration in the right order. Instead of mapping the application needs to the low-level performance settings, and keeping an associated application logic to configure each possible device separately, the hints express a target scenario with a single config key and let the *device* configure itself in response.
Previously, a certain level of automatic configuration was the result of the *default* values of the parameters. For example, the number of CPU streams was deduced from the number of CPU cores, when `ov::streams::AUTO` (`CPU_THROUGHPUT_AUTO` in the pre-API 2.0 terminology) was set. However, the resulting number of streams did not account for actual compute requirements of the model to be inferred.
Previously, a certain level of automatic configuration was the result of the *default* values of the parameters. For example, the number of CPU streams was deduced from the number of CPU cores, when `ov::streams::AUTO <groupov_runtime_cpp_prop_api.html#doxid-group-ov-runtime-cpp-prop-api-1gaddb29425af71fbb6ad3379c59342ff0e>`__ (``CPU_THROUGHPUT_AUTO`` in the pre-API 2.0 terminology) was set. However, the resulting number of streams did not account for actual compute requirements of the model to be inferred.
The hints, in contrast, respect the actual model, so the parameters for optimal throughput are calculated for each model individually (based on its compute versus memory bandwidth requirements and capabilities of the device).
## Performance Hints: Latency and Throughput
As discussed in the [Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md) there are a few different metrics associated with inference speed.
Performance Hints: Latency and Throughput
#########################################
As discussed in the :doc:`Optimization Guide <openvino_docs_deployment_optimization_guide_dldt_optimization_guide>` there are a few different metrics associated with inference speed.
Throughput and latency are some of the most widely used metrics that measure the overall performance of an application.
Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely `ov::hint::PerformanceMode::THROUGHPUT` and `ov::hint::PerformanceMode::LATENCY`.
A special `ov::hint::PerformanceMode::UNDEFINED` hint acts the same as specifying no hint.
Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely `ov::hint::PerformanceMode::THROUGHPUT <enumov_1_1hint_1_1PerformanceMode.html#doxid-group-ov-runtime-cpp-prop-api-1gga032aa530efa40760b79af14913d48d73a50f9b1f40c078d242af7ec323ace44b3>`__ and `ov::hint::PerformanceMode::LATENCY <enumov_1_1hint_1_1PerformanceMode.html#doxid-group-ov-runtime-cpp-prop-api-1gga032aa530efa40760b79af14913d48d73a501069dd75f76384ba18f133fdce99c2>`__.
A special `ov::hint::PerformanceMode::UNDEFINED <enumov_1_1hint_1_1PerformanceMode.html#doxid-group-ov-runtime-cpp-prop-api-1gga032aa530efa40760b79af14913d48d73a0db45d2a4141101bdfe48e3314cfbca3>`__ hint acts the same as specifying no hint.
For more information on conducting performance measurements with the `benchmark_app`, refer to the last section in this document.
For more information on conducting performance measurements with the ``benchmark_app``, refer to the last section in this document.
Keep in mind that a typical model may take significantly more time to load with the `ov::hint::PerformanceMode::THROUGHPUT` and consume much more memory, compared to the `ov::hint::PerformanceMode::LATENCY`.
Keep in mind that a typical model may take significantly more time to load with the ``ov::hint::PerformanceMode::THROUGHPUT`` and consume much more memory, compared to the ``ov::hint::PerformanceMode::LATENCY``.
Performance Hints: How It Works
###############################
## Performance Hints: How It Works
Internally, every device "translates" the value of the hint to the actual performance settings.
For example, the `ov::hint::PerformanceMode::THROUGHPUT` selects the number of CPU or GPU streams.
Additionally, the optimal batch size is selected for the GPU and the [automatic batching](../OV_Runtime_UG/automatic_batching.md) is applied whenever possible. To check whether the device supports it, refer to the [devices/features support matrix](./supported_plugins/Device_Plugins.md) article.
For example, the ``ov::hint::PerformanceMode::THROUGHPUT`` selects the number of CPU or GPU streams.
Additionally, the optimal batch size is selected for the GPU and the :doc:`automatic batching <openvino_docs_OV_UG_Automatic_Batching>` is applied whenever possible. To check whether the device supports it, refer to the :doc:`devices/features support matrix <openvino_docs_OV_UG_Working_with_devices>` article.
The resulting (device-specific) settings can be queried back from the instance of the `ov:Compiled_Model`.
Be aware that the `benchmark_app` outputs the actual settings for the `THROUGHPUT` hint. See the example of the output below:
The resulting (device-specific) settings can be queried back from the instance of the ``ov:Compiled_Model``.
Be aware that the ``benchmark_app`` outputs the actual settings for the ``THROUGHPUT`` hint. See the example of the output below:
```
$benchmark_app -hint tput -d CPU -m 'path to your favorite model'
...
[Step 8/11] Setting optimal runtime parameters
[ INFO ] Device: CPU
[ INFO ] { PERFORMANCE_HINT , THROUGHPUT }
...
[ INFO ] { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 4 }
[ INFO ] { NUM_STREAMS , 4 }
...
```
.. code-block:: sh
$benchmark_app -hint tput -d CPU -m 'path to your favorite model'
...
[Step 8/11] Setting optimal runtime parameters
[ INFO ] Device: CPU
[ INFO ] { PERFORMANCE_HINT , THROUGHPUT }
...
[ INFO ] { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 4 }
[ INFO ] { NUM_STREAMS , 4 }
...
Using the Performance Hints: Basic API
######################################
In the example code snippet below, ``ov::hint::PerformanceMode::THROUGHPUT`` is specified for the ``ov::hint::performance_mode`` property for ``compile_model``:
## Using the Performance Hints: Basic API
In the example code snippet below, `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for `compile_model`:
@sphinxdirective
.. tab:: C++
@@ -55,12 +64,13 @@ In the example code snippet below, `ov::hint::PerformanceMode::THROUGHPUT` is sp
:language: python
:fragment: [compile_model]
@endsphinxdirective
## Additional (Optional) Hints from the App
For an application that processes 4 video streams, the most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4.
As mentioned earlier, this will limit the batch size for the GPU and the number of inference streams for the CPU. Thus, each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
@sphinxdirective
Additional (Optional) Hints from the App
########################################
For an application that processes 4 video streams, the most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional ``ov::hint::num_requests`` configuration key set to 4.
As mentioned earlier, this will limit the batch size for the GPU and the number of inference streams for the CPU. Thus, each device uses the ``ov::hint::num_requests`` while converting the hint to the actual device configuration options:
.. tab:: C++
@@ -74,11 +84,12 @@ As mentioned earlier, this will limit the batch size for the GPU and the number
:language: python
:fragment: [hint_num_requests]
@endsphinxdirective
## Optimal Number of Inference Requests
The hints are used on the presumption that the application queries `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
@sphinxdirective
Optimal Number of Inference Requests
####################################
The hints are used on the presumption that the application queries ``ov::optimal_number_of_infer_requests`` to create and run the returned number of requests simultaneously:
.. tab:: C++
@@ -92,21 +103,24 @@ The hints are used on the presumption that the application queries `ov::optimal_
:language: python
:fragment: [query_optimal_num_requests]
@endsphinxdirective
While an application is free to create more requests if needed (for example to support asynchronous inputs population) **it is very important to at least run the `ov::optimal_number_of_infer_requests` of the inference requests in parallel**. It is recommended for efficiency, or device utilization, reasons.
While an application is free to create more requests if needed (for example to support asynchronous inputs population) **it is very important to at least run the ``ov::optimal_number_of_infer_requests`` of the inference requests in parallel**. It is recommended for efficiency, or device utilization, reasons.
Keep in mind that `ov::hint::PerformanceMode::LATENCY` does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as many requests at the same minimal latency as the number of NUMA nodes in the system.
To make your application fully scalable, make sure to query the `ov::optimal_number_of_infer_requests` directly.
Keep in mind that ``ov::hint::PerformanceMode::LATENCY`` does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as many requests at the same minimal latency as the number of NUMA nodes in the system.
To make your application fully scalable, make sure to query the ``ov::optimal_number_of_infer_requests`` directly.
Prefer Async API
################
The API of the inference requests offers Sync and Async execution. The ``ov::InferRequest::infer()`` is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread). The Async "splits" the ``infer()`` into ``ov::InferRequest::start_async()`` and ``ov::InferRequest::wait()`` (or callbacks). For more information, refer to the doc:`API examples <openvino_docs_OV_UG_Infer_request>`.
Although the Synchronous API can be somewhat easier to start with, it is recommended to use the Asynchronous (callbacks-based) API in the production code. It is the most general and scalable way to implement the flow control for any possible number of requests (and thus both latency and throughput scenarios).
Combining the Hints and Individual Low-Level Settings
#####################################################
## Prefer Async API
The API of the inference requests offers Sync and Async execution. The `ov::InferRequest::infer()` is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread). The Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()` (or callbacks). For more information, refer to the [API examples](../OV_Runtime_UG/ov_infer_request.md).
Although the Synchronous API can be somewhat easier to start with, it is recommended to use the Asynchronous (callbacks-based) API in the production code. It is the most general and scalable way to implement the flow control for any possible number of requests (and thus both latency and throughput scenarios).
## Combining the Hints and Individual Low-Level Settings
While sacrificing the portability to some extent, it is possible to combine the hints with individual device-specific settings.
For example, use `ov::hint::PerformanceMode::THROUGHPUT` to prepare a general configuration and override any of its specific values:
@sphinxdirective
For example, use ``ov::hint::PerformanceMode::THROUGHPUT`` to prepare a general configuration and override any of its specific values:
.. tab:: C++
@@ -121,15 +135,22 @@ For example, use `ov::hint::PerformanceMode::THROUGHPUT` to prepare a general co
:fragment: [hint_plus_low_level]
Testing Performance of the Hints with the Benchmark_App
#######################################################
The ``benchmark_app``, that exists in both :doc:`C++ <openvino_inference_engine_samples_benchmark_app_README>` and :doc:`Python <openvino_inference_engine_tools_benchmark_tool_README>` versions, is the best way to evaluate the functionality of the performance hints for a particular device:
* benchmark_app **-hint tput** -d 'device' -m 'path to your model'
* benchmark_app **-hint latency** -d 'device' -m 'path to your model'
Disabling the hints to emulate the pre-hints era (highly recommended before trying the individual low-level settings, such as the number of streams as below, threads, etc):
* benchmark_app **-hint none -nstreams 1** -d 'device' -m 'path to your model'
Additional Resources
####################
* :doc:`Supported Devices <openvino_docs_OV_UG_Working_with_devices>`
@endsphinxdirective
## Testing Performance of the Hints with the Benchmark_App
The `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the functionality of the performance hints for a particular device:
- benchmark_app **-hint tput** -d 'device' -m 'path to your model'
- benchmark_app **-hint latency** -d 'device' -m 'path to your model'
- Disabling the hints to emulate the pre-hints era (highly recommended before trying the individual low-level settings, such as the number of streams as below, threads, etc):
- - benchmark_app **-hint none -nstreams 1** -d 'device' -m 'path to your model'
### Additional Resources
* [Supported Devices](./supported_plugins/Supported_Devices.md)

View File

@@ -1,42 +1,62 @@
# Using Advanced Throughput Options: Streams and Batching {#openvino_docs_deployment_optimization_guide_tput_advanced}
## OpenVINO Streams
As explained in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common), running multiple inference requests asynchronously is important for general application efficiency.
@sphinxdirective
OpenVINO Streams
####################
As explained in the :doc:`common-optimizations section <openvino_docs_deployment_optimization_guide_common>`, running multiple inference requests asynchronously is important for general application efficiency.
Internally, every device implements a queue, which acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
The devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput.
This configurable method of this device-side parallelism is commonly referred as **streams**.
> **NOTE**: Be aware that streams are **really executing the requests in parallel, but not in the lock step** (as the batching does), which makes the streams fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md), while individual requests can have different shapes.
.. note::
> **NOTE**: Most OpenVINO devices (including CPU and GPU) support the streams, yet the *optimal* number of the streams is deduced very differently. More information on this topic can be found in the section [below](@ref stream_considerations).
Be aware that streams are **really executing the requests in parallel, but not in the lock step** (as the batching does), which makes the streams fully compatible with :doc:`dynamically-shaped inputs <openvino_docs_OV_UG_DynamicShapes>`, while individual requests can have different shapes.
.. note::
Most OpenVINO devices (including CPU and GPU) support the streams, yet the *optimal* number of the streams is deduced very differently. More information on this topic can be found in the section `below <#number-of-streams-considerations>`__.
A few general considerations:
* Using the streams does increase the latency of an individual request:
* When the number of streams is not specified, a device creates a bare minimum of streams (usually, just one), as the latency-oriented case is default.
* See further tips for the optimal number of the streams [below](@ref throughput_advanced).
* When the number of streams is not specified, a device creates a bare minimum of streams (usually, just one), as the latency-oriented case is default.
* See further tips for the optimal number of the streams `below <#choosing-the-number-of-streams-and-or-batch-size>`__.
* Streams are memory-intensive, as every stream duplicates the intermediate buffers to do inference in parallel to the rest of the streams:
* Always prefer streams over creating multiple `ov:Compiled_Model` instances for the same model, as weights memory is shared across streams, reducing the memory consumption.
* Always prefer streams over creating multiple ``ov:Compiled_Model`` instances for the same model, as weights memory is shared across streams, reducing the memory consumption.
* Keep in mind that the streams also inflate the model load (compilation) time.
For efficient asynchronous execution, the streams are actually handling the inference with a special pool of the threads (a thread per stream).
Each time you start inference requests (potentially from different application threads), they are actually muxed into an inference queue of the particular `ov:Compiled_Model`.
Each time you start inference requests (potentially from different application threads), they are actually muxed into an inference queue of the particular ``ov:Compiled_Model``.
If there is a vacant stream, it pulls the request from the queue and actually expedites that to the on-device execution.
There are further device-specific details, like for the CPU, in the [internals](dldt_deployment_optimization_internals.md) section.
There are further device-specific details, like for the CPU, in the :doc:`internals <openvino_docs_deployment_optimization_guide_internals>` section.
Batching
####################
## Batching
Hardware accelerators such as GPUs are optimized for a massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
While the streams (described in previous section) already help to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient compared to calling a kernel on the multiple inputs at once.
While the streams (described in previous section) already help to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient compared to calling a kernel on the multiple inputs at once.
As explained in the next section, the batching is a must to leverage maximum throughput on the GPU.
There are several primary methods of using the batching to help application performance:
* Collecting the inputs explicitly on the application side and then **sending the batch requests to OpenVINO**:
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic.
* **Sending individual requests**, while configuring OpenVINO to collect and perform inference on the requests in batch [automatically](../OV_Runtime_UG/automatic_batching.md).
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic.
* **Sending individual requests**, while configuring OpenVINO to collect and perform inference on the requests in batch :doc:`automatically <openvino_docs_OV_UG_Automatic_Batching>`.
In both cases, the optimal batch size is very device-specific. As explained below, the optimal batch size also depends on the model, inference precision and other factors.
@anchor throughput_advanced
## Choosing the Number of Streams and/or Batch Size
Choosing the Number of Streams and/or Batch Size
################################################
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements.
Run performance testing in the scope of development, and make sure to validate overall (*end-to-end*) application performance.
@@ -46,33 +66,54 @@ In some cases, combination of streams and batching may be required to maximize t
One possible throughput optimization strategy is to **set an upper bound for latency and then increase the batch size and/or number of the streams until that tail latency is met (or the throughput is not growing anymore)**.
> **NOTE**: When playing with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md), use only the streams (no batching), as they tolerate individual requests having different shapes.
.. note::
> **NOTE**: Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the alternative, portable and future-proof option, allowing OpenVINO to find the best combination of streams and batching for a given scenario and a model.
When playing with :doc:`dynamically-shaped inputs <openvino_docs_OV_UG_DynamicShapes>`, use only the streams (no batching), as they tolerate individual requests having different shapes.
.. note::
Using the :doc:`High-Level Performance Hints <openvino_docs_OV_UG_Performance_Hints>` is the alternative, portable and future-proof option, allowing OpenVINO to find the best combination of streams and batching for a given scenario and a model.
Number of Streams Considerations
++++++++++++++++++++++++++++++++
@anchor stream_considerations
### Number of Streams Considerations
* Select the number of streams that is **less or equal** to the number of requests that the application would be able to run simultaneously.
* To avoid wasting resources, the number of streams should be enough to meet the *average* parallel slack rather than the peak load.
* Use the `ov::streams::AUTO` as a more portable option (that also respects the underlying hardware configuration).
* Use the `ov::streams::AUTO <groupov_runtime_cpp_prop_api.html#doxid-group-ov-runtime-cpp-prop-api-1gaddb29425af71fbb6ad3379c59342ff0e>`__ as a more portable option (that also respects the underlying hardware configuration).
* It is very important to keep these streams busy, by running as many inference requests as possible (for example, start the newly-arrived inputs immediately):
* A bare minimum of requests to saturate the device can be queried as the `ov::optimal_number_of_infer_requests` of the `ov:Compiled_Model`.
* *The maximum number of streams* for the device (per model) can be queried as the `ov::range_for_streams`.
### Batch Size Considerations
* A bare minimum of requests to saturate the device can be queried as the `ov::optimal_number_of_infer_requests <groupov_runtime_cpp_prop_api.html#doxid-group-ov-runtime-cpp-prop-api-1ga087c6da667f7c3d8374aec5f6cbba027>`__ of the ``ov:Compiled_Model``.
* *The maximum number of streams* for the device (per model) can be queried as the `ov::range_for_streams <groupov_runtime_cpp_prop_api.html#doxid-group-ov-runtime-cpp-prop-api-1ga8a5d84196f6873729167aa512c34a94a>`__.
Batch Size Considerations
+++++++++++++++++++++++++
* Select the batch size that is **equal** to the number of requests that your application is able to run simultaneously:
* Otherwise (or if the number of "available" requests fluctuates), you may need to keep several instances of the network (reshaped to the different batch size) and select the properly sized instance in the runtime accordingly.
* For OpenVINO devices that implement a dedicated heuristic internally, the `ov::optimal_batch_size` is a *device* property (that accepts the actual model as a parameter) to query the recommended batch size for the model.
* Otherwise (or if the number of "available" requests fluctuates), you may need to keep several instances of the network (reshaped to the different batch size) and select the properly sized instance in the runtime accordingly.
* For OpenVINO devices that implement a dedicated heuristic internally, the `ov::optimal_batch_size <groupov_runtime_cpp_prop_api.html#doxid-group-ov-runtime-cpp-prop-api-1ga129bad2da2fc2a40a7d746d86fc9c68d>`__ is a *device* property (that accepts the actual model as a parameter) to query the recommended batch size for the model.
### A Few Device-specific Details
A Few Device-specific Details
+++++++++++++++++++++++++++++
* For the **GPU**:
* When the parallel slack is small, for example, only 2-4 requests executed simultaneously, then using only the streams for the GPU may suffice:
* The GPU runs 2 requests per stream, so 4 requests can be served by 2 streams.
* Alternatively, consider a single stream with 2 requests (each with a small batch size like 2), which would total the same 4 inputs in flight.
* Typically, for 4 and more requests the batching delivers better throughput.
* A batch size can be calculated as "a number of inference requests executed in parallel" divided by the "number of requests that the streams consume":
* For example, if you process 16 cameras (by 16 requests inferenced *simultaneously*) by 2 GPU streams (each can process two requests), the batch size per request is 16/(2*2)=4.
* When the parallel slack is small, for example, only 2-4 requests executed simultaneously, then using only the streams for the GPU may suffice:
* The GPU runs 2 requests per stream, so 4 requests can be served by 2 streams.
* Alternatively, consider a single stream with 2 requests (each with a small batch size like 2), which would total the same 4 inputs in flight.
* Typically, for 4 and more requests the batching delivers better throughput.
* A batch size can be calculated as "a number of inference requests executed in parallel" divided by the "number of requests that the streams consume":
* For example, if you process 16 cameras (by 16 requests inferenced *simultaneously*) by 2 GPU streams (each can process two requests), the batch size per request is 16/(2*2)=4.
* For the **CPU, always use the streams first!**:
* On high-end CPUs, using moderate (2-8) batch size *in addition* to the maximum number of streams may further improve the performance.
* On high-end CPUs, using moderate (2-8) batch size *in addition* to the maximum number of streams may further improve the performance.
@endsphinxdirective