cherry-picking opt guide changes from the release branch (#11430)

This commit is contained in:
Maxim Shevtsov 2022-04-04 19:41:17 +03:00 committed by GitHub
parent 417d75d80b
commit 500d36e1c0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 148 additions and 99 deletions

View File

@ -1,6 +1,9 @@
# General Optimizations {#openvino_docs_deployment_optimization_guide_common}
## Inputs Pre-processing with OpenVINO
This chapter covers application-level optimization techniques such as asynchronous execution to improve data pipelining, pre-processing acceleration and so on.
While the techniques (e.g. pre-processing) can be specific to end-user applications, the associated performance improvements are general and shall improve any target scenario (both latency and throughput).
## Inputs Pre-Processing with OpenVINO
In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:
- Model Optimizer can efficiently bake the mean and normalization (scale) values into the model (for example, to the weights of the first convolution). Please see [relevant Model Optimizer command-line options](../MO_DG/prepare_model/Additional_Optimizations.md).

View File

@ -9,36 +9,53 @@
openvino_docs_deployment_optimization_guide_common
openvino_docs_deployment_optimization_guide_latency
openvino_docs_deployment_optimization_guide_tput
openvino_docs_deployment_optimization_guide_hints
openvino_docs_deployment_optimization_guide_tput_advanced
openvino_docs_deployment_optimization_guide_internals
@endsphinxdirective
## Deployment Optimizations Overview {#openvino_docs_deployment_optimization_guide_overview}
Runtime or deployment optimizations are focused on tuning of the inference _parameters_ (e.g. optimal number of the requests executed simultaneously) and other means of how a model is _executed_.
Runtime or deployment optimizations are focused on tuning of the inference _parameters_ (e.g. optimal number of the requests executed simultaneously) and other means of how a model is _executed_.
As referenced in the parent [performance introduction topic](./dldt_optimization_guide.md), the [dedicated document](./model_optimization_guide.md) covers the **model-level optimizations** like quantization that unlocks the 8-bit inference. Model-optimizations are most general and help any scenario and any device (that e.g. accelerates the quantized models). The relevant _runtime_ configuration is `ov::hint::inference_precision` allowing the devices to trade the accuracy for the performance (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model).
As referenced in the parent [performance introduction topic](./dldt_optimization_guide.md), the [dedicated document](./model_optimization_guide.md) covers the **model-level optimizations** like quantization that unlocks the 8-bit inference. Model-optimizations are most general and help any scenario and any device (that e.g. accelerates the quantized models). The relevant _runtime_ configuration is `ov::hint::inference_precision` which trades the accuracy for the performance (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model).
Then, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers.
In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency. Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold.
Below you can find summary on the associated tips.
Below you can find summary on the associated tips.
How the full-stack application uses the inference component _end-to-end_ is also important. For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. Below you can find multiple tips on connecting the data input pipeline and the model inference efficiently.
These are also common performance tricks that help both latency and throughput scenarios.
Further documents cover the associated _runtime_ performance optimizations topics. Please also consider [matrix support of the features by the individual devices](@ref features_support_matrix).
**General, application-level optimizations**, and specifically:
Further documents cover the associated _runtime_ performance optimizations topics. Please also consider [matrix support of the features by the individual devices](@ref features_support_matrix).
[General, application-level optimizations](dldt_deployment_optimization_common.md), and specifically:
* [Inputs Pre-processing with the OpenVINO](../OV_Runtime_UG/preprocessing_overview.md)
* [Async API and 'get_tensor' Idiom](./dldt_deployment_optimization_common.md)
* [Async API and 'get_tensor' Idiom](dldt_deployment_optimization_common.md)
* For variably-sized inputs, consider [dynamic shapes](../OV_Runtime_UG/ov_dynamic_shapes.md)
**Use-case specific optimizations** along with some implementation details:
**Use-case specific optimizations** such as optimizing for [latency](./dldt_deployment_optimization_latency.md) or [throughput](./dldt_deployment_optimization_tput.md)
* Optimizing for [throughput](./dldt_deployment_optimization_tput.md) and [latency](./dldt_deployment_optimization_latency.md)
## Writing Performance Portable Inference Application
Each of the OpenVINO's [supported devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers a bunch of low-level performance settings.
Tweaking this detailed configuration requires deep architecture understanding.
* [OpenVINO's high-level performance hints](./dldt_deployment_optimization_hints.md) as the portable, future-proof approach for performance configuration, thar does not requires re-tuning when the model or device has changed.
* **If the performance portability is of concern, consider using the [hints](../OV_Runtime_UG/performance_hints.md) first.**
Also, while the resulting performance may be optimal for the specific combination of the device and the model that is inferred, it is actually neither device/model nor future-proof:
- Even within a family of the devices (like various CPUs), different instruction set, or number of CPU cores would eventually result in different execution configuration to be optimal.
- Similarly the optimal batch size is very much specific to the particular instance of the GPU.
- Compute vs memory-bandwidth requirements for the model being inferenced, as well as inference precision, possible model's quantization also contribute to the optimal parameters selection.
- Finally, the optimal execution parameters of one device do not transparently map to another device type, for example:
- Both the CPU and GPU devices support the notion of the [streams](./dldt_deployment_optimization_tput_advanced.md), yet the optimal number of the streams is deduced very differently.
Here, to mitigate the performance configuration complexity the **Performance Hints** offer the high-level "presets" for the **latency** and **throughput**, as detailed in the [Performance Hints usage document](../OV_Runtime_UG/performance_hints.md).
Beyond execution _parameters_ there is a device-specific _scheduling_ that greatly affects the performance.
Specifically, GPU-oriented optimizations like batching, which combines many (potentially tens) of inputs to achieve optimal throughput, do not always map well to the CPU, as e.g. detailed in the [further internals](dldt_deployment_optimization_internals.md) sections.
The hints really hide the _execution_ specifics required to saturate the device. In the [internals](dldt_deployment_optimization_internals.md) sections you can find the implementation details (particularly how the OpenVINO implements the 'throughput' approach) for the specific devices. Keep in mind that the hints make this transparent to the application. For example, the hints obviates the need for explicit (application-side) batching or streams.
With the hints, it is enough to keep separate infer requests per camera or another source of input and process the requests in parallel using Async API as explained in the [application design considerations section](@ref throughput_app_design). The main requirement for the application to leverage the throughput is **running multiple inference requests in parallel**.
In summary, when the performance _portability_ is of concern, consider the Performance Hints as a solution. You may find further details and API examples [here](../OV_Runtime_UG/performance_hints.md).

View File

@ -1,22 +0,0 @@
# High-level Performance Hints (Presets) {#openvino_docs_deployment_optimization_guide_hints}
Traditionally, each of the OpenVINO's [supported devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers a bunch of low-level performance settings.
Tweaking this detailed configuration requires deep architecture understanding.
Also, while the resulting performance may be optimal for the specific combination of the device and the model that is inferred, it is actually neither device/model nor future-proof:
- Even within a family of the devices (like various CPUs), things like different number of CPU cores would eventually result in different execution configuration to be optimal.
- Similarly the optimal batch size is very much specific to the particular instance of the GPU.
- Compute vs memory-bandwidth requirements for the model being inferenced, as well as inference precision, possible model's quantization and other factors add more unknowns to the resulting performance equation.
- Finally, the optimal execution parameters of one device do not transparently map to another device type, for example:
- Both the CPU and GPU devices support the notion of the 'streams' (i.e. inference instances that are executed in parallel, please see `ov::num_streams`), yet the optimal number of the streams is deduced very differently.
Beyond execution _parameters_ there are potentially many device-specific details like _scheduling_ that greatly affect the performance.
Specifically, GPU-oriented tricks like batching, which combines many (potentially tens) of input images to achieve optimal throughput, do not always map well to the CPU, as e.g. detailed in the next sections.
The hints allow to really hide _execution_ specifics required to saturate the device. For example, no need to explicitly combine multiple inputs into a batch to achieve good GPU performance.
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API as explained in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common).
The only requirement for the application to leverage the throughput is about **running multiple inference requests in parallel**.
OpenVINO's device-specific implementation of the hints will take care of the rest. This allows a developer to greatly simplify the app-logic.
In summary, when the performance _portability_ is of concern, consider the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md).
Below you can find the implementation details (particularly how the OpenVINO implements the 'throughput' approach) for the specific devices.
Keep in mind that while different throughput-oriented scheduling approaches ([like the batching or other means of executing individual inference requests](./dldt_deployment_optimization_tput.md)) can work together, the hints make these decisions to be transparent to the application.

View File

@ -20,9 +20,9 @@ As expected, the easiest way to achieve the lowest latency is **running only one
However, some conventional "root" devices (e.g. CPU or GPU) can be in fact internally composed of several "sub-devices". In many cases letting the OpenVINO to transparently leverage the "sub-devices" helps to improve the application throughput (e.g. serve multiple clients simultaneously) without degrading the latency. For example, multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes in the machine. Similarly, a multi-tile GPU (which is essentially multiple GPUs in a single package), can deliver a multi-tile scalability with the number of inference requests, while preserving the single-tile latency.
Thus, human expertise is required to get more _throughput_ out of the device even in the inherently latency-oriented cases. OpenVINO can take this configuration burden via [high-level performance hints](../OV_Runtime_UG/performance_hints.md).
Thus, human expertise is required to get more _throughput_ out of the device even in the inherently latency-oriented cases. OpenVINO can take this configuration burden via [high-level performance hints](../OV_Runtime_UG/performance_hints.md), via `ov::hint::PerformanceMode::LATENCY` specified for the `ov::hint::performance_mode` property for the compile_model.
> **NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
> **NOTE**: [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
In the case when there are multiple models to be used simultaneously, consider using different devices for inferencing the different models. Finally, when multiple models are executed in parallel on the device, using additional `ov::hint::model_priority` may help to define relative priorities of the models (please refer to the documentation on the [matrix features support for OpenVINO devices](@ref features_support_matrix) to check for the support of the feature by the specific device).

View File

@ -1,79 +1,50 @@
# Optimizing for Throughput {#openvino_docs_deployment_optimization_guide_tput}
## General Throughput Considerations
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is delivering every single request at the minimal delay.
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is _delivering every single request at the minimal delay_.
Throughput on the other hand, is about inference scenarios in which potentially large **number of inference requests are served simultaneously to improve the device utilization**.
Here, the overall application inference rate can be significantly improved with the right performance configuration.
Also, if the model is not already memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel.
With the OpenVINO there are two major means of processing multiple inputs simultaneously: **batching** and **streams**, explained in this document.
The associated increase in latency is not linearly dependent on the number of requests executed in parallel.
Here, a trade-off between overall throughput and serial performance of individual requests can be achieved with the right OpenVINO performance configuration.
## OpenVINO Streams
As detailed in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common) running multiple inference requests asynchronously is important for general application efficiency.
The [Asynchronous API](./dldt_deployment_optimization_common.md) is in fact the "application side" of scheduling, as every device internally implements a queue. The queue acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
## Basic and Advanced Ways of Leveraging Throughput
With the OpenVINO there are two means of leveraging the throughput with the individual device:
* **Basic (high-level)** flow with [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) which is inherently **portable and future-proof**
* **Advanced (low-level)** approach of explicit **batching** and **streams**, explained in the separate [document](dldt_deployment_optimization_tput_advanced.md).
Further, the devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput. This parallelism is commonly referred as 'streams'. Some devices (like GPU) may run several requests per stream to amortize the host-side costs.
Notice that streams are **really executing the requests in parallel, but not in the lock step** (as e.g. the batching does), which makes the streams fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
In both cases application should be designed to execute multiple inference requests in parallel as detailed in the [next section](@ref throughput_app_design).
For efficient asynchronous execution, the streams are actually handling inference with special pool of the threads.
So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:Compiled_Model`.
If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
Finally, consider the _automatic_ multi-device execution covered below.
The multi-streams approach is inherently throughput-oriented, as every stream requires a dedicated device memory to do inference in parallel to the rest of streams.
Although similar, the streams are always preferable compared to creating multiple `ov:Compiled_Model` instances for the same model, as weights memory is shared across streams, reducing the overall memory consumption.
Notice that the streams inflate the model load/compilation time.
Finally, using streams does increase the latency of an individual request, this is why for example the [latency hint](./dldt_deployment_optimization_hints.md) governs a device to create a bare minimum of streams (usually just one).
Please find the considerations for the optimal number of the streams in the later sections.
@anchor throughput_app_design
## Throughput-Oriented Application Design
Most generally, throughput-oriented inference applications should:
* Expose substantial amounts of _inputs_ parallelism (e.g. process multiple video- or audio- sources, text documents, etc)
* Decompose the data flow into a collection of concurrent inference requests that are aggressively scheduled to be executed in parallel
* Setup the configuration for the _device_ (e.g. as parameters of the `ov::Core::compile_model`) via either [low-level explicit options](dldt_deployment_optimization_tput_advanced.md), introduced in the previous section or [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) (**preferable**):
@sphinxdirective
## Batching
Hardware accelerators like GPUs are optimized for massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
While the streams (described) earlier already allow to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient, compared to calling a kernel on the multiple inputs at once.
As explained in the next section, the batching is a must to leverage maximum throughput on the GPUs.
.. tab:: C++
There are two primary ways of using the batching to help application performance:
* Collecting the inputs explicitly on the application side and then _sending these batched requests to the OpenVINO_
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic
* _Sending individual requests_, while configuring the OpenVINO to collect and perform inference on the requests in batch [automatically](../OV_Runtime_UG/automatic_batching.md).
In both cases, optimal batch size is very device-specific. Also as explained below, the optimal batch size depends on the model, inference precision and other factors.
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [compile_model]
## Choosing the Batch Size and Number of Streams
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements.
One possible throughput optimization strategy is to **set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore)**.
Also, consider [Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction) that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
.. tab:: Python
Different devices behave differently with the batch sizes. The optimal batch size depends on the model, inference precision and other factors. Similarly, different devices require different number of execution streams to maximize the throughput.
Below are general recommendations:
* For the **CPU always prefer the streams** over the batching
* Create as many streams as you application runs the requests simultaneously
* Number of streams should be enough to meet the _average_ parallel slack rather than the peak load
* _Maximum number of streams_ equals **total number of CPU cores**
* As explained in the [CPU streams internals](dldt_deployment_optimization_internals.md), the CPU cores are evenly distributed between streams, so one core per stream is the finest-grained configuration
* For the **GPU**:
* When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice
* Notice that the GPU runs 2 request per stream
* _Maximum number of streams_ is usually 2, for more portability consider using the `ov::streams::AUTO` (`GPU_THROUGHPUT_AUTO` in the pre-OpenVINO 2.0 parlance)
* Typically, for 4 and more requests the batching delivers better throughput for the GPUs
* Batch size can be calculated as "number of inference requests executed _in parallel_" divided by the "number of requests that the streams consume"
* E.g. if you process 16 cameras (by 16 requests inferenced _simultaneously_) with 2 GPU streams (each can process 2 requests), the batch size per request is 16/(2*2)=4
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [compile_model]
> **NOTE**: When playing with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) use only the streams (no batching), as they tolerate individual requests having different shapes.
> **NOTE**: Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) explained in the next section, is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario and model.
## OpenVINO Hints: Selecting Optimal Execution and Parameters **Automatically**
Overall, the latency-throughput is not linearly dependent and very _device_ specific. It is also tightly integrated with _model_ characteristics.
As for the possible inference devices the scenery had already become pretty diverse, the OpenVINO has introduced the dedicated notion of the high-level performance configuration "hints" to describe the target application scenarios.
The hints are described [here](./dldt_deployment_optimization_hints.md).
The hints also obviates the need for explicit (application-side) batching. With the hints, the only requirement for the application is to run multiple individual requests using [Async API](./dldt_deployment_optimization_common.md) and let the OpenVINO decide whether to collect the requests and execute them in batch, streams, or both.
> **NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
@endsphinxdirective
* Query the `ov::optimal_number_of_infer_requests` from the `ov::CompiledModel` (resulted from compilation of the model for a device) to create the number of the requests required to saturate the device
* Use the Async API with callbacks, to avoid any dependency on the requests' completion order and possible device starvation, as explained in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common)
## Multi-Device Execution
OpenVINO offers _automatic_, [scalable multi-device inference](../OV_Runtime_UG/multi_device.md). This is simple _application-transparent_ way to improve the throughput. No need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance the inference requests between devices, etc. From the application point of view, it is communicating to the single device that internally handles the actual machinery.
OpenVINO offers automatic, [scalable multi-device inference](../OV_Runtime_UG/multi_device.md). This is simple _application-transparent_ way to improve the throughput. No need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance the inference requests between devices, etc. From the application point of view, it is communicating to the single device that internally handles the actual machinery.
Just like with other throughput-oriented scenarios, there are two major pre-requisites for optimal multi-device performance:
* Using the [Asynchronous API](@ref openvino_docs_deployment_optimization_guide_common) and [callbacks](../OV_Runtime_UG/ov_infer_request.md) in particular
* Providing the multi-device (and hence the underlying devices) with enough data to crunch. As the inference requests are naturally independent data pieces, the multi-device performs load-balancing at the “requests” (outermost) level to minimize the scheduling overhead.
Notice that the resulting performance is usually a fraction of the “ideal” (plain sum) value, when the devices compete for a certain resources, like the memory-bandwidth which is shared between CPU and iGPU.
> **NOTE**: While the legacy approach of optimizing the parameters of each device separately works, the [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) allow to configure all devices (that are part of the specific multi-device configuration) at once.
> **NOTE**: While the legacy approach of optimizing the parameters of each device separately works, the [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) allow to configure all devices (that are part of the specific multi-device configuration) at once.

View File

@ -0,0 +1,77 @@
# Using Advanced Throughput Options: Streams and Batching {#openvino_docs_deployment_optimization_guide_tput_advanced}
## OpenVINO Streams
As detailed in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common) running multiple inference requests asynchronously is important for general application efficiency.
Internally, every device implements a queue. The queue acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
The devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput.
This configurable mean of this device-side parallelism is commonly referred as **streams**.
> **NOTE**: Notice that streams are **really executing the requests in parallel, but not in the lock step** (as e.g. the batching does), which makes the streams fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
> **NOTE**: Most OpenVINO devices (including CPU, GPU and VPU) support the streams, yet the _optimal_ number of the streams is deduced very differently, please see the a dedicated section below.
Few general considerations:
* Using the streams does increase the latency of an individual request
* When no number of streams is not specified, a device creates a bare minimum of streams (usually just one), as the latency-oriented case is default
* Please find further tips for the optimal number of the streams [below](@ref throughput_advanced)
* Streams are memory-hungry, as every stream duplicates the intermediate buffers to do inference in parallel to the rest of streams
* Always prefer streams over creating multiple `ov:Compiled_Model` instances for the same model, as weights memory is shared across streams, reducing the memory consumption
* Notice that the streams also inflate the model load (compilation) time.
For efficient asynchronous execution, the streams are actually handling the inference with a special pool of the threads (a thread per stream).
Each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:Compiled_Model`.
If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
There are further device-specific details e.g. for the CPU, that you may find in the [internals](dldt_deployment_optimization_internals.md) section.
## Batching
Hardware accelerators like GPUs are optimized for massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
While the streams (described earlier) already help to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient, compared to calling a kernel on the multiple inputs at once.
As explained in the next section, the batching is a must to leverage maximum throughput on the GPUs.
There are two primary ways of using the batching to help application performance:
* Collecting the inputs explicitly on the application side and then _sending these batched requests to the OpenVINO_
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic
* _Sending individual requests_, while configuring the OpenVINO to collect and perform inference on the requests in batch [automatically](../OV_Runtime_UG/automatic_batching.md).
In both cases, optimal batch size is very device-specific. Also as explained below, the optimal batch size depends on the model, inference precision and other factors.
@anchor throughput_advanced
## Choosing the Number of Streams and/or Batch Size
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements.
Run performance testing in the scope of development, and make sure to validate overall (end-to-end) application performance.
Different devices behave differently with the batch sizes. The optimal batch size depends on the model, inference precision and other factors.
Similarly, different devices require different number of execution streams to saturate.
Finally, in some cases combination of streams and batching may be required to maximize the throughput.
One possible throughput optimization strategy is to **set an upper bound for latency and then increase the batch size and/or number of the streams until that tail latency is met (or the throughput is not growing anymore)**.
Also, consider [OpenVINO Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction) that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
> **NOTE**: When playing with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) use only the streams (no batching), as they tolerate individual requests having different shapes.
> **NOTE**: Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the alternative, portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario and model.
### Number of Streams Considerations
* Select the number of streams is it is **less or equal** to the number of requests that your application would be able to runs simultaneously
* To avoid wasting resources, the number of streams should be enough to meet the _average_ parallel slack rather than the peak load
* As a more portable option (that also respects the underlying hardware configuration) use the `ov::streams::AUTO`
* It is very important to keep these streams busy, by running as many inference requests as possible (e.g. start the newly-arrived inputs immediately)
* Bare minimum of requests to saturate the device can be queried as `ov::optimal_number_of_infer_requests` of the `ov:Compiled_Model`
* _Maximum number of streams_ for the device (per model) can be queried as the `ov::range_for_streams`
### Batch Size Considerations
* Select the batch size that is **equal** to the number of requests that your application is able to runs simultaneously
* Otherwise (or if the number of "available" requests fluctuates), you may need to keep several instances of the network (reshaped to the different batch size) and select the properly sized instance in the runtime accordingly
* For OpenVINO devices that internally implement a dedicated heuristic, the `ov::optimal_batch_size` is a _device_ property (that accepts the actual model as a parameter) to query the recommended batch size for the model.
### Few Device Specific Details
* For the **GPU**:
* When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using only the streams for the GPU may suffice
* Notice that the GPU runs 2 request per stream, so 4 requests can be served by 2 streams
* Alternatively, consider single stream with with 2 requests (each with a small batch size like 2), which would total the same 4 inputs in flight
* Typically, for 4 and more requests the batching delivers better throughput
* Batch size can be calculated as "number of inference requests executed in parallel" divided by the "number of requests that the streams consume"
* E.g. if you process 16 cameras (by 16 requests inferenced _simultaneously_) by the two GPU streams (each can process two requests), the batch size per request is 16/(2*2)=4
* For the **CPU always use the streams first**
* On the high-end CPUs, using moderate (2-8) batch size _in addition_ to the maximum number of streams, may further improve the performance.

View File

@ -9,11 +9,13 @@ Generally, performance means how fast the model processes the live data. Two key
![](../img/LATENCY_VS_THROUGHPUT.svg)
**Latency** measures inference time (ms) required to process a single input. When it comes to the executing multiple inputs executed simultaneously (e.g. via batching) then the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually of more concern.
**Latency** measures inference time (ms) required to process a single input. When it comes to the executing multiple inputs simultaneously (e.g. via batching) then the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually of more concern.
To calculate **throughput**, divide number of inputs that were processed by the processing time.
## End-to-End Application Performance
It is important to separate the "pure" inference time of a neural network and the end-to-end application performance. For example data transfers between the host and a device may unintentionally affect the performance when a host input tensor is processed on the accelerator like dGPU. Similarly, the image-preprocessing may also contribute significantly to the to inference time. As detailed in the [getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) section, when drilling into _inference_ performance, one option is to measure all such items separately.
It is important to separate the "pure" inference time of a neural network and the end-to-end application performance. For example data transfers between the host and a device may unintentionally affect the performance when a host input tensor is processed on the accelerator like dGPU.
Similarly, the input-preprocessing contributes significantly to the to inference time. As detailed in the [getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) section, when drilling into _inference_ performance, one option is to measure all such items separately.
For the **end-to-end scenario** though, consider the image pre-processing thru the OpenVINO and the asynchronous execution as a way to amortize the communication costs like data transfers. You can find further details in the [general optimizations document](./dldt_deployment_optimization_common.md).
**First-inference latency** is another specific case (e.g. when fast application start-up is required) where the resulting performance may be well dominated by the model loading time. Consider [model caching](../OV_Runtime_UG/Model_caching_overview.md) as a way to improve model loading/compilation time.
@ -29,7 +31,8 @@ Finally, **memory footprint** restrictions is another possible concern when desi
With the OpenVINO there are two primary ways of improving the inference performance, namely model- and runtime-level optimizations. **These two optimizations directions are fully compatible**.
- **Model optimizations** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md).
- **Model optimizations** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md).
- Notice that the model optimizations directly improve the inference time, even without runtime parameters tuning, described below
- **Runtime (Deployment) optimizations** includes tuning of model _execution_ parameters. To read more visit the [Runtime Inference Optimizations](../optimization_guide/dldt_deployment_optimization_guide.md).