DOCS: ported changes from 2022.1 release branch (#11206)

* Extensibility guide with FE extensions and remove OV_FRAMEWORK_MAP from docs

* Rework of Extensibility Intro, adopted examples to missing OPENVINO_FRAMEWORK_MAP

* Removed OPENVINO_FRAMEWORK_MAP reference

* Frontend extension detailed documentation

* Fixed distributed snippets

* Fixed snippet inclusion in FE extension document and chapter headers

* Fixed wrong name in a snippet reference

* Fixed test for template extension due to changed number of loaded extensions

* Update docs/Extensibility_UG/frontend_extensions.md

Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com>

* Minor fixes in extension snippets

* Small grammar fix

Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com>

Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com>

* DOCS: transition banner (#10973)

* transition banner

* minor fix

* update transition banner

* updates

* update custom.js

* updates

* updates

* Documentation fixes (#11044)

* Benchmark app usage

* Fixed link to the devices

* More fixes

* Update docs/OV_Runtime_UG/multi_device.md

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* Removed several hardcoded links

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* Updated documentation for compile_tool (#11049)

* Added deployment guide (#11060)

* Added deployment guide

* Added local distribution

* Updates

* Fixed more indentations

* Removed obsolete code snippets (#11061)

* Removed obsolete code snippets

* NCC style

* Fixed NCC for BA

* Add a troubleshooting issue for PRC installation (#11074)

* updates

* adding gna to linux

* add missing reference

* update

* Update docs/install_guides/installing-model-dev-tools.md

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* Update docs/install_guides/installing-model-dev-tools.md

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* Update docs/install_guides/installing-model-dev-tools.md

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* Update docs/install_guides/installing-model-dev-tools.md

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* Update docs/install_guides/installing-model-dev-tools.md

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* update

* minor updates

* add gna item to yum and apt

* add gna to get started page

* update reference formatting

* merge commit

* add a troubleshooting issue

* update

* update

* fix CVS-71846

Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>

* DOCS: fixed hardcoded links  (#11100)

* Fixes

* Use links

* applying reviewers comments to the Opt Guide (#11093)

* applying reviewrs comments

* fixed refs, more structuring (bold, bullets, etc)

* refactoring tput/latency sections

* next iteration (mostly latency), also brushed the auto-batching and other sections

* updates sync/async images

* common opts brushed

* WIP tput redesigned

* minor brushing of common and auto-batching

* Tput fully refactored

* fixed doc name in the link

* moved int8 perf counters to the right section

* fixed links

* fixed broken quotes

* fixed more links

* add ref to the internals to the TOC

* Added a note on the batch size

Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com>

* [80085] New images for docs (#11114)

* change doc structure

* fix manager tools

* fix manager tools 3 step

* fix manager tools 3 step

* new img

* new img for OV Runtime

* fix steps

* steps

* fix intendents

* change list

* fix space

* fix space

* code snippets fix

* change display

* Benchmarks 2022 1 (#11130)

* Minor fixes

* Updates for 2022.1

* Edits according to the review

* Edits according to review comments

* Edits according to review comments

* Edits according to review comments

* Fixed table

* Edits according to review comments

* Removed config for Intel® Core™ i7-11850HE

* Removed forward-tacotron-duration-prediction-241 graph

* Added resnet-18-pytorch

* Add info about Docker images in Deployment guide (#11136)

* Renamed user guides (#11137)

* fix screenshot (#11140)

* More conservative recommendations on dynamic shapes usage in docs (#11161)

* More conservative recommendations about using dynamic shapes

* Duplicated statement from C++ part to Python part of reshape doc (no semantical changes)

* Update ShapeInference.md (#11168)

* Benchmarks 2022 1 updates (#11180)

* Updated graphs

* Quick fix for TODO in Dynamic Shapes article

* Anchor link fixes

* Fixed DM config (#11199)

* DOCS: doxy sphinxtabs (#11027)

* initial implementation of doxy sphinxtabs

* fixes

* fixes

* fixes

* fixes

* fixes

* WA for ignored visibility attribute

* Fixes

Co-authored-by: Sergey Lyalin <sergey.lyalin@intel.com>
Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com>
Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com>
Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com>
Co-authored-by: Yuan Xu <yuan1.xu@intel.com>
Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com>
Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
Co-authored-by: Ilya Naumov <ilya.naumov@intel.com>
Co-authored-by: Evgenya Stepyreva <evgenya.stepyreva@intel.com>
This commit is contained in:
Ilya Lavrenov
2022-03-24 22:27:29 +03:00
committed by GitHub
parent 8118147a96
commit a883dc0b85
161 changed files with 3254 additions and 3078 deletions

View File

@@ -1,68 +1,79 @@
# Optimizing for Throughput {#openvino_docs_deployment_optimization_guide_tput}
## General Throughput Considerations
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is delivering the every single request at the minimal delay.
Throughput on the other hand, is about inference scenarios in which potentially large number of inference requests are served simultaneously.
Here, the overall application throughput can be significantly improved with the right performance configuration.
Also, if the model is not already compute- or memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel.
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is delivering every single request at the minimal delay.
Throughput on the other hand, is about inference scenarios in which potentially large **number of inference requests are served simultaneously to improve the device utilization**.
With the OpenVINO there two major means of running the multiple requests simultaneously: batching and "streams", explained in this document.
Yet, different GPUs behave differently with batch sizes, just like different CPUs require different number of execution streams to maximize the throughput.
Predicting inference performance is difficult and and finding optimal execution parameters requires direct experiments measurements.
One possible throughput optimization strategy is to set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore).
Also, consider [Deep Learning Workbench](https://docs.openvino.ai/latest/workbench_docs_Workbench_DG_Introduction.html).
Here, the overall application inference rate can be significantly improved with the right performance configuration.
Also, if the model is not already memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel.
With the OpenVINO there are two major means of processing multiple inputs simultaneously: **batching** and **streams**, explained in this document.
Finally, the [automatic multi-device execution](../OV_Runtime_UG/multi_device.md) helps to improve the throughput, please also see the section below.
While the same approach of optimizing the parameters of each device separately does work, the resulting multi-device performance is a fraction (that is different for different models) of the “ideal” (plain sum) performance.
## OpenVINO Streams
As detailed in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common) running multiple inference requests asynchronously is important for general application efficiency.
The [Asynchronous API](./dldt_deployment_optimization_common.md) is in fact the "application side" of scheduling, as every device internally implements a queue. The queue acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
Further, the devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput. This parallelism is commonly referred as 'streams'. Some devices (like GPU) may run several requests per stream to amortize the host-side costs.
Notice that streams are **really executing the requests in parallel, but not in the lock step** (as e.g. the batching does), which makes the streams fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
For efficient asynchronous execution, the streams are actually handling inference with special pool of the threads.
So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:Compiled_Model`.
If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
The multi-streams approach is inherently throughput-oriented, as every stream requires a dedicated device memory to do inference in parallel to the rest of streams.
Although similar, the streams are always preferable compared to creating multiple `ov:Compiled_Model` instances for the same model, as weights memory is shared across streams, reducing the overall memory consumption.
Notice that the streams inflate the model load/compilation time.
Finally, using streams does increase the latency of an individual request, this is why for example the [latency hint](./dldt_deployment_optimization_hints.md) governs a device to create a bare minimum of streams (usually just one).
Please find the considerations for the optimal number of the streams in the later sections.
## Batching
Hardware accelerators like GPUs are optimized for massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
While the streams (described) earlier already allow to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient, compared to calling a kernel on the multiple inputs at once.
As explained in the next section, the batching is a must to leverage maximum throughput on the GPUs.
There are two primary ways of using the batching to help application performance:
* Collecting the inputs explicitly on the application side and then _sending these batched requests to the OpenVINO_
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic
* _Sending individual requests_, while configuring the OpenVINO to collect and perform inference on the requests in batch [automatically](../OV_Runtime_UG/automatic_batching.md).
In both cases, optimal batch size is very device-specific. Also as explained below, the optimal batch size depends on the model, inference precision and other factors.
## Choosing the Batch Size and Number of Streams
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements.
One possible throughput optimization strategy is to **set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore)**.
Also, consider [Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction) that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
Different devices behave differently with the batch sizes. The optimal batch size depends on the model, inference precision and other factors. Similarly, different devices require different number of execution streams to maximize the throughput.
Below are general recommendations:
* For the **CPU always prefer the streams** over the batching
* Create as many streams as you application runs the requests simultaneously
* Number of streams should be enough to meet the _average_ parallel slack rather than the peak load
* _Maximum number of streams_ equals **total number of CPU cores**
* As explained in the [CPU streams internals](dldt_deployment_optimization_internals.md), the CPU cores are evenly distributed between streams, so one core per stream is the finest-grained configuration
* For the **GPU**:
* When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice
* Notice that the GPU runs 2 request per stream
* _Maximum number of streams_ is usually 2, for more portability consider using the `ov::streams::AUTO` (`GPU_THROUGHPUT_AUTO` in the pre-OpenVINO 2.0 parlance)
* Typically, for 4 and more requests the batching delivers better throughput for the GPUs
* Batch size can be calculated as "number of inference requests executed _in parallel_" divided by the "number of requests that the streams consume"
* E.g. if you process 16 cameras (by 16 requests inferenced _simultaneously_) with 2 GPU streams (each can process 2 requests), the batch size per request is 16/(2*2)=4
> **NOTE**: When playing with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) use only the streams (no batching), as they tolerate individual requests having different shapes.
> **NOTE**: Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) explained in the next section, is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario and model.
## OpenVINO Hints: Selecting Optimal Execution and Parameters **Automatically**
Overall, the latency-throughput is not linearly dependent and very _device_ specific. It is also tightly integrated with _model_ characteristics.
As for the possible inference devices the scenery had already become pretty diverse, the OpenVINO has introduced the dedicated notion of the high-level performance configuration "hints" to describe the target application scenarios.
The hints are described [here](./dldt_deployment_optimization_hints.md).
**NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
The hints also obviates the need for explicit (application-side) batching. With the hints, the only requirement for the application is to run multiple individual requests using [Async API](./dldt_deployment_optimization_common.md) and let the OpenVINO decide whether to collect the requests and execute them in batch, streams, or both.
The rest of the document provides low-level details on the OpenVINO's low-level ways to optimize the throughput.
> **NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
## Low-Level Implementation Details
### OpenVINO Streams <a name="ov-streams"></a>
As detailed in the section <a href="#ov-async-api">OpenVINO Async API</a> running multiple inference requests asynchronously is important for general application efficiency.
Additionally, most devices support running multiple inference requests in parallel in order to improve the device utilization. The _level_ of the parallelism (i.e. how many requests are really executed in parallel on the device) is commonly referred as a number of 'streams'. Some devices run several requests per stream to amortize the host-side costs.
Notice that streams (that can be considered as independent queues) are really executing the requests in parallel, but not in the lock step (as e.g. the batching does), this makes the streams much more compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
## Multi-Device Execution
OpenVINO offers _automatic_, [scalable multi-device inference](../OV_Runtime_UG/multi_device.md). This is simple _application-transparent_ way to improve the throughput. No need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance the inference requests between devices, etc. From the application point of view, it is communicating to the single device that internally handles the actual machinery.
Just like with other throughput-oriented scenarios, there are two major pre-requisites for optimal multi-device performance:
* Using the [Asynchronous API](@ref openvino_docs_deployment_optimization_guide_common) and [callbacks](../OV_Runtime_UG/ov_infer_request.md) in particular
* Providing the multi-device (and hence the underlying devices) with enough data to crunch. As the inference requests are naturally independent data pieces, the multi-device performs load-balancing at the “requests” (outermost) level to minimize the scheduling overhead.
Also, notice that for efficient asynchronous execution, the streams are actually handling inference with special pool of the threads.
So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:compiled_model`.
If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
The usage of multiple streams is an inherently throughput-oriented approach, as every stream requires a dedicated memory to operate in parallel to the rest streams (read-only data like weights are usually shared between all streams).
Also, the streams inflate the load/compilation time.
This is why the [latency hint](./dldt_deployment_optimization_hints.md) governs a device to create a bare minimum of streams (usually just one).
Finally, the streams are always preferable compared to creating multiple instances of the same model, as weights memory is shared across streams, reducing possible memory consumption.
### Throughput on the CPU: Internals <a name="cpu-streams"></a>
In order to best serve multiple inference requests simultaneously, the inference threads are grouped/pinned to the particular CPU cores, constituting the CPU streams.
This provides much better performance for the networks than batching especially for the many-core machines:
![](../img/cpu_streams_explained_1.png)
Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, with much less synchronization within CNN ops):
![](../img/cpu_streams_explained.png)
Notice that [high-level performance hints](../OV_Runtime_UG/performance_hints.md) allows the implementation to select the optimal number of the streams, _depending on the model compute demands_ and CPU capabilities (including [int8 inference](../OV_Runtime_UG/Int8Inference.md) hardware acceleration, number of cores, etc).
### Automatic Batching Internals <a name="ov-auto-batching"></a>
While the GPU plugin fully supports general notion of the streams, the associated performance (throughput) improvements are usually modest.
The primary reason is that, while the streams allow to hide the communication overheads and hide certain bubbles in device utilization, running multiple OpenCL kernels on the GPU simultaneously is less efficient, compared to calling a kernel on the multiple inputs at once.
When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice. Also streams are fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
Typically, for 4 and more requests the batching delivers better throughput for the GPUs. Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario.
As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests to improve device utilization.
The Automatic Batching relaxes the requirement for an application to saturate devices like GPU by _explicitly_ using a large batch. It performs transparent inputs gathering from
individual inference requests followed by the actual batched execution, with no programming effort from the user:
![](../img/BATCH_device.PNG)
Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a batching timeout.
Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect).
So if your workload experiences the timeouts (resulting in the performance drop, as the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value and batch size may yield better performance than large batch size, but coupled with the timeout value that cannot guarantee accommodating the full number of the required requests.
Finally, following the "get_tensor idiom" section from the [general optimizations](./dldt_deployment_optimization_common.md) helps the Automatic Batching to save on inputs/outputs copies. Thus, in your application always prefer the "get" versions of the tensor data access APIs.
Notice that the resulting performance is usually a fraction of the “ideal” (plain sum) value, when the devices compete for a certain resources, like the memory-bandwidth which is shared between CPU and iGPU.
> **NOTE**: While the legacy approach of optimizing the parameters of each device separately works, the [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) allow to configure all devices (that are part of the specific multi-device configuration) at once.