* Extensibility guide with FE extensions and remove OV_FRAMEWORK_MAP from docs * Rework of Extensibility Intro, adopted examples to missing OPENVINO_FRAMEWORK_MAP * Removed OPENVINO_FRAMEWORK_MAP reference * Frontend extension detailed documentation * Fixed distributed snippets * Fixed snippet inclusion in FE extension document and chapter headers * Fixed wrong name in a snippet reference * Fixed test for template extension due to changed number of loaded extensions * Update docs/Extensibility_UG/frontend_extensions.md Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> * Minor fixes in extension snippets * Small grammar fix Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> * DOCS: transition banner (#10973) * transition banner * minor fix * update transition banner * updates * update custom.js * updates * updates * Documentation fixes (#11044) * Benchmark app usage * Fixed link to the devices * More fixes * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Removed several hardcoded links Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Updated documentation for compile_tool (#11049) * Added deployment guide (#11060) * Added deployment guide * Added local distribution * Updates * Fixed more indentations * Removed obsolete code snippets (#11061) * Removed obsolete code snippets * NCC style * Fixed NCC for BA * Add a troubleshooting issue for PRC installation (#11074) * updates * adding gna to linux * add missing reference * update * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * update * minor updates * add gna item to yum and apt * add gna to get started page * update reference formatting * merge commit * add a troubleshooting issue * update * update * fix CVS-71846 Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * DOCS: fixed hardcoded links (#11100) * Fixes * Use links * applying reviewers comments to the Opt Guide (#11093) * applying reviewrs comments * fixed refs, more structuring (bold, bullets, etc) * refactoring tput/latency sections * next iteration (mostly latency), also brushed the auto-batching and other sections * updates sync/async images * common opts brushed * WIP tput redesigned * minor brushing of common and auto-batching * Tput fully refactored * fixed doc name in the link * moved int8 perf counters to the right section * fixed links * fixed broken quotes * fixed more links * add ref to the internals to the TOC * Added a note on the batch size Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * [80085] New images for docs (#11114) * change doc structure * fix manager tools * fix manager tools 3 step * fix manager tools 3 step * new img * new img for OV Runtime * fix steps * steps * fix intendents * change list * fix space * fix space * code snippets fix * change display * Benchmarks 2022 1 (#11130) * Minor fixes * Updates for 2022.1 * Edits according to the review * Edits according to review comments * Edits according to review comments * Edits according to review comments * Fixed table * Edits according to review comments * Removed config for Intel® Core™ i7-11850HE * Removed forward-tacotron-duration-prediction-241 graph * Added resnet-18-pytorch * Add info about Docker images in Deployment guide (#11136) * Renamed user guides (#11137) * fix screenshot (#11140) * More conservative recommendations on dynamic shapes usage in docs (#11161) * More conservative recommendations about using dynamic shapes * Duplicated statement from C++ part to Python part of reshape doc (no semantical changes) * Update ShapeInference.md (#11168) * Benchmarks 2022 1 updates (#11180) * Updated graphs * Quick fix for TODO in Dynamic Shapes article * Anchor link fixes * Fixed DM config (#11199) * DOCS: doxy sphinxtabs (#11027) * initial implementation of doxy sphinxtabs * fixes * fixes * fixes * fixes * fixes * WA for ignored visibility attribute * Fixes Co-authored-by: Sergey Lyalin <sergey.lyalin@intel.com> Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com> Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com> Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> Co-authored-by: Ilya Naumov <ilya.naumov@intel.com> Co-authored-by: Evgenya Stepyreva <evgenya.stepyreva@intel.com>
9.5 KiB
Optimizing for Throughput
General Throughput Considerations
As described in the section on the latency-specific considerations one possible use-case is delivering every single request at the minimal delay. Throughput on the other hand, is about inference scenarios in which potentially large number of inference requests are served simultaneously to improve the device utilization.
Here, the overall application inference rate can be significantly improved with the right performance configuration. Also, if the model is not already memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel. With the OpenVINO there are two major means of processing multiple inputs simultaneously: batching and streams, explained in this document.
OpenVINO Streams
As detailed in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common) running multiple inference requests asynchronously is important for general application efficiency. The Asynchronous API is in fact the "application side" of scheduling, as every device internally implements a queue. The queue acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
Further, the devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput. This parallelism is commonly referred as 'streams'. Some devices (like GPU) may run several requests per stream to amortize the host-side costs. Notice that streams are really executing the requests in parallel, but not in the lock step (as e.g. the batching does), which makes the streams fully compatible with dynamically-shaped inputs when individual requests can have different shapes.
For efficient asynchronous execution, the streams are actually handling inference with special pool of the threads.
So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular ov:Compiled_Model.
If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
The multi-streams approach is inherently throughput-oriented, as every stream requires a dedicated device memory to do inference in parallel to the rest of streams.
Although similar, the streams are always preferable compared to creating multiple ov:Compiled_Model instances for the same model, as weights memory is shared across streams, reducing the overall memory consumption.
Notice that the streams inflate the model load/compilation time.
Finally, using streams does increase the latency of an individual request, this is why for example the latency hint governs a device to create a bare minimum of streams (usually just one).
Please find the considerations for the optimal number of the streams in the later sections.
Batching
Hardware accelerators like GPUs are optimized for massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
While the streams (described) earlier already allow to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient, compared to calling a kernel on the multiple inputs at once.
As explained in the next section, the batching is a must to leverage maximum throughput on the GPUs.
There are two primary ways of using the batching to help application performance:
- Collecting the inputs explicitly on the application side and then sending these batched requests to the OpenVINO
- Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic
- Sending individual requests, while configuring the OpenVINO to collect and perform inference on the requests in batch automatically. In both cases, optimal batch size is very device-specific. Also as explained below, the optimal batch size depends on the model, inference precision and other factors.
Choosing the Batch Size and Number of Streams
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements. One possible throughput optimization strategy is to set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore). Also, consider [Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction) that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
Different devices behave differently with the batch sizes. The optimal batch size depends on the model, inference precision and other factors. Similarly, different devices require different number of execution streams to maximize the throughput. Below are general recommendations:
- For the CPU always prefer the streams over the batching
- Create as many streams as you application runs the requests simultaneously
- Number of streams should be enough to meet the average parallel slack rather than the peak load
- Maximum number of streams equals total number of CPU cores
- As explained in the CPU streams internals, the CPU cores are evenly distributed between streams, so one core per stream is the finest-grained configuration
- For the GPU:
- When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice
- Notice that the GPU runs 2 request per stream
- Maximum number of streams is usually 2, for more portability consider using the
ov::streams::AUTO(GPU_THROUGHPUT_AUTOin the pre-OpenVINO 2.0 parlance) - Typically, for 4 and more requests the batching delivers better throughput for the GPUs
- Batch size can be calculated as "number of inference requests executed in parallel" divided by the "number of requests that the streams consume"
- E.g. if you process 16 cameras (by 16 requests inferenced simultaneously) with 2 GPU streams (each can process 2 requests), the batch size per request is 16/(2*2)=4
- When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice
Note
: When playing with dynamically-shaped inputs use only the streams (no batching), as they tolerate individual requests having different shapes.
Note
: Using the High-Level Performance Hints explained in the next section, is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario and model.
OpenVINO Hints: Selecting Optimal Execution and Parameters Automatically
Overall, the latency-throughput is not linearly dependent and very device specific. It is also tightly integrated with model characteristics. As for the possible inference devices the scenery had already become pretty diverse, the OpenVINO has introduced the dedicated notion of the high-level performance configuration "hints" to describe the target application scenarios. The hints are described here.
The hints also obviates the need for explicit (application-side) batching. With the hints, the only requirement for the application is to run multiple individual requests using Async API and let the OpenVINO decide whether to collect the requests and execute them in batch, streams, or both.
Note
: OpenVINO performance hints is a recommended way for performance configuration, which is both device-agnostic and future-proof.
Multi-Device Execution
OpenVINO offers automatic, scalable multi-device inference. This is simple application-transparent way to improve the throughput. No need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance the inference requests between devices, etc. From the application point of view, it is communicating to the single device that internally handles the actual machinery. Just like with other throughput-oriented scenarios, there are two major pre-requisites for optimal multi-device performance:
- Using the [Asynchronous API](@ref openvino_docs_deployment_optimization_guide_common) and callbacks in particular
- Providing the multi-device (and hence the underlying devices) with enough data to crunch. As the inference requests are naturally independent data pieces, the multi-device performs load-balancing at the “requests” (outermost) level to minimize the scheduling overhead.
Notice that the resulting performance is usually a fraction of the “ideal” (plain sum) value, when the devices compete for a certain resources, like the memory-bandwidth which is shared between CPU and iGPU.
Note
: While the legacy approach of optimizing the parameters of each device separately works, the OpenVINO performance hints allow to configure all devices (that are part of the specific multi-device configuration) at once.