Runtime Inference Optimizations

@sphinxdirective

.. toctree:: :maxdepth: 1 :hidden:

openvino_docs_deployment_optimization_guide_common openvino_docs_deployment_optimization_guide_latency openvino_docs_deployment_optimization_guide_tput openvino_docs_deployment_optimization_guide_hints openvino_docs_deployment_optimization_guide_internals

@endsphinxdirective

Deployment Optimizations Overview

Runtime or deployment optimizations are focused on tuning of the inference parameters (e.g. optimal number of the requests executed simultaneously) and other means of how a model is executed.

As referenced in the parent performance introduction topic, the dedicated document covers the model-level optimizations like quantization that unlocks the 8-bit inference. Model-optimizations are most general and help any scenario and any device (that e.g. accelerates the quantized models). The relevant runtime configuration is ov::hint::inference_precision allowing the devices to trade the accuracy for the performance (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model).

Then, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers. In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency. Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold. Below you can find summary on the associated tips.

How the full-stack application uses the inference component end-to-end is also important. For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. Below you can find multiple tips on connecting the data input pipeline and the model inference efficiently. These are also common performance tricks that help both latency and throughput scenarios.

Further documents cover the associated runtime performance optimizations topics. Please also consider [matrix support of the features by the individual devices](@ref features_support_matrix).

General, application-level optimizations, and specifically:

Inputs Pre-processing with the OpenVINO
Async API and 'get_tensor' Idiom
For variably-sized inputs, consider dynamic shapes

Use-case specific optimizations along with some implementation details:

Optimizing for throughput and latency
OpenVINO's high-level performance hints as the portable, future-proof approach for performance configuration, thar does not requires re-tuning when the model or device has changed.
- If the performance portability is of concern, consider using the hints first.

3.3 KiB Raw Blame History

Runtime Inference Optimizations

Deployment Optimizations Overview

3.3 KiB

Raw Blame History