applying reviewers comments to the Opt Guide (#11093)
* applying reviewrs comments * fixed refs, more structuring (bold, bullets, etc) * refactoring tput/latency sections * next iteration (mostly latency), also brushed the auto-batching and other sections * updates sync/async images * common opts brushed * WIP tput redesigned * minor brushing of common and auto-batching * Tput fully refactored * fixed doc name in the link * moved int8 perf counters to the right section * fixed links * fixed broken quotes * fixed more links * add ref to the internals to the TOC * Added a note on the batch size Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com>
This commit is contained in:
@@ -9,7 +9,7 @@ When evaluating performance of your model with the OpenVINO Runtime, you must me
|
||||
|
||||
- Track separately the operations that happen outside the OpenVINO Runtime, like video decoding.
|
||||
|
||||
> **NOTE**: Some image pre-processing can be baked into the IR and accelerated accordingly. For more information, refer to [Embedding the Preprocessing](Additional_Optimizations.md). Also consider [_runtime_ preprocessing optimizations](../../optimization_guide/dldt_deployment_optimization_common).
|
||||
> **NOTE**: Some image pre-processing can be baked into the IR and accelerated accordingly. For more information, refer to [Embedding the Preprocessing](Additional_Optimizations.md). Also consider [Runtime Optimizations of the Preprocessing](../../optimization_guide/dldt_deployment_optimization_common).
|
||||
|
||||
## Tip 2. Getting Credible Performance Numbers
|
||||
|
||||
@@ -53,17 +53,32 @@ When comparing the OpenVINO Runtime performance with the framework or another re
|
||||
Further, finer-grained insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs.
|
||||
Both [C++](../../../samples/cpp/benchmark_app/README.md) and [Python](../../../tools/benchmark_tool/README.md) versions of the `benchmark_app` supports a `-pc` command-line parameter that outputs internal execution breakdown.
|
||||
|
||||
Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock `realTime` and the `cpu` time are the same):
|
||||
For example, below is the part of performance counters for quantized [TensorFlow* implementation of ResNet-50](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf) model inference on [CPU Plugin](../../OV_Runtime_UG/supported_plugins/CPU.md).
|
||||
Notice that since the device is CPU, the layers wall clock `realTime` and the `cpu` time are the same. Information about layer precision is also stored in the performance counters.
|
||||
|
||||
| layerName | execStatus | layerType | execType | realTime (ms) | cpuTime (ms) |
|
||||
| --------------------------------------------------------- | ---------- | ------------ | -------------------- | ------------- | ------------ |
|
||||
| resnet\_model/batch\_normalization\_15/FusedBatchNorm/Add | EXECUTED | Convolution | jit\_avx512\_1x1\_I8 | 0.377 | 0.377 |
|
||||
| resnet\_model/conv2d\_16/Conv2D/fq\_input\_0 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
| resnet\_model/batch\_normalization\_16/FusedBatchNorm/Add | EXECUTED | Convolution | jit\_avx512\_I8 | 0.499 | 0.499 |
|
||||
| resnet\_model/conv2d\_17/Conv2D/fq\_input\_0 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
| resnet\_model/batch\_normalization\_17/FusedBatchNorm/Add | EXECUTED | Convolution | jit\_avx512\_1x1\_I8 | 0.399 | 0.399 |
|
||||
| resnet\_model/add\_4/fq\_input\_0 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
| resnet\_model/add\_4 | NOT\_RUN | Eltwise | undef | 0 | 0 |
|
||||
| resnet\_model/add\_5/fq\_input\_1 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
|
||||
|
||||
The `exeStatus` column of the table includes possible values:
|
||||
- `EXECUTED` - layer was executed by standalone primitive,
|
||||
- `NOT_RUN` - layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.
|
||||
|
||||
The `execType` column of the table includes inference primitives with specific suffixes. The layers have the following marks:
|
||||
* Suffix `I8` for layers that had 8-bit data type input and were computed in 8-bit precision
|
||||
* Suffix `FP32` for layers computed in 32-bit precision
|
||||
|
||||
All `Convolution` layers are executed in int8 precision. Rest layers are fused into Convolutions using post operations optimization technique, which is described in [Internal CPU Plugin Optimizations](../../OV_Runtime_UG/supported_plugins/CPU.md).
|
||||
This contains layers name (as seen in IR), layers type and execution statistics.
|
||||
|
||||
```
|
||||
conv1 EXECUTED layerType: Convolution realTime: 706 cpu: 706 execType: jit_avx2
|
||||
conv2_1_x1 EXECUTED layerType: Convolution realTime: 137 cpu: 137 execType: jit_avx2_1x1
|
||||
fc6 EXECUTED layerType: Convolution realTime: 233 cpu: 233 execType: jit_avx2_1x1
|
||||
fc6_nChw8c_nchw EXECUTED layerType: Reorder realTime: 20 cpu: 20 execType: reorder
|
||||
out_fc6 EXECUTED layerType: Output realTime: 3 cpu: 3 execType: unknown
|
||||
relu5_9_x2 OPTIMIZED_OUT layerType: ReLU realTime: 0 cpu: 0 execType: undef
|
||||
```
|
||||
This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution.
|
||||
Both benchmark_app versions also support "exec_graph_path" command-line option governing the OpenVINO to output the same per-layer execution statistics, but in the form of the plugin-specific [Netron-viewable](https://netron.app/) graph to the specified file.
|
||||
|
||||
Notice that on some devices, the execution graphs/counters may be pretty intrusive overhead-wise.
|
||||
@@ -71,4 +86,4 @@ Also, especially when performance-debugging the [latency case](../../optimizatio
|
||||
|
||||
Finally, the performance statistics with both performance counters and execution graphs is averaged, so such a data for the [dynamically-shaped inputs](../../OV_Runtime_UG/ov_dynamic_shapes.md) should be measured carefully (ideally by isolating the specific shape and executing multiple times in a loop, to gather the reliable data).
|
||||
|
||||
OpenVINO in general and individual plugins are heavily instrumented with Intel® instrumentation and tracing technology (ITT), so another option is to compile the OpenVINO from the source code with the ITT enabled and using tools like [Intel® VTune™ Profiler](https://software.intel.com/en-us/vtune) to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.
|
||||
OpenVINO in general and individual plugins are heavily instrumented with Intel® instrumentation and tracing technology (ITT), so another option is to compile the OpenVINO from the source code with the ITT enabled and using tools like [Intel® VTune™ Profiler](https://software.intel.com/en-us/vtune) to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.
|
||||
|
||||
@@ -59,30 +59,3 @@ For 8-bit integer computations, a model must be quantized. Quantized models can
|
||||
|
||||
![int8_flow]
|
||||
|
||||
## Performance Counters
|
||||
|
||||
Information about layer precision is stored in the performance counters that are
|
||||
available from the OpenVINO Runtime API. For example, the part of performance counters table for quantized [TensorFlow* implementation of ResNet-50](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf) model inference on [CPU Plugin](supported_plugins/CPU.md) looks as follows:
|
||||
|
||||
|
||||
| layerName | execStatus | layerType | execType | realTime (ms) | cpuTime (ms) |
|
||||
| --------------------------------------------------------- | ---------- | ------------ | -------------------- | ------------- | ------------ |
|
||||
| resnet\_model/batch\_normalization\_15/FusedBatchNorm/Add | EXECUTED | Convolution | jit\_avx512\_1x1\_I8 | 0.377 | 0.377 |
|
||||
| resnet\_model/conv2d\_16/Conv2D/fq\_input\_0 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
| resnet\_model/batch\_normalization\_16/FusedBatchNorm/Add | EXECUTED | Convolution | jit\_avx512\_I8 | 0.499 | 0.499 |
|
||||
| resnet\_model/conv2d\_17/Conv2D/fq\_input\_0 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
| resnet\_model/batch\_normalization\_17/FusedBatchNorm/Add | EXECUTED | Convolution | jit\_avx512\_1x1\_I8 | 0.399 | 0.399 |
|
||||
| resnet\_model/add\_4/fq\_input\_0 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
| resnet\_model/add\_4 | NOT\_RUN | Eltwise | undef | 0 | 0 |
|
||||
| resnet\_model/add\_5/fq\_input\_1 | NOT\_RUN | FakeQuantize | undef | 0 | 0 |
|
||||
|
||||
|
||||
The `exeStatus` column of the table includes possible values:
|
||||
- `EXECUTED` - layer was executed by standalone primitive,
|
||||
- `NOT_RUN` - layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.
|
||||
|
||||
The `execType` column of the table includes inference primitives with specific suffixes. The layers have the following marks:
|
||||
* Suffix `I8` for layers that had 8-bit data type input and were computed in 8-bit precision
|
||||
* Suffix `FP32` for layers computed in 32-bit precision
|
||||
|
||||
All `Convolution` layers are executed in int8 precision. Rest layers are fused into Convolutions using post operations optimization technique, which is described in [Internal CPU Plugin Optimizations](supported_plugins/CPU.md).
|
||||
|
||||
@@ -98,6 +98,7 @@ To achieve the best performance with the Automatic Batching, the application sho
|
||||
- Operate the number of inference requests that represents the multiple of the batch size. In the above example, for batch size 4, the application should operate 4, 8, 12, 16, etc. requests.
|
||||
- Use the requests, grouped by the batch size, together. For example, the first 4 requests are inferred, while the second group of the requests is being populated. Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches.
|
||||
- Balance the 'timeout' value vs the batch size. For example, in many cases having a smaller timeout value/batch size may yield better performance than large batch size, but with the timeout value that is not large enough to accommodate the full number of the required requests.
|
||||
- When the Automatic Batching is enabled, the 'timeout' property of the `ov::CompiledModel` can be changed any time, even after model loading/compilation. For example, setting the value to 0 effectively disables the auto-batching, as requests' collection would be omitted.
|
||||
- Carefully apply the auto-batching to the pipelines. For example for the conventional video-sources->detection->classification flow, it is the most benefical to do auto-batching over the inputs to the detection stage. Whereas the resulting number of detections is usually fluent, which makes the auto-batching less applicable for the classification stage.
|
||||
|
||||
The following are limitations of the current implementations:
|
||||
@@ -119,11 +120,12 @@ Following the OpenVINO convention for devices names, the *batching* device is na
|
||||
### Testing Automatic Batching Performance with the Benchmark_App
|
||||
The `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the performance of the Automatic Batching:
|
||||
- The most straighforward way is performance hints:
|
||||
- - benchmark_app **-hint tput** -d GPU -m 'path to your favorite model'
|
||||
- benchmark_app **-hint tput** -d GPU -m 'path to your favorite model'
|
||||
- Overriding the strict rules of implicit reshaping by the batch dimension via the explicit device notion:
|
||||
- - benchmark_app **-hint none -d BATCH:GPU** -m 'path to your favorite model'
|
||||
- benchmark_app **-hint none -d BATCH:GPU** -m 'path to your favorite model'
|
||||
- Finally, overriding the automatically-deduced batch size as well:
|
||||
- - $benchmark_app -hint none -d **BATCH:GPU(16)** -m 'path to your favorite model'
|
||||
- $benchmark_app -hint none -d **BATCH:GPU(16)** -m 'path to your favorite model'
|
||||
- notice that some shell versions (e.g. `bash`) may require adding quotes around complex device names, i.e. -d "BATCH:GPU(16)"
|
||||
|
||||
The last example is also applicable to the CPU or any other device that generally supports the batched execution.
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# High-level Performance Hints {#openvino_docs_OV_UG_Performance_Hints}
|
||||
|
||||
Each of the OpenVINO's [supported devices](supported_plugins/Supported_Devices.md) offers low-level performance settings. Tweaking this detailed configuration requires deep architecture understanding.
|
||||
Each of the OpenVINO's [supported devices](supported_plugins/Device_Plugins.md) offers low-level performance settings. Tweaking this detailed configuration requires deep architecture understanding.
|
||||
Also, while the performance may be optimal for the specific combination of the device and the inferred model, the resulting configuration is not necessarily optimal for another device or model.
|
||||
The OpenVINO performance hints are the new way to configure the performance with the _portability_ in mind.
|
||||
|
||||
|
||||
@@ -37,7 +37,7 @@ OpenVINO runtime also has several execution capabilities which work on top of ot
|
||||
Devices similar to the ones we have used for benchmarking can be accessed using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).
|
||||
|
||||
|
||||
## Features support matrix
|
||||
## Features support matrix {#openvino_docs_OV_UG_features_support_matrix}
|
||||
The table below demonstrates support of key features by OpenVINO device plugins.
|
||||
|
||||
| Capability | [CPU](CPU.md) | [GPU](GPU.md) | [GNA](GNA.md) |[Arm® CPU](ARM_CPU.md) |
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:c47ede993681ba3f0a3e3f4274369ee1854365b1bcd1b5cb0f649a781fdf51bd
|
||||
size 6215
|
||||
oid sha256:1af95a7e8f12f3e663530e6d7eb6b48633f759aa7d83459633f36655a67047e8
|
||||
size 174761
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:9a4fce51076df19fbca04a36d6886765771f8ffc174bebbd751bfc77d91ab1f2
|
||||
size 7081
|
||||
oid sha256:63b9d3bbea1efba0d30c465dcaa3552a61c5c4317d073f8993ec08f3f9db051b
|
||||
size 132583
|
||||
|
||||
@@ -5,9 +5,9 @@
|
||||
In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:
|
||||
- Model Optimizer can efficiently bake the mean and normalization (scale) values into the model (for example, to the weights of the first convolution). Please see [relevant Model Optimizer command-line options](../MO_DG/prepare_model/Additional_Optimizations.md).
|
||||
- Let the OpenVINO accelerate other means of [Image Pre-processing and Conversion](../OV_Runtime_UG/preprocessing_overview.md).
|
||||
- Note that in many cases, you can directly share the (input) data with the OpenVINO, for example consider [remote tensors API of the GPU Plugin](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
- You can directly input a data that is already in the _on-device_ memory, by using the [remote tensors API of the GPU Plugin](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
|
||||
## Prefer OpenVINO Async API <a name="ov-async-api"></a>
|
||||
## Prefer OpenVINO Async API
|
||||
The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and executes immediately (effectively serializing the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()`. Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md).
|
||||
|
||||
A typical use-case for the `ov::InferRequest::infer()` is running a dedicated application thread per source of inputs (e.g. a camera), so that every step (frame capture, processing, results parsing and associated logic) is kept serial within the thread.
|
||||
@@ -15,9 +15,9 @@ In contrast, the `ov::InferRequest::start_async()` and `ov::InferRequest::wait()
|
||||
|
||||
> **NOTE**: Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based, below) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios).
|
||||
|
||||
Let's see how the OpenVINO Async API can improve overall throughput rate of the application. The key advantage of the Async approach is as follows: while a device is busy with the inference, the application can do other things in parallel (e.g. populating inputs or scheduling other requests) rather than wait for the inference to complete.
|
||||
Let's see how the OpenVINO Async API can improve overall frame rate of the application. The key advantage of the Async approach is as follows: while a device is busy with the inference, the application can do other things in parallel (e.g. populating inputs or scheduling other requests) rather than wait for the current inference to complete first.
|
||||
|
||||
In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages.
|
||||
In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding vs inference) and not by the sum of the stages.
|
||||
|
||||
You can compare the pseudo-codes for the regular and async-based approaches:
|
||||
|
||||
@@ -36,6 +36,8 @@ You can compare the pseudo-codes for the regular and async-based approaches:
|
||||
The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results.
|
||||
Refer to the [Object Detection С++ Demo](@ref omz_demos_object_detection_demo_cpp), [Object Detection Python Demo](@ref omz_demos_object_detection_demo_python)(latency-oriented Async API showcase) and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) for complete examples of the Async API in action.
|
||||
|
||||
> **NOTE**: Using the Asynchronous API is a must for [throughput-oriented scenarios](./dldt_deployment_optimization_tput.md).
|
||||
|
||||
### Notes on Callbacks
|
||||
Notice that the Async's `ov::InferRequest::wait()` waits for the specific request only. However, running multiple inference requests in parallel provides no guarantees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait`. The most scalable approach is using callbacks (set via the `ov::InferRequest::set_callback`) that are executed upon completion of the request. The callback functions will be used by the OpenVINO runtime to notify on the results (or errors.
|
||||
This is more event-driven approach.
|
||||
@@ -44,8 +46,13 @@ Few important points on the callbacks:
|
||||
- It is the application responsibility to ensure that any callback function is thread-safe
|
||||
- Although executed asynchronously by a dedicated threads the callbacks should NOT include heavy operations (e.g. I/O) and/or blocking calls. Keep the work done by any callback to a minimum.
|
||||
|
||||
## "get_tensor" Idiom <a name="new-request-based-api"></a>
|
||||
## "get_tensor" Idiom
|
||||
Within the OpenVINO, each device may have different internal requirements on the memory padding, alignment, etc for intermediate tensors. The **input/output tensors** are also accessible by the application code.
|
||||
As every `ov::InferRequest` is created by the particular instance of the `ov::CompiledModel`(that is already device-specific) the requirements are respected and the requests' input/output tensors are still device-friendly.
|
||||
Thus:
|
||||
* `get_tensor` (that offers the `data()` method to get a system-memory pointer to the tensor's content), is a recommended way to populate the inference inputs (and read back the outputs) **from/to the host memory**
|
||||
* For example, for the GPU device, the inputs/outputs tensors are mapped to the host (which is fast) only when the `get_tensor` is used, while for the `set_tensor` a copy into the internal GPU structures may happen
|
||||
* In contrast, when the input tensors are already in the **on-device memory** (e.g. as a result of the video-decoding), prefer the `set_tensor` as a zero-copy way to proceed
|
||||
* Consider [GPU device Remote tensors API](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
|
||||
`get_tensor` is a recommended way to populate the inference inputs (and read back the outputs), as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs tensors are mapped to the host (which is fast) only when the `get_tensor` is used, while for the `set_tensor` a copy into the internal GPU structures may happen.
|
||||
Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md).
|
||||
In contrast, the `set_tensor` is a preferable way to handle remote tensors, [for example with the GPU device](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md) for `get_tensor` and `set_tensor`.
|
||||
@@ -10,35 +10,35 @@
|
||||
openvino_docs_deployment_optimization_guide_latency
|
||||
openvino_docs_deployment_optimization_guide_tput
|
||||
openvino_docs_deployment_optimization_guide_hints
|
||||
openvino_docs_deployment_optimization_guide_internals
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Deployment Optimizations Overview {#openvino_docs_deployment_optimization_guide_overview}
|
||||
Runtime or deployment optimizations focus is tuning of the inference parameters (e.g. optimal number of the requests executed simultaneously) and other means of how a model is _executed_.
|
||||
Runtime or deployment optimizations are focused on tuning of the inference _parameters_ (e.g. optimal number of the requests executed simultaneously) and other means of how a model is _executed_.
|
||||
|
||||
Here, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers.
|
||||
In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency.
|
||||
Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold.
|
||||
As referenced in the parent [performance introduction topic](./dldt_optimization_guide.md), the [dedicated document](./model_optimization_guide.md) covers the **model-level optimizations** like quantization that unlocks the [int8 inference](../OV_Runtime_UG/Int8Inference.md). Model-optimizations are most general and help any scenario and any device (that accelerated the quantized models). The relevant _runtime_ configuration is `ov::hint::inference_precision` allowing the devices to trade the accuracy for the performance (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model).
|
||||
|
||||
Each of the [OpenVINO supported devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md) offers low-level performance configuration. This allows to leverage the optimal model performance on the _specific_ device, but may require careful re-tuning when the model or device has changed.
|
||||
**If the performance portability is of concern, consider using the [OpenVINO High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) first.**
|
||||
Then, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers.
|
||||
In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency. Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold.
|
||||
Below you can find summary on the associated tips.
|
||||
|
||||
Finally, how the full-stack application uses the inference component _end-to-end_ is important.
|
||||
For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. As detailed in the section on the [general optimizations](./dldt_deployment_optimization_common.md), the inputs population can be performed asynchronously to the inference. Also, in many cases the (image) [pre-processing can be offloaded to the OpenVINO](../OV_Runtime_UG/preprocessing_overview.md). For variably-sized inputs, consider [dynamic shapes](../OV_Runtime_UG/ov_dynamic_shapes.md) to efficiently connect the data input pipeline and the model inference.
|
||||
These are common performance tricks that help both latency and throughput scenarios.
|
||||
How the full-stack application uses the inference component _end-to-end_ is also important. For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. Below you can find multiple tips on connecting the data input pipeline and the model inference efficiently.
|
||||
These are also common performance tricks that help both latency and throughput scenarios.
|
||||
|
||||
Similarly, the _model-level_ optimizations like [quantization that unlocks the int8 inference](../OV_Runtime_UG/Int8Inference.md) are general and help any scenario. As referenced in the [performance introduction topic](./dldt_optimization_guide.md), these are covered in the [dedicated document](./model_optimization_guide.md). Additionally, the `ov::hint::inference_precision` allows the devices to trade the accuracy for the performance at the _runtime_ (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model).
|
||||
Further documents cover the associated _runtime_ performance optimizations topics. Please also consider [matrix support of the features by the individual devices](@ref openvino_docs_OV_UG_features_support_matrix).
|
||||
|
||||
**General, application-level optimizations**, and specifically:
|
||||
|
||||
Further documents cover the _runtime_ performance optimizations topics. Please also consider [matrix support of the features by the individual devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md).
|
||||
* [Inputs Pre-processing with the OpenVINO](../OV_Runtime_UG/preprocessing_overview.md)
|
||||
|
||||
[General, application-level optimizations](./dldt_deployment_optimization_common.md):
|
||||
|
||||
* Inputs Pre-processing with the OpenVINO
|
||||
* [Async API and 'get_tensor' Idiom](./dldt_deployment_optimization_common.md)
|
||||
|
||||
* Async API and 'get_tensor' Idiom
|
||||
* For variably-sized inputs, consider [dynamic shapes](../OV_Runtime_UG/ov_dynamic_shapes.md)
|
||||
|
||||
Use-case specific optimizations along with some implementation details:
|
||||
**Use-case specific optimizations** along with some implementation details:
|
||||
|
||||
* Optimizing for [throughput](./dldt_deployment_optimization_tput.md) and [latency](./dldt_deployment_optimization_latency.md)
|
||||
|
||||
* [OpenVINO's high-level performance hints](./dldt_deployment_optimization_hints.md) as the portable, future-proof approach for performance configuration
|
||||
* [OpenVINO's high-level performance hints](./dldt_deployment_optimization_hints.md) as the portable, future-proof approach for performance configuration, thar does not requires re-tuning when the model or device has changed.
|
||||
* **If the performance portability is of concern, consider using the [hints](../OV_Runtime_UG/performance_hints.md) first.**
|
||||
@@ -12,7 +12,7 @@ Also, while the resulting performance may be optimal for the specific combinatio
|
||||
Beyond execution _parameters_ there are potentially many device-specific details like _scheduling_ that greatly affect the performance.
|
||||
Specifically, GPU-oriented tricks like batching, which combines many (potentially tens) of input images to achieve optimal throughput, do not always map well to the CPU, as e.g. detailed in the next sections.
|
||||
The hints allow to really hide _execution_ specifics required to saturate the device. For example, no need to explicitly combine multiple inputs into a batch to achieve good GPU performance.
|
||||
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using <a href="#ov-async-api">OpenVINO Async API</a>.
|
||||
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API as explained in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common).
|
||||
|
||||
The only requirement for the application to leverage the throughput is about **running multiple inference requests in parallel**.
|
||||
OpenVINO's device-specific implementation of the hints will take care of the rest. This allows a developer to greatly simplify the app-logic.
|
||||
|
||||
@@ -0,0 +1,24 @@
|
||||
# Further Low-Level Implementation Details {#openvino_docs_deployment_optimization_guide_internals}
|
||||
## Throughput on the CPU: Internals
|
||||
As explained in the [throughput-related section](./dldt_deployment_optimization_tput.md), the OpenVINO streams is a mean of running multiple requests in parallel.
|
||||
In order to best serve multiple inference requests executed simultaneously, the inference threads are grouped/pinned to the particular CPU cores, constituting the "CPU" streams.
|
||||
This provides much better performance for the networks than batching especially for the many-core machines:
|
||||

|
||||
|
||||
Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, with much less synchronization within CNN ops):
|
||||

|
||||
|
||||
Notice that [high-level performance hints](../OV_Runtime_UG/performance_hints.md) allows the implementation to select the optimal number of the streams, _depending on the model compute demands_ and CPU capabilities (including [int8 inference](../OV_Runtime_UG/Int8Inference.md) hardware acceleration, number of cores, etc).
|
||||
|
||||
## Automatic Batching Internals
|
||||
As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests to improve device utilization.
|
||||
The Automatic Batching relaxes the requirement for an application to saturate devices like GPU by _explicitly_ using a large batch. It performs transparent inputs gathering from
|
||||
individual inference requests followed by the actual batched execution, with no programming effort from the user:
|
||||

|
||||
|
||||
Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a batching timeout.
|
||||
Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect).
|
||||
|
||||
So if your workload experiences the timeouts (resulting in the performance drop, as the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value and batch size may yield better performance than large batch size, but coupled with the timeout value that cannot guarantee accommodating the full number of the required requests.
|
||||
|
||||
Finally, following the "get_tensor idiom" section from the [general optimizations](./dldt_deployment_optimization_common.md) helps the Automatic Batching to save on inputs/outputs copies. Thus, in your application always prefer the "get" versions of the tensors' data access APIs.
|
||||
@@ -12,18 +12,19 @@
|
||||
|
||||
## Latency Specifics
|
||||
A significant fraction of applications focused on the situations where typically a single model is loaded (and single input is used) at a time.
|
||||
This is a regular "consumer" use case and a default (also for the legacy reasons) performance setup for any OpenVINO device.
|
||||
Notice that an application can create more than one request if needed (for example to support asynchronous inputs population), the question is really about how many requests are being executed in parallel.
|
||||
This is a regular "consumer" use case.
|
||||
While an application can create more than one request if needed (for example to support [asynchronous inputs population](./dldt_deployment_optimization_common.md)), the inference performance depends on **how many requests are being inferenced in parallel** on a device.
|
||||
|
||||
Similarly, when multiple models are served on the same device, it is important whether the models are executed simultaneously, or in chain (for example in the inference pipeline).
|
||||
As expected, the lowest latency is achieved with only one concurrent inference at a moment. Accordingly, any additional concurrency usually results in the latency growing fast.
|
||||
As expected, the easiest way to achieve the lowest latency is **running only one concurrent inference at a moment** on the device. Accordingly, any additional concurrency usually results in the latency growing fast.
|
||||
|
||||
However, for example, specific configurations, like multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes in the machine.
|
||||
Thus, human expertise is required to get the most out of the device even in the latency case. Consider using [OpenVINO high-level performance hints](../OV_Runtime_UG/performance_hints.md) instead.
|
||||
However, some conventional "root" devices (e.g. CPU or GPU) can be in fact internally composed of several "sub-devices". In many cases letting the OpenVINO to transparently leverage the "sub-devices" helps to improve the application throughput (e.g. serve multiple clients simultaneously) without degrading the latency. For example, multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes in the machine. Similarly, a multi-tile GPU (which is essentially multiple GPUs in a single package), can deliver a multi-tile scalability with the number of inference requests, while preserving the single-tile latency.
|
||||
|
||||
Thus, human expertise is required to get more _throughput_ out of the device even in the inherently latency-oriented cases. OpenVINO can take this configuration burden via [high-level performance hints](../OV_Runtime_UG/performance_hints.md).
|
||||
|
||||
> **NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
|
||||
|
||||
In the case when there are multiple models to be used simultaneously, consider using different devices for inferencing the different models. Finally, when multiple models are executed in parallel on the device, using additional `ov::hint::model_priority` may help to define relative priorities of the models (please refer to the documentation on the [matrix features support for OpenVINO devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md) to check for the support of the feature by the specific device).
|
||||
In the case when there are multiple models to be used simultaneously, consider using different devices for inferencing the different models. Finally, when multiple models are executed in parallel on the device, using additional `ov::hint::model_priority` may help to define relative priorities of the models (please refer to the documentation on the [matrix features support for OpenVINO devices](@ref openvino_docs_OV_UG_features_support_matrix) to check for the support of the feature by the specific device).
|
||||
|
||||
## First-Inference Latency and Model Load/Compile Time
|
||||
There are cases when model loading/compilation are heavily contributing to the end-to-end latencies.
|
||||
|
||||
@@ -1,68 +1,79 @@
|
||||
# Optimizing for Throughput {#openvino_docs_deployment_optimization_guide_tput}
|
||||
|
||||
## General Throughput Considerations
|
||||
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is delivering the every single request at the minimal delay.
|
||||
Throughput on the other hand, is about inference scenarios in which potentially large number of inference requests are served simultaneously.
|
||||
Here, the overall application throughput can be significantly improved with the right performance configuration.
|
||||
Also, if the model is not already compute- or memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel.
|
||||
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is delivering every single request at the minimal delay.
|
||||
Throughput on the other hand, is about inference scenarios in which potentially large **number of inference requests are served simultaneously to improve the device utilization**.
|
||||
|
||||
With the OpenVINO there two major means of running the multiple requests simultaneously: batching and "streams", explained in this document.
|
||||
Yet, different GPUs behave differently with batch sizes, just like different CPUs require different number of execution streams to maximize the throughput.
|
||||
Predicting inference performance is difficult and and finding optimal execution parameters requires direct experiments measurements.
|
||||
One possible throughput optimization strategy is to set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore).
|
||||
Also, consider [Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction).
|
||||
Here, the overall application inference rate can be significantly improved with the right performance configuration.
|
||||
Also, if the model is not already memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel.
|
||||
With the OpenVINO there are two major means of processing multiple inputs simultaneously: **batching** and **streams**, explained in this document.
|
||||
|
||||
Finally, the [automatic multi-device execution](../OV_Runtime_UG/multi_device.md) helps to improve the throughput, please also see the section below.
|
||||
While the same approach of optimizing the parameters of each device separately does work, the resulting multi-device performance is a fraction (that is different for different models) of the “ideal” (plain sum) performance.
|
||||
## OpenVINO Streams
|
||||
As detailed in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common) running multiple inference requests asynchronously is important for general application efficiency.
|
||||
The [Asynchronous API](./dldt_deployment_optimization_common.md) is in fact the "application side" of scheduling, as every device internally implements a queue. The queue acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
|
||||
|
||||
Overall, the latency-throughput is not linearly dependent and very _device_ specific. It is also tightly integrated with _model_ characteristics.
|
||||
As for the possible inference devices the scenery had already become pretty diverse, the OpenVINO has introduced the dedicated notion of the high-level performance configuration "hints" to describe the target application scenarios.
|
||||
The hints are described [here](dldt_deployment_optimization_hints.md).
|
||||
Further, the devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput. This parallelism is commonly referred as 'streams'. Some devices (like GPU) may run several requests per stream to amortize the host-side costs.
|
||||
Notice that streams are **really executing the requests in parallel, but not in the lock step** (as e.g. the batching does), which makes the streams fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
|
||||
|
||||
> **NOTE**: [OpenVINO performance hints](dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
|
||||
|
||||
The rest of the document provides low-level details on the OpenVINO's low-level ways to optimize the throughput.
|
||||
|
||||
## Low-Level Implementation Details
|
||||
### OpenVINO Streams <a name="ov-streams"></a>
|
||||
As detailed in the section <a href="#ov-async-api">OpenVINO Async API</a> running multiple inference requests asynchronously is important for general application efficiency.
|
||||
Additionally, most devices support running multiple inference requests in parallel in order to improve the device utilization. The _level_ of the parallelism (i.e. how many requests are really executed in parallel on the device) is commonly referred as a number of 'streams'. Some devices run several requests per stream to amortize the host-side costs.
|
||||
Notice that streams (that can be considered as independent queues) are really executing the requests in parallel, but not in the lock step (as e.g. the batching does), this makes the streams much more compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
|
||||
|
||||
Also, notice that for efficient asynchronous execution, the streams are actually handling inference with special pool of the threads.
|
||||
So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:compiled_model`.
|
||||
For efficient asynchronous execution, the streams are actually handling inference with special pool of the threads.
|
||||
So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:Compiled_Model`.
|
||||
If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
|
||||
|
||||
The usage of multiple streams is an inherently throughput-oriented approach, as every stream requires a dedicated memory to operate in parallel to the rest streams (read-only data like weights are usually shared between all streams).
|
||||
Also, the streams inflate the load/compilation time.
|
||||
This is why the [latency hint](./dldt_deployment_optimization_hints.md) governs a device to create a bare minimum of streams (usually just one).
|
||||
The multi-streams approach is inherently throughput-oriented, as every stream requires a dedicated device memory to do inference in parallel to the rest of streams.
|
||||
Although similar, the streams are always preferable compared to creating multiple `ov:Compiled_Model` instances for the same model, as weights memory is shared across streams, reducing the overall memory consumption.
|
||||
Notice that the streams inflate the model load/compilation time.
|
||||
Finally, using streams does increase the latency of an individual request, this is why for example the [latency hint](./dldt_deployment_optimization_hints.md) governs a device to create a bare minimum of streams (usually just one).
|
||||
Please find the considerations for the optimal number of the streams in the later sections.
|
||||
|
||||
Finally, the streams are always preferable compared to creating multiple instances of the same model, as weights memory is shared across streams, reducing possible memory consumption.
|
||||
## Batching
|
||||
Hardware accelerators like GPUs are optimized for massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
|
||||
While the streams (described) earlier already allow to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient, compared to calling a kernel on the multiple inputs at once.
|
||||
As explained in the next section, the batching is a must to leverage maximum throughput on the GPUs.
|
||||
|
||||
### Throughput on the CPU: Internals <a name="cpu-streams"></a>
|
||||
In order to best serve multiple inference requests simultaneously, the inference threads are grouped/pinned to the particular CPU cores, constituting the CPU streams.
|
||||
This provides much better performance for the networks than batching especially for the many-core machines:
|
||||

|
||||
There are two primary ways of using the batching to help application performance:
|
||||
* Collecting the inputs explicitly on the application side and then _sending these batched requests to the OpenVINO_
|
||||
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic
|
||||
* _Sending individual requests_, while configuring the OpenVINO to collect and perform inference on the requests in batch [automatically](../OV_Runtime_UG/automatic_batching.md).
|
||||
In both cases, optimal batch size is very device-specific. Also as explained below, the optimal batch size depends on the model, inference precision and other factors.
|
||||
|
||||
Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, with much less synchronization within CNN ops):
|
||||

|
||||
## Choosing the Batch Size and Number of Streams
|
||||
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements.
|
||||
One possible throughput optimization strategy is to **set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore)**.
|
||||
Also, consider [Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction) that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
|
||||
|
||||
Notice that [high-level performance hints](../OV_Runtime_UG/performance_hints.md) allows the implementation to select the optimal number of the streams, _depending on the model compute demands_ and CPU capabilities (including [int8 inference](../OV_Runtime_UG/Int8Inference.md) hardware acceleration, number of cores, etc).
|
||||
Different devices behave differently with the batch sizes. The optimal batch size depends on the model, inference precision and other factors. Similarly, different devices require different number of execution streams to maximize the throughput.
|
||||
Below are general recommendations:
|
||||
* For the **CPU always prefer the streams** over the batching
|
||||
* Create as many streams as you application runs the requests simultaneously
|
||||
* Number of streams should be enough to meet the _average_ parallel slack rather than the peak load
|
||||
* _Maximum number of streams_ equals **total number of CPU cores**
|
||||
* As explained in the [CPU streams internals](dldt_deployment_optimization_internals.md), the CPU cores are evenly distributed between streams, so one core per stream is the finest-grained configuration
|
||||
* For the **GPU**:
|
||||
* When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice
|
||||
* Notice that the GPU runs 2 request per stream
|
||||
* _Maximum number of streams_ is usually 2, for more portability consider using the `ov::streams::AUTO` (`GPU_THROUGHPUT_AUTO` in the pre-OpenVINO 2.0 parlance)
|
||||
* Typically, for 4 and more requests the batching delivers better throughput for the GPUs
|
||||
* Batch size can be calculated as "number of inference requests executed _in parallel_" divided by the "number of requests that the streams consume"
|
||||
* E.g. if you process 16 cameras (by 16 requests inferenced _simultaneously_) with 2 GPU streams (each can process 2 requests), the batch size per request is 16/(2*2)=4
|
||||
|
||||
### Automatic Batching Internals <a name="ov-auto-batching"></a>
|
||||
While the GPU plugin fully supports general notion of the streams, the associated performance (throughput) improvements are usually modest.
|
||||
The primary reason is that, while the streams allow to hide the communication overheads and hide certain bubbles in device utilization, running multiple OpenCL kernels on the GPU simultaneously is less efficient, compared to calling a kernel on the multiple inputs at once.
|
||||
> **NOTE**: When playing with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) use only the streams (no batching), as they tolerate individual requests having different shapes.
|
||||
|
||||
When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice. Also streams are fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
|
||||
Typically, for 4 and more requests the batching delivers better throughput for the GPUs. Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario.
|
||||
As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests to improve device utilization.
|
||||
The Automatic Batching relaxes the requirement for an application to saturate devices like GPU by _explicitly_ using a large batch. It performs transparent inputs gathering from
|
||||
individual inference requests followed by the actual batched execution, with no programming effort from the user:
|
||||

|
||||
> **NOTE**: Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) explained in the next section, is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario and model.
|
||||
|
||||
Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a batching timeout.
|
||||
Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect).
|
||||
## OpenVINO Hints: Selecting Optimal Execution and Parameters **Automatically**
|
||||
Overall, the latency-throughput is not linearly dependent and very _device_ specific. It is also tightly integrated with _model_ characteristics.
|
||||
As for the possible inference devices the scenery had already become pretty diverse, the OpenVINO has introduced the dedicated notion of the high-level performance configuration "hints" to describe the target application scenarios.
|
||||
The hints are described [here](./dldt_deployment_optimization_hints.md).
|
||||
|
||||
So if your workload experiences the timeouts (resulting in the performance drop, as the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value and batch size may yield better performance than large batch size, but coupled with the timeout value that cannot guarantee accommodating the full number of the required requests.
|
||||
The hints also obviates the need for explicit (application-side) batching. With the hints, the only requirement for the application is to run multiple individual requests using [Async API](./dldt_deployment_optimization_common.md) and let the OpenVINO decide whether to collect the requests and execute them in batch, streams, or both.
|
||||
|
||||
Finally, following the "get_tensor idiom" section from the [general optimizations](./dldt_deployment_optimization_common.md) helps the Automatic Batching to save on inputs/outputs copies. Thus, in your application always prefer the "get" versions of the tensor data access APIs.
|
||||
> **NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
|
||||
|
||||
## Multi-Device Execution
|
||||
OpenVINO offers _automatic_, [scalable multi-device inference](../OV_Runtime_UG/multi_device.md). This is simple _application-transparent_ way to improve the throughput. No need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance the inference requests between devices, etc. From the application point of view, it is communicating to the single device that internally handles the actual machinery.
|
||||
Just like with other throughput-oriented scenarios, there are two major pre-requisites for optimal multi-device performance:
|
||||
* Using the [Asynchronous API](@ref openvino_docs_deployment_optimization_guide_common) and [callbacks](../OV_Runtime_UG/ov_infer_request.md) in particular
|
||||
* Providing the multi-device (and hence the underlying devices) with enough data to crunch. As the inference requests are naturally independent data pieces, the multi-device performs load-balancing at the “requests” (outermost) level to minimize the scheduling overhead.
|
||||
|
||||
Notice that the resulting performance is usually a fraction of the “ideal” (plain sum) value, when the devices compete for a certain resources, like the memory-bandwidth which is shared between CPU and iGPU.
|
||||
> **NOTE**: While the legacy approach of optimizing the parameters of each device separately works, the [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) allow to configure all devices (that are part of the specific multi-device configuration) at once.
|
||||
|
||||
@@ -9,15 +9,16 @@ Generally, performance means how fast the model processes the live data. Two key
|
||||
|
||||

|
||||
|
||||
Latency measures inference time (ms) required to process a single input. When it comes to the executing multiple inputs executed simultaneously (e.g. via batching) then the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually of more concern.
|
||||
To calculate throughput, divide number of frames that were processed by the processing time.
|
||||
**Latency** measures inference time (ms) required to process a single input. When it comes to the executing multiple inputs executed simultaneously (e.g. via batching) then the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually of more concern.
|
||||
To calculate **throughput**, divide number of inputs that were processed by the processing time.
|
||||
|
||||
## End-to-End Application Performance
|
||||
It is important to separate the "pure" inference time of a neural network and the end-to-end application performance. For example data transfers between the host and a device may unintentionally affect the performance when a host input tensor is processed on the accelerator like dGPU. Similarly, the image-preprocessing may also contribute significantly to the to inference time. As detailed in the [getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) section, when drilling into _inference_ performance, one option is to measure all such items separately.
|
||||
For the end-to-end scenario though, consider the image pre-processing thru the OpenVINO and the asynchronous execution is a way to amortize the communication costs like data transfers. You can find further details in the [general optimizations document](./dldt_deployment_optimization_common.md).
|
||||
For the **end-to-end scenario** though, consider the image pre-processing thru the OpenVINO and the asynchronous execution as a way to amortize the communication costs like data transfers. You can find further details in the [general optimizations document](./dldt_deployment_optimization_common.md).
|
||||
|
||||
"First-inference latency" is another specific case (e.g. when fast application start-up is required) where the resulting performance may be well dominated by the model loading time. Consider [model caching](../OV_Runtime_UG/Model_caching_overview.md) as a way to improve model loading/compilation time.
|
||||
**First-inference latency** is another specific case (e.g. when fast application start-up is required) where the resulting performance may be well dominated by the model loading time. Consider [model caching](../OV_Runtime_UG/Model_caching_overview.md) as a way to improve model loading/compilation time.
|
||||
|
||||
Finally, memory footprint restrictions is another possible concern when designing an application. While this is a motivation for the _model_ optimization techniques referenced in the next section, notice that the the throughput-oriented execution is usually much more memory-hungry, as detailed in the [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
Finally, **memory footprint** restrictions is another possible concern when designing an application. While this is a motivation for the _model_ optimization techniques referenced in the next section, notice that the the throughput-oriented execution is usually much more memory-hungry, as detailed in the [Runtime Inference Optimizations](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
|
||||
|
||||
> **NOTE**: To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, check [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page.
|
||||
@@ -28,9 +29,9 @@ Finally, memory footprint restrictions is another possible concern when designin
|
||||
|
||||
With the OpenVINO there are two primary ways of improving the inference performance, namely model- and runtime-level optimizations. **These two optimizations directions are fully compatible**.
|
||||
|
||||
- **Model optimization** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md).
|
||||
- **Model optimizations** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md).
|
||||
|
||||
- **Runtime (Deployment) optimization** includes tuning of model _execution_ parameters. To read more visit [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
- **Runtime (Deployment) optimizations** includes tuning of model _execution_ parameters. To read more visit the [Runtime Inference Optimizations](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
|
||||
## Performance benchmarks
|
||||
To estimate the performance and compare performance numbers, measured on various supported devices, a wide range of public models are available at [Performance benchmarks](../benchmarks/performance_benchmarks.md) section.
|
||||
@@ -6,7 +6,8 @@ while(true) {
|
||||
// capture frame
|
||||
// populate NEXT InferRequest
|
||||
// start NEXT InferRequest //this call is async and returns immediately
|
||||
// wait for the CURRENT InferRequest //processed in a dedicated thread
|
||||
|
||||
// wait for the CURRENT InferRequest
|
||||
// display CURRENT result
|
||||
// swap CURRENT and NEXT InferRequests
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user