DOCS: port changes from releases/2022/1 (#11040)

* Added migration for deployment (#10800) * Added migration for deployment * Addressed comments * more info after the What's new Sessions' questions (#10803) * more info after the What's new Sessions' questions * generalizing the optimal_batch_size vs explicit value message * Update docs/OV_Runtime_UG/automatic_batching.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/OV_Runtime_UG/automatic_batching.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/OV_Runtime_UG/automatic_batching.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/OV_Runtime_UG/automatic_batching.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/OV_Runtime_UG/automatic_batching.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/OV_Runtime_UG/automatic_batching.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Perf Hints docs and General Opt Guide refactoring (#10815) * Brushed the general optimization page * Opt GUIDE, WIP * perf hints doc placeholder * WIP * WIP2 * WIP 3 * added streams and few other details * fixed titles, misprints etc * Perf hints * movin the runtime optimizations intro * fixed link * Apply suggestions from code review Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * some details on the FIL and other means when pure inference time is not the only factor * shuffled according to general->use-case->device-specifics flow, minor brushing * next iter * section on optimizing for tput and latency * couple of links to the features support matrix * Links, brushing, dedicated subsections for Latency/FIL/Tput * had to make the link less specific (otherwise docs compilations fails) * removing the Temp/Should be moved to the Opt Guide * shuffled the tput/latency/etc info into separated documents. also the following docs moved from the temp into specific feature, general product desc or corresponding plugins - openvino_docs_IE_DG_Model_caching_overview - openvino_docs_IE_DG_Int8Inference - openvino_docs_IE_DG_Bfloat16Inference - openvino_docs_OV_UG_NoDynamicShapes * fixed toc for ov_dynamic_shapes.md * referring the openvino_docs_IE_DG_Bfloat16Inference to avoid docs compilation errors * fixed main product TOC, removed ref from the second-level items * reviewers remarks * reverted the openvino_docs_OV_UG_NoDynamicShapes * reverting openvino_docs_IE_DG_Bfloat16Inference and openvino_docs_IE_DG_Int8Inference * "No dynamic shapes" to the "Dynamic shapes" as TOC * removed duplication * minor brushing * Caching to the next level in TOC * brushing * more on the perf counters ( for latency and dynamic cases) Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Updated common IE pipeline infer-request section (#10844) * Updated common IE pipeline infer-reqest section * Update ov_infer_request.md * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com> Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * DOCS: Removed useless 4 spaces in snippets (#10870) * Updated snippets * Added link to encryption * [DOCS] ARM CPU plugin docs (#10885) * initial commit ARM_CPU.md added ARM CPU is added to the list of supported devices * Update the list of supported properties * Update Device_Plugins.md * Update CODEOWNERS * Removed quotes in limitations section * NVIDIA and Android are added to the list of supported devices * Added See Also section and reg sign to arm * Added Preprocessing acceleration section * Update the list of supported layers * updated list of supported layers * fix typos * Added support disclaimer * update trade and reg symbols * fixed typos * fix typos * reg fix * add reg symbol back Co-authored-by: Vitaly Tuzov <vitaly.tuzov@intel.com> * Try to fix visualization (#10896) * Try to fix visualization * New try * Update Install&Deployment for migration guide to 22/1 (#10933) * updates * update * Getting started improvements (#10948) * Onnx updates (#10962) * onnx changes * onnx updates * onnx updates * fix broken anchors api reference (#10976) * add ote repo (#10979) * DOCS: Increase content width (#10995) * fixes * fix * Fixed compilation Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Aleksandr Voron <aleksandr.voron@intel.com> Co-authored-by: Vitaly Tuzov <vitaly.tuzov@intel.com> Co-authored-by: Ilya Churaev <ilya.churaev@intel.com> Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Co-authored-by: Victoria Yashina <victoria.yashina@intel.com> Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com>
2022-03-18 17:48:45 +03:00
parent 2f5cb43cba
commit e3098ece7e
50 changed files with 1392 additions and 1009 deletions
--- a/3
+++ b/3
@@ -68,6 +68,9 @@ Jenkinsfile  @openvinotoolkit/openvino-admins
 /src/plugins/intel_gna/  @openvinotoolkit/openvino-ie-gna-maintainers
 /src/inference/include/ie/gna/  @openvinotoolkit/openvino-ie-gna-maintainers

+# IE ARM CPU:
+/docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md  @openvinotoolkit/openvino_contrib-arm_plugin-maintainers
+
 # IE Auto (MULTI) plugin:
 /src/plugins/auto/  @openvinotoolkit/openvino-ie-auto-multi-maintainers
 /src/inference/include/ie/multi-device/  @openvinotoolkit/openvino-ie-auto-multi-maintainers
--- a/docs/CMakeLists.txt
+++ b/docs/CMakeLists.txt
@@ -46,6 +46,7 @@ endif()
 set(LINKCHECKER_PY "" CACHE FILEPATH "Path to linkchecker.py for documentation check dir.")
 set(ENABLE_OPENVINO_NOTEBOOKS OFF CACHE BOOL "Build with openvino notebooks")
 set(OMZ_DOCS_DIR "" CACHE PATH "Path to open_model_zoo documentation dir.")
+set(OTE_DOCS_DIR "" CACHE PATH "Path to training_extensions documentation dir.")
 set(WORKBENCH_DOCS_DIR "" CACHE PATH "Path to workbench documentation dir.")
 set(OVMS_DOCS_DIR "" CACHE PATH "Path to model server documentation dir.")
 set(GRAPH_CSV_DIR "" CACHE PATH "Path to the folder containing csv data for rendering graphs.")
@@ -159,6 +160,15 @@ function(build_docs)
            --output_dir=${DOCS_BUILD_DIR}/workbench)
    endif()

+    # ote doc files
+    if(EXISTS "${OTE_DOCS_DIR}")
+        get_filename_component(WORKBENCH_DOCS_DIR "${OTE_DOCS_DIR}" ABSOLUTE)
+
+        list(APPEND commands COMMAND ${PYTHON_EXECUTABLE} ${DOXY_MD_FILTER}
+            --input_dir=${OTE_DOCS_DIR}
+            --output_dir=${DOCS_BUILD_DIR}/ote)
+    endif()
+
    # ovms doc files
    if(EXISTS "${OVMS_DOCS_DIR}")
        get_filename_component(OVMS_DOCS_DIR "${OVMS_DOCS_DIR}" ABSOLUTE)
--- a/docs/IE_PLUGIN_DG/QuantizedNetworks.md
+++ b/docs/IE_PLUGIN_DG/QuantizedNetworks.md
@@ -9,7 +9,7 @@ For more details about low-precision model representation please refer to this [
 During the model load each plugin can interpret quantization rules expressed in *FakeQuantize* operations:
 - Independently based on the definition of *FakeQuantize* operation.
 - Using a special library of low-precision transformations (LPT) which applies common rules for generic operations,
-such as Convolution, Fully-Connected, Eltwise, etc., and translates "fake-quantized" models into the models with low-precision operations. For more information about low-precision flow please refer to the following [document](@ref openvino_docs_IE_DG_Int8Inference). 
+such as Convolution, Fully-Connected, Eltwise, etc., and translates "fake-quantized" models into the models with low-precision operations. For more information about low-precision flow please refer to the following [document](../OV_Runtime_UG/Int8Inference.md). 

 Here we provide only a high-level overview of the interpretation rules of FakeQuantize. 
 At runtime each FakeQuantize can be split into two independent operations: **Quantize** and **Dequantize**. 
--- a/docs/MO_DG/prepare_model/Getting_performance_numbers.md
+++ b/docs/MO_DG/prepare_model/Getting_performance_numbers.md
@@ -9,22 +9,19 @@ When evaluating performance of your model with the OpenVINO Runtime, you must me

 - Track separately the operations that happen outside the OpenVINO Runtime, like video decoding. 

-> **NOTE**: Some image pre-processing can be baked into the IR and accelerated. For more information, refer to [Embedding Preprocessing Computation](Additional_Optimizations.md)
+> **NOTE**: Some image pre-processing can be baked into the IR and accelerated accordingly. For more information, refer to [Embedding the Preprocessing](Additional_Optimizations.md). Also consider [_runtime_ preprocessing optimizations](../../optimization_guide/dldt_deployment_optimization_common).

 ## Tip 2. Getting Credible Performance Numbers 

 You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections:

 -	If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average or find a mean of the results.
-	 For time values that range too much, use geomean.
+-	For time values that range too much, consider geomean.
+-   Beware of the throttling and other power oddities. A device can exist in one of several different power states. When optimizing your model, for better performance data reproducibility consider fixing the device frequency. However the end to end (application) benchmarking should be also performed under real operational conditions.

-Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for code examples for the performance measurements. Almost every sample, except interactive demos, has a `-ni` option to specify the number of iterations.
+## Tip 3. Measure Reference Performance Numbers with OpenVINO's benchmark_app 

-## Getting performance numbers using OpenVINO tool 
-
-To get performance numbers use our Benchmark app.  
-
-[Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample is the best performance reference.
+To get performance numbers, use the dedicated [Benchmark App](../../../samples/cpp/benchmark_app/README.md) sample which is the best way to produce the performance reference.
 It has a lot of device-specific knobs, but the primary usage is as simple as: 
 ```bash
 $ ./benchmark_app –d GPU –m <model> -i <input>
@@ -36,35 +33,25 @@ $ ./benchmark_app –d CPU –m <model> -i <input>
 ```
 to execute on the CPU instead.

-For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param). 
-Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams. 
-
-Finally, notice that when you don't specify number of streams with `-nstreams`, "AUTO" value for the streams is used, e.g. for the CPU this is [CPU_THROUGHPUT_AUTO](../../OV_Runtime_UG/supported_plugins/CPU.md). You can spot the actual value behind "AUTO" for your machine in the application output.
-Notice that the "AUTO" number is not necessarily most optimal, so it is generally recommended to play either with the benchmark_app's "-nstreams" as described above, or via  [new Workbench tool](@ref workbench_docs_Workbench_DG_Introduction).This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance.
-Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.
+Each of the [OpenVINO supported devices](../../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers performance settings that have command-line equivalents in the [Benchmark App](../../../samples/cpp/benchmark_app/README.md).
+While these settings provide really low-level control and allow to leverage the optimal model performance on the _specific_ device, we suggest always starting the performance evaluation with the [OpenVINO High-Level Performance Hints](../../OV_Runtime_UG/performance_hints.md) first:
+ - benchmark_app **-hint tput** -d 'device' -m 'path to your model'
+ - benchmark_app **-hint latency** -d 'device' -m 'path to your model'

 ## Comparing Performance with Native/Framework Code 

 When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible:

-	Wrap exactly the inference execution (refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for examples).
+-	Wrap exactly the inference execution (refer to the  [Benchmark App](../../../samples/cpp/benchmark_app/README.md) for examples).
 -	Do not include model loading time.
-	Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, Caffe\* allows to auto-populate the input with random values. Notice that it might give different performance than on real images.
-	Similarly, for correct performance comparison, make sure the access pattern, for example, input layouts, is optimal for OpenVINO Runtime (currently, it is NCHW).
-	Any user-side pre-processing should be tracked separately.
-	Make sure to try the same environment settings that the framework developers recommend, for example, for TensorFlow*. In many cases, things that are more machine friendly, like respecting NUMA (see <a href="#cpu-checklist">CPU Checklist</a>), might work well for the OpenVINO Runtime as well.
-	If applicable, use batching.
-	If possible, demand the same accuracy. For example, TensorFlow allows `FP16` support, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well.
+-	Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, beware of random values that can be used to populate the inputs.
+-	Consider [Image Pre-processing and Conversion](../../OV_Runtime_UG/preprocessing_overview.md), while any user-side pre-processing should be tracked separately.
+-   When applicable, leverage the [Dynamic Shapes support](../../OV_Runtime_UG/ov_dynamic_shapes.md)
+-	If possible, demand the same accuracy. For example, TensorFlow allows `FP16` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the `FP16` as well.

-## Using Tools <a name="using-tools"></a>
-
-Whether you are tuning for the first time or doing advanced performance optimization, you need a tool that provides accurate insights. Intel&reg; VTune&trade; Amplifier gives you the tool to mine it and interpret the profiling data.
-
-Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these.
-
-### Internal Inference Performance Counters <a name="performance-counters"></a>
-
-Almost every sample (inspect command-line options for a specific sample with `-h`) supports a `-pc` command that outputs internal execution breakdown. Refer to the [OpenVINO Samples](../../OV_Runtime_UG/Samples_Overview.md) for the actual OpenVINO Runtime API behind that.
+## Internal Inference Performance Counters and Execution Graphs <a name="performance-counters"></a>
+Further, finer-grained insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs.
+Both [C++](../../../samples/cpp/benchmark_app/README.md) and [Python](../../../tools/benchmark_tool/README.md) versions of the `benchmark_app` supports a `-pc` command-line parameter that outputs internal execution breakdown.

 Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock `realTime` and the `cpu` time are the same):

@@ -76,58 +63,12 @@ fc6_nChw8c_nchw      EXECUTED  layerType: Reorder           realTime: 20
 out_fc6         EXECUTED       layerType: Output            realTime: 3          cpu: 3              execType: unknown
 relu5_9_x2    OPTIMIZED_OUT     layerType: ReLU             realTime: 0          cpu: 0              execType: undef
 ```
+This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution.
+Both benchmark_app versions also support "exec_graph_path" command-line option governing the OpenVINO to output the same per-layer execution statistics, but in the form of the plugin-specific [Netron-viewable](https://netron.app/) graph to the specified file.

-This contains layers name (as seen in IR), layers type and execution statistics. Notice the `OPTIMIZED_OUT`, which indicates that the particular activation was fused into adjacent convolution. Also, the `unknown` stays for the Inference Engine specific CPU (helper) primitives that are not part of the Intel MKL-DNN.
+Notice that on some devices, the execution graphs/counters may be pretty intrusive overhead-wise. 
+Also, especially when performance-debugging the [latency case](../../optimization_guide/dldt_deployment_optimization_latency.md) notice that  the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example running single [OpenVINO stream](../../optimization_guide/dldt_deployment_optimization_tput.md) with multiple requests would produce nearly identical counters as running single inference request, yet the actual latency can be quite different.

-Notice that there are some helper layers in the CPU execution breakdown, which were not presented in the original topology. These are automatically added by the plugin. For example, the `Reorder` re-packs the Intel MKL-DNN internal (blocked) layout to the regular plain NCHW (that the user expects as the output). As explained in the <a href="#device-specific-tips">Few Device-Specific Tips</a>, if your custom kernels introduces a lot of outstanding/expensive Reorders, consider blocked implementation for the kernels.
+Finally, the performance statistics with both performance counters and execution graphs is averaged, so such a data for the [dynamically-shaped inputs](../../OV_Runtime_UG/ov_dynamic_shapes.md) should be measured carefully (ideally by isolating the specific shape and executing multiple times in a loop, to gather the reliable data).

-Notice that in the heterogeneous cases, there will be additional information on which subgraph the statistics is about (the first subgraph is GPU, so its `cpu`/host time is really small compared to the actual `realTime`):
-
-```
-subgraph1: squeeze1x1   	  EXECUTED       layerType: Convolution        realTime: 227    cpu:3    execType: GPU
-…
-subgraph2: detection_out      EXECUTED       layerType: DetectionOutput    realTime: 121 cpu:121  execType: unknown
-…
-```
-
-As mentioned earlier, `unknown` here means CPU kernel with unknown (for example, not AVX2 or AVX512) acceleration path.
-Since FPGA execution does not separate individual kernels, only bulk execution/data transfer statistics is available:
-
-```
-subgraph1: 1. input preprocessing (mean data/FPGA):EXECUTED   layerType: preprocessing   realTime: 129     cpu: 129
-subgraph1: 2. input transfer to DDR:EXECUTED       layerType:                    realTime: 201        cpu: 0              
-subgraph1: 3. FPGA execute time:EXECUTED           layerType:                    realTime: 3808       cpu: 0              subgraph1: 4. output transfer from DDR:EXECUTED    layerType:                    realTime: 55         cpu: 0              
-subgraph1: 5. FPGA output postprocessing:EXECUTED  layerType:                    realTime: 7          cpu: 7              
-subgraph1: 6. softmax/copy:   EXECUTED       layerType:                    realTime: 2          cpu: 2              
-subgraph2: out_prob:          NOT_RUN        layerType: Output             realTime: 0          cpu: 0              
-subgraph2: prob:              EXECUTED       layerType: SoftMax            realTime: 10         cpu: 10             
-Total time: 4212     microseconds
-```
-
-The `softmax/copy` is a glue layer that connects the FPGA subgraph to the CPU subgraph (and copies the data).
-
-### Intel&reg; VTune&trade; Examples <a name="vtune-examples"></a>
-
-All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel&reg; VTune&trade; timelines and aggregations plus correlating them to the underlying APIs, like OpenCL.  In turn, this enables careful per-layer execution breakdown.
-
-When choosing the Analysis type in Intel&reg; VTune&trade; Amplifier, make sure to select the **Analyze user tasks, events, and counters** option:
-
-![](vtune_option.png)
-
-See the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-task-analysis) for details.
-
-Example of Inference Engine calls:
-
-	On the Intel VTune Amplifier timeline.
-	Notice that `Task_runNOThrow` is an Async API wrapper and it is executed in a different thread and triggers the Intel MKL-DNN execution:
-
-	![](vtune_timeline.png)
-	
-	In the Intel VTune Amplifier **Top-down view**, grouped by the **Task Domain**.
-	Notice the `Task_runNoThrow` and `MKLDNN _INFER` that are bracketing the actual Intel MKL-DNN kernels execution:
-	
-	![](vtune_topdown_view.jpg)
-	
-Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels.
-
-Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for <a href="#optimizing-custom-kernels">optimizing custom kernels</a>. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the [corresponding section in the Intel&reg; VTune&trade; Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-analyze-performance)).
+OpenVINO in general and individual plugins are heavily instrumented with Intel® instrumentation and tracing technology (ITT), so another option is to compile the OpenVINO from the source code with the ITT enabled and using tools like [Intel® VTune™ Profiler](https://software.intel.com/en-us/vtune) to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.
--- a/docs/MO_DG/prepare_model/Supported_Frameworks_Layers.md
+++ b/docs/MO_DG/prepare_model/Supported_Frameworks_Layers.md
@@ -480,146 +480,186 @@ Some TensorFlow operations do not match to any OpenVINO operation, but are still

 ## ONNX Supported Operators

+### Standard ONNX Operators

-| Symbol Name in ONNX| Limitations|
-| :----------| :----------|
-| Abs |  |
-| Acos |  |
-| Acosh |  |
-| Add |  |
-| Affine |  |
-| And |  |
-| ArgMax |  |
-| ArgMin |  |
-| Asin |  |
-| Asinh |  |
-| Atan |  |
-| Atanh |  |
-| ATen | Supported only for the 'embedding_bag' operator |
-| AveragePool |  |
-| BatchMatMul |  |
-| BatchNormalization |  |
-| Cast |  |
-| Ceil |  |
-| Clip |  |
-| Concat |  |
-| Constant |  |
-| ConstantFill |  |
-| ConstantOfShape |  |
-| Conv |  |
-| ConvTranspose |  |
-| Cos |  |
-| Cosh |  |
-| Crop |  |
-| CumSum |  |
-| DepthToSpace |  |
-| DequantizeLinear |  |
-| DetectionOutput (Intel experimental) |  |
-| Div |  |
-| Dropout | Not needed for inference |
-| Elu |  |
-| Equal |  |
-| Erf |  |
-| Exp |  |
-| Expand |  |
-| ExperimentalDetectronDetectionOutput (Intel experimental) |  |
-| ExperimentalDetectronGenerateProposalsSingleImage (Intel experimental) |  |
-| ExperimentalDetectronGroupNorm (Intel experimental) |  |
-| ExperimentalDetectronPriorGridGenerator (Intel experimental) |  |
-| ExperimentalDetectronROIFeatureExtractor (Intel experimental) |  |
-| ExperimentalDetectronTopKROIs (Intel experimental) |  |
-| FakeQuantize (Intel experimental) |  |
-| Fill |  |
-| Flatten |  |
-| Floor |  |
-| GRU |  |
-| Gather |  |
-| GatherElements | Doesn't work with negative indices |
-| GatherND | Doesn't work with negative indices |
-| GatherTree |  |
-| Gemm |  |
-| GlobalAveragePool |  |
-| GlobalMaxPool |  |
-| Greater |  |
-| GreaterEqual |  |
-| HardSigmoid |  |
-| Identity | Not needed for inference |
-| ImageScaler |  |
-| InstanceNormalization |  |
-| LRN |  |
-| LSTM | Peepholes are not supported |
-| LeakyRelu |  |
-| Less |  |
-| LessEqual |  |
-| Log |  |
-| LogicalAnd |  |
-| LogicalOr |  |
-| LogSoftmax |  |
-| Loop |  |
-| LpNormalization |  |
-| MatMul |  |
-| Max |  |
-| MaxPool |  |
-| MeanVarianceNormalization | Reduction over the batch dimension is not supported, reduction over all dimensions except batch and channel ones is obligatory |
-| Min |  |
-| Mul |  |
-| Neg |  |
-| NonMaxSuppression |  |
-| NonZero |  |
-| Not |  |
-| NotEqual |  |
-| OneHot |  |
-| Pad |  |
-| Pow |  |
-| PriorBox (Intel experimental) |  |
-| PriorBoxClustered |  |
-| QuantizeLinear |  |
-| RNN |  |
-| ROIAlign |  |
-| Range |  |
-| RandomUniform | Operation provides sequence from uniform distribution, but exact values won't match. |
-| Reciprocal |  |
-| ReduceL1 |  |
-| ReduceL2 |  |
-| ReduceMax |  |
-| ReduceMean |  |
-| ReduceMin |  |
-| ReduceProd |  |
-| ReduceSum |  |
-| Relu |  |
-| Reshape |  |
-| Resize | Coordinate transformation mode `tf_crop_and_resize` is not supported, `nearest` mode is not supported for 5D+ inputs. |
-| ReverseSequence |  |
-| Round |  |
-| Scatter | Supported if fuse-able to ScatterUpdate. MYRIAD only |
-| ScatterND |  |
-| ScatterElements | Supported if fuse-able to ScatterUpdate. MYRIAD only |
-| Select |  |
-| Shape |  |
-| Sigmoid |  |
-| Sign |  |
-| Sin |  |
-| Size |  |
-| Slice |  |
-| Softmax |  |
-| Softplus |  |
-| Softsign |  |
-| SpaceToDepth |  |
-| Split |  |
-| Sqrt |  |
-| Squeeze | The case when squeeze axis is not specified is not supported |
-| Sub |  |
-| Sum |  |
-| Tan |  |
-| Tanh |  |
-| ThresholdedRelu |  |
-| TopK |  |
-| Transpose |  |
-| Unsqueeze |  |
-| Upsample |  |
-| Where |  |
-| Xor |  |
+| ONNX Operator Name |
+| :----------|
+| Abs |
+| Acos |
+| Acosh |
+| And |
+| ArgMin | 
+| ArgMax | 
+| Asin |
+| Asinh |
+| Atan |
+| ATen |
+| Atanh |
+| AveragePool |
+| BatchNormalization |
+| BitShift |
+| Cast |
+| CastLike |
+| Ceil |
+| Clip |
+| Concat |
+| Constant |
+| ConstantOfShape |
+| Conv |
+| ConvInteger |
+| ConvTranspose |
+| Compress |
+| Cos |
+| Cosh |
+| ConstantFill |
+| CumSum |
+| DepthToSpace |
+| DequantizeLinear |
+| Div |
+| Dropout |
+| Einsum |
+| Elu |
+| Equal |
+| Erf |
+| Exp |
+| Expand |
+| EyeLike |
+| Flatten |
+| Floor |
+| Gather |
+| GatherElements |
+| GatherND |
+| Gemm |
+| GlobalAveragePool |
+| GlobalLpPool |
+| GlobalMaxPool |
+| Greater |
+| GRU |
+| Hardmax |
+| HardSigmoid |
+| HardSwish |
+| Identity |
+| If |
+| ImageScaler |
+| InstanceNormalization |
+| LeakyRelu |
+| Less |
+| Log |
+| LogSoftmax |
+| Loop |
+| LpNormalization |
+| LRN |
+| LSTM |
+| MatMulInteger |
+| MatMul |
+| MaxPool |
+| Max |
+| Mean |
+| MeanVarianceNormalization |
+| Min |
+| Mod |
+| Mul |
+| Neg |
+| NonMaxSuppression |
+| NonZero |
+| Not |
+| Or |
+| OneHot |
+| Pad |
+| Pow |
+| PRelu |
+| QLinearConv |
+| QLinearMatMul |
+| QuantizeLinear |
+| Range |
+| RandomNormal |
+| RandomNormalLike |
+| RandomUniform |
+| RandomUniformLike |
+| Reciprocal |
+| ReduceLogSum |
+| ReduceLogSumExp |
+| ReduceL1 |
+| ReduceL2 |
+| ReduceMax |
+| ReduceMean |
+| ReduceMin |
+| ReduceProd |
+| ReduceSum |
+| ReduceSumSquare |
+| Relu |
+| Reshape |
+| Resize |
+| ReverseSequence |
+| RNN |
+| RoiAlign |
+| Round |
+| ScatterElements |
+| ScatterND |
+| Selu |
+| Shape |
+| Shrink |
+| Sigmoid |
+| Sign |
+| Sin |
+| Sinh |
+| Size |
+| Slice |
+| Softmax |
+| Softplus |
+| Softsign |
+| SpaceToDepth |
+| Split |
+| Sqrt |
+| Squeeze |
+| Sub |
+| Sum |
+| Tan |
+| Tanh |
+| ThresholdedRelu |
+| Tile |
+| TopK |
+| Transpose |
+| Unsqueeze |
+| Where |
+| Xor |

+### Deprecated ONNX Operators (Supported)
+
+| ONNX Operator Name |
+| :----------|
+| Affine |
+| Crop |
+| Scatter |
+| Upsample |
+
+### Operators From the org.openvinotoolkit Domain
+
+| Custom ONNX Operator Name |
+| :----------|
+| DeformableConv2D |
+| DetectionOutput |
+| ExperimentalDetectronDetectionOutput |
+| ExperimentalDetectronGenerateProposalsSingleImage |
+| ExperimentalDetectronGroupNorm |
+| ExperimentalDetectronPriorGridGenerator |
+| ExperimentalDetectronROIFeatureExtractor |
+| ExperimentalDetectronTopKROIs |
+| FakeQuantize |
+| GroupNorm |
+| Normalize |
+| PriorBox |
+| PriorBoxClustered |
+| Swish |
+
+### Operators From the com.microsoft Domain
+
+| Custom ONNX Operator Name |
+| :----------|
+| Attention |
+| BiasGelu |
+| EmbedLayerNormalization |
+| SkipLayerNormalization |

 ## PaddlePaddle Supported Operators

--- a/docs/OV_Runtime_UG/automatic_batching.md
+++ b/docs/OV_Runtime_UG/automatic_batching.md
@@ -42,7 +42,8 @@ Batching is a straightforward way of leveraging the GPU compute power and saving
@endsphinxdirective


-Alternatively, to enable the Auto-Batching in the legacy apps not akin to the notion of the performance hints, you may need to use the **explicit** device notion, such as 'BATCH:GPU'. In both cases (the *throughput* hint or explicit BATCH device), the optimal batch size selection happens automatically. The actual value depends on the model and device specifics, for example, on-device memory for the dGPUs.
+Alternatively, to enable the Auto-Batching in the legacy apps not akin to the notion of the performance hints, you may need to use the **explicit** device notion, such as 'BATCH:GPU'. In both cases (the *throughput* hint or explicit BATCH device), the optimal batch size selection happens automatically (the implementation queries the `ov::optimal_batch_size` property from the device, passing the model's graph as the parameter). The actual value depends on the model and device specifics, for example, on-device memory for the dGPUs.
+Auto-Batching support is not limited to the GPUs, but if a device does not support the `ov::optimal_batch_size` yet, it can work with the auto-batching only when specifying an explicit batch size, for example, "BATCH:<device>(16)".

 This _automatic batch size selection_ assumes that the application queries the `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
@sphinxdirective
@@ -80,28 +81,42 @@ For example, the application processes only 4 video streams, so there is no need

@endsphinxdirective

-For the *explicit* usage, you can limit the batch size using  "BATCH:GPU(4)",  where 4 is the number of requests running in parallel.
+For the *explicit* usage, you can limit the batch size using "BATCH:GPU(4)",  where 4 is the number of requests running in parallel.

 ### Other Performance Considerations

 To achieve the best performance with the Automatic Batching, the application should:
 - Operate the number of inference requests that represents the multiple of the batch size. In the above example, for batch size 4, the application should operate 4, 8, 12, 16, etc. requests.
- - Use the requests, grouped by the batch size, together. For example, the first 4 requests are inferred, while the second group of the requests is being populated.  
+ - Use the requests, grouped by the batch size, together. For example, the first 4 requests are inferred, while the second group of the requests is being populated. Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches.
+  - Balance the 'timeout' value vs the batch size. For example, in many cases having a smaller timeout value/batch size may yield better performance than large batch size, but with the timeout value that is not large enough to accommodate the full number of the required requests.
+  - Carefully apply the auto-batching to the pipelines. For example for the conventional video-sources->detection->classification flow, it is the most benefical to do auto-batching over the inputs to the detection stage. Whereas the resulting number of detections is usually fluent, which makes the auto-batching less applicable for the classification stage.         

 The following are limitations of the current implementations:
 - Although less critical for the throughput-oriented scenarios, the load-time with auto-batching increases by almost 2x.
- - Certain networks are not reshape-able by the "batching" dimension (specified as 'N' in the layouts terms) or if the dimension is not zero-th, the auto-batching is not triggered. 
+ - Certain networks are not safely reshape-able by the "batching" dimension (specified as 'N' in the layouts terms). Also, if the batching dimension is not zero-th, the auto-batching is not triggered _implicitly_ by the throughput hint.
+ -  The _explicit_ notion, for example, "BATCH:GPU", uses the relaxed dimensions tracking, often making the auto-batching possible. For example, this trick unlocks most **detection networks**.
+ - - When *forcing* the auto-batching via the explicit device notion, make sure to validate the results for correctness.   
 - Performance improvements happen at the cost of the memory footprint growth, yet the auto-batching queries the available memory (especially for the dGPUs) and limits the selected batch size accordingly.

 
-
 ### Configuring the Automatic Batching
 Following the OpenVINO convention for devices names, the *batching* device is named *BATCH*. The configuration options are as follows:

 | Parameter name     | Parameter description      | Default            |             Examples                                                      |
 | :---               | :---                  | :---               |:-----------------------------------------------------------------------------|
-| "AUTO_BATCH_DEVICE" | Device name to apply the automatic batching and optional batch size in brackets | N/A | BATCH:GPU which triggers the automatic batch size selection or explicit batch size BATCH:GPU(4)     |
+| "AUTO_BATCH_DEVICE" | Device name to apply the automatic batching and optional batch size in brackets | N/A | "BATCH:GPU" which triggers the automatic batch size selection. Another example is the device name (to apply the batching) with directly specified batch size "BATCH:GPU(4)"     |
 | "AUTO_BATCH_TIMEOUT" | timeout value, in ms | 1000 |  you can reduce the timeout value (to avoid performance penalty when the data arrives too non-evenly) e.g. pass the "100", or in contrast make it large enough e.g. to accommodate inputs preparation (e.g. when it is serial process)     |

+### Testing Automatic Batching Performance with the Benchmark_App
+The `benchmark_app`, that exists in both  [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the performance of the Automatic Batching:
+ -  The most straighforward way is performance hints:
+- - benchmark_app **-hint tput** -d GPU -m 'path to your favorite model'
+ -  Overriding the strict rules of implicit reshaping by the batch dimension via the explicit device notion:
+- - benchmark_app **-hint none -d BATCH:GPU** -m 'path to your favorite model'
+ -  Finally, overriding the automatically-deduced batch size as well:
+- - $benchmark_app -hint none -d **BATCH:GPU(16)** -m 'path to your favorite model'
+
+The last example is also applicable to the CPU or any other device that generally supports the batched execution.  
+
 ### See Also
-[Supported Devices](supported_plugins/Supported_Devices.md)
+[Supported Devices](supported_plugins/Supported_Devices.md)
--- a/docs/OV_Runtime_UG/integrate_with_your_application.md
+++ b/docs/OV_Runtime_UG/integrate_with_your_application.md
@@ -132,9 +132,7 @@ To learn how to change the device configuration, read the [Query device properti

 ### Step 3. Create an Inference Request

-`ov::InferRequest` class provides methods for model inference in the OpenVINO™ Runtime.
-This section demonstrates a simple pipeline, to get more information about other use cases, read the [InferRequest documentation](./ov_infer_request.md) dedicated article.
-Create an infer request using the following code:
+`ov::InferRequest` class provides methods for model inference in OpenVINO™ Runtime. Create an infer request using the following code (see [InferRequest detailed documentation](./ov_infer_request.md) for more details):

@sphinxdirective

@@ -174,7 +172,7 @@ You can use external memory to create `ov::Tensor` and use the `ov::InferRequest

 ### Step 5. Start Inference

-OpenVINO™ Runtime supports inference in asynchronous or synchronous mode. Async API usage can improve overall frame-rate of the application, because rather than wait for inference to complete, the app can continue doing things on the host, while the accelerator is busy. You can use `ov::InferRequest::start_async()` to start model inference in the asynchronous mode and call `ov::InferRequest::wait()` to wait for the inference results:
+OpenVINO™ Runtime supports inference in either synchronous or asynchronous mode. Using the Async API can improve application's overall frame-rate, because rather than wait for inference to complete, the app can keep working on the host, while the accelerator is busy. You can use `ov::InferRequest::start_async` to start model inference in the asynchronous mode and call `ov::InferRequest::wait` to wait for the inference results:

@sphinxdirective

@@ -192,14 +190,7 @@ OpenVINO™ Runtime supports inference in asynchronous or synchronous mode. Asyn

@endsphinxdirective

-The asynchronous mode supports two methods to get the inference results:
-  * `ov::InferRequest::wait_for()` - Waits until the specified timeout (in milliseconds) has elapsed or the inference result becomes available, whichever comes first.
-  * `ov::InferRequest::wait()` - Waits until the inference result becomes available.
-
-Both requests are thread-safe, which means they can be called from different threads without exposing erroneous behavior or producing unpredictable results.
-
-While the request is ongoing, all its methods except `ov::InferRequest::cancel`, `ov::InferRequest::wait` or `ov::InferRequest::wait_for` throw
-the `ov::Busy` exception indicating the request is busy with computations.
+This section demonstrates a simple pipeline, to get more information about other ways to perform inference, read the dedicated ["Run inference" section](./ov_infer_request.md).

 ### Step 6. Process the Inference Results 

@@ -258,6 +249,7 @@ Congratulations, you have made your first application with OpenVINO™ toolkit,
 ## See also

 - [OpenVINO™ Runtime Preprocessing](./preprocessing_overview.md)
+ - [Using Encrypted Models with OpenVINO&trade;](./protecting_model_guide.md)

 [ie_api_flow_cpp]: img/BASIC_IE_API_workflow_Cpp.svg
 [ie_api_use_cpp]: img/IMPLEMENT_PIPELINE_with_API_C.svg
--- a/docs/OV_Runtime_UG/migration_ov_2_0/common_inference_pipeline.md
+++ b/docs/OV_Runtime_UG/migration_ov_2_0/common_inference_pipeline.md
@@ -131,13 +131,13 @@ Inference Engine API:

@sphinxdirective

-.. tab:: sync
+.. tab:: Sync

    .. doxygensnippet:: docs/snippets/ie_common.cpp
       :language: cpp
       :fragment: [ie:inference]

-.. tab:: async
+.. tab:: Async

    .. doxygensnippet:: docs/snippets/ie_common.cpp
       :language: cpp
@@ -149,13 +149,13 @@ OpenVINO™ Runtime API 2.0:

@sphinxdirective

-.. tab:: sync
+.. tab:: Sync

    .. doxygensnippet:: docs/snippets/ov_common.cpp
       :language: cpp
       :fragment: [ov_api_2_0:inference]

-.. tab:: async
+.. tab:: Async

    .. doxygensnippet:: docs/snippets/ov_common.cpp
       :language: cpp
--- a/docs/OV_Runtime_UG/migration_ov_2_0/deployment_migration.md
+++ b/docs/OV_Runtime_UG/migration_ov_2_0/deployment_migration.md
@@ -0,0 +1,197 @@
+# Installation & Deployment {#openvino_2_0_deployment}
+
+"Easy to use" is one of the main concepts for OpenVINO™ API 2.0. It includes not only simplifying the migration from frameworks to OpenVINO, but also how OpenVINO is organized, how the development tools are used, and how to develop and deploy OpenVINO-based applications.
+
+To accomplish that, we have made some changes on the installation and deployment of OpenVINO in the 2022.1 release. This guide will walk you through them.
+
+## Installer Package Contains OpenVINO™ Runtime Only
+
+Starting from OpenVINO 2022.1, Model Optimizer, Post-Training Optimization tool and Python-based Development tools such as Open Model Zoo tools are distributed via [PyPI](https://pypi.org/project/openvino-dev/) only, and are no longer included in the OpenVINO installer package. This change has several benefits as it:
+
+* Simplifies the user experience. In previous versions, the installation and usage of OpenVINO Development Tools differ according to the distribution type (via an OpenVINO installer or PyPI). 
+* Ensures that dependencies are handled properly via the PIP package manager and support virtual environments of development tools.
+
+The structure of OpenVINO 2022.1 installer package has been organized as below:
+
+- The `runtime` folder includes headers, libraries and CMake interfaces.
+- The `tools` folder contains [the compile tool](../../../tools/compile_tool/README.md), [deployment manager](../../install_guides/deployment-manager-tool.md) and a set of `requirements.txt` files with links to the corresponding versions of the `openvino-dev` package.
+- The `python` folder contains the Python version for OpenVINO Runtime.
+
+## Installing OpenVINO Development Tools via PyPI
+
+Since OpenVINO Development Tools is no longer in the installer package, the installation process has changed too. This section describes it through a comparison with previous versions.
+
+### For Versions Prior to 2022.1
+
+In previous versions, OpenVINO Development Tools is a part of main package. After the package is installed, to convert models (for example, TensorFlow), you need to install additional dependencies by using the requirements files such as `requirements_tf.txt`, install Post-Training Optimization tool and Accuracy Checker tool via the `setup.py` scripts, and then use the `setupvars` scripts to make the tools available to the following command:
+
+```sh
+$ mo.py -h
+```
+
+### For 2022.1 and After
+
+Starting from OpenVINO 2022.1, you can install the development tools from [PyPI](https://pypi.org/project/openvino-dev/) repository only, using the following command (taking TensorFlow as an example):
+
+```sh
+$ python3 -m pip install -r <INSTALL_DIR>/tools/requirements_tf.txt 
+```
+
+This will install all the development tools and additional necessary components to work with TensorFlow via the `openvino-dev` package (see **Step 4. Install the Package** on the [PyPI page](https://pypi.org/project/openvino-dev/) for parameters of other frameworks).
+
+Then, the tools can be used by commands like:
+
+```sh
+$ mo -h
+$ pot -h
+```
+
+You don't have to install any other dependencies. For more details on the installation steps, see [Install OpenVINO Development Tools](../../install_guides/installing-model-dev-tools.md).
+
+## Interface Changes for Building C/C++ Applications
+
+The new OpenVINO Runtime with API 2.0 has also brought some changes for builiding your C/C++ applications.
+
+### CMake Interface
+
+The CMake interface has been changed as below:
+
+**With Inference Engine of previous versions**:
+
+```cmake
+find_package(InferenceEngine REQUIRED)
+find_package(ngraph REQUIRED)
+add_executable(ie_ngraph_app main.cpp)
+target_link_libraries(ie_ngraph_app PRIVATE ${InferenceEngine_LIBRARIES} ${NGRAPH_LIBRARIES})
+```
+
+**With OpenVINO Runtime 2022.1 (API 2.0)**:
+
+```cmake
+find_package(OpenVINO REQUIRED)
+add_executable(ov_app main.cpp)
+target_link_libraries(ov_app PRIVATE openvino::runtime)
+
+add_executable(ov_c_app main.c)
+target_link_libraries(ov_c_app PRIVATE openvino::runtime::c)
+```
+
+### Native Interfaces
+
+To build applications without CMake interface, you can also use MSVC IDE, UNIX makefiles and any other interfaces, which have been changed as below:
+
+**With Inference Engine of previous versions**:
+
+@sphinxdirective
+
+.. tab:: Include dirs
+
+  .. code-block:: sh
+    
+    <INSTALL_DIR>/deployment_tools/inference_engine/include
+    <INSTALL_DIR>/deployment_tools/ngraph/include
+
+.. tab:: Path to libs
+
+  .. code-block:: sh
+
+    <INSTALL_DIR>/deployment_tools/inference_engine/lib/intel64/Release
+    <INSTALL_DIR>/deployment_tools/ngraph/lib/
+
+.. tab:: Shared libs
+
+  .. code-block:: sh
+
+    // UNIX systems
+    inference_engine.so ngraph.so
+
+    // Windows
+    inference_engine.dll ngraph.dll
+
+.. tab:: (Windows) .lib files
+
+  .. code-block:: sh
+  
+    ngraph.lib
+    inference_engine.lib
+
+@endsphinxdirective
+
+**With OpenVINO Runtime 2022.1 (API 2.0)**:
+
+@sphinxdirective
+
+.. tab:: Include dirs
+
+  .. code-block:: sh
+
+    <INSTALL_DIR>/runtime/include
+
+.. tab:: Path to libs
+
+  .. code-block:: sh
+
+    <INSTALL_DIR>/runtime/lib/intel64/Release
+
+.. tab:: Shared libs
+
+  .. code-block:: sh
+
+    // UNIX systems
+    openvino.so
+
+    // Windows
+    openvino.dll
+
+.. tab:: (Windows) .lib files
+
+  .. code-block:: sh
+
+    openvino.lib
+
+@endsphinxdirective
+
+## Clearer Library Structure for Deployment
+
+OpenVINO 2022.1 has reorganized the libraries to make it easier for deployment. In previous versions, to perform deployment steps, you have to use several libraries. Now you can just use `openvino` or `openvino_c` based on your developing language plus necessary plugins to complete your task. For example, `openvino_intel_cpu_plugin` and `openvino_ir_frontend` plugins will enable you to load OpenVINO IRs and perform inference on CPU device.
+
+Here you can find some detailed comparisons on library structure between OpenVINO 2022.1 and previous versions:
+
+* A single core library with all the functionalities (`openvino` for C++ Runtime, `openvino_c` for Inference Engine API C interface) is used in 2022.1, instead of the previous core libraries which contain `inference_engine`, `ngraph`, `inference_engine_transformations` and `inference_engine_lp_transformations`.
+* The optional `inference_engine_preproc` preprocessing library (if `InferenceEngine::PreProcessInfo::setColorFormat` or `InferenceEngine::PreProcessInfo::setResizeAlgorithm` is used) is renamed as `openvino_gapi_preproc` and deprecated in 2022.1. See more details on [Preprocessing capabilities of OpenVINO API 2.0](preprocessing.md).
+* The libraries of plugins are renamed as below:
+   * `openvino_intel_cpu_plugin` is used for [CPU](../supported_plugins/CPU.md) device instead of `MKLDNNPlugin` in previous versions.
+   * `openvino_intel_gpu_plugin` is used for [GPU](../supported_plugins/GPU.md) device instead of `clDNNPlugin` in previous versions.
+   * `openvino_auto_plugin` is used for [Auto-Device Plugin](../auto_device_selection.md) in 2022.1.
+* The plugins for reading and converting models have been changed as below:
+   * `openvino_ir_frontend` is used to read IRs instead of `inference_engine_ir_reader` in previous versions.
+   * `openvino_onnx_frontend` is used to read ONNX models instead of `inference_engine_onnx_reader` (with its dependencies) in previous versions. 
+   * `openvino_paddle_frontend` is added in 2022.1 to read PaddlePaddle models.
+
+<!-----
+Older versions of OpenVINO had several core libraries and plugin modules:
+- Core: `inference_engine`, `ngraph`, `inference_engine_transformations`, `inference_engine_lp_transformations`
+- Optional `inference_engine_preproc` preprocessing library (if `InferenceEngine::PreProcessInfo::setColorFormat` or `InferenceEngine::PreProcessInfo::setResizeAlgorithm` are used)
+- Plugin libraries:
+ - `MKLDNNPlugin` for [CPU](../supported_plugins/CPU.md) device
+ - `clDNNPlugin` for [GPU](../supported_plugins/GPU.md) device
+ - `MultiDevicePlugin` for [Multi-device execution](../multi_device.md)
+ - others
+- Plugins to read and convert a model:
+ - `inference_engine_ir_reader` to read OpenVINO IR
+ - `inference_engine_onnx_reader` (with its dependencies) to read ONNX models
+Now, the modularity is more clear:
+- A single core library with all the functionality `openvino` for C++ runtime
+- `openvino_c` with Inference Engine API C interface
+- **Deprecated** Optional `openvino_gapi_preproc` preprocessing library (if `InferenceEngine::PreProcessInfo::setColorFormat` or `InferenceEngine::PreProcessInfo::setResizeAlgorithm` are used)
+ - Use [preprocessing capabilities from OpenVINO 2.0](../preprocessing_overview.md)
+- Plugin libraries with clear names:
+ - `openvino_intel_cpu_plugin`
+ - `openvino_intel_gpu_plugin`
+ - `openvino_auto_plugin`
+ - others
+- Plugins to read and convert models:
+ - `openvino_ir_frontend` to read OpenVINO IR
+ - `openvino_onnx_frontend` to read ONNX models
+ - `openvino_paddle_frontend` to read Paddle models
+---->
--- a/docs/OV_Runtime_UG/migration_ov_2_0/intro.md
+++ b/docs/OV_Runtime_UG/migration_ov_2_0/intro.md
@@ -6,6 +6,7 @@
   :maxdepth: 1
   :hidden:

+   openvino_2_0_deployment
   openvino_2_0_inference_pipeline
   openvino_2_0_configure_devices
   openvino_2_0_preprocessing
@@ -15,7 +16,7 @@

 ### Introduction

-Older versions of OpenVINO (prior to 2022.1) required to change the logic of applications when an user migrates from the frameworks like TensorFlow, ONNX Runtime, PyTorch, PaddlePaddle, etc. The change of application's logic is connected with:
+Older versions of OpenVINO™ (prior to 2022.1) required to change the logic of applications when an user migrates from the frameworks like TensorFlow, ONNX Runtime, PyTorch, PaddlePaddle, etc. The change of application's logic is connected with:

 - Model Optimizer changed input precisions for some inputs. For example, neural language processing models with `I64` input are becoming to have `I32` input element type.
 - Model Optimizer changed layouts for TensorFlow models (see [Layouts in OpenVINO](../layout_overview.md)). It leads to unexpected user behavior that a user needs to use a different layout for its input data with compare to the framework:
@@ -23,25 +24,25 @@ Older versions of OpenVINO (prior to 2022.1) required to change the logic of app
 - Inference Engine API (`InferenceEngine::CNNNetwork`) also applied some conversion rules for input and output precisions because of device plugins limitations.
 - Users need to specify input shapes during model conversions in Model Optimizer and work with static shapes in the application.

-OpenVINO™ introduces API 2.0 to align logic of working with model as it is done in the frameworks - no layout and precision changes, operates with tensor names and indices to address inputs and outputs. OpenVINO Runtime is composed of Inference Engine API used for inference and nGraph API targeted to work with models, operations. The API 2.0 has common structure, naming convention styles, namespaces, removes duplicated structures. See [How to migrate to OpenVINO API v2](common_inference_pipeline.md) for details.
+OpenVINO™ introduces API 2.0 to align logic of working with model as it is done in the frameworks - no layout and precision changes, operates with tensor names and indices to address inputs and outputs. OpenVINO Runtime is composed of Inference Engine API used for inference and nGraph API targeted to work with models, operations. The API 2.0 has common structure, naming convention styles, namespaces, removes duplicated structures. See [How to migrate to OpenVINO API 2.0](common_inference_pipeline.md) for details.

-> **NOTE**: Most important is that your existing application can continue working with OpenVINO Runtime 2.0 as it used to be, but we recommend migration to new API to unlock additional features like [Preprocessing](../preprocessing_overview.md) and [Dynamic shapes support](../ov_dynamic_shapes.md).
+> **NOTE**: Most important is that your existing application can continue working with OpenVINO Runtime 2022.1 as it used to be, but we recommend migration to API 2.0 to unlock additional features like [Preprocessing](../preprocessing_overview.md) and [Dynamic shapes support](../ov_dynamic_shapes.md).

 ### Introducing IR v11

-To support these features, OpenVINO introduced IR v11 which is generated by Model Optimizer by default since 2022.1. The model represented in IR v11 fully matches the original model in a original framework format in terms of inputs and outputs. Also, a user does not have to specify input shapes during the conversion, so the resulting IR v11 contains `-1` to denote undefined dimensions (see [Working with dynamic shapes](../ov_dynamic_shapes.md) to fully utilize this feature; or [Changning input shapes](../ShapeInference.md) to reshape to static shapes in the application).
+To support these features, OpenVINO introduced IR v11 which is generated by Model Optimizer by default since 2022.1. The model represented in IR v11 fully matches the original model in a original framework format in terms of inputs and outputs. Also, a user does not have to specify input shapes during the conversion, so the resulting IR v11 contains `-1` to denote undefined dimensions (see [Working with dynamic shapes](../ov_dynamic_shapes.md) to fully utilize this feature; or [Changing input shapes](../ShapeInference.md) to reshape to static shapes in the application).

 What is also important to mention - the IR v11 is fully compatible with old applications written with Inference Engine API from older versions of OpenVINO. This is achieved by adding additional runtime information to the IR v11 which is responsible for backward compatible behavior. So, once the IR v11 is read by the old Inference Engine based application, it's internally converted to IR v10 to provide backward-compatible behavior.

-The IR v11 is supported by all OpenVINO Development tools including Post Training Optimization tool, Benchmark app, etc.
+The IR v11 is supported by all OpenVINO Development tools including Post-Training Optimization tool, Benchmark app, etc.

-### IR v10 compatibility
+### IR v10 Compatibility

 OpenVINO API 2.0 also supports models in IR v10 for backward compatibility. So, if a user has an IR v10, it can be fed to OpenVINO Runtime as well (see [migration steps](common_inference_pipeline.md)).

 Some OpenVINO Development Tools also support both IR v10 and IR v11 as an input:
 - Accuracy checker also supports IR v10, but requires an additional option to denote which API is used underneath.
- [Compile tool](../../../tools/compile_tool/README.md) compiles the model to be used in OpenVINO 2.0 API by default. If a user wants to use the resulting compiled blob in Inference Engine API, the additional `ov_api_1_0` option should be passed.
+- [Compile tool](../../../tools/compile_tool/README.md) compiles the model to be used in API 2.0 by default. If a user wants to use the resulting compiled blob in Inference Engine API, the additional `ov_api_1_0` option should be passed.

 The following OpenVINO tools don't support IR v10 as an input, and require to generate an IR v11 from the original model with the latest version of Model Optimizer:
 - Post-Training Optimization tool
@@ -49,13 +50,15 @@ The following OpenVINO tools don't support IR v10 as an input, and require to ge

 > **NOTE**: If you need to quantize your IR v10 models to run with OpenVINO 2022.1, it's recommended to download and use Post-Training Optimization tool from OpenVINO 2021.4 release.

+### Differences between Inference Engine and OpenVINO Runtime 2022.1
+
 ### Differences between Inference Engine and OpenVINO Runtime 2.0

 Inference Engine and nGraph APIs are not deprecated, they are fully functional and can be used in applications. However, it's highly recommended to migrate to API 2.0, because it already has additional features and this list will be extended later. The following list of additional features is supported by API 2.0:
 - [Working with dynamic shapes](../ov_dynamic_shapes.md). The feature is quite useful for best performance for NLP (Neural Language Processing) models, super resolution models and other which accepts dynamic input shapes.
 - [Preprocessing of the model](../preprocessing_overview.md) to add preprocessing operations to the inference models and fully occupy the accelerator and free CPU resources.

-To define a difference on the API level between Inference Engine and OpenVINO RUntime API 2.0, let's define two types of behaviors:
+To define a difference on the API level between Inference Engine and API 2.0, let's define two types of behaviors:
 - **Old behavior** of OpenVINO supposes:
  - Model Optimizer can change input element types, order of dimensions (layouts) with compare to the model from the original framework.
  - Inference Engine can override input and output element types.
@@ -74,7 +77,8 @@ The table below demonstrates which behavior **old** or **new** is used depending
 |Inference Engine / nGraph APIs |     Old |     Old |       Old |                   Old |
 |API 2.0                        |     Old |     New |       New |                   New |

-Please look at next transition guides to understand how migrate Inference Engine-based application to OpenVINO™ Runtime API 2.0:
+Please look at next transition guides to understand how migrate Inference Engine-based application to API 2.0:
+ - [Installation & Deployment](deployment_migration.md)
 - [OpenVINO™ Common Inference pipeline](common_inference_pipeline.md)
 - [Preprocess your model](./preprocessing.md)
 - [Configure device](./configure_devices.md)
--- a/docs/OV_Runtime_UG/multi_device.md
+++ b/docs/OV_Runtime_UG/multi_device.md
@@ -1,4 +1,4 @@
-# Running on multiple device simultaneously {#openvino_docs_OV_UG_Running_on_multiple_devices}
+# Running on multiple devices simultaneously {#openvino_docs_OV_UG_Running_on_multiple_devices}

 ## Introducing the Multi-Device Plugin (C++)

@@ -159,6 +159,21 @@ The Multi-Device plugin supports FP16 IR files. The CPU plugin automatically upc
 ### See Also
 [Supported Devices](supported_plugins/Supported_Devices.md)

+## Performance Considerations for the Multi-Device Execution
+This section covers few recommendations for the multi-device execution (applicable for both Python and C++):
+- MULTI usually performs best when the fastest device is specified first in the list of the devices. 
+    This is particularly important when the request-level parallelism is not sufficient 
+    (e.g. the number of request in the flight is not enough to saturate all devices).
+- Just like with any throughput-oriented execution, it is highly recommended to query the optimal number of inference requests directly from the instance of the `ov:compiled_model`. 
+Please refer to the code of the `benchmark_app`, that exists in both  [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md), for more details.    
+-   Notice that for example CPU+GPU execution performs better with certain knobs 
+    which you can find in the code of the same [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample.
+    One specific example is disabling GPU driver polling, which in turn requires multiple GPU streams to amortize slower 
+    communication of inference completion from the device to the host.
+-	Multi-device logic always attempts to save on the (e.g. inputs) data copies between device-agnostic, user-facing inference requests 
+    and device-specific 'worker' requests that are being actually scheduled behind the scene. 
+    To facilitate the copy savings, it is recommended to run the requests in the order that they were created.
+
 ## Introducing the Multi-Device Plugin (Python)

@sphinxdirective
--- a/docs/OV_Runtime_UG/openvino_intro.md
+++ b/docs/OV_Runtime_UG/openvino_intro.md
@@ -9,7 +9,6 @@
   :hidden:

   openvino_docs_Integrate_OV_with_your_application
-   <!-- should be a part of Integrate OV in user application -->
   openvino_docs_IE_DG_ShapeInference
   openvino_docs_OV_UG_Working_with_devices
   openvino_docs_OV_Runtime_UG_Preprocessing_Overview
@@ -17,12 +16,12 @@
   openvino_docs_IE_DG_supported_plugins_AUTO
   openvino_docs_OV_UG_Running_on_multiple_devices
   openvino_docs_OV_UG_Hetero_execution
+   openvino_docs_OV_UG_Performance_Hints
   openvino_docs_OV_UG_Automatic_Batching
   openvino_docs_IE_DG_network_state_intro
   openvino_docs_OV_Runtime_UG_Python_API_exclusives
   openvino_2_0_transition_guide
-   openvino_docs_OV_Should_be_in_performance
-
+   
@endsphinxdirective

 ## Introduction
--- a/docs/OV_Runtime_UG/openvino_temporary.md
+++ b/docs/OV_Runtime_UG/openvino_temporary.md
@@ -1,16 +0,0 @@
-# Should be moved to performance / extensibility {#openvino_docs_OV_Should_be_in_performance}
-
-@sphinxdirective
-
-.. toctree::
-   :maxdepth: 1
-   :hidden:
-
-   openvino_docs_deployment_optimization_guide_dldt_optimization_guide
-   openvino_docs_IE_DG_Model_caching_overview
-   openvino_docs_IE_DG_Int8Inference
-   openvino_docs_OV_UG_NoDynamicShapes
-
-@endsphinxdirective
-
-## TEMP: should be moved to performance / extensibility guides
--- a/docs/OV_Runtime_UG/ov_dynamic_shapes.md
+++ b/docs/OV_Runtime_UG/ov_dynamic_shapes.md
@@ -1,10 +1,19 @@
 # Dynamic Shapes {#openvino_docs_OV_UG_DynamicShapes}

+@sphinxdirective
+
+.. toctree::
+   :maxdepth: 1
+   :hidden:
+
+   openvino_docs_OV_UG_NoDynamicShapes
+
+@endsphinxdirective
+
 As it was demonstrated in the [Changing Input Shapes](ShapeInference.md) article, there are models that support changing of input shapes before model compilation in `Core::compile_model`.
 Reshaping models provides an ability to customize the model input shape for exactly that size that is required in the end application.
 This article explains how the ability of model to reshape can further be leveraged in more dynamic scenarios.

-
 ## When to Apply Dynamic Shapes

 Conventional "static" model reshaping works well when it can be done once per many model inference calls with the same shape.
--- a/docs/OV_Runtime_UG/ov_infer_request.md
+++ b/docs/OV_Runtime_UG/ov_infer_request.md
@@ -30,7 +30,7 @@ This class allows to set and get data for model inputs, outputs and run inferenc

 ### Synchronous mode

-You can use `ov::InferRequest::infer()`, which blocks the application execution,  to infer model in synchronous mode:
+You can use `ov::InferRequest::infer`, which blocks the application execution, to infer model in the synchronous mode:

@sphinxdirective

@@ -50,7 +50,7 @@ You can use `ov::InferRequest::infer()`, which blocks the application execution,

 ### Asynchronous mode

-Asynchronous mode can improve overall frame-rate of the application, because rather than wait for inference to complete, the app can continue doing things on the host, while accelerator is busy. You can use `ov::InferRequest::start_async()` to infer model in asynchronous mode:
+Asynchronous mode can improve application's overall frame-rate, because rather than wait for inference to complete, the app can keep working on the host, while the accelerator is busy. You can use `ov::InferRequest::start_async` to infer model in the asynchronous mode:

@sphinxdirective

@@ -68,8 +68,8 @@ Asynchronous mode can improve overall frame-rate of the application, because rat

@endsphinxdirective

-Asynchronous mode supports two ways to wait inference results:
-  * `ov::InferRequest::wait_for()` - specify maximum duration in milliseconds to block for. The method is blocked until the specified timeout has elapsed, or the result becomes available, whichever comes first.
+Asynchronous mode supports two ways the application waits for inference results:
+  * `ov::InferRequest::wait_for` - specifies the maximum duration in milliseconds to block the method. The method is blocked until the specified time has passed, or the result becomes available, whichever comes first.
    @sphinxdirective

    .. tab:: C++
@@ -85,7 +85,7 @@ Asynchronous mode supports two ways to wait inference results:
           :fragment: [wait_for]

    @endsphinxdirective
-  * `ov::InferRequest::wait()` - waits until inference result becomes available
+  * `ov::InferRequest::wait` - waits until inference result becomes available
    @sphinxdirective

    .. tab:: C++
@@ -102,10 +102,9 @@ Asynchronous mode supports two ways to wait inference results:

    @endsphinxdirective

-Both requests are thread-safe: can be called from different threads without fearing corruption and failures.
+Both methods are thread-safe.

-Also InferRequest provides an functionality which allows to avoid a call of `ov::InferRequest::wait()`, in order to do it, you can use `ov::InferRequest::set_callback()` method. This method allows to set callback which will be called after completing run of InferRequest, please use weak reference of infer_request (`ov::InferRequest*`, `ov::InferRequest&`, `std::weal_ptr<ov::InferRequest>` and etc) in the callback, it is needed to avoid cyclic references.
-For more details please take a look too [Classification Sample Async](../../samples/cpp/classification_sample_async/README.md).
+When you are running several inference requests in parallel, a device can process them simultaneously, with no garauntees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait` (unless your code needs to wait for the _all_ requests). For multi-request scenarios, consider using the `ov::InferRequest::set_callback` method to set a callback which is  called upon completion of the request:

@sphinxdirective

@@ -123,7 +122,10 @@ For more details please take a look too [Classification Sample Async](../../samp

@endsphinxdirective

-You can use `ov::InferRequest::cancel()` method in case if you want to cancel the current inference request:
+> **NOTE**: Use weak reference of infer_request (`ov::InferRequest*`, `ov::InferRequest&`, `std::weal_ptr<ov::InferRequest>`, etc.) in the callback. It is necessary to avoid cyclic references.
+For more details, check [Classification Sample Async](../../samples/cpp/classification_sample_async/README.md).
+
+You can use the `ov::InferRequest::cancel` method if you want to abort execution of the current inference request:

@sphinxdirective

@@ -145,7 +147,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

 `ov::InferRequest` allows to get input/output tensors by tensor name, index, port and without any arguments in case if model has only one input or output.

-  * `ov::InferRequest::get_input_tensor()`, `ov::InferRequest::set_input_tensor()`, `ov::InferRequest::get_output_tensor()`, `ov::InferRequest::set_output_tensor()` methods without arguments can be used to get or set input/output tensor for model with only one input/output:
+  * `ov::InferRequest::get_input_tensor`, `ov::InferRequest::set_input_tensor`, `ov::InferRequest::get_output_tensor`, `ov::InferRequest::set_output_tensor` methods without arguments can be used to get or set input/output tensor for model with only one input/output:
    @sphinxdirective

    .. tab:: C++
@@ -162,7 +164,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

    @endsphinxdirective

-  * `ov::InferRequest::get_input_tensor()`, `ov::InferRequest::set_input_tensor()`, `ov::InferRequest::get_output_tensor()`, `ov::InferRequest::set_output_tensor()` methods with argument can be used to get or set input/output tensor by input/output index:
+  * `ov::InferRequest::get_input_tensor`, `ov::InferRequest::set_input_tensor`, `ov::InferRequest::get_output_tensor`, `ov::InferRequest::set_output_tensor` methods with argument can be used to get or set input/output tensor by input/output index:
    @sphinxdirective

    .. tab:: C++
@@ -179,7 +181,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

    @endsphinxdirective

-  * `ov::InferRequest::get_tensor()`, `ov::InferRequest::set_tensor()` methods can be used to get or set input/output tensor by tensor name:
+  * `ov::InferRequest::get_tensor`, `ov::InferRequest::set_tensor` methods can be used to get or set input/output tensor by tensor name:
    @sphinxdirective

    .. tab:: C++
@@ -196,7 +198,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th

    @endsphinxdirective

-  * `ov::InferRequest::get_tensor()`, `ov::InferRequest::set_tensor()` methods can be used to get or set input/output tensor by port:
+  * `ov::InferRequest::get_tensor`, `ov::InferRequest::set_tensor` methods can be used to get or set input/output tensor by port:
    @sphinxdirective

    .. tab:: C++
@@ -218,7 +220,7 @@ You can use `ov::InferRequest::cancel()` method in case if you want to cancel th
 ### Cascade of models

 `ov::InferRequest` can be used to organize cascade of models. You need to have infer requests for each model.
-In this case you can get output tensor from the first request using `ov::InferRequest::get_tensor()` and set it as input for the second request using `ov::InferRequest::set_tensor()`. But be careful, shared tensors across compiled models can be rewritten by the first model if the first infer request is run once again, while the second model has not started yet.
+In this case you can get output tensor from the first request using `ov::InferRequest::get_tensor` and set it as input for the second request using `ov::InferRequest::set_tensor`. But be careful, shared tensors across compiled models can be rewritten by the first model if the first infer request is run once again, while the second model has not started yet.

@sphinxdirective

@@ -238,7 +240,7 @@ In this case you can get output tensor from the first request using `ov::InferRe

 ### Using of ROI tensors

-It is possible to re-use shared input by several models. You do not need to allocate separate input tensor for a model if it processes a ROI object located inside of already allocated input of a previous model. For instance, when first model detects objects on a video frame (stored as input tensor) and second model accepts detected bounding boxes (ROI inside of the frame) as input. In this case, it is allowed to re-use pre-allocated input tensor (used by first model) by second model and just crop ROI without allocation of new memory using `ov::Tensor()` with passing of `ov::Tensor` and `ov::Coordinate` as parameters.
+It is possible to re-use shared input by several models. You do not need to allocate separate input tensor for a model if it processes a ROI object located inside of already allocated input of a previous model. For instance, when the first model detects objects in a video frame (stored as input tensor) and the second model accepts detected bounding boxes (ROI inside of the frame) as input. In this case, it is allowed to re-use pre-allocated input tensor (used by the first model) by the second model and just crop ROI without allocation of new memory using `ov::Tensor` with passing of `ov::Tensor` and `ov::Coordinate` as parameters.

@sphinxdirective

--- a/docs/OV_Runtime_UG/performance_hints.md
+++ b/docs/OV_Runtime_UG/performance_hints.md
@@ -0,0 +1,138 @@
+# High-level Performance Hints {#openvino_docs_OV_UG_Performance_Hints}
+
+Each of the OpenVINO's [supported devices](supported_plugins/Supported_Devices.md) offers low-level performance settings. Tweaking this detailed configuration requires deep architecture understanding.
+Also, while the performance may be optimal for the specific combination of the device and the inferred model, the resulting configuration is not necessarily optimal for another device or model.
+The OpenVINO performance hints are the new way to configure the performance with the _portability_ in mind. 
+
+The hints also "reverse" the direction of the configuration in the right fashion: rather than map the application needs to the low-level performance settings, and keep an associated application logic to configure each possible device separately, the idea is to express a target scenario with a single config key and let the *device* to configure itself in response.
+As the hints are supported by every OpenVINO device, this is completely portable and future-proof solution. 
+
+Previously, certain level of automatic configuration was coming from the _default_ values of the parameters. For example, number of the CPU streams was deduced from the number of CPU cores, when the `ov::streams::AUTO` (`CPU_THROUGHPUT_AUTO` in the pre-OpenVINO 2.0 parlance) is set. However, the resulting number of streams didn't account for actual compute requirements of the model to be inferred.
+The hints, in contrast, respect the actual model, so the parameters for the optimal throughput are calculated for each model individually (based on it's compute versus memory bandwidth requirements and capabilities of the device).
+
+## Performance Hints: Latency and Throughput
+As discussed in the [Optimization Guide](../optimization_guide/dldt_optimization_guide.md) there are few different metrics associated with the inference speed.
+Throughput and latency are some of the most critical factors that influence the overall performance of an application.
+
+This is why, to ease the configuration of the device, the OpenVINO already offers two dedicated hints, namely `ov::hint::PerformanceMode::THROUGHPUT` and `ov::hint::PerformanceMode::LATENCY`.
+Every OpenVINO device supports these, which makes the things portable and future-proof.
+The also allows to do a performance configuration that is fully compatible with the [automatic device selection](./auto_device_selection.md).
+A special `ov::hint::PerformanceMode::UNDEFINED` acts same just as specifying no hint.
+
+Please also see the last section in the document on conducting the performance measurements with the `benchmark_app`.
+
+Notice that if there are other performance factors (other than inference time) like memory footprint and model load/compilation time are of concern, a typical model may take significantly more time to load with `ov::hint::PerformanceMode::THROUGHPUT` and then consume  much more memory, compared to the `ov::hint::PerformanceMode::LATENCY`.  
+
+## Performance Hints: How It Works?
+Internally, every device "translates" the value of the hint to the actual performance settings.
+For example the `ov::hint::PerformanceMode::THROUGHPUT` selects number of CPU or GPU streams.
+For the GPU, additionally the optimal batch size is selected and the [automatic batching](../OV_Runtime_UG/automatic_batching.md) is applied whenever possible (and also if the device supports that [refer to the devices/features support matrix](./supported_plugins/Device_Plugins.md)).
+
+The resulting (device-specific) settings can be queried back from the instance of the `ov:Compiled_Model`.  
+Notice that the `benchmark_app`, outputs the actual settings for the THROUGHPUT hint, please the bottom of the output example:
+
+   ```
+    $benchmark_app -hint tput -d CPU -m 'path to your favorite model'
+    ...
+    [Step 8/11] Setting optimal runtime parameters
+    [ INFO ] Device: CPU
+    [ INFO ]   { PERFORMANCE_HINT , THROUGHPUT }
+    ...
+    [ INFO ]   { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 4 }
+    [ INFO ]   { NUM_STREAMS , 4 }
+    ...
+   ```
+
+## Using the Performance Hints: Basic API
+In the example code-snippet below the  `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for the compile_model:
+@sphinxdirective
+
+.. tab:: C++
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
+       :language: cpp
+       :fragment: [compile_model]
+
+.. tab:: Python
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
+       :language: python
+       :fragment: [compile_model]
+
+@endsphinxdirective
+
+## Additional (Optional) Hints from the App
+Let's take an example  of an application that processes 4 video streams.  The most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4. 
+As discussed previosly, for the GPU this will limit the batch size, for the CPU - the number of inference streams, so each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
+@sphinxdirective
+
+.. tab:: C++
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
+       :language: cpp
+       :fragment: [hint_num_requests]
+
+.. tab:: Python
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
+       :language: python
+       :fragment: [hint_num_requests]
+
+@endsphinxdirective
+
+## Optimal Number of Inference Requests
+Using the hints assumes that the application queries the `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
+@sphinxdirective
+
+.. tab:: C++
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
+       :language: cpp
+       :fragment: [query_optimal_num_requests]
+
+.. tab:: Python
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
+       :language: python
+       :fragment: [query_optimal_num_requests]
+
+@endsphinxdirective
+
+While an application is free to create more requests if needed (for example to support asynchronous inputs population) **it is very important to at least run the `ov::optimal_number_of_infer_requests` of the inference requests in parallel**, for efficiency (device utilization) reasons. 
+
+Also, notice that `ov::hint::PerformanceMode::LATENCY` does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes the machine features.
+To make your application fully scalable, prefer to query the `ov::optimal_number_of_infer_requests` directly.
+
+## Prefer Async API
+The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and use of the `ov::InferRequest::wait()` (or callbacks). Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md).
+ Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios).
+ 
+## Combining the Hints and Individual Low-Level Settings
+While sacrificing the portability at a some extent, it is possible to combine the hints with individual device-specific settings. 
+For example, you can let the device prepare a configuration `ov::hint::PerformanceMode::THROUGHPUT` while overriding any specific value:  
+@sphinxdirective
+
+.. tab:: C++
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
+       :language: cpp
+       :fragment: [hint_plus_low_level]
+
+.. tab:: Python
+
+    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
+       :language: python
+       :fragment: [hint_plus_low_level]
+
+
+@endsphinxdirective
+## Testing the Performance of The Hints with the Benchmark_App
+The `benchmark_app`, that exists in both  [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the performance of the performance hints for a particular device:
+ - benchmark_app **-hint tput** -d 'device' -m 'path to your model'
+ - benchmark_app **-hint latency** -d 'device' -m 'path to your model'
+-  Disabling the hints to emulate the pre-hints era (highly recommended before trying the individual low-level settings, such as the number of streams as below, threads, etc):
+- - benchmark_app **-hint none -nstreams 1**  -d 'device' -m 'path to your model'
+ 
+
+### See Also
+[Supported Devices](./supported_plugins/Supported_Devices.md)
--- a/docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md
+++ b/docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md
@@ -0,0 +1,91 @@
+# Arm® CPU device {#openvino_docs_OV_UG_supported_plugins_ARM_CPU}
+
+
+## Introducing the Arm® CPU Plugin
+The Arm® CPU plugin is developed in order to enable deep neural networks inference on Arm® CPU, using [Compute Library](https://github.com/ARM-software/ComputeLibrary) as a backend.
+
+> **NOTE**: Note that this is a community-level add-on to OpenVINO™. Intel® welcomes community participation in the OpenVINO™ ecosystem and technical questions on community forums as well as code contributions are welcome. However, this component has not undergone full release validation or qualification from Intel®, and no official support is offered. 
+
+The Arm® CPU plugin is not a part of the Intel® Distribution of OpenVINO™ toolkit and is not distributed in pre-built form. To use the plugin, it should be built from source code. Plugin build procedure is described on page [How to build Arm® CPU plugin](https://github.com/openvinotoolkit/openvino_contrib/wiki/How-to-build-ARM-CPU-plugin). 
+
+The set of supported layers is defined on [Operation set specification](https://github.com/openvinotoolkit/openvino_contrib/wiki/ARM-plugin-operation-set-specification).
+
+
+## Supported inference data types
+The Arm® CPU plugin supports the following data types as inference precision of internal primitives:
+
+- Floating-point data types:
+  - f32
+  - f16
+- Quantized data types:
+  - i8
+
+
+> **NOTE**: i8 support is experimental.
+
+[Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out supported data types for all detected devices.
+
+## Supported features
+
+### Preprocessing acceleration
+The Arm® CPU plugin supports the following accelerated preprocessing operations:
+- Precision conversion:
+    - u8  -> u16, s16, s32
+    - u16 -> u8, u32
+    - s16 -> u8, s32
+    - f16 -> f32
+- Transposion of tensors with dims < 5
+- Interpolation of 4D tensors with no padding (`pads_begin` and `pads_end` equal 0).
+
+The Arm® CPU plugin supports the following preprocessing operations, however they are not accelerated:
+- Precision conversion that are not mentioned above
+- Color conversion:
+    - NV12 to RGB
+    - NV12 to BGR
+    - i420 to RGB
+    - i420 to BGR
+
+See [preprocessing API guide](../preprocessing_overview.md) for more details.
+
+## Supported properties
+The plugin supports the properties listed below.
+
+### Read-write properties
+All parameters must be set before calling `ov::Core::compile_model()` in order to take effect or passed as additional argument to `ov::Core::compile_model()`
+
+- ov::enable_profiling
+
+### Read-only properties
+- ov::supported_properties
+- ov::available_devices
+- ov::range_for_async_infer_requests
+- ov::range_for_streams
+- ov::device::full_name
+- ov::device::capabilities
+
+
+## Known Layers Limitation
+* `AvgPool` layer is supported via arm_compute library for 4D input tensor and via reference implementation for another cases.
+* `BatchToSpace` layer is supported 4D tensors only and constant nodes: `block_shape` with `N` = 1 and `C`= 1, `crops_begin` with zero values and `crops_end` with zero values.
+* `ConvertLike` layer is supported configuration like `Convert`.
+* `DepthToSpace` layer is supported 4D tensors only and for `BLOCKS_FIRST` of `mode` attribute.
+* `Equal` does not support `broadcast` for inputs.
+* `Gather` layer is supported constant scalar or 1D indices axes only. Layer is supported as via arm_compute library for non negative indices and via reference implementation otherwise.
+* `Less` does not support `broadcast` for inputs.
+* `LessEqual` does not support `broadcast` for inputs.
+* `LRN` layer is supported `axes = {1}` or `axes = {2, 3}` only.
+* `MaxPool-1` layer is supported via arm_compute library for 4D input tensor and via reference implementation for another cases.
+* `Mod` layer is supported for f32 only.
+* `MVN` layer is supported via arm_compute library for 2D inputs and `false` value of `normalize_variance` and `false` value of `across_channels`, for another cases layer is implemented via runtime reference.
+* `Normalize` layer is supported via arm_compute library with `MAX` value of `eps_mode` and `axes = {2 | 3}`, and for `ADD` value of `eps_mode` layer uses `DecomposeNormalizeL2Add`, for another cases layer is implemented via runtime reference.
+* `NotEqual` does not support `broadcast` for inputs.
+* `Pad` layer works with `pad_mode = {REFLECT | CONSTANT | SYMMETRIC}` parameters only.
+* `Round` layer is supported via arm_compute library with `RoundMode::HALF_AWAY_FROM_ZERO` value of `mode`, for another cases layer is implemented via runtime reference.
+* `SpaceToBatch` layer is supported 4D tensors only and constant nodes: `shapes`, `pads_begin` or `pads_end` with zero paddings for batch or channels and one values `shapes` for batch and channels.
+* `SpaceToDepth` layer is supported 4D tensors only and for `BLOCKS_FIRST` of `mode` attribute.
+* `StridedSlice` layer is supported via arm_compute library for tensors with dims < 5 and zero values of `ellipsis_mask` or zero values of `new_axis_mask` and `shrink_axis_mask`, for another cases layer is implemented via runtime reference.
+* `FakeQuantize` layer is supported via arm_compute library in Low Precision evaluation mode for suitable models and via runtime reference otherwise.
+
+## See Also
+* [How to run YOLOv4 model inference using OpenVINO™ and OpenCV on Arm®](https://opencv.org/how-to-run-yolov4-using-openvino-and-opencv-on-arm/)
+* [Face recognition on Android™ using OpenVINO™ toolkit with Arm® plugin](https://opencv.org/face-recognition-on-android-using-openvino-toolkit-with-arm-plugin/)
--- a/docs/OV_Runtime_UG/supported_plugins/CPU.md
+++ b/docs/OV_Runtime_UG/supported_plugins/CPU.md
@@ -90,7 +90,7 @@ Each stream is pinned to its own group of physical cores with respect to NUMA no
 See [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide) for more details.

 > **NOTE**: When it comes to latency, one needs to keep in mind that running only one stream on multi-socket platform may introduce additional overheads on data transfer between NUMA nodes.
-> In that case it is better to run inference on one socket (please see [deployment optimization guide (additional configurations)](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide_additional) for details).
+> In that case it is better to run inference on one socket (please see [Optimizing for Throughput](../../optimization_guide/dldt_deployment_optimization_tput.md) for details).

 ### Dynamic shapes
 CPU plugin provides full functional support for models with dynamic shapes in terms of the opset coverage.
--- a/docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md
+++ b/docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md
@@ -11,6 +11,7 @@
   openvino_docs_OV_UG_supported_plugins_GPU
   openvino_docs_IE_DG_supported_plugins_VPU
   openvino_docs_OV_UG_supported_plugins_GNA
+   openvino_docs_OV_UG_supported_plugins_ARM_CPU

@endsphinxdirective

@@ -22,6 +23,7 @@ The OpenVINO Runtime provides capabilities to infer deep learning models on the
 |[GPU](GPU.md)            |Intel® Graphics, including Intel® HD Graphics, Intel® UHD Graphics, Intel® Iris® Graphics, Intel® Xe Graphics, Intel® Xe MAX Graphics |
 |[VPUs](VPU.md)            |Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X, Intel® Vision Accelerator Design with Intel® Movidius™ VPUs                                                                                           |
 |[GNA](GNA.md)              |[Intel® Speech Enabling Developer Kit](https://www.intel.com/content/www/us/en/support/articles/000026156/boards-and-kits/smart-home.html); [Amazon Alexa\* Premium Far-Field Developer Kit](https://developer.amazon.com/en-US/alexa/alexa-voice-service/dev-kits/amazon-premium-voice); [Intel® Pentium® Silver Processors N5xxx, J5xxx and Intel® Celeron® Processors N4xxx, J4xxx (formerly codenamed Gemini Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/83915/gemini-lake.html): [Intel® Pentium® Silver J5005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128984/intel-pentium-silver-j5005-processor-4m-cache-up-to-2-80-ghz.html), [Intel® Pentium® Silver N5000 Processor](https://ark.intel.com/content/www/us/en/ark/products/128990/intel-pentium-silver-n5000-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128992/intel-celeron-j4005-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4105 Processor](https://ark.intel.com/content/www/us/en/ark/products/128989/intel-celeron-j4105-processor-4m-cache-up-to-2-50-ghz.html), [Intel® Celeron® J4125 Processor](https://ark.intel.com/content/www/us/en/ark/products/197305/intel-celeron-processor-j4125-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® Processor N4100](https://ark.intel.com/content/www/us/en/ark/products/128983/intel-celeron-processor-n4100-4m-cache-up-to-2-40-ghz.html), [Intel® Celeron® Processor N4000](https://ark.intel.com/content/www/us/en/ark/products/128988/intel-celeron-processor-n4000-4m-cache-up-to-2-60-ghz.html); [Intel® Pentium® Processors N6xxx, J6xxx, Intel® Celeron® Processors N6xxx, J6xxx and Intel Atom® x6xxxxx (formerly codenamed Elkhart Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/128825/products-formerly-elkhart-lake.html); [Intel® Core™ Processors (formerly codenamed Cannon Lake)](https://ark.intel.com/content/www/us/en/ark/products/136863/intel-core-i3-8121u-processor-4m-cache-up-to-3-20-ghz.html); [10th Generation Intel® Core™ Processors (formerly codenamed Ice Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/74979/ice-lake.html): [Intel® Core™ i7-1065G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i71065g7-processor-8m-cache-up-to-3-90-ghz.html), [Intel® Core™ i7-1060G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197120/intel-core-i71060g7-processor-8m-cache-up-to-3-80-ghz.html), [Intel® Core™ i5-1035G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/196591/intel-core-i51035g4-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196592/intel-core-i51035g7-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196603/intel-core-i51035g1-processor-6m-cache-up-to-3-60-ghz.html), [Intel® Core™ i5-1030G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197119/intel-core-i51030g7-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i5-1030G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197121/intel-core-i51030g4-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i3-1005G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196588/intel-core-i31005g1-processor-4m-cache-up-to-3-40-ghz.html), [Intel® Core™ i3-1000G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/197122/intel-core-i31000g1-processor-4m-cache-up-to-3-20-ghz.html), [Intel® Core™ i3-1000G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197123/intel-core-i31000g4-processor-4m-cache-up-to-3-20-ghz.html); [11th Generation Intel® Core™ Processors (formerly codenamed Tiger Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/88759/tiger-lake.html); [12th Generation Intel® Core™ Processors (formerly codenamed Alder Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/147470/products-formerly-alder-lake.html)|
+|[Arm® CPU](ARM_CPU.md) |Raspberry Pi™ 4 Model B, Apple® Mac mini with M1 chip, NVIDIA® Jetson Nano™, Android™ devices    |

 OpenVINO runtime also has several execution capabilities which work on top of other devices:

@@ -38,17 +40,17 @@ Devices similar to the ones we have used for benchmarking can be accessed using
 ## Features support matrix
 The table below demonstrates support of key features by OpenVINO device plugins.

-| Capability | [CPU](CPU.md) | [GPU](GPU.md) | [GNA](GNA.md) | [VPU](VPU.md) |
-| ---------- | --- | --- | --- | --- |
-| [Heterogeneous execution](../hetero_execution.md)| Yes | Yes | No | ? |
-| [Multi-device execution](../multi_device.md) | Yes | Yes | Partial | ? |
-| [Automatic batching](../automatic_batching.md) | No | Yes | No | ? |
-| [Multi-stream execution](@ref openvino_docs_optimization_guide_dldt_optimization_guide) | Yes | Yes | No | ? |
-| [Models caching](../Model_caching_overview.md) | Yes | Partial | Yes | ? |
-| [Dynamic shapes](../ov_dynamic_shapes.md) | Yes | Partial | No | ? |
-| Import/Export | Yes | No | Yes | ? |
-| [Preprocessing acceleration](../preprocessing_overview.md) | Yes | Yes | No | ? |
-| [Stateful models](../network_state_intro.md) | Yes | No | Yes | ? |
-| [Extensibility](@ref openvino_docs_Extensibility_UG_Intro) | Yes | Yes | No | ? |
+| Capability | [CPU](CPU.md) | [GPU](GPU.md) | [GNA](GNA.md) | [VPU](VPU.md) | [Arm® CPU](ARM_CPU.md) |
+| ---------- | --- | --- | --- | --- | --- |
+| [Heterogeneous execution](../hetero_execution.md)| Yes | Yes | No | ? | Yes |
+| [Multi-device execution](../multi_device.md) | Yes | Yes | Partial | ? | Yes |
+| [Automatic batching](../automatic_batching.md) | No | Yes | No | ? | No |
+| [Multi-stream execution](@ref openvino_docs_optimization_guide_dldt_optimization_guide) | Yes | Yes | No | ? | Yes |
+| [Models caching](../Model_caching_overview.md) | Yes | Partial | Yes | ? | No |
+| [Dynamic shapes](../ov_dynamic_shapes.md) | Yes | Partial | No | ? | No |
+| Import/Export | Yes | No | Yes | ? | No |
+| [Preprocessing acceleration](../preprocessing_overview.md) | Yes | Yes | No | ? | Partial |
+| [Stateful models](../network_state_intro.md) | Yes | No | Yes | ? | No |
+| [Extensibility](@ref openvino_docs_Extensibility_UG_Intro) | Yes | Yes | No | ? | No |

 For more details on plugin specific feature limitation, see corresponding plugin pages.
--- a/docs/OV_Runtime_UG/supported_plugins/GPU.md
+++ b/docs/OV_Runtime_UG/supported_plugins/GPU.md
@@ -83,7 +83,7 @@ See [low-precision optimization guide](@ref pot_docs_LowPrecisionOptimizationGui

 Floating-point precision of a GPU primitive is selected based on operation precision in IR except [compressed f16 IR form](../../MO_DG/prepare_model/FP16_Compression.md) which is executed in f16 precision.

-> **NOTE**: Harware acceleration for i8/u8 precision may be unavailable on some platforms. In that case model is executed in floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via `ov::device::capabilities` property.
+> **NOTE**: Hardware acceleration for i8/u8 precision may be unavailable on some platforms. In that case model is executed in floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via `ov::device::capabilities` property.

 [Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out supported data types for all detected devices.

@@ -99,8 +99,8 @@ See [Multi-device execution page](../multi_device.md) for more details.

 ### Automatic batching
 GPU plugin is capable of reporting `ov::max_batch_size` and `ov::optimal_batch_size` metrics with respect to the current hardware platform and model,
-thus automatic batching can be applied in cases when `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` is set
-or device is specified as `"BATCH:GPU"`.
+thus automatic batching is automatically enabled when `ov::optimal_batch_size` is > 1 and `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` is set.
+Alternatively it can be enabled explicitly via the device notion, e.g. `"BATCH:GPU"`.

@sphinxdirective

@@ -110,7 +110,7 @@ or device is specified as `"BATCH:GPU"`.
        :language: cpp
        :fragment: [compile_model_batch_plugin]

-.. tab:: Bacthing via throughput hint
+.. tab:: Batching via throughput hint

    .. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
        :language: cpp
@@ -215,6 +215,17 @@ Below is the list of such operations:

 The behavior depends on specific parameters of the operations and hardware configuration.

+
+## GPU Performance Checklist: Summary <a name="gpu-checklist"></a>
+Since the OpenVINO relies on the OpenCL&trade; kernels for the GPU implementation. Thus, many general OpenCL tips apply:
+-	Prefer `FP16` inference precision over `FP32`, as the Model Optimizer can generate both variants and the `FP32` is default. Also, consider [int8 inference](../Int8Inference.md)
+- 	Try to group individual infer jobs by using [automatic batching](../automatic_batching.md)
+-	Consider [caching](../Model_caching_overview.md) to minimize model load time
+-	If your application is simultaneously using the inference on the CPU or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. You can use [CPU configuration options](./CPU.md) to limit number of inference threads for the CPU plugin.
+-	Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-looped polling for completion. If the _CPU_ utilization is a concern, consider the dedicated referenced in this document. Notice that this option might increase the inference latency, so consider combining with multiple GPU streams or [throughput performance hints](../performance_hints.md).
+- When operating media inputs consider [remote tensors API of the GPU Plugin](./GPU_RemoteTensor_API.md).
+
+
 ## See Also
 * [Supported Devices](Supported_Devices.md)
 * [Optimization guide](@ref openvino_docs_optimization_guide_dldt_optimization_guide)
--- a/docs/OV_Runtime_UG/supported_plugins/Supported_Devices.md
+++ b/docs/OV_Runtime_UG/supported_plugins/Supported_Devices.md
@@ -13,6 +13,7 @@ The OpenVINO Runtime provides unique capabilities to infer deep learning models
 |[CPU plugin](CPU.md)              |Intel&reg; Xeon&reg; with Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and AVX512_BF16, Intel&reg; Core&trade; Processors with Intel&reg; AVX2, Intel&reg; Atom&reg; Processors with Intel® Streaming SIMD Extensions (Intel® SSE) |
 |[VPU plugins](VPU.md) (available in the Intel® Distribution of OpenVINO™ toolkit)            |Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X, Intel® Vision Accelerator Design with Intel® Movidius™ VPUs                                                                                           |
 |[GNA plugin](GNA.md) (available in the Intel® Distribution of OpenVINO™ toolkit)              |Intel&reg; Speech Enabling Developer Kit, Amazon Alexa* Premium Far-Field Developer Kit, Intel&reg; Pentium&reg; Silver J5005 Processor, Intel&reg; Pentium&reg; Silver N5000 Processor, Intel&reg; Celeron&reg; J4005 Processor, Intel&reg; Celeron&reg; J4105 Processor, Intel&reg; Celeron&reg; Processor N4100, Intel&reg; Celeron&reg; Processor N4000, Intel&reg; Core&trade; i3-8121U Processor, Intel&reg; Core&trade; i7-1065G7 Processor, Intel&reg; Core&trade; i7-1060G7 Processor, Intel&reg; Core&trade; i5-1035G4 Processor, Intel&reg; Core&trade; i5-1035G7 Processor, Intel&reg; Core&trade; i5-1035G1 Processor, Intel&reg; Core&trade; i5-1030G7 Processor, Intel&reg; Core&trade; i5-1030G4 Processor, Intel&reg; Core&trade; i3-1005G1 Processor, Intel&reg; Core&trade; i3-1000G1 Processor, Intel&reg; Core&trade; i3-1000G4 Processor|
+|[Arm® CPU plugin](ARM_CPU.md) (unavailable in the Intel® Distribution of OpenVINO™ toolkit) |Raspberry Pi™ 4 Model B, Apple® Mac mini with M1 chip, NVIDIA® Jetson Nano™, Android™ devices    |
 |[Multi-Device execution](../multi_device.md) |Multi-Device execution enables simultaneous inference of the same model on several devices in parallel    |
 |[Auto-Device plugin](../auto_device_selection.md) |Auto-Device plugin enables selecting Intel&reg; device for inference automatically |
 |[Heterogeneous plugin](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers)).                                                           |
@@ -28,6 +29,7 @@ The table below shows the plugin libraries and additional dependencies for Linux
 | MYRIAD | `libopenvino_intel_myriad_plugin.so` | `libusb.so`                                                 | `openvino_intel_myriad_plugin.dll`| `usb.dll`                                                                                              | `libopenvino_intel_myriad_plugin.so`   | `libusb.dylib`                              |
 | HDDL   | `libintel_hddl_plugin.so`          | `libbsl.so`, `libhddlapi.so`, `libmvnc-hddl.so`             | `intel_hddl_plugin.dll`         | `bsl.dll`, `hddlapi.dll`, `json-c.dll`, `libcrypto-1_1-x64.dll`, `libssl-1_1-x64.dll`, `mvnc-hddl.dll` |  Is not supported            |  -                                          |
 | GNA    | `libopenvino_intel_gna_plugin.so`           | `libgna.so`,                                                | `openvino_intel_gna_plugin.dll`          | `gna.dll`                                                                                              |  Is not supported            |  -                                          |
+| Arm® CPU | `libopenvino_arm_cpu_plugin.so`           |                                                 | Is not supported          | -                                                                                              |  `libopenvino_arm_cpu_plugin.so`            |  -                                          |
 | HETERO | `libopenvino_hetero_plugin.so`        | Same as for selected plugins                                | `openvino_hetero_plugin.dll`       | Same as for selected plugins                                                                           | `libopenvino_hetero_plugin.so`      |  Same as for selected plugins               |
 | MULTI  | `libopenvino_auto_plugin.so`   | Same as for selected plugins                                | `openvino_auto_plugin.dll`  | Same as for selected plugins                                                                           | `libopenvino_auto_plugin.so` |  Same as for selected plugins               |
 | AUTO | `libopenvino_auto_plugin.so`   | Same as for selected plugins                                | `openvino_auto_plugin.dll`  | Same as for selected plugins                                                                           | `libopenvino_auto_plugin.so` |  Same as for selected plugins               |
@@ -66,23 +68,26 @@ For example, the CHW value at index (c,h,w) is physically located at index (c\*H

 ### Supported Model Formats

-|Plugin        |FP32                    |FP16                    |I8                      |
-|:-------------|:----------------------:|:----------------------:|:----------------------:|
-|CPU plugin    |Supported and preferred |Supported               |Supported               |
-|GPU plugin    |Supported               |Supported and preferred |Supported               |
-|VPU plugins   |Not supported           |Supported               |Not supported           |
-|GNA plugin    |Supported               |Supported               |Not supported           |
+|Plugin             |FP32                    |FP16                    |I8                      |
+|:------------------|:----------------------:|:----------------------:|:----------------------:|
+|CPU plugin         |Supported and preferred |Supported               |Supported               |
+|GPU plugin         |Supported               |Supported and preferred |Supported               |
+|VPU plugins        |Not supported           |Supported               |Not supported           |
+|GNA plugin         |Supported               |Supported               |Not supported           |
+|Arm® CPU plugin    |Supported and preferred |Supported               |Supported (partially)   |
+
 For [Multi-Device](../multi_device.md) and [Heterogeneous](../hetero_execution.md) executions
 the supported models formats depends on the actual underlying devices. _Generally, FP16 is preferable as it is most ubiquitous and performant_.

 ### Supported Input Precision

-|Plugin        |FP32      |FP16           |U8             |U16            |I8            |I16            |
-|:-------------|:--------:|:-------------:|:-------------:|:-------------:|:------------:|:-------------:|
-|CPU plugin    |Supported |Not supported  |Supported      |Supported      |Not supported |Supported      |
-|GPU plugin    |Supported |Supported\*    |Supported\*    |Supported\*    |Not supported |Supported\*    |
-|VPU plugins   |Supported |Supported      |Supported      |Not supported  |Not supported |Not supported  |
-|GNA plugin    |Supported |Not supported  |Supported      |Not supported  |Supported     |Supported      |
+|Plugin             |FP32      |FP16           |U8             |U16            |I8            |I16            |
+|:------------------|:--------:|:-------------:|:-------------:|:-------------:|:------------:|:-------------:|
+|CPU plugin         |Supported |Not supported  |Supported      |Supported      |Not supported |Supported      |
+|GPU plugin         |Supported |Supported\*    |Supported\*    |Supported\*    |Not supported |Supported\*    |
+|VPU plugins        |Supported |Supported      |Supported      |Not supported  |Not supported |Not supported  |
+|GNA plugin         |Supported |Not supported  |Supported      |Not supported  |Supported     |Supported      |
+|Arm® CPU plugin    |Supported |Supported      |Supported      |Supported      |Not supported |Not supported  |

 <br>\* - Supported via `SetBlob` only, `GetBlob` returns FP32<br>
 For [Multi-Device](../multi_device.md) and [Heterogeneous](../hetero_execution.md) executions
@@ -90,23 +95,26 @@ the supported input precision  depends on the actual underlying devices. _Genera

 ### Supported Output Precision

-|Plugin        |FP32      |FP16          |
-|:-------------|:--------:|:------------:|
-|CPU plugin    |Supported |Not supported |
-|GPU plugin    |Supported |Supported     |
-|VPU plugins   |Supported |Supported     |
-|GNA plugin    |Supported |Not supported |
+|Plugin             |FP32      |FP16          |
+|:------------------|:--------:|:------------:|
+|CPU plugin         |Supported |Not supported |
+|GPU plugin         |Supported |Supported     |
+|VPU plugins        |Supported |Supported     |
+|GNA plugin         |Supported |Not supported |
+|Arm® CPU plugin    |Supported |Supported     |
+
 For [Multi-Device](../multi_device.md) and [Heterogeneous](../hetero_execution.md) executions
 the supported output precision  depends on the actual underlying devices. _Generally, FP32 is preferable as it is most ubiquitous_.

 ### Supported Input Layout

-|Plugin        |NCDHW         |NCHW          |NHWC          |NC            |
-|:-------------|:------------:|:------------:|:------------:|:------------:|
-|CPU plugin    |Supported     |Supported     |Supported     |Supported     |
-|GPU plugin    |Supported     |Supported     |Supported     |Supported     |
-|VPU plugins   |Supported     |Supported     |Supported     |Supported     |
-|GNA plugin    |Not supported |Supported     |Supported     |Supported     |
+|Plugin             |NCDHW         |NCHW          |NHWC          |NC            |
+|:------------------|:------------:|:------------:|:------------:|:------------:|
+|CPU plugin         |Supported     |Supported     |Supported     |Supported     |
+|GPU plugin         |Supported     |Supported     |Supported     |Supported     |
+|VPU plugins        |Supported     |Supported     |Supported     |Supported     |
+|GNA plugin         |Not supported |Supported     |Supported     |Supported     |
+|Arm® CPU plugin    |Not supported |Supported     |Supported     |Supported     |

 ### Supported Output Layout

@@ -121,155 +129,157 @@ For setting relevant configuration, refer to the
 ### Supported Layers
 The following layers are supported by the plugins and by [Shape Inference feature](../ShapeInference.md):

-| Layers                         | GPU           | CPU           | VPU           | GNA           | ShapeInfer    |
-|:-------------------------------|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|
-| Abs                            | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Acos                           | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Acosh                          | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Activation-Clamp               | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Activation-ELU                 | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Activation-Exp                 | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Activation-Leaky ReLU          | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Activation-Not                 | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Activation-PReLU               | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Activation-ReLU                | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Activation-ReLU6               | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Activation-Sigmoid/Logistic    | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Activation-TanH                | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| ArgMax                         | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Asin                           | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Asinh                          | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Atan                           | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Atanh                          | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| BatchNormalization             | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| BinaryConvolution              | Supported     | Supported     | Not Supported | Not Supported | Supported     |
-| Broadcast                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Ceil                           | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Concat                         | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Const                          | Supported     | Supported     | Supported     | Supported     | Not Supported |
-| Convolution-Dilated            | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| Convolution-Dilated 3D         | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| Convolution-Grouped            | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| Convolution-Grouped 3D         | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| Convolution-Ordinary           | Supported     | Supported     | Supported     | Supported\*   | Supported     |
-| Convolution-Ordinary 3D        | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| Cos                            | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Cosh                           | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Crop                           | Supported     | Supported     | Supported     | Supported     | Supported     |
-| CTCGreedyDecoder               | Supported\*\* | Supported\*\* | Supported\*   | Not Supported | Supported     |
-| Deconvolution                  | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| Deconvolution 3D               | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| DeformableConvolution          | Supported     | Supported     | Not Supported | Not Supported | Supported     |
-| DepthToSpace                   | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| DetectionOutput                | Supported     | Supported\*\* | Supported\*   | Not Supported | Supported     |
-| Eltwise-And                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Add                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Div                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Equal                  | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-FloorMod               | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Greater                | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-GreaterEqual           | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Less                   | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-LessEqual              | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-LogicalAnd             | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-LogicalOr              | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-LogicalXor             | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Max                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Min                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Mul                    | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Eltwise-NotEqual               | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Pow                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Prod                   | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Eltwise-SquaredDiff            | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Eltwise-Sub                    | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Eltwise-Sum                    | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Erf                            | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Exp                            | Supported     | Supported     | Supported     | Supported     | Supported     |
-| FakeQuantize                   | Not Supported | Supported     | Not Supported | Not Supported | Supported     |
-| Fill                           | Not Supported | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Flatten                        | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| Floor                          | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| FullyConnected (Inner Product) | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Gather                         | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| GatherTree                     | Not Supported | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Gemm                           | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| GRN                            | Supported\*\* | Supported\*\* | Supported     | Not Supported | Supported     |
-| HardSigmoid                    | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Interp                         | Supported\*\* | Supported\*\* | Supported     | Not Supported | Supported\*   |
-| Log                            | Supported     | Supported\*\* | Supported     | Supported     | Supported     |
-| LRN (Norm)                     | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| LSTMCell                       | Supported     | Supported     | Supported     | Supported     | Not Supported |
-| GRUCell                        | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| RNNCell                        | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| LSTMSequence                   | Supported     | Supported     | Supported     | Not Supported | Not Supported |
-| GRUSequence                    | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| RNNSequence                    | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| LogSoftmax                     | Supported     | Supported\*\* | Not Supported | Not Supported | Not Supported |
-| Memory                         | Not Supported | Supported     | Not Supported | Supported     | Supported     |
-| MVN                            | Supported     | Supported\*\* | Supported\*   | Not Supported | Supported     |
-| Neg                            | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| NonMaxSuppression              | Not Supported | Supported\*\* | Supported     | Not Supported | Supported     |
-| Normalize                      | Supported     | Supported\*\* | Supported\*   | Not Supported | Supported     |
-| OneHot                         | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Pad                            | Supported     | Supported\*\* | Supported\*   | Not Supported | Supported     |
-| Permute                        | Supported     | Supported     | Supported     | Supported\*   | Supported     |
-| Pooling(AVG,MAX)               | Supported     | Supported     | Supported     | Supported     | Supported     |
-| Pooling(AVG,MAX) 3D            | Supported     | Supported     | Not Supported | Not Supported | Not Supported |
-| Power                          | Supported     | Supported\*\* | Supported     | Supported\*   | Supported     |
-| PowerFile                      | Not Supported | Supported\*\* | Not Supported | Not Supported | Not Supported |
-| PriorBox                       | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| PriorBoxClustered              | Supported\*\* | Supported\*\* | Supported     | Not Supported | Supported     |
-| Proposal                       | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| PSROIPooling                   | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Range                          | Not Supported | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Reciprocal                     | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ReduceAnd                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| ReduceL1                       | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ReduceL2                       | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ReduceLogSum                   | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ReduceLogSumExp                | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ReduceMax                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| ReduceMean                     | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| ReduceMin                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| ReduceOr                       | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ReduceProd                     | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ReduceSum                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| ReduceSumSquare                | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| RegionYolo                     | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| ReorgYolo                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Resample                       | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Reshape                        | Supported     |Supported\*\*\*| Supported     | Supported     | Supported\*   |
-| ReverseSequence                | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| RNN                            | Not Supported | Supported     | Supported     | Not Supported | Not Supported |
-| ROIPooling                     | Supported\*   | Supported     | Supported     | Not Supported | Supported     |
-| ScaleShift                     | Supported     |Supported\*\*\*| Supported\*   | Supported     | Supported     |
-| ScatterUpdate                  | Not Supported | Supported\*\* | Supported     | Not Supported | Supported     |
-| Select                         | Supported     | Supported     | Supported     | Not Supported | Supported     |
-| Selu                           | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| ShuffleChannels                | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Sign                           | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Sin                            | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Sinh                           | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| SimplerNMS                     | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Slice                          | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| SoftMax                        | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| Softplus                       | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Softsign                       | Supported     | Supported\*\* | Not Supported | Supported     | Supported     |
-| SpaceToDepth                   | Not Supported | Supported\*\* | Not Supported | Not Supported | Supported     |
-| SpatialTransformer             | Not Supported | Supported\*\* | Not Supported | Not Supported | Supported     |
-| Split                          | Supported     |Supported\*\*\*| Supported     | Supported     | Supported     |
-| Squeeze                        | Supported     | Supported\*\* | Supported     | Supported     | Supported     |
-| StridedSlice                   | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Tan                            | Supported     | Supported\*\* | Not Supported | Not Supported | Supported     |
-| TensorIterator                 | Not Supported | Supported     | Supported     | Supported     | Not Supported |
-| Tile                           | Supported\*\* |Supported\*\*\*| Supported     | Not Supported | Supported     |
-| TopK                           | Supported     | Supported\*\* | Supported     | Not Supported | Supported     |
-| Unpooling                      | Supported     | Not Supported | Not Supported | Not Supported | Not Supported |
-| Unsqueeze                      | Supported     | Supported\*\* | Supported     | Supported     | Supported     |
-| Upsampling                     | Supported     | Not Supported | Not Supported | Not Supported | Not Supported |
+| Layers                         | GPU           | CPU           | VPU           | GNA           | Arm® CPU    | ShapeInfer    |
+|:-------------------------------|:-------------:|:-------------:|:-------------:|:-------------:|:---------------:|:-------------:|
+| Abs                            | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| Acos                           | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*|Supported      |
+| Acosh                          | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*|Supported      |
+| Activation-Clamp               | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Activation-ELU                 | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Activation-Exp                 | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Activation-Leaky ReLU          | Supported     |Supported\*\*\*| Supported     | Supported     | Not Supported   | Supported     |
+| Activation-Not                 | Supported     |Supported\*\*\*| Supported     | Not Supported | Not Supported   | Supported     |
+| Activation-PReLU               | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Activation-ReLU                | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Activation-ReLU6               | Supported     |Supported\*\*\*| Supported     | Not Supported | Not Supported   | Supported     |
+| Activation-Sigmoid/Logistic    | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Activation-TanH                | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| ArgMax                         | Supported     | Supported\*\* | Supported     | Not Supported | Not Supported   | Supported     |
+| Asin                           | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| Asinh                          | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| Atan                           | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| Atanh                          | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| BatchNormalization             | Supported     | Supported     | Supported     | Not Supported | Supported       | Supported     |
+| BinaryConvolution              | Supported     | Supported     | Not Supported | Not Supported | Not Supported   | Supported     |
+| Broadcast                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| Ceil                           | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| Concat                         | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Const                          | Supported     | Supported     | Supported     | Supported     | Supported       | Not Supported |
+| Convolution-Dilated            | Supported     | Supported     | Supported     | Not Supported | Supported       | Supported     |
+| Convolution-Dilated 3D         | Supported     | Supported     | Not Supported | Not Supported | Not Supported   | Not Supported |
+| Convolution-Grouped            | Supported     | Supported     | Supported     | Not Supported | Supported       | Supported     |
+| Convolution-Grouped 3D         | Supported     | Supported     | Not Supported | Not Supported | Not Supported   | Not Supported |
+| Convolution-Ordinary           | Supported     | Supported     | Supported     | Supported\*   | Supported       | Supported     |
+| Convolution-Ordinary 3D        | Supported     | Supported     | Not Supported | Not Supported | Not Supported   | Not Supported |
+| Cos                            | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| Cosh                           | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| Crop                           | Supported     | Supported     | Supported     | Supported     | Not Supported   | Supported     |
+| CTCGreedyDecoder               | Supported\*\* | Supported\*\* | Supported\*   | Not Supported |Supported\*\*\*\*| Supported     |
+| Deconvolution                  | Supported     | Supported     | Supported     | Not Supported | Not Supported   | Supported     |
+| Deconvolution 3D               | Supported     | Supported     | Not Supported | Not Supported | Not Supported   | Not Supported |
+| DeformableConvolution          | Supported     | Supported     | Not Supported | Not Supported | Not Supported   | Supported     |
+| DepthToSpace                   | Supported     | Supported\*\* | Not Supported | Not Supported | Supported\*     | Supported     |
+| DetectionOutput                | Supported     | Supported\*\* | Supported\*   | Not Supported |Supported\*\*\*\*| Supported     |
+| Eltwise-And                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Add                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Div                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Equal                  | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported\*     | Supported     |
+| Eltwise-FloorMod               | Supported     |Supported\*\*\*| Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| Eltwise-Greater                | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-GreaterEqual           | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Less                   | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported\*     | Supported     |
+| Eltwise-LessEqual              | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported\*     | Supported     |
+| Eltwise-LogicalAnd             | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-LogicalOr              | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-LogicalXor             | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Max                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Min                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Mul                    | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Eltwise-NotEqual               | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported\*     | Supported     |
+| Eltwise-Pow                    | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Prod                   | Supported     |Supported\*\*\*| Supported     | Supported     | Not Supported   | Supported     |
+| Eltwise-SquaredDiff            | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Eltwise-Sub                    | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Eltwise-Sum                    | Supported     |Supported\*\*\*| Supported     | Supported     |Supported\*\*\*\*| Supported     |
+| Erf                            | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| Exp                            | Supported     | Supported     | Supported     | Supported     | Supported       | Supported     |
+| FakeQuantize                   | Not Supported | Supported     | Not Supported | Not Supported | Supported\*     | Supported     |
+| Fill                           | Not Supported | Supported\*\* | Not Supported | Not Supported | Not Supported   | Supported     |
+| Flatten                        | Supported     | Supported     | Supported     | Not Supported | Not Supported   | Supported     |
+| Floor                          | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| FullyConnected (Inner Product) | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Gather                         | Supported     | Supported\*\* | Supported     | Not Supported | Supported\*     | Supported     |
+| GatherTree                     | Not Supported | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| Gemm                           | Supported     | Supported     | Supported     | Not Supported | Not Supported   | Supported     |
+| GRN                            | Supported\*\* | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| HardSigmoid                    | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| Interp                         | Supported\*\* | Supported\*\* | Supported     | Not Supported | Supported\*     | Supported\*   |
+| Log                            | Supported     | Supported\*\* | Supported     | Supported     | Supported       | Supported     |
+| LRN (Norm)                     | Supported     | Supported     | Supported     | Not Supported | Supported\*     | Supported     |
+| LSTMCell                       | Supported     | Supported     | Supported     | Supported     | Supported       | Not Supported |
+| GRUCell                        | Supported     | Supported     | Not Supported | Not Supported | Supported       | Not Supported |
+| RNNCell                        | Supported     | Supported     | Not Supported | Not Supported | Supported       | Not Supported |
+| LSTMSequence                   | Supported     | Supported     | Supported     | Not Supported |Supported\*\*\*\*| Not Supported |
+| GRUSequence                    | Supported     | Supported     | Not Supported | Not Supported |Supported\*\*\*\*| Not Supported |
+| RNNSequence                    | Supported     | Supported     | Not Supported | Not Supported |Supported\*\*\*\*| Not Supported |
+| LogSoftmax                     | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Not Supported |
+| Memory                         | Not Supported | Supported     | Not Supported | Supported     | Not Supported   | Supported     |
+| MVN                            | Supported     | Supported\*\* | Supported\*   | Not Supported | Supported\*     | Supported     |
+| Neg                            | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Supported     |
+| NonMaxSuppression              | Not Supported | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| Normalize                      | Supported     | Supported\*\* | Supported\*   | Not Supported | Supported\*     | Supported     |
+| OneHot                         | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| Pad                            | Supported     | Supported\*\* | Supported\*   | Not Supported | Supported\*     | Supported     |
+| Permute                        | Supported     | Supported     | Supported     | Supported\*   | Not Supported   | Supported     |
+| Pooling(AVG,MAX)               | Supported     | Supported     | Supported     | Supported     | Supported       | Supported     |
+| Pooling(AVG,MAX) 3D            | Supported     | Supported     | Not Supported | Not Supported | Supported\*     | Not Supported |
+| Power                          | Supported     | Supported\*\* | Supported     | Supported\*   | Supported       | Supported     |
+| PowerFile                      | Not Supported | Supported\*\* | Not Supported | Not Supported | Not Supported   | Not Supported |
+| PriorBox                       | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| PriorBoxClustered              | Supported\*\* | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| Proposal                       | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| PSROIPooling                   | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| Range                          | Not Supported | Supported\*\* | Not Supported | Not Supported | Not Supported   | Supported     |
+| Reciprocal                     | Supported     | Supported\*\* | Not Supported | Not Supported | Not Supported   | Supported     |
+| ReduceAnd                      | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| ReduceL1                       | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Supported     |
+| ReduceL2                       | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Supported     |
+| ReduceLogSum                   | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Supported     |
+| ReduceLogSumExp                | Supported     | Supported\*\* | Not Supported | Not Supported | Not Supported   | Supported     |
+| ReduceMax                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| ReduceMean                     | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| ReduceMin                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| ReduceOr                       | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| ReduceProd                     | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Supported     |
+| ReduceSum                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| ReduceSumSquare                | Supported     | Supported\*\* | Not Supported | Not Supported | Not Supported   | Supported     |
+| RegionYolo                     | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| ReorgYolo                      | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| Resample                       | Supported     | Supported\*\* | Supported     | Not Supported | Not Supported   | Supported     |
+| Reshape                        | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported\*   |
+| ReverseSequence                | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| RNN                            | Not Supported | Supported     | Supported     | Not Supported | Supported       | Not Supported |
+| ROIPooling                     | Supported\*   | Supported     | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| ScaleShift                     | Supported     |Supported\*\*\*| Supported\*   | Supported     | Not Supported   | Supported     |
+| ScatterUpdate                  | Not Supported | Supported\*\* | Supported     | Not Supported | Not Supported   | Supported     |
+| Select                         | Supported     | Supported     | Supported     | Not Supported | Supported       | Supported     |
+| Selu                           | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| ShuffleChannels                | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Supported     |
+| Sign                           | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| Sin                            | Supported     | Supported\*\* | Not Supported | Not Supported | Supported       | Supported     |
+| Sinh                           | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| SimplerNMS                     | Supported     | Supported\*\* | Not Supported | Not Supported | Not Supported   | Supported     |
+| Slice                          | Supported     |Supported\*\*\*| Supported     | Supported     | Not Supported   | Supported     |
+| SoftMax                        | Supported     |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| Softplus                       | Supported     | Supported\*\* | Supported     | Not Supported | Supported       | Supported     |
+| Softsign                       | Supported     | Supported\*\* | Not Supported | Supported     | Not Supported   | Supported     |
+| SpaceToDepth                   | Not Supported | Supported\*\* | Not Supported | Not Supported | Supported\*     | Supported     |
+| SpatialTransformer             | Not Supported | Supported\*\* | Not Supported | Not Supported | Not Supported   | Supported     |
+| Split                          | Supported     |Supported\*\*\*| Supported     | Supported     | Supported       | Supported     |
+| Squeeze                        | Supported     | Supported\*\* | Supported     | Supported     | Supported       | Supported     |
+| StridedSlice                   | Supported     | Supported\*\* | Supported     | Not Supported | Supported\*     | Supported     |
+| Tan                            | Supported     | Supported\*\* | Not Supported | Not Supported |Supported\*\*\*\*| Supported     |
+| TensorIterator                 | Not Supported | Supported     | Supported     | Supported     | Supported       | Not Supported |
+| Tile                           | Supported\*\* |Supported\*\*\*| Supported     | Not Supported | Supported       | Supported     |
+| TopK                           | Supported     | Supported\*\* | Supported     | Not Supported |Supported\*\*\*\*| Supported     |
+| Unpooling                      | Supported     | Not Supported | Not Supported | Not Supported | Not Supported   | Not Supported |
+| Unsqueeze                      | Supported     | Supported\*\* | Supported     | Supported     | Supported       | Supported     |
+| Upsampling                     | Supported     | Not Supported | Not Supported | Not Supported | Not Supported   | Not Supported |

 \*- support is limited to the specific parameters. Refer to "Known Layers Limitation" section for the device [from the list of supported](Supported_Devices.md).

 \*\*- support is implemented via [Extensibility mechanism](../../Extensibility_UG/Intro.md).

 \*\*\*- supports NCDHW layout.
+
+\*\*\*\*- support is implemented via runtime reference.
--- a/docs/_static/css/custom.css
+++ b/docs/_static/css/custom.css
@@ -77,3 +77,9 @@ div.highlight {
    width:100%;
    color: #fff;
 }
+
+@media (min-width: 1200px) {
+    .container, .container-lg, .container-md, .container-sm, .container-xl {
+        max-width: 1800px;
+    }
+}
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -104,6 +104,12 @@ repositories = {
        'github_version': 'master',
        'host_url': 'https://github.com'
    },
+    'ote': {
+        'github_user': 'openvinotoolkit',
+        'github_repo': 'training_extensions',
+        'github_version': 'develop',
+        'host_url': 'https://github.com'
+    },
    'open_model_zoo': {
        'github_user': 'openvinotoolkit',
        'github_repo': 'open_model_zoo',
--- a/docs/documentation.md
+++ b/docs/documentation.md
@@ -29,6 +29,7 @@
   openvino_docs_optimization_guide_dldt_optimization_guide
   openvino_docs_MO_DG_Getting_Performance_Numbers
   openvino_docs_model_optimization_guide
+   openvino_docs_deployment_optimization_guide_dldt_optimization_guide
   openvino_docs_tuning_utilities
   openvino_docs_performance_benchmarks

@@ -61,6 +62,7 @@
   :hidden:

   ovms_what_is_openvino_model_server
+   ote_documentation
   ovsa_get_started

 .. toctree::
--- a/docs/doxyrest/frame/common/item.lua
+++ b/docs/doxyrest/frame/common/item.lua
@@ -98,16 +98,16 @@ function getItemRefTargetString(item)
 	end

 	local s =
-		".. index:: pair: " .. item.memberKind .. "; " .. item.name .. "\n" ..
-		".. _doxid-" .. item.id .. ":\n"
+		".. _doxid-" .. item.id .. ":\n" ..
+		".. index:: pair: " .. item.memberKind .. "; " .. item.name .. "\n"

 	if item.isSubGroupHead then
 		for j = 1, #item.subGroupSlaveArray do
 			slaveItem = item.subGroupSlaveArray[j]

 			s = s ..
-				".. index:: pair: " .. slaveItem.memberKind .. "; " .. slaveItem.name .. "\n" ..
-				".. _doxid-" .. slaveItem.id .. ":\n"
+				".. _doxid-" .. slaveItem.id .. ":\n" ..
+				".. index:: pair: " .. slaveItem.memberKind .. "; " .. slaveItem.name .. "\n"
 		end
 	end

--- a/docs/img/BATCH_device.PNG
+++ b/docs/img/BATCH_device.PNG
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d1461f042894cd61c2812f12ffa566e1723fdd16a1ee8398321e58d309143475
+size 123115
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -78,7 +78,7 @@ OpenVINO™ Documentation
 	 </a>
 	 <a href="openvino_docs_optimization_guide_dldt_optimization_guide.html" >
 	    <h3>Tune & Optimize </h3>
-	    <p> Use quantization, pruning, and sparsity algorithms to make your application as efficient as possible. </p> 
+	    <p> Model-level (e.g. quantization) and Runtime (i.e. application) -level  optimizations to make your inference as fast as possible. </p> 
 	 </a>
 	 <a href="openvino_docs_performance_benchmarks.html" >
 	    <h3>Performance<br /> Benchmarks </h3>
--- a/docs/install_guides/installing-openvino-apt.md
+++ b/docs/install_guides/installing-openvino-apt.md
@@ -30,17 +30,17 @@ The complete list of supported hardware is available in the [Release Notes](http
      sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
      ```
      > **NOTE**: You might need to install GnuPG: `sudo apt-get install gnupg`   
-      
+
 2.	Add the repository via the following command:
    @sphinxdirective

-    .. tab:: Ubuntu 18
+    .. tab:: On Ubuntu 18

        .. code-block:: sh

            echo "deb https://apt.repos.intel.com/openvino/2022 bionic main" | sudo tee /etc/apt/sources.list.d/intel-openvino-2022.list

-    .. tab:: Ubuntu 20
+    .. tab:: On Ubuntu 20

        .. code-block:: sh

@@ -53,12 +53,12 @@ The complete list of supported hardware is available in the [Release Notes](http
   ```sh
   sudo apt update
   ```       
-   
+
 4.	Verify that the APT repository is properly set up. Run the apt-cache command to see a list of all available OpenVINO packages and components:
   ```sh
   apt-cache search openvino
   ```   
-   
+

 ### Step 2: Install OpenVINO Runtime Using the APT Package Manager

--- a/docs/install_guides/installing-openvino-overview.md
+++ b/docs/install_guides/installing-openvino-overview.md
@@ -12,20 +12,20 @@ From the 2022.1 release, the OpenVINO installation package has been separated in

 ### Decide What to Install

-**If you have already finished your model development and want to deploy your applications on various devices, install OpenVINO Runtime**, which contains a set of libraries for an easy inference integration into your applications and supports heterogeneous execution across Intel® CPU and Intel® GPU hardware.
+**If you have already finished your model development and want to deploy your applications on various devices, [install OpenVINO Runtime](installing-openvino-runtime.md)**, which contains a set of libraries for an easy inference integration into your applications and supports heterogeneous execution across Intel® CPU and Intel® GPU hardware.

-**If you want to download, convert, optimize and tune pre-trained deep learning models**, [install OpenVINO Development Tools](installing-model-dev-tools.md), which provides the following tools:
+**If you want to download model from [Open Model Zoo](../model_zoo.md), convert to [OpenVINO IR](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md), [optimize](../optimization_guide/model_optimization_guide.md) and tune pre-trained deep learning models**, [install OpenVINO Development Tools](installing-model-dev-tools.md), which provides the following tools:

  * Model Optimizer
+  * Post-Training Optimization Tool
  * Benchmark Tool
  * Accuracy Checker and Annotation Converter
-  * Post-Training Optimization Tool
  * Model Downloader and other Open Model Zoo tools


 ### Choose Your Installation Method

-For Python developers, you can [install OpenVINO from PyPI](installing-openvino-pip.md), which contains both OpenVINO Runtime and Development Tools and less steps. 
+For Python developers, you can [install OpenVINO from PyPI](installing-openvino-pip.md), which contains both OpenVINO Runtime and Development Tools and less steps.

 For C++ developers, you may choose one of the following installation options to install OpenVINO Runtime on your specific operating system:

--- a/docs/install_guides/movidius-setup-guide.md
+++ b/docs/install_guides/movidius-setup-guide.md
@@ -36,21 +36,21 @@ The `hddldaemon` is a system service, a binary executable that is run to manage

 ### Conventions Used in This Document

-`<IE>` refers to the following default OpenVINO&trade; Inference Engine directories:
+`<OV>` refers to the following default OpenVINO&trade; Runtime directories:
 -  **Linux:**	   
 ```
- /opt/intel/openvino_2022/inference_engine
+ /opt/intel/openvino_2022/runtime
 ```
 -  **Windows:**	    
 ``` 
-C:\Program Files (x86)\IntelSWTools\openvino\inference_engine 
+C:\Program Files (x86)\IntelSWTools\openvino\runtime 
 ```

 If you have installed OpenVINO&trade; in a different directory on your system, you will need to enter your unique directory path.

 ### Configuration File Location

-`<IE>\external\hddl\config\hddl_service.config`
+`<OV>\3rdparty\hddl\config\hddl_service.config`

 ### Service Configuration File Settings

--- a/docs/optimization_guide/dldt_deployment_optimization_common.md
+++ b/docs/optimization_guide/dldt_deployment_optimization_common.md
@@ -0,0 +1,51 @@
+# General Optimizations {#openvino_docs_deployment_optimization_guide_common}
+
+## Inputs Pre-processing with OpenVINO
+
+In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:
+- Model Optimizer can efficiently bake the mean and normalization (scale) values into the model (for example, to the weights of the first convolution). Please see [relevant Model Optimizer command-line options](../MO_DG/prepare_model/Additional_Optimizations.md).
+- Let the OpenVINO accelerate other means of [Image Pre-processing and Conversion](../OV_Runtime_UG/preprocessing_overview.md).
+- Note that in many cases, you can directly share the (input) data with the OpenVINO, for example consider [remote tensors API of the GPU Plugin](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
+
+## Prefer OpenVINO Async API <a name="ov-async-api"></a>
+The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and executes immediately (effectively serializing the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()`. Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md).
+
+A typical use-case for the `ov::InferRequest::infer()` is running a dedicated application thread per source of inputs (e.g. a camera), so that every step (frame capture, processing, results parsing and associated logic) is kept serial within the thread.
+In contrast, the `ov::InferRequest::start_async()` and `ov::InferRequest::wait()` allow the application to continue its activities and poll or wait for the inference completion when really needed. So one reason for using asynchronous code is _efficiency_.
+
+**NOTE**: Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based, below) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios).
+
+Let's see how the OpenVINO Async API can improve overall throughput rate of the application. The key advantage of the Async approach is as follows:  while a device is busy with the inference, the application can do other things in parallel (e.g. populating inputs or scheduling other requests) rather than wait for the inference to complete.
+
+In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages.
+
+You can compare the pseudo-codes for the regular and async-based approaches:
+
+-	In the regular way, the frame is captured with OpenCV and then immediately processed:<br>
+
+@snippet snippets/dldt_optimization_guide8.cpp part8
+
+![Intel&reg; VTune&trade; screenshot](../img/vtune_regular.png)
+
+-	In the "true" async mode, the `NEXT` request is populated in the main (application) thread, while the `CURRENT` request is processed:<br>
+
+@snippet snippets/dldt_optimization_guide9.cpp part9
+
+![Intel&reg; VTune&trade; screenshot](../img/vtune_async.png)
+
+The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results.
+Refer to the [Object Detection С++ Demo](@ref omz_demos_object_detection_demo_cpp), [Object Detection Python Demo](@ref omz_demos_object_detection_demo_python)(latency-oriented Async API showcase) and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) for complete examples of the Async API in action.
+
+### Notes on Callbacks
+Notice that the Async's `ov::InferRequest::wait()` waits for the specific request only. However, running multiple inference requests in parallel provides no guarantees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait`. The most scalable approach is using callbacks (set via the `ov::InferRequest::set_callback`) that are executed upon completion of the request. The callback functions will be used by the OpenVINO runtime to notify on the results (or errors. 
+This is more event-driven approach.
+
+Few important points on the callbacks:
+- It is the application responsibility to ensure that any callback function is thread-safe
+- Although executed asynchronously by a dedicated threads the callbacks should NOT include heavy operations (e.g. I/O) and/or blocking calls. Keep the work done by any callback to a minimum.
+
+## "get_tensor" Idiom <a name="new-request-based-api"></a>
+
+`get_tensor` is a recommended way to populate the inference inputs (and read back the outputs), as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs tensors are mapped to the host (which is fast) only when the `get_tensor` is used, while for the `set_tensor` a copy into the internal GPU structures may happen.
+Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md).
+In contrast, the `set_tensor` is a preferable way to handle remote tensors, [for example with the GPU device](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
--- a/docs/optimization_guide/dldt_deployment_optimization_guide.md
+++ b/docs/optimization_guide/dldt_deployment_optimization_guide.md
@@ -1,303 +1,44 @@
-# Deployment Optimization Guide {#openvino_docs_deployment_optimization_guide_dldt_optimization_guide}
+# Runtime Inference Optimizations {#openvino_docs_deployment_optimization_guide_dldt_optimization_guide}

@sphinxdirective

 .. toctree::
   :maxdepth: 1
   :hidden:
-   
-   openvino_docs_deployment_optimization_guide_dldt_optimization_guide_additional
+    
+   openvino_docs_deployment_optimization_guide_common
+   openvino_docs_deployment_optimization_guide_latency
+   openvino_docs_deployment_optimization_guide_tput
+   openvino_docs_deployment_optimization_guide_hints

@endsphinxdirective

-To optimize your performance results during runtime step it is possible to experiment with: 
+## Deployment Optimizations Overview {#openvino_docs_deployment_optimization_guide_overview}
+Runtime or deployment optimizations focus is tuning of the inference parameters (e.g. optimal number of the requests executed simultaneously) and other means of how a model is _executed_. 

-* Preprocess 
+Here, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers.
+In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency. 
+Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold.

-* Throughput mode  
+Each of the [OpenVINO supported devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md) offers low-level performance configuration. This allows to leverage the optimal model performance on the _specific_ device, but may require careful re-tuning when the model or device has changed.
+**If the performance portability is of concern, consider using the [OpenVINO High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) first.**  

-* Async API 
+Finally, how the full-stack application uses the inference component _end-to-end_ is important.  
+For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. As detailed in the section on the [general optimizations](./dldt_deployment_optimization_common.md), the inputs population can be performed asynchronously to the inference. Also, in many cases the (image) [pre-processing can be offloaded to the OpenVINO](../OV_Runtime_UG/preprocessing_overview.md). For variably-sized inputs, consider [dynamic shapes](../OV_Runtime_UG/ov_dynamic_shapes.md) to efficiently connect the data input pipeline and the model inference.
+These are common performance tricks that help both latency and throughput scenarios. 

-* Lowering inference precision 
+ Similarly, the _model-level_ optimizations like [quantization that unlocks the int8 inference](../OV_Runtime_UG/Int8Inference.md) are general and help any scenario. As referenced in the [performance introduction topic](./dldt_optimization_guide.md), these are covered in the [dedicated document](./model_optimization_guide.md). Additionally, the  `ov::hint::inference_precision` allows the devices to trade the accuracy for the performance at the _runtime_ (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model). 
+ 
+Further documents cover the  _runtime_ performance optimizations topics. Please also consider [matrix support of the features by the individual devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md).

-* Device optimization 
+[General, application-level optimizations](./dldt_deployment_optimization_common.md):
+ 
+* Inputs Pre-processing with the OpenVINO

-* Combination of devices 
+* Async API and 'get_tensor' Idiom

-## Preprocess
-
-### Letting the Inference Engine Accelerate Image Pre-processing/Conversion <a name="image-preprocessing"></a>
-
-In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:
- Model Optimizer can efficiently bake the mean and normalization (scale) values into the model (for example, weights of the first convolution). See <a href="#mo-knobs-related-to-performance">Model Optimizer Knobs Related to Performance</a>.
- If regular 8-bit per channel images are your native media (for instance, decoded frames), do not convert to the `FP32` on your side, as this is something that plugins can accelerate. Use the `InferenceEngine::Precision::U8` as your input format:<br>
-
-@snippet snippets/dldt_optimization_guide1.cpp part1
-
-Note that in many cases, you can directly share the (input) data with the Inference Engine.
-
-## Throughput Mode
-
-One way to increase computational efficiency is batching, which combines many (potentially tens) of input images to achieve optimal throughput. Internally, the execution resources are split/pinned into execution *streams*. Using this feature gains much better performance for the networks that originally are not scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines. 
-
-![](../img/THROUGHPUT.svg)
-
-Run the Benchmark App and play with number of infer requests running in parallel, next section. Try different values of the -nstreams argument from 1 to a number of CPU cores and find one that provides the best performance. 
-
-The throughput mode relaxes the requirement to saturate the CPU by using a large batch: running multiple independent inference requests in parallel often gives much better performance, than using a batch only. This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance. Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API. 
-
-## Inference Engine Async API 
-
-Inference Engine Async API can improve overall frame rate of the application. While accelerator is busy with the inference, the application can continue doing things on the host rather than wait for the inference to complete.
-
-In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages.
-
-You can compare the pseudo-codes for the regular and async-based approaches:
-
-	In the regular way, the frame is captured with OpenCV and then immediately processed:<br>
-
-@snippet snippets/dldt_optimization_guide8.cpp part8
-
-![Intel&reg; VTune&trade; screenshot](../img/vtune_regular.png)
-
-	In the "true" async mode, the `NEXT` request is populated in the main (application) thread, while the `CURRENT` request is processed:<br>
-
-@snippet snippets/dldt_optimization_guide9.cpp part9
-
-![Intel&reg; VTune&trade; screenshot](../img/vtune_async.png)
-
-The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results.
-
-There are important performance caveats though: for example, the tasks that run in parallel should try to avoid oversubscribing the shared compute resources. If the inference is performed on the HDDL and the CPU is essentially idle, it makes sense to do things on the CPU in parallel. However, multiple infer requests can oversubscribe that. Notice that heterogeneous execution can implicitly use the CPU, refer to <a href="#heterogeneity">Heterogeneity</a>.
-
-Also, if the inference is performed on the graphics processing unit (GPU), it can take little gain to do the encoding, for instance, of the resulting video, on the same GPU in parallel, because the device is already busy.
-
-Refer to the [Object Detection С++ Demo](@ref omz_demos_object_detection_demo_cpp), [Object Detection Python Demo](@ref omz_demos_object_detection_demo_python)(latency-oriented Async API showcase) and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) (which has both latency and throughput-oriented modes) for complete examples of the Async API in action.
-
-### Request-Based API and “GetBlob” Idiom <a name="new-request-based-api"></a>
-
-Infer Request based API offers two types of request: Sync and Async. The Sync is considered below. The Async splits (synchronous) `Infer` into `StartAsync` and `Wait` (see <a href="#ie-async-api">Inference Engine Async API</a>).
-
-More importantly, an infer request encapsulates the reference to the “executable” network and actual inputs/outputs. Now, when you load the network to the plugin, you get a reference to the executable network (you may consider that as a queue). Actual infer requests are created by the executable network:
-
-```sh
-
-@snippet snippets/dldt_optimization_guide6.cpp part6
-```
-
-`GetBlob` is a recommend way to communicate with the network, as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs blobs are mapped to the host (which is fast) if the `GetBlob` is used. But if you called the `SetBlob`, the copy (from/to the blob you have set) into the internal GPU plugin structures will happen.
-
-### Performance Aspects of Running Multiple Requests Simultaneously <a name="running-multiple-requests-simultaneously"></a>
-
-If your application simultaneously executes multiple infer requests:
-
- For the CPU, the best solution, you can use the <a href="#cpu-streams">CPU "throughput" mode</a>.
- If latency is of more concern, you can try the `EXCLUSIVE_ASYNC_REQUESTS` [configuration option](../OV_Runtime_UG/supported_plugins/CPU.md) that limits the number of the simultaneously executed requests for all (executable) networks that share the specific device to just one:
-
-@snippet snippets/dldt_optimization_guide7.cpp part7
-
-For more information on the executable networks notation, see <a href="#new-request-based-api">Request-Based API and “GetBlob” Idiom</a>.
-
- The heterogeneous device uses the `EXCLUSIVE_ASYNC_REQUESTS` by default.
-
- `KEY_EXCLUSIVE_ASYNC_REQUESTS` option affects only device queues of the individual application.
-
- For GPU, the actual work is serialized by a plugin and/or a driver anyway.
-
- Finally, for <a href="#myriad">any VPU flavor</a>, using multiple requests is a must for achieving good throughput. 
-
-In the Inference Engine, there is no notion of requests priorities. It is left to the user side (for example, not queuing the low priority infer request, until another higher priority is waiting). Notice that it would require additional logic to synchronize between executable networks (queues) in your application code.
-
-## Automatic Lowering of the Inference Precision 
-
-Inference precision directly affects the performance. 
-
-Model Optimizer can produce an IR with different precision. For example, an FP16 IR initially targets VPU and GPU devices, while, for example, for the CPU, an FP16 IR is    typically up-scaled to the regular FP32 automatically upon loading. But notice that further device-specific inference precision settings are available, 
-for example, [8-bit integer](../OV_Runtime_UG/Int8Inference.md) or [bfloat16](../OV_Runtime_UG/supported_plugins/CPU.md), which is specific to the CPU inference, below.
-Note that for the [Multi-Device execution](../OV_Runtime_UG/multi_device.md) that supports automatic inference on multiple devices in parallel, you can use an FP16 IR (no need for FP32).
-You can find more information, including preferred data types for specific devices, in the
-[Supported Devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) document.
-
-
-By default, plugins enable the optimizations that allow lower precision if the acceptable range of accuracy is preserved.
-For example, for the CPU that supports the AVX512_BF16 instructions, an FP16/FP32 model is converted to a [bfloat16](../OV_Runtime_UG/supported_plugins/CPU.md) IR to accelerate inference.
-
-To compare the associated speedup, run the example command below to disable this feature on the CPU device with the AVX512_BF16 support and get regular FP32 execution:
-
-```sh
-$ benchmark_app -m <model.xml> -enforcebf16=false
- ```
-
-Notice that for quantized (e.g. INT8) models the bfloat16 calculations (of the layers that remain in FP32) is disabled by default.
-Refer to the [CPU Plugin documentation](../OV_Runtime_UG/supported_plugins/CPU.md) for more details.
-
-Similarly, the GPU device automatically executes FP16 for the layers that remain in FP16 in the quantized models (assuming that the FP16 model was quantized).
-Refer to the ENABLE_FP16_FOR_QUANTIZED_MODELS key in the [GPU Plugin documentation](../OV_Runtime_UG/supported_plugins/GPU.md).
-
-## Device Optimizations
-
-The Inference Engine supports several target devices (CPU, GPU, Intel&reg; Movidius&trade; Myriad&trade; 2 VPU, Intel&reg; Movidius&trade; Myriad&trade; X VPU, Intel® Vision Accelerator Design with Intel® Movidius™ Vision Processing Units (VPU)), and each of them has a corresponding plugin. If you want to optimize a specific device, you must keep in mind the following tips to increase the performance.
-
-### CPU Checklist <a name="cpu-checklist"></a>
-
-CPU plugin completely relies on the Intel&reg; Math Kernel Library for Deep Neural Networks (Intel&reg; MKL-DNN) for major primitives acceleration, for example, Convolutions or FullyConnected.
-
-The only hint you can get from that is how the major primitives are accelerated (and you cannot change this). For example, on the Core machines, you should see variations of the `jit_avx2` when inspecting the <a href="#performance-counters">internal inference performance counters</a> (and additional '_int8' postfix for [int8 inference](../OV_Runtime_UG/Int8Inference.md)). If you are an advanced user, you can further trace the CPU execution with (see <a href="#vtune-examples">Intel&reg; VTune&trade;</a>).
-
-Internally, the Inference Engine has a threading abstraction level, which allows for compiling the [open source version](https://github.com/opencv/dldt) with either Intel&reg; Threading Building Blocks (Intel&reg; TBB) which is now default, or OpenMP* as an alternative parallelism solution. When using inference on the CPU, this is particularly important to align threading model with the rest of your application (and any third-party libraries that you use) to avoid oversubscription. For more information, see <a href="#note-on-app-level-threading">Note on the App-Level Threading</a> section.
-
- Since R1 2019, the OpenVINO&trade; toolkit comes pre-compiled with Intel TBB,
- so any  OpenMP* API or environment settings (like `OMP_NUM_THREADS`) has no effect.
- Certain tweaks (like number of threads used for inference on the CPU) are still possible via  [CPU configuration options](../OV_Runtime_UG/supported_plugins/CPU.md).
- Finally, the OpenVINO CPU inference is NUMA-aware, please refer to the <a href="#note-on-numa">Tips for inference on NUMA systems</a> section.
-
-Other general recommendations:
-	Usually, batching improves CPU performance. However, the need to gather frames in the batch might complicate the application logic. Instead, you can keep a separate infer request per camera or other source of input and process the requests in parallel. For more information, see the next section.
-	If your application simultaneously performs inference of multiple models on the same CPU, make sure you do not oversubscribe the machine. See <a href="#running-multiple-requests-simultaneously">Performance Aspects of Running Multiple Requests Simultaneously</a> for more information.
-	Notice that the heterogeneous execution might implicitly load the CPU. For details, refer to the <a href="#heterogeneity">Heterogeneity</a> section.
- 	Consider [8-bit integer inference on the CPU](../OV_Runtime_UG/Int8Inference.md).
-
-#### Throughput Mode for CPU <a name="cpu-streams"></a>
-Unlike most accelerators, CPU is perceived as an inherently latency-oriented device.
-In fact, the OpenVINO does support the "throughput" mode for the CPU, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the overall throughput.
-
-Internally, the execution resources are split/pinned into execution "streams".
-This feature usually provides much better performance for the networks than batching. This is especially true for the many-core server machines:
-![](../img/cpu_streams_explained_1.png)
-
-Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, and much less within CNN ops):
-![](../img/cpu_streams_explained.png)
-
-Try the [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample and play with number of streams running in parallel. The rule of thumb is tying up to a number of CPU cores on your machine.
-For example, on an 8-core CPU, compare the `-nstreams 1` (which is a legacy, latency-oriented scenario) to the 2, 4, and 8 streams.
-
-In addition, you can play with the batch size to find the throughput sweet spot.
-
-If your application is hard or impossible to change in accordance with the multiple-requests logic, consider the "multiple-instance" trick to improve the throughput:  
-   For multi-socket execution, it is recommended to set   [`KEY_CPU_THREADS_NUM`](../OV_Runtime_UG/supported_plugins/CPU.md) to the number of cores per socket, and run as many instances of the application as you have sockets.
-   Similarly, for extremely lightweight networks (running faster than 1ms) and/or many-core machines (16+ cores), try limiting the number of CPU inference  threads to just `#&zwj;phys` cores and further, while trying to saturate the machine with running multiple instances of the application.
-
-### GPU Checklist <a name="gpu-checklist"></a>
-
-Inference Engine relies on the [Compute Library for Deep Neural Networks (clDNN)](https://01.org/cldnn) for Convolutional Neural Networks acceleration on Intel&reg; GPUs. Internally, clDNN uses OpenCL&trade; to implement the kernels. Thus, many general tips apply:
-
-	Prefer `FP16` over `FP32`, as the Model Optimizer can generate both variants and the `FP32` is default.
- 	Try to group individual infer jobs by using batches.
-	Notice that using the GPU introduces one-time overhead (order of few seconds) of compiling the OpenCL kernels. The compilation happens upon loading the network to the GPU plugin and does not affect the inference time.
-	If your application is simultaneously using the inference on the CPU or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. You can use [CPU configuration options](../OV_Runtime_UG/supported_plugins/CPU.md) to limit number of inference threads for the CPU plugin.
-	In the GPU-only scenario, a GPU driver might occupy a CPU core with spin-looped polling for completion. If the _CPU_ utilization is a concern, consider the `KEY_CLDND_PLUGIN_THROTTLE` configuration option.
-
-> **NOTE**: See the [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) code for a usage example. 
-Notice that while disabling the polling, this option might reduce the GPU performance, so usually this option is used with multiple [GPU streams](../OV_Runtime_UG/supported_plugins/GPU.md). 
-
-
-### Intel&reg; Movidius&trade; Myriad&trade; X Visual Processing Unit and Intel&reg; Vision Accelerator Design with Intel&reg; Movidius&trade; VPUs  <a name="myriad"></a>
-
-Since Intel&reg; Movidius&trade; Myriad&trade; X Visual Processing Unit (Intel&reg; Movidius&trade; Myriad&trade; 2 VPU) communicates with the host over USB, minimum four infer requests in flight are recommended to hide the data transfer costs. See <a href="#new-request-based-api">Request-Based API and “GetBlob” Idiom</a> and [Benchmark App Sample](../../samples/cpp/benchmark_app/README.md) for more information.
-
-Intel&reg; Vision Accelerator Design with Intel&reg; Movidius&trade; VPUs requires to keep at least 32 inference requests in flight to fully saturate the device.  
-
-## Heterogeneity <a name="heterogeneity"></a>
-
-Heterogeneous execution (constituted by the dedicated Inference Engine [“Hetero” device](../OV_Runtime_UG/hetero_execution.md)) enables to schedule a network inference to the multiple devices.
-
-### Typical Heterogeneous Scenarios of Concern <a name="heterogeneous-scenarios-of-concern"></a>
-
-The primary points for executing a network in heterogeneous mode are as follows:
-
-	Calculate the heaviest pieces of the network with an accelerator while falling back to the CPU for the layers that are not supported by the accelerator.<br>
-	This is particularly useful when certain custom (user) kernels are implemented only for the CPU (and much harder or even impossible to implement for the accelerator).
-
-	Use all available compute devices more efficiently, for example, by running branches of the network on the different devices.
-
-### Heterogeneous Flow <a name="heterogeneous-flow"></a>
-
-The execution through heterogeneous plugin has three distinct steps:
-
-1.	**Applying affinity setting for the layers**, that is, binding them to the devices.
-
-	-	This can be done automatically using *fallback priorities*, or on the *per-layer* basis.
-
-	-	The affinity setting is made before loading the network to the (heterogeneous) plugin, so this is always a **static** setup with respect to execution.
-
-2.	**Loading a network to the heterogeneous plugin**, which internally splits the network into subgraphs.<br>
-	You can check the decisions the plugin makes, see <a href="#analyzing-heterogeneous-execution">Analysing the Heterogeneous Execution</a>.
-
-3.	**Executing the infer requests**. From user’s side, this looks identical to a single-device case, while internally, the subgraphs are executed by actual plugins/devices.
-
-Performance benefits of the heterogeneous execution depend heavily on the communications granularity between devices. If transmitting/converting data from one part device to another takes more time than the execution, the heterogeneous approach makes little or no sense. Using Intel&reg; VTune&trade; helps to visualize the execution flow on a timeline (see <a href="#vtune-examples">Intel&reg; VTune&trade; Examples</a>).
-
-Similarly, if there are too much subgraphs, the synchronization and data transfers might eat the entire performance. In some cases, you can define the (coarser) affinity manually to avoid sending data back and forth many times during one inference.
-
-The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and "glue" or helper  kernels on the CPU. Notice that this includes the granularity considerations. For example, running some custom activation (that comes after every accelerator-equipped convolution) on the CPU might result in performance degradation due to too much data type and/or layout conversions, even though the activation itself can be extremely fast. In this case, it might make sense to consider implementing the kernel for the accelerator (see <a href="#optimizing-custom-kernels">Optimizing Custom Kernels</a>). The conversions typically manifest themselves as outstanding (comparing to CPU-only execution) 'Reorder' entries (see <a href="#performance-counters">Internal Inference Performance Counters</a>).
-
-For general details on the heterogeneous mode, refer to the [Heterogeneous execution guide](../OV_Runtime_UG/hetero_execution.md).
-
-### Trying the Heterogeneous Plugin with Inference Engine Samples <a name="heterogeneous-plugin-with-samples"></a>
-
-Every Inference Engine sample supports the `-d` (device) option.
-
-For example, here is a command to run an [Classification Sample Async](../../samples/cpp/classification_sample_async/README.md):
-
-```sh
-./classification_sample_async -m <path_to_model>/Model.xml -i <path_to_pictures>/picture.jpg -d HETERO:GPU,CPU
-```
-
-where:
-
-	`HETERO` stands for Heterogeneous plugin.
-	`GPU,CPU` points to fallback policy with first priority on GPU and further fallback to CPU.
-
-You can point more than two devices: `-d HETERO:HDDL,GPU,CPU`.
-
-### General Tips on GPU/CPU Execution <a name="tips-on-gpu-cpu-execution"></a>
-
-The following tips are provided to give general guidance on optimizing execution on GPU/CPU devices.
-
-	Generally, GPU performance is better on heavy kernels (like Convolutions) and large inputs. So if the network inference time is already too small (~1ms of execution time), using the GPU would unlikely give a boost.
-
-	A typical strategy to start with is to test the CPU-only and GPU-only scenarios first (with samples this is plain `-d CPU` or `-d GPU`). If there are specific kernels that are not supported by the GPU, the best option to try is the `HETERO:GPU,CPU` that automatically applies default splitting (based on the plugins layers support). Then, you can play with the manual affinity settings (for example, to further minimize the number of subgraphs).  
-
-	The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and "glue" (or helper) kernels on the CPU. Notice that this includes the granularity considerations. For example, running some (custom) activation on the CPU would result in too many conversions.
-
-	It is advised to do <a href="#analyzing-hetero-execution">performance analysis</a> to determine “hotspot” kernels, which should be the first candidates for offloading. At the same time, it is often more efficient to offload some reasonably sized sequence of kernels, rather than individual kernels, to minimize scheduling and other run-time overheads.
-
-	Notice that GPU can be busy with other tasks (like rendering). Similarly, the CPU can be in charge for the general OS routines and other application threads (see <a href="#note-on-app-level-threading">Note on the App-Level Threading</a>). Also, a high interrupt rate due to many subgraphs can raise the frequency of the one device and drag the frequency of another down.
-
-	Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices stopping use of the Intel&reg; Turbo Boost Technology. This might result in overall performance decrease, even comparing to single-device scenario.
-
-	Mixing the `FP16` (GPU) and `FP32` (CPU) execution results in conversions and, thus, performance issues. If you are seeing a lot of heavy outstanding (compared to the CPU-only execution) Reorders, consider implementing actual GPU kernels. Refer to <a href="#performance-counters">Internal Inference Performance Counters</a> for more information.
-
-### Analyzing Heterogeneous Execution <a name="analyzing-heterogeneous-execution"></a>
-
-There is a dedicated configuration option that enables dumping the visualization of the subgraphs created by the heterogeneous mode, please see code example in the [Heterogeneous execution guide](../OV_Runtime_UG/hetero_execution.md)
-
-After enabling the configuration key, the heterogeneous plugin generates two files:
-
-	`hetero_affinity.dot` - per-layer affinities. This file is generated only if default fallback policy was executed (as otherwise you have set the affinities by yourself, so you know them).
-	`hetero_subgraphs.dot` - affinities per sub-graph. This file is written to the disk during execution of `Core::LoadNetwork` for the heterogeneous flow.
-
-You can use GraphViz\* utility or `.dot` converters (for example, to `.png` or `.pdf`), like xdot\*, available on Linux\* OS with `sudo apt-get install xdot`. 
-
-You can also use performance data (in the [Benchmark App](../../samples/cpp/benchmark_app/README.md), it is an option `-pc`) to get performance data on each subgraph. Again, refer to the [Heterogeneous execution guide](../OV_Runtime_UG/hetero_execution.md) and to <a href="#performance-counters">Internal Inference Performance Counters</a> for a general counters information.
-
-## Multi-Device Execution <a name="multi-device-optimizations"></a>
-OpenVINO&trade; toolkit supports automatic multi-device execution, please see [Multi-Device execution](../OV_Runtime_UG/multi_device.md) description.
-In the next chapter you can find the device-specific tips, while this section covers few recommendations 
-for the multi-device execution:
-	MULTI usually performs best when the fastest device is specified first in the list of the devices. 
-    This is particularly important when the parallelism is not sufficient 
-    (e.g. the number of request in the flight is not enough to saturate all devices).
- It is highly recommended to query the optimal number of inference requests directly from the instance of the ExecutionNetwork 
-  (resulted from the LoadNetwork call with the specific multi-device configuration as a parameter). 
-Please refer to the code of the [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample for details.    
-   Notice that for example CPU+GPU execution performs better with certain knobs 
-    which you can find in the code of the same [Benchmark App](../../samples/cpp/benchmark_app/README.md) sample.
-    One specific example is disabling GPU driver polling, which in turn requires multiple GPU streams (which is already a default for the GPU) to amortize slower 
-    inference completion from the device to the host.
-	Multi-device logic always attempts to save on the (e.g. inputs) data copies between device-agnostic, user-facing inference requests 
-    and device-specific 'worker' requests that are being actually scheduled behind the scene. 
-    To facilitate the copy savings, it is recommended to start the requests in the order that they were created 
-    (with ExecutableNetwork's CreateInferRequest).
-  
-Refer to [Deployment Optimization Guide Additional Configurations](dldt_deployment_optimization_guide_additional.md) to read more about performance during deployment step and learn about threading, working with multi-socket CPUs and Basic Interoperability with Other APIs.
+Use-case specific optimizations along with some implementation details:
+ 
+* Optimizing for [throughput](./dldt_deployment_optimization_tput.md) and [latency](./dldt_deployment_optimization_latency.md)
+ 
+* [OpenVINO's high-level performance hints](./dldt_deployment_optimization_hints.md) as the portable, future-proof approach for performance configuration
--- a/docs/optimization_guide/dldt_deployment_optimization_guide_additional.md
+++ b/docs/optimization_guide/dldt_deployment_optimization_guide_additional.md
@@ -1,70 +0,0 @@
-# Deployment Optimization Guide Additional Configurations {#openvino_docs_deployment_optimization_guide_dldt_optimization_guide_additional}
-
-To optimize your performance results during runtime step, you can experiment with:
-
-* multi socket CPUs
-
-* threading
-
-* Basic Interoperability with Other APIs
-
-
-## Best Latency on the Multi-Socket CPUs
-Note that when latency is of concern, there are additional tips for multi-socket systems.
-When input is limited to the single image, the only way to achieve the best latency is to limit execution to the single socket.
-The reason is that single image is simply not enough
-to saturate more than one socket. Also NUMA overheads might dominate the execution time.
-Below is the example command line that limits the execution to the single socket using numactl for the best *latency* value
-(assuming the machine with 28 phys cores per socket):
-```
-limited to the single socket).
-$ numactl -m 0 --physcpubind 0-27  benchmark_app -m <model.xml> -api sync -nthreads 28
- ```
-Note that if you have more than one input, running as many inference requests as you have NUMA nodes (or sockets)
-usually gives the same best latency as a single request on the single socket, but much higher throughput. Assuming two NUMA nodes machine:
-```
-$ benchmark_app -m <model.xml> -nstreams 2
- ```
-Number of NUMA nodes on the machine can be queried via 'lscpu'.
-Please see more on the NUMA support in the [Optimization Guide](../OV_Runtime_UG/multi_device.md).
- 
-  
- ##  Threading 
-
- - As explained in the <a href="#cpu-checklist">CPU Checklist</a> section, by default the Inference Engine uses Intel TBB as a parallel engine. Thus, any OpenVINO-internal threading (including CPU inference) uses the same threads pool, provided by the TBB. But there are also other threads in your application, so oversubscription is possible at the application level:
- The rule of thumb is that you should try to have the overall number of active threads in your application equal to the number of cores in your machine. Keep in mind the spare core(s) that the OpenCL driver under the GPU plugin might also need.
- One specific workaround to limit the number of threads for the Inference Engine is using the [CPU configuration options](../OV_Runtime_UG/supported_plugins/CPU.md).
- To avoid further oversubscription, use the same threading model in all modules/libraries that your application uses. Notice that third party components might bring their own threading. For example, using Inference Engine which is now compiled with the TBB by default might lead to [performance troubles](https://www.threadingbuildingblocks.org/docs/help/reference/appendices/known_issues/interoperability.html) when mixed in the same app with another computationally-intensive library, but compiled with OpenMP. You can try to compile the [open source version](https://github.com/opencv/dldt) of the Inference Engine to use the OpenMP as well. But notice that in general, the TBB offers much better composability, than other threading solutions.
- If your code (or third party libraries) uses GNU OpenMP, the Intel&reg; OpenMP (if you have recompiled Inference Engine with that) must be initialized first. This can be achieved by linking your application with the Intel OpenMP instead of GNU OpenMP, or using `LD_PRELOAD` on Linux* OS.
-
-## Basic Interoperability with Other APIs <a name="basic-interoperability-with-other-apis"></a>
-
-The general approach for sharing data between Inference Engine and media/graphics APIs like Intel&reg; Media Server Studio (Intel&reg; MSS) is based on sharing the *system* memory.  That is, in your code, you should map or copy the data from the API to the CPU address space first.
-
-For Intel MSS, it is recommended to perform a viable pre-processing, for example, crop/resize, and then convert to RGB again with the [Video Processing Procedures (VPP)](https://software.intel.com/en-us/node/696108). Then lock the result and create an Inference Engine blob on top of that. The resulting pointer can be used for the `SetBlob`:
-
-@snippet snippets/dldt_optimization_guide2.cpp part2
-
-**WARNING**: The `InferenceEngine::NHWC` layout is not supported natively by most InferenceEngine plugins so internal conversion might happen.
-
-@snippet snippets/dldt_optimization_guide3.cpp part3
-
-Alternatively, you can use RGBP (planar RGB) output from Intel MSS. This allows to wrap the (locked) result as regular NCHW which is generally friendly for most plugins (unlike NHWC). Then you can use it with `SetBlob` just like in previous example:
-
-@snippet snippets/dldt_optimization_guide4.cpp part4
-
-The only downside of this approach is that VPP conversion to RGBP is not hardware accelerated (and performed on the GPU EUs). Also, it is available only on LInux.
-
-## OpenCV* Interoperability Example <a name="opencv-interoperability"></a>
-
-Unlike APIs that use dedicated address space and/or special data layouts (for instance, compressed OpenGL* textures), regular OpenCV data objects like `cv::Mat` reside in the conventional system memory. That is, the memory can be actually shared with the Inference Engine and only data ownership to be transferred.
-
-Again, if the OpenCV and Inference Engine layouts match, the data can be wrapped as Inference Engine (input/output) blob. Notice that by default, Inference Engine accepts the **planar** and **not interleaved** inputs in NCHW, so the NHWC (which is exactly the interleaved layout) should be specified explicitly:
-
-**WARNING**: The `InferenceEngine::NHWC` layout is not supported natively by most InferenceEngine plugins so internal conversion might happen.
-
-@snippet snippets/dldt_optimization_guide5.cpp part5
-
-Notice that original `cv::Mat`/blobs cannot be used simultaneously by the application and the Inference Engine. Alternatively, the data that the pointer references to can be copied to unlock the original data and return ownership to the original API.
-
-To learn more about optimizations during developing step, visit [Deployment Optimization Guide](dldt_deployment_optimization_guide.md) page.
--- a/docs/optimization_guide/dldt_deployment_optimization_hints.md
+++ b/docs/optimization_guide/dldt_deployment_optimization_hints.md
@@ -0,0 +1,22 @@
+# High-level Performance Hints (Presets) {#openvino_docs_deployment_optimization_guide_hints}
+
+Traditionally, each of the OpenVINO's [supported devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers a bunch of low-level performance settings. 
+Tweaking this detailed configuration requires deep architecture understanding.
+Also, while the resulting performance may be optimal for the specific combination of the device and the model that is inferred, it is actually neither device/model nor future-proof:
+- Even within a family of the devices (like various CPUs), things like different number of CPU cores would eventually result in different execution configuration to be optimal.
+- Similarly the optimal batch size is very much specific to the particular instance of the GPU.
+- Compute vs memory-bandwidth requirements for the model being inferenced, as well as inference precision, possible model's quantization and other factors add more unknowns to the resulting performance equation.
+- Finally, the optimal execution parameters of one device do not transparently map to another device type, for example:
+    - Both the CPU and GPU devices support the notion of the 'streams' (i.e. inference instances that are executed in parallel, please see `ov::num_streams`), yet the optimal number of the streams is deduced very differently.
+ 
+Beyond execution _parameters_ there are potentially many device-specific details like _scheduling_ that greatly affect the performance. 
+Specifically, GPU-oriented tricks like batching, which combines many (potentially tens) of input images to achieve optimal throughput, do not always map well to the CPU, as e.g. detailed in the next sections.
+The hints allow to really hide _execution_ specifics required to saturate the device. For example, no need to explicitly combine multiple inputs into a batch to achieve good GPU performance.
+Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using <a href="#ov-async-api">OpenVINO Async API</a>.
+
+The only requirement for the application to leverage the throughput is about **running multiple inference requests in parallel**.
+OpenVINO's device-specific implementation of the hints will take care of the rest. This allows a developer to greatly simplify the app-logic.
+
+In summary, when the performance _portability_ is of concern, consider the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md). 
+Below you can find the implementation details (particularly how the OpenVINO implements the 'throughput' approach) for the specific devices. 
+Keep in mind that while different throughput-oriented scheduling approaches ([like the batching or other means of executing individual inference requests](./dldt_deployment_optimization_tput.md)) can work together, the hints make these decisions to be transparent to the application.
--- a/docs/optimization_guide/dldt_deployment_optimization_latency.md
+++ b/docs/optimization_guide/dldt_deployment_optimization_latency.md
@@ -0,0 +1,35 @@
+## Optimizing for the Latency {#openvino_docs_deployment_optimization_guide_latency}
+
+@sphinxdirective
+
+.. toctree::
+   :maxdepth: 1
+   :hidden:
+    
+   openvino_docs_IE_DG_Model_caching_overview
+
+@endsphinxdirective
+
+## Latency Specifics
+A significant fraction of applications focused on the situations where typically a single model is loaded (and single input is used) at a time.
+This is a regular "consumer" use case and a default (also for the legacy reasons) performance setup for any OpenVINO device.
+Notice that an application can  create more than one request if needed (for example to support asynchronous inputs population), the question is really about how many requests are being executed in parallel.
+
+Similarly, when multiple models are served on the same device, it is important whether the models are executed simultaneously, or in chain (for example in the inference pipeline).
+As expected, the lowest latency is achieved with only one concurrent inference at a moment. Accordingly, any additional concurrency usually results in the latency growing fast.
+
+However, for example, specific configurations, like multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes in the machine.
+Thus, human expertise is required to get the most out of the device even in the latency case. Consider using [OpenVINO high-level performance hints](../OV_Runtime_UG/performance_hints.md) instead.
+
+**NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof. 
+
+In the case when there are multiple models to be used simultaneously, consider using different devices for inferencing the different models. Finally, when multiple models are executed in parallel on the device, using additional `ov::hint::model_priority` may help to define relative priorities of the models (please refer to the documentation on the [matrix features support for OpenVINO devices](../OV_Runtime_UG/supported_plugins/Device_Plugins.md) to check for the support of the feature by the specific device).
+
+## First-Inference Latency and Model Load/Compile Time
+There are cases when model loading/compilation are heavily contributing to the end-to-end latencies.
+For example when the model is used exactly once, or when due to on-device memory limitations the model is unloaded (to free the memory for another inference) and reloaded at some cadence.
+
+Such a "first-inference latency" scenario however may pose an additional limitation on the model load\compilation time, as inference accelerators (other than the CPU) usually require certain level of model compilation upon loading.
+The [model caching](../OV_Runtime_UG/Model_caching_overview.md) is a way to amortize the loading/compilation time over multiple application runs. If the model caching is not possible (as e.g. it requires write permissions for the applications), the CPU device is almost exclusively offers the fastest model load time. Also, consider using the [AUTO device](../OV_Runtime_UG/auto_device_selection.md). It allows to transparently use the CPU for inference, while the actual accelerator loads the model (upon that, the inference hot-swapping also happens automatically).
+
+Finally, notice that any [throughput-oriented options](./dldt_deployment_optimization_tput.md) may increase the model up time significantly.
--- a/docs/optimization_guide/dldt_deployment_optimization_tput.md
+++ b/docs/optimization_guide/dldt_deployment_optimization_tput.md
@@ -0,0 +1,68 @@
+# Optimizing for Throughput {#openvino_docs_deployment_optimization_guide_tput}
+
+## General Throughput Considerations
+As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is delivering the every single request at the minimal delay.
+Throughput on the other hand, is about inference scenarios in which potentially large number of inference requests are served simultaneously.
+Here, the overall application throughput can be significantly improved  with the right performance configuration.
+Also, if the model is not already compute- or memory bandwidth-limited, the associated increase in latency is not linearly dependent on the number of requests executed in parallel.
+
+With the OpenVINO there two major means of running the multiple requests simultaneously: batching and "streams", explained in this document. 
+Yet, different GPUs behave differently with batch sizes, just like different CPUs require different number of execution streams to maximize the throughput.
+Predicting inference performance is difficult and and finding optimal execution parameters requires direct experiments measurements.
+One possible throughput optimization strategy is to set an upper bound for latency and then increase the batch size or number of the streams until that tail latency is met (or the throughput is not growing anymore).
+Also, consider [Deep Learning Workbench](https://docs.openvino.ai/latest/workbench_docs_Workbench_DG_Introduction.html).
+
+Finally, the [automatic multi-device execution](../OV_Runtime_UG/multi_device.md) helps to improve the throughput, please also see the section below. 
+While the same approach of optimizing the parameters of each device separately does work, the resulting multi-device performance is a fraction (that is  different for different models) of the “ideal” (plain sum) performance. 
+
+Overall, the latency-throughput is not linearly dependent and very _device_ specific. It is also tightly integrated with _model_ characteristics.
+As for the possible inference devices the scenery had already become pretty diverse, the OpenVINO has introduced the dedicated notion of the high-level performance configuration "hints" to describe the target application scenarios.
+The hints are described [here](./dldt_deployment_optimization_hints.md). 
+
+**NOTE**: [OpenVINO performance hints](./dldt_deployment_optimization_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof. 
+
+The rest of the document provides low-level details on the OpenVINO's low-level ways to optimize the throughput.
+
+## Low-Level Implementation Details
+### OpenVINO Streams <a name="ov-streams"></a>
+As detailed in the section <a href="#ov-async-api">OpenVINO Async API</a> running multiple inference requests asynchronously is important for general application efficiency.
+Additionally, most devices support running multiple inference requests in parallel in order to improve the device utilization. The _level_ of the parallelism (i.e. how many requests are really executed in parallel on the device) is commonly referred as a number of 'streams'. Some devices run several requests per stream to amortize the host-side costs.
+Notice that streams (that can be considered as independent queues) are really executing the requests in parallel, but not in the lock step (as e.g. the batching does), this makes the streams much more compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes. 
+
+Also, notice that for efficient asynchronous execution, the streams are actually handling inference with special pool of the threads.
+So each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:compiled_model`. 
+If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
+
+The usage of multiple streams is an inherently throughput-oriented approach, as every stream requires a dedicated memory to operate in parallel to the rest streams (read-only data like weights are usually shared between all streams).
+Also, the streams inflate the load/compilation time.
+This is why the [latency hint](./dldt_deployment_optimization_hints.md) governs a device to create a bare minimum of streams (usually just one).
+
+Finally, the streams are always preferable compared to creating  multiple instances of the same model, as weights memory is shared across streams, reducing possible  memory consumption.
+
+### Throughput on the CPU: Internals <a name="cpu-streams"></a>
+In order to best serve multiple inference requests simultaneously, the inference threads are grouped/pinned to the particular CPU cores, constituting the CPU streams.
+This provides much better performance for the networks than batching especially for the many-core machines:
+![](../img/cpu_streams_explained_1.png)
+
+Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, with much less synchronization within CNN ops):
+![](../img/cpu_streams_explained.png)
+
+Notice that [high-level performance hints](../OV_Runtime_UG/performance_hints.md) allows the implementation to select the optimal number of the streams, _depending on the model compute demands_ and CPU capabilities (including [int8 inference](../OV_Runtime_UG/Int8Inference.md) hardware acceleration, number of cores, etc).
+
+### Automatic Batching Internals <a name="ov-auto-batching"></a>
+While the GPU plugin fully supports general notion of the streams, the associated performance (throughput) improvements are usually modest.
+The primary reason is that, while the streams allow to hide the communication overheads and hide certain bubbles in device utilization, running multiple OpenCL kernels on the GPU simultaneously is less efficient, compared to calling a kernel on the multiple inputs at once.   
+
+When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using the streams for the GPU may suffice. Also streams are fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes. 
+Typically, for 4 and more requests the batching delivers better throughput for the GPUs. Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the most portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario. 
+As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests to improve device utilization.
+The Automatic Batching relaxes the requirement for an application to saturate devices like GPU by _explicitly_ using a large batch. It performs transparent inputs gathering from 
+individual inference requests followed by the actual batched execution, with no programming effort from the user:
+![](../img/BATCH_device.PNG)
+
+Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a batching timeout. 
+Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect).
+
+So if your workload experiences the timeouts (resulting in the performance drop, as the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value and batch size may yield better performance than large batch size, but coupled with the timeout value that cannot guarantee accommodating the full number of the required requests.
+
+Finally, following the "get_tensor idiom" section from the [general optimizations](./dldt_deployment_optimization_common.md) helps the Automatic Batching to save on inputs/outputs copies. Thus, in your application always prefer the "get" versions of the tensor data access APIs. 
--- a/docs/optimization_guide/dldt_optimization_guide.md
+++ b/docs/optimization_guide/dldt_optimization_guide.md
@@ -1,28 +1,36 @@
-# Performance Optimization Guide {#openvino_docs_optimization_guide_dldt_optimization_guide}
+# Introduction to Performance Optimization {#openvino_docs_optimization_guide_dldt_optimization_guide}
+Before exploring possible optimization techniques, let us first define what the inference performance is and how to measure that.
+Notice that reported inference performance often tends to focus on the speed of execution. 
+In fact these are at least four connected factors of accuracy, throughput/latency and efficiency. The rest of the document discusses how to balance these key factors. 


-Before exploring optimization techniques, let us first define what performance is and how it is measured.
-
-## What Is Performance 
-
-Performance means how fast the model is in deployment. Two key metrics are used to measure performance: latency and throughput. 
+## What Is Inference Performance
+Generally, performance means how fast the model processes the live data. Two key metrics are used to measure the performance: latency and throughput are fundamentally interconnected. 

 ![](../img/LATENCY_VS_THROUGHPUT.svg)

-Latency measures inference time (ms) required to process a single input. When it comes to batch input need to measure throughput (images per second or frames per second, FPS). To calculate throughput, divide the number of frames that were processed by the processing time.   
+Latency measures inference time (ms) required to process a single input. When it comes to the executing multiple inputs executed simultaneously (e.g. via batching) then the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually of more concern.
+To calculate throughput, divide number of frames that were processed by the processing time.

-## How to measure performance
-To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, go to [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page.
+It is important to separate the "pure" inference time of a neural network and the end-to-end application performance. For example data transfers between the host and a device may unintentionally affect the performance when a host input tensor is processed on the accelerator like dGPU. Similarly, the image-preprocessing may also contribute significantly to the to inference time. As detailed in the [getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) section, when drilling into _inference_ performance, one option is to measure all such items separately. 
+For the end-to-end scenario though, consider the image pre-processing thru the OpenVINO and the asynchronous execution is a way to amortize the communication costs like data transfers. You can find further details in the [general optimizations document](./dldt_deployment_optimization_common.md).
+
+"First-inference latency" is another specific case (e.g. when fast application start-up is required) where the resulting performance may be well dominated by the model loading time. Consider [model caching](../OV_Runtime_UG/Model_caching_overview.md) as a way to improve model loading/compilation time.
+
+Finally, memory footprint restrictions is another possible concern when designing an application. While this is a motivation for the _model_ optimization techniques referenced in the next section, notice that the the throughput-oriented execution is usually much more memory-hungry, as detailed in the [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md). 
+
+
+> **NOTE**: To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, check [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page.
 
-## How to Improve Performance 
+## Improving the Performance: Model vs Runtime Optimizations 

-> **NOTE**: Make sure that your model can be successfully inferred with OpenVINO Inference Engine before reffering to the optimization topic. 
+> **NOTE**: Make sure that your model can be successfully inferred with OpenVINO Runtime. 

-Inside OpenVINO there are two ways how to get better performance numbers: optimize the model, which is called **model optimization** or tune parameters of execution, which is also **deployment optimization**. Note, that it is possible to combine both types of optimizations. 
+With the OpenVINO there are two primary ways of improving the inference performance, namely model- and runtime-level optimizations. **These two optimizations directions are fully compatible**. 

 - **Model optimization** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md).

- **Deployment optimization**  includes tuning inference parameters and optimizing model execution. To read more visit [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md).
+- **Runtime (Deployment) optimization**  includes tuning of model _execution_ parameters. To read more visit [Deployment Optimization Guide](../optimization_guide/dldt_deployment_optimization_guide.md).

 ## Performance benchmarks
-To estimate the performance and compare performance numbers, measured on various supported devices, a wide range of public models are available at [Perforance benchmarks](../benchmarks/performance_benchmarks.md) section.
+To estimate the performance and compare performance numbers, measured on various supported devices, a wide range of public models are available at [Performance benchmarks](../benchmarks/performance_benchmarks.md) section.
--- a/docs/optimization_guide/model_optimization_guide.md
+++ b/docs/optimization_guide/model_optimization_guide.md
@@ -8,6 +8,7 @@
   
   pot_README
   docs_nncf_introduction
+   openvino_docs_IE_DG_Int8Inference

@endsphinxdirective

--- a/docs/scripts/create_mapping.py
+++ b/docs/scripts/create_mapping.py
@@ -11,7 +11,8 @@ REPOSITORIES = [
    'openvino',
    'omz',
    'pot'
-    'ovms'
+    'ovms',
+    'ote'
 ]


--- a/docs/scripts/doxy_md_filter.py
+++ b/docs/scripts/doxy_md_filter.py
@@ -65,7 +65,7 @@ class DoxyMDFilter:
        """
        for link in self.md_links:
            link_path = self.parent_folder.joinpath(link).resolve()
-            if os.path.exists(link_path):
+            if os.path.exists(link_path) and link_path in self.file_to_label_mapping:
                self.content = self.content.replace(link, '@ref ' + self.file_to_label_mapping[link_path])
            else:
                rel_path = os.path.relpath(link_path, self.input_dir).replace('\\', '/')
--- a/docs/scripts/tests/conftest.py
+++ b/docs/scripts/tests/conftest.py
@@ -72,6 +72,11 @@ def pytest_addoption(parser):
        action="store_true",
        default=False,
        help='Include link check for ovms')
+    parser.addoption(
+        '--include_ote',
+        action="store_true",
+        default=False,
+        help='Include link check for ote')


 def read_lists(configs):
@@ -90,7 +95,7 @@ def read_lists(configs):
 def pytest_generate_tests(metafunc):
    """ Generate tests depending on command line options
    """
-    exclude_links = {'open_model_zoo', 'workbench', 'pot', 'gst', 'omz', 'ovms'}
+    exclude_links = {'open_model_zoo', 'workbench', 'pot', 'gst', 'omz', 'ovms', 'ote'}
    if metafunc.config.getoption('include_omz'):
        exclude_links.remove('open_model_zoo')
        exclude_links.remove('omz')
@@ -102,6 +107,8 @@ def pytest_generate_tests(metafunc):
        exclude_links.remove('gst')
    if metafunc.config.getoption('include_ovms'):
        exclude_links.remove('ovms')
+    if metafunc.config.getoption('include_ote'):
+        exclude_links.remove('ote')

    # warnings to ignore
    suppress_warnings = read_lists(metafunc.config.getoption('suppress_warnings'))
--- a/docs/snippets/dldt_optimization_guide9.cpp
+++ b/docs/snippets/dldt_optimization_guide9.cpp
@@ -1,7 +1,6 @@
 #include <ie_core.hpp>

 int main() {
-using namespace InferenceEngine;
 //! [part9]
 while(true) {
    // capture frame
--- a/docs/snippets/ie_common.cpp
+++ b/docs/snippets/ie_common.cpp
@@ -55,6 +55,9 @@ int main() {
    //! [ie:inference]

    //! [ie:start_async_and_wait]
+    // NOTE: For demonstration purposes we are trying to set callback
+    // which restarts inference inside one more time, so two inferences happen here
+
    // Start inference without blocking current thread
    auto restart_once = true;
    infer_request.SetCompletionCallback<std::function<void(InferenceEngine::InferRequest, InferenceEngine::StatusCode)>>(
@@ -72,11 +75,11 @@ int main() {
            }
        });
    infer_request.StartAsync();
-    // Get inference status
+    // Get inference status immediately
    InferenceEngine::StatusCode status = infer_request.Wait(InferenceEngine::InferRequest::STATUS_ONLY);
-    // Wait for 1 miliseconds
+    // Wait for 1 milisecond
    status = infer_request.Wait(1);
-    // Wait for inference complition
+    // Wait for inference completion
    infer_request.Wait(InferenceEngine::InferRequest::RESULT_READY);
    //! [ie:start_async_and_wait]

--- a/docs/snippets/ov_auto_batching.cpp
+++ b/docs/snippets/ov_auto_batching.cpp
@@ -3,39 +3,52 @@
 int main() {
    ov::Core core;
    auto model = core.read_model("sample.xml");
+{

 //! [compile_model]
-{
-    auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
-}
+auto compiled_model = core.compile_model(model, "GPU",
+    ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
 //! [compile_model]
+}

+{
 //! [compile_model_no_auto_batching]
-{
-    // disabling the automatic batching
-    // leaving intact other configurations options that the device selects for the 'throughput' hint 
-    auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
-                                                            ov::hint::allow_auto_batching(false)});
-}
+// disabling the automatic batching
+// leaving intact other configurations options that the device selects for the 'throughput' hint 
+auto compiled_model = core.compile_model(model, "GPU", 
+    ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
+    ov::hint::allow_auto_batching(false));
 //! [compile_model_no_auto_batching]
-
-//! [query_optimal_num_requests]
-{
-    // when the batch size is automatically selected by the implementation
-    // it is important to query/create and run the sufficient #requests
-    auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
-    auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);
 }
-//! [query_optimal_num_requests]

-//! [hint_num_requests]
 {
-    // limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests
-    // so that certain parameters (like selected batch size) are automatically accommodated accordingly 
-    auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
-                                                            ov::hint::num_requests(4)});
+//! [query_optimal_num_requests]
+// when the batch size is automatically selected by the implementation
+// it is important to query/create and run the sufficient #requests
+auto compiled_model = core.compile_model(model, "GPU",
+    ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
+auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);
+//! [query_optimal_num_requests]
 }
+
+{
 //! [hint_num_requests]
+// limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests
+// so that certain parameters (like selected batch size) are automatically accommodated accordingly 
+auto compiled_model = core.compile_model(model, "GPU",
+    ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
+    ov::hint::num_requests(4));
+//! [hint_num_requests]
+}
+
+//! [hint_plus_low_level]
+{
+    // high-level performance hints are compatible with low-level device-specific settings 
+auto compiled_model = core.compile_model(model, "CPU",
+    ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
+    ov::inference_num_threads(4));
+}
+//! [hint_plus_low_level]

    return 0;
 }
--- a/docs/snippets/ov_auto_batching.py
+++ b/docs/snippets/ov_auto_batching.py
@@ -31,3 +31,11 @@ config = {"PERFORMANCE_HINT": "THROUGHPUT",
 # so that certain parameters (like selected batch size) are automatically accommodated accordingly 
 compiled_model = core.compile_model(model, "GPU", config)
 # [hint_num_requests]
+
+# [hint_plus_low_level]
+config = {"PERFORMANCE_HINT": "THROUGHPUT",
+          "INFERENCE_NUM_THREADS": "4"}
+# limiting the available parallel slack for the 'throughput'
+# so that certain parameters (like selected batch size) are automatically accommodated accordingly
+compiled_model = core.compile_model(model, "CPU", config)
+# [hint_plus_low_level]]
--- a/docs/snippets/ov_common.cpp
+++ b/docs/snippets/ov_common.cpp
@@ -80,6 +80,9 @@ int main() {
    //! [ov_api_2_0:inference]

    //! [ov_api_2_0:start_async_and_wait]
+    // NOTE: For demonstration purposes we are trying to set callback
+    // which restarts inference inside one more time, so two inferences happen here
+
    auto restart_once = true;
    infer_request.set_callback([&, restart_once] (std::exception_ptr exception_ptr) mutable {
        if (exception_ptr) {
@@ -97,11 +100,11 @@ int main() {
    });
    // Start inference without blocking current thread
    infer_request.start_async();
-    // Get inference status
+    // Get inference status immediately
    bool status = infer_request.wait_for(std::chrono::milliseconds{0});
-    // Wait for one miliseconds
+    // Wait for one milisecond
    status = infer_request.wait_for(std::chrono::milliseconds{1});
-    // Wait for inference complition
+    // Wait for inference completion
    infer_request.wait();
    //! [ov_api_2_0:start_async_and_wait]

--- a/docs/snippets/ov_hetero.cpp
+++ b/docs/snippets/ov_hetero.cpp
@@ -31,25 +31,26 @@ for (auto&& node : model->get_ops()) {
 auto compiled_model = core.compile_model(model, device);
 //! [fix_automatic_affinities]

-//! [compile_model]
 {
-    auto compiled_model = core.compile_model(model, "HETERO:GPU,CPU");
-    // or with ov::device::priorities with multiple args
-    compiled_model = core.compile_model(model, "HETERO", ov::device::priorities("GPU", "CPU"));
-    // or with ov::device::priorities with a single argument
-    compiled_model = core.compile_model(model, "HETERO", ov::device::priorities("GPU,CPU"));
-}
 //! [compile_model]
+auto compiled_model = core.compile_model(model, "HETERO:GPU,CPU");
+// or with ov::device::priorities with multiple args
+compiled_model = core.compile_model(model, "HETERO", ov::device::priorities("GPU", "CPU"));
+// or with ov::device::priorities with a single argument
+compiled_model = core.compile_model(model, "HETERO", ov::device::priorities("GPU,CPU"));
+//! [compile_model]
+}
+
 {
 //! [configure_fallback_devices]
-    auto compiled_model = core.compile_model(model, "HETERO",
-        // GPU with fallback to CPU
-        ov::device::priorities("GPU", "CPU"),
-        // profiling is enabled only for GPU
-        ov::device::properties("GPU", ov::enable_profiling(true)),
-        // FP32 inference precision only for CPU
-        ov::device::properties("CPU", ov::hint::inference_precision(ov::element::f32))
-    );
+auto compiled_model = core.compile_model(model, "HETERO",
+    // GPU with fallback to CPU
+    ov::device::priorities("GPU", "CPU"),
+    // profiling is enabled only for GPU
+    ov::device::properties("GPU", ov::enable_profiling(true)),
+    // FP32 inference precision only for CPU
+    ov::device::properties("CPU", ov::hint::inference_precision(ov::element::f32))
+);
 //! [configure_fallback_devices]
 }
 return 0;
--- a/docs/snippets/ov_hetero.py
+++ b/docs/snippets/ov_hetero.py
@@ -31,6 +31,8 @@ compiled_model = core.compile_model(model, device)

 #! [compile_model]
 compiled_model = core.compile_model(model, device_name="HETERO:GPU,CPU")
+# device priorities via configuration property
+compiled_model = core.compile_model(model, device_name="HETERO", config={"MULTI_DEVICE_PRIORITIES": "GPU,CPU"})
 #! [compile_model]

 #! [configure_fallback_devices]
--- a/docs/snippets/ov_layout.cpp
+++ b/docs/snippets/ov_layout.cpp
@@ -4,52 +4,55 @@
 #include <openvino/core/layout.hpp>

 int main() {
- ov::Layout layout;
- //! [ov:layout:simple]
- layout = ov::Layout("NHWC");
- //! [ov:layout:simple]
- //! [ov:layout:complex]
- // Each dimension has name separated by comma, layout is wrapped with square brackets
- layout = ov::Layout("[time,temperature,humidity]");
- //! [ov:layout:complex]
- //! [ov:layout:partially_defined]
- // First dimension is batch, 4th is 'channels'. Others are not important for us
- layout = ov::Layout("N??C");
- // Or the same using advanced syntax
- layout = ov::Layout("[n,?,?,c]");
- //! [ov:layout:partially_defined]
- //! [ov:layout:dynamic]
- // First dimension is 'batch' others are whatever
- layout = ov::Layout("N...");
+    ov::Layout layout;
+//! [ov:layout:simple]
+layout = ov::Layout("NHWC");
+//! [ov:layout:simple]

- // Second dimension is 'channels' others are whatever
- layout = ov::Layout("?C...");
+//! [ov:layout:complex]
+// Each dimension has name separated by comma, layout is wrapped with square brackets
+layout = ov::Layout("[time,temperature,humidity]");
+//! [ov:layout:complex]

- // Last dimension is 'channels' others are whatever
- layout = ov::Layout("...C");
- //! [ov:layout:dynamic]
+//! [ov:layout:partially_defined]
+// First dimension is batch, 4th is 'channels'. Others are not important for us
+layout = ov::Layout("N??C");
+// Or the same using advanced syntax
+layout = ov::Layout("[n,?,?,c]");
+//! [ov:layout:partially_defined]

- //! [ov:layout:predefined]
- // returns 0 for batch
- ov::layout::batch_idx("NCDHW");
+//! [ov:layout:dynamic]
+// First dimension is 'batch' others are whatever
+layout = ov::Layout("N...");

- // returns 1 for channels
- ov::layout::channels_idx("NCDHW");
+// Second dimension is 'channels' others are whatever
+layout = ov::Layout("?C...");

- // returns 2 for depth
- ov::layout::depth_idx("NCDHW");
+// Last dimension is 'channels' others are whatever
+layout = ov::Layout("...C");
+//! [ov:layout:dynamic]

- // returns -2 for height
- ov::layout::height_idx("...HW");
+//! [ov:layout:predefined]
+// returns 0 for batch
+ov::layout::batch_idx("NCDHW");

- // returns -1 for width
- ov::layout::width_idx("...HW");
- //! [ov:layout:predefined]
+// returns 1 for channels
+ov::layout::channels_idx("NCDHW");

- //! [ov:layout:dump]
- layout = ov::Layout("NCHW");
- std::cout << layout.to_string(); // prints [N,C,H,W]
- //! [ov:layout:dump]
+// returns 2 for depth
+ov::layout::depth_idx("NCDHW");

- return 0;
+// returns -2 for height
+ov::layout::height_idx("...HW");
+
+// returns -1 for width
+ov::layout::width_idx("...HW");
+//! [ov:layout:predefined]
+
+//! [ov:layout:dump]
+layout = ov::Layout("NCHW");
+std::cout << layout.to_string(); // prints [N,C,H,W]
+//! [ov:layout:dump]
+
+    return 0;
 }
--- a/docs/snippets/ov_preprocessing.cpp
+++ b/docs/snippets/ov_preprocessing.cpp
@@ -6,71 +6,78 @@
 #include <openvino/core/preprocess/pre_post_process.hpp>

 void ppp_input_1(ov::preprocess::PrePostProcessor& ppp) {
- //! [ov:preprocess:input_1]
- ppp.input() // no index/name is needed if model has one input
-   .preprocess().scale(50.f);
+//! [ov:preprocess:input_1]
+ppp.input() // no index/name is needed if model has one input
+  .preprocess().scale(50.f);

- ppp.output()   // same for output
-   .postprocess().convert_element_type(ov::element::u8);
- //! [ov:preprocess:input_1]
- //! [ov:preprocess:mean_scale]
- ppp.input("input").preprocess().mean(128).scale(127);
- //! [ov:preprocess:mean_scale]
- //! [ov:preprocess:mean_scale_array]
- // Suppose model's shape is {1, 3, 224, 224}
- ppp.input("input").model().set_layout("NCHW"); // N=1, C=3, H=224, W=224
- // Mean/Scale has 3 values which matches with C=3
- ppp.input("input").preprocess()
-   .mean({103.94, 116.78, 123.68}).scale({57.21, 57.45, 57.73});
- //! [ov:preprocess:mean_scale_array]
- //! [ov:preprocess:convert_element_type]
- // First define data type for your tensor
- ppp.input("input").tensor().set_element_type(ov::element::u8);
+ppp.output()   // same for output
+  .postprocess().convert_element_type(ov::element::u8);
+//! [ov:preprocess:input_1]

- // Then define preprocessing step
- ppp.input("input").preprocess().convert_element_type(ov::element::f32);
+//! [ov:preprocess:mean_scale]
+ppp.input("input").preprocess().mean(128).scale(127);
+//! [ov:preprocess:mean_scale]

- // If conversion is needed to `model's` element type, 'f32' can be omitted
- ppp.input("input").preprocess().convert_element_type();
- //! [ov:preprocess:convert_element_type]
- //! [ov:preprocess:convert_layout]
- // First define layout for your tensor
- ppp.input("input").tensor().set_layout("NHWC");
+//! [ov:preprocess:mean_scale_array]
+// Suppose model's shape is {1, 3, 224, 224}
+ppp.input("input").model().set_layout("NCHW"); // N=1, C=3, H=224, W=224
+// Mean/Scale has 3 values which matches with C=3
+ppp.input("input").preprocess()
+  .mean({103.94, 116.78, 123.68}).scale({57.21, 57.45, 57.73});
+//! [ov:preprocess:mean_scale_array]

- // Then define layout of model
- ppp.input("input").model().set_layout("NCHW");
+//! [ov:preprocess:convert_element_type]
+// First define data type for your tensor
+ppp.input("input").tensor().set_element_type(ov::element::u8);

- std::cout << ppp; // Will print 'implicit layout conversion step'
- //! [ov:preprocess:convert_layout]
- //! [ov:preprocess:convert_layout_2]
- ppp.input("input").tensor().set_shape({1, 480, 640, 3});
- // Model expects shape {1, 3, 480, 640}
- ppp.input("input").preprocess().convert_layout({0, 3, 1, 2});
- // 0 -> 0; 3 -> 1; 1 -> 2; 2 -> 3
- //! [ov:preprocess:convert_layout_2]
+// Then define preprocessing step
+ppp.input("input").preprocess().convert_element_type(ov::element::f32);

- //! [ov:preprocess:resize_1]
- ppp.input("input").tensor().set_shape({1, 3, 960, 1280});
- ppp.input("input").model().set_layout("??HW");
- ppp.input("input").preprocess().resize(ov::preprocess::ResizeAlgorithm::RESIZE_LINEAR, 480, 640);
- //! [ov:preprocess:resize_1]
- //! [ov:preprocess:resize_2]
- ppp.input("input").tensor().set_shape({1, 3, 960, 1280});
- ppp.input("input").model().set_layout("??HW"); // Model accepts {1, 3, 480, 640} shape
- // Resize to model's dimension
- ppp.input("input").preprocess().resize(ov::preprocess::ResizeAlgorithm::RESIZE_LINEAR);
- //! [ov:preprocess:resize_2]
+// If conversion is needed to `model's` element type, 'f32' can be omitted
+ppp.input("input").preprocess().convert_element_type();
+//! [ov:preprocess:convert_element_type]

- //! [ov:preprocess:convert_color_1]
- ppp.input("input").tensor().set_color_format(ov::preprocess::ColorFormat::BGR);
- ppp.input("input").preprocess().convert_color(ov::preprocess::ColorFormat::RGB);
- //! [ov:preprocess:convert_color_1]
- //! [ov:preprocess:convert_color_2]
- // This will split original `input` to 2 separate inputs: `input/y' and 'input/uv'
- ppp.input("input").tensor().set_color_format(ov::preprocess::ColorFormat::NV12_TWO_PLANES);
- ppp.input("input").preprocess().convert_color(ov::preprocess::ColorFormat::RGB);
- std::cout << ppp;  // Dump preprocessing steps to see what will happen
- //! [ov:preprocess:convert_color_2]
+//! [ov:preprocess:convert_layout]
+// First define layout for your tensor
+ppp.input("input").tensor().set_layout("NHWC");
+
+// Then define layout of model
+ppp.input("input").model().set_layout("NCHW");
+
+std::cout << ppp; // Will print 'implicit layout conversion step'
+//! [ov:preprocess:convert_layout]
+
+//! [ov:preprocess:convert_layout_2]
+ppp.input("input").tensor().set_shape({1, 480, 640, 3});
+// Model expects shape {1, 3, 480, 640}
+ppp.input("input").preprocess().convert_layout({0, 3, 1, 2});
+// 0 -> 0; 3 -> 1; 1 -> 2; 2 -> 3
+//! [ov:preprocess:convert_layout_2]
+
+//! [ov:preprocess:resize_1]
+ppp.input("input").tensor().set_shape({1, 3, 960, 1280});
+ppp.input("input").model().set_layout("??HW");
+ppp.input("input").preprocess().resize(ov::preprocess::ResizeAlgorithm::RESIZE_LINEAR, 480, 640);
+//! [ov:preprocess:resize_1]
+
+//! [ov:preprocess:resize_2]
+ppp.input("input").tensor().set_shape({1, 3, 960, 1280});
+ppp.input("input").model().set_layout("??HW"); // Model accepts {1, 3, 480, 640} shape
+// Resize to model's dimension
+ppp.input("input").preprocess().resize(ov::preprocess::ResizeAlgorithm::RESIZE_LINEAR);
+//! [ov:preprocess:resize_2]
+
+//! [ov:preprocess:convert_color_1]
+ppp.input("input").tensor().set_color_format(ov::preprocess::ColorFormat::BGR);
+ppp.input("input").preprocess().convert_color(ov::preprocess::ColorFormat::RGB);
+//! [ov:preprocess:convert_color_1]
+
+//! [ov:preprocess:convert_color_2]
+// This will split original `input` to 2 separate inputs: `input/y' and 'input/uv'
+ppp.input("input").tensor().set_color_format(ov::preprocess::ColorFormat::NV12_TWO_PLANES);
+ppp.input("input").preprocess().convert_color(ov::preprocess::ColorFormat::RGB);
+std::cout << ppp;  // Dump preprocessing steps to see what will happen
+//! [ov:preprocess:convert_color_2]
 }

 void ppp_input_2(ov::preprocess::PrePostProcessor& ppp) {