[GPU] GPU plugin docs (#10734)

2022-03-10 15:01:52 +03:00 · 2022-03-10 15:01:52 +03:00 · 09246e2db8
commit 09246e2db8
parent a8a2640fb7
26 changed files with 1340 additions and 583 deletions
--- a/3
+++ b/3
@ -47,6 +47,9 @@ Jenkinsfile  @openvinotoolkit/openvino-admins
 /src/inference/include/ie/cldnn/  @openvinotoolkit/openvino-ie-gpu-maintainers @openvinotoolkit/openvino-ie-gpu-developers
 /src/inference/include/openvino/runtime/intel_gpu/  @openvinotoolkit/openvino-ie-gpu-maintainers @openvinotoolkit/openvino-ie-gpu-developers
 /src/plugins/intel_gpu/  @openvinotoolkit/openvino-ie-gpu-maintainers @openvinotoolkit/openvino-ie-gpu-developers
+/docs/snippets/gpu/  @openvinotoolkit/openvino-ie-gpu-maintainers @openvinotoolkit/openvino-ie-gpu-developers
+/docs/OV_Runtime_UG/supported_plugins/GPU.md  @openvinotoolkit/openvino-ie-gpu-maintainers @openvinotoolkit/openvino-ie-gpu-developers
+/docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md  @openvinotoolkit/openvino-ie-gpu-maintainers @openvinotoolkit/openvino-ie-gpu-developers

 # IE VPU:
 /src/plugins/intel_myriad  @openvinotoolkit/openvino-ie-vpu-maintainers
--- a/docs/Extensibility_UG/GPU_Extensibility.md
+++ b/docs/Extensibility_UG/GPU_Extensibility.md
@ -0,0 +1,229 @@
+# How to Implement Custom GPU Operations {#openvino_docs_Extensibility_UG_GPU}
+
+To enable operations not supported by OpenVINO out of the box, you may need an extension for OpenVINO operation set, and a custom kernel for the device you will target. This page describes custom kernel support for the GPU device.
+
+The GPU codepath abstracts many details about OpenCL\*. You need to provide the kernel code in OpenCL C and an XML configuration file that connects the kernel and its parameters to the parameters of the operation.
+
+There are two options for using the custom operation configuration file:
+
+* Include a section with your kernels into the automatically-loaded `<lib_path>/cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml` file.
+* Call the `ov::Core::set_property()` method from your application with the `"CONFIG_FILE"` key and the configuration file name as a value before loading the network that uses custom operations to the plugin:
+
+@snippet snippets/gpu/custom_kernels_api.cpp part0
+
+All OpenVINO samples, except the trivial `hello_classification`, and most Open Model Zoo demos
+feature a dedicated command-line option `-c` to load custom kernels. For example, to load custom operations for the classification sample, run the command below:
+```sh
+$ ./classification_sample -m <path_to_model>/bvlc_alexnet_fp16.xml -i ./validation_set/daily/227x227/apron.bmp -d GPU
+ -c <absolute_path_to_config>/custom_layer_example.xml
+```
+
+## Configuration File Format <a name="config-file-format"></a>
+
+The configuration file is expected to follow the `.xml` file structure
+with a node of the type `CustomLayer` for every custom operation you provide.
+
+The definitions described in the sections below use the following notations:
+
+Notation | Description
+---|---
+(0/1) | Can have zero or one instance of this node or attribute
+(1) | Must have only one instance of this node or attribute
+(0+) | Can have any number of instances of this node or attribute
+(1+) | Can have one or more instances of this node or attribute
+
+### CustomLayer Node and Sub-Node Structure
+
+`CustomLayer` node contains the entire configuration for a single custom operation.
+
+| Attribute Name   |\#    |  Description |
+|-----|-----|-----|
+| `name`           | (1)  | The name of the operation type to be used. This name should be identical to the type used in the IR.|
+| `type`           | (1)  | Must be `SimpleGPU`.                                                                                |
+| `version`        | (1)  | Must be `1`.                                                                                        |
+
+**Sub-nodes**: `Kernel` (1), `Buffers` (1), `CompilerOptions` (0+),
+`WorkSizes` (0/1)
+
+### Kernel Node and Sub-Node Structure
+
+`Kernel` node contains all kernel source code configuration.
+
+**Sub-nodes**: `Source` (1+), `Define` (0+)
+
+### Source Node and Sub-Node Structure
+
+`Source` node points to a single OpenCL source file.
+
+| Attribute Name | \#  |Description|
+|-----|-----|-----|
+| `filename`     | (1) | Name of the file containing OpenCL source code. Note that the path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
+
+**Sub-nodes**: None
+
+### Define Node and Sub-Node Structure
+
+`Define` node configures a single `#&zwj;define` instruction to be added to
+the sources during compilation (JIT).
+
+| Attribute Name | \#    | Description |
+|------|-------|------|
+| `name`         | (1)   | The name of the defined JIT. For static constants, this can include the value as well, which is taken as a string. |
+| `param`        | (0/1) | This parameter value is used as the value of this JIT definition.                                          |
+| `type`         | (0/1) | The parameter type. Accepted values: `int`, `float`, and `int[]`, `float[]` for arrays.                    |
+| `default`      | (0/1) | The default value to be used if the specified parameters are missing from the operation in the IR.          |
+
+**Sub-nodes:** None
+
+The resulting JIT has the following form:
+`#&zwj;define [name] [type] [value/default]`.
+
+### Buffers Node and Sub-Node Structure
+
+`Buffers` node configures all input/output buffers for the OpenCL entry
+function. No buffers node structure exists.
+
+**Sub-nodes:** `Data` (0+), `Tensor` (1+)
+
+### Data Node and Sub-Node Structure
+
+`Data` node configures a single input with static data, for example,
+weights or biases.
+
+| Attribute Name | \#  | Description |
+|----|-----|------|
+| `name`         | (1) | Name of a blob attached to an operation in the IR             |
+| `arg-index`    | (1) | 0-based index in the entry function arguments to be bound to |
+
+**Sub-nodes**: None
+
+### Tensor Node and Sub-Node Structure
+
+`Tensor` node configures a single input or output tensor.
+
+| Attribute Name | \#    | Description  |
+|------|-------|-------|
+| `arg-index`    | (1)   | 0-based index in the entry function arguments to be bound to.                                                                          |
+| `type`         | (1)   | `input` or `output`                                                                                                                    |
+| `port-index`   | (1)   | 0-based index in the operation input/output ports in the IR                                                                            |
+| `format`       | (0/1) | Data layout declaration for the tensor. Accepted values: `BFYX`, `BYXF`, `YXFB`, `FYXB`, and same values in all lowercase. Default value: `BFYX` |
+
+### CompilerOptions Node and Sub-Node Structure
+
+`CompilerOptions` node configures the compilation flags for the OpenCL
+sources.
+
+| Attribute Name | \#  | Description                                        |
+|--------|-----|------|
+| `options`      | (1) | Options string to be passed to the OpenCL compiler |
+
+**Sub-nodes**: None
+
+### WorkSizes Node and Sub-Node Structure
+
+`WorkSizes` node configures the global/local work sizes to be used when
+queuing an OpenCL program for execution.
+
+| Attribute Name      | \#             | Description                                                                 |
+|-----|------|-----|
+| `global`<br>`local` | (0/1)<br>(0/1) | An array of up to three integers or formulas for defining OpenCL work-sizes to be used during execution.<br> The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,\*,%. All operators are evaluated in integer arithmetic. <br>Default value: `global=”B*F*Y*X” local=””` |
+| `dim`               | (0/1)          | A tensor to take the work-size from. Accepted values: `input N`, `output`, where `N` is an index of input tensor starting with 0. Default value: `output` |
+
+**Sub-nodes**: None
+
+## Example Configuration File
+
+The following code sample provides an example configuration file in XML
+format. For information on the configuration file structure, see
+[Configuration File Format](#config-file-format).
+```xml
+<CustomLayer name="ReLU" type="SimpleGPU" version="1">
+  <Kernel entry="example_relu_kernel">
+    <Source filename="custom_layer_kernel.cl"/>
+    <Define name="neg_slope" type="float" param="negative_slope" default="0.0"/>
+  </Kernel>
+  <Buffers>
+    <Tensor arg-index="0" type="input" port-index="0" format="BFYX"/>
+    <Tensor arg-index="1" type="output" port-index="0" format="BFYX"/>
+  </Buffers>
+  <CompilerOptions options="-cl-mad-enable"/>
+  <WorkSizes global="X,Y,B*F"/>
+</CustomLayer>
+```
+
+## Built-In Definitions for Custom Layers
+
+The following table includes definitions that are attached before
+user sources.
+
+For an example, see [Example Kernel](#example-kernel).
+
+| Name | Value  |
+|---|---|
+| `NUM_INPUTS` | Number of the input tensors bound to this kernel |
+| `GLOBAL_WORKSIZE`  | An array of global work sizes used to execute this kernel |
+| `GLOBAL_WORKSIZE_SIZE` | The size of the `GLOBAL_WORKSIZE` array |
+| `LOCAL_WORKSIZE`  | An array of local work sizes used to execute this kernel  |
+| `LOCAL_WORKSIZE_SIZE`   | The size of the `LOCAL_WORKSIZE` array |
+| `<TENSOR>_DIMS`| An array of the tensor dimension sizes. Always ordered as `BFYX` |
+| `<TENSOR>_DIMS_SIZE`| The size of the `<TENSOR>_DIMS` array.|
+| `<TENSOR>_TYPE`| The datatype of the tensor: `float`, `half`, or `char`|
+| `<TENSOR>_FORMAT_<TENSOR_FORMAT>` | The format of the tensor, BFYX, BYXF, YXFB , FYXB, or ANY. The format is concatenated to the defined name. You can use the tensor format to define codepaths in your code with `#&zwj;ifdef/#&zwj;endif`. |
+| `<TENSOR>_LOWER_PADDING` | An array of padding elements used for the tensor dimensions before they start. Always ordered as BFYX.|
+| `<TENSOR>_LOWER_PADDING_SIZE` | The size of the `<TENSOR>_LOWER_PADDING` array  |
+| `<TENSOR>_UPPER_PADDING`   | An array of padding elements used for the tensor dimensions after they end. Always ordered as BFYX. |
+| `<TENSOR>_UPPER_PADDING_SIZE`  | The size of the `<TENSOR>_UPPER_PADDING` array |
+| `<TENSOR>_PITCHES` | The offset (in elements) between adjacent elements in each dimension. Always ordered as BFYX.|
+| `<TENSOR>_PITCHES_SIZE`| The size of the `<TENSOR>_PITCHES` array   |
+| `<TENSOR>_OFFSET`| The number of elements from the start of the tensor to the first valid element, bypassing the lower padding.  |
+
+All `<TENSOR>` values are automatically defined for every tensor
+bound to this operation, such as `INPUT0`, `INPUT1`, and `OUTPUT0`, as shown
+in the following example:
+
+```c
+#define INPUT0_DIMS_SIZE 4
+#define INPUT0_DIMS (int []){ 1,96,55,55, }
+```
+
+## Example Kernel<a name="example-kernel"></a>
+
+```c
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+__kernel void example_relu_kernel(
+    const __global INPUT0_TYPE*  input0,
+          __global OUTPUT0_TYPE* output)
+{
+    const uint idx  = get_global_id(0);
+    const uint idy  = get_global_id(1);
+    const uint idbf = get_global_id(2); // batches*features, as OpenCL supports 3D nd-ranges only
+    const uint feature = idbf % OUTPUT0_DIMS[1];
+    const uint batch   = idbf / OUTPUT0_DIMS[1];
+    //notice that pitches are in elements, not in bytes!
+    const uint in_id  = batch*INPUT0_PITCHES[0] + feature*INPUT0_PITCHES[1]   + idy*INPUT0_PITCHES[2]  + idx*INPUT0_PITCHES[3]  + INPUT0_OFFSET;
+    const uint out_id = batch*OUTPUT0_PITCHES[0] + feature*OUTPUT0_PITCHES[1]  + idy*OUTPUT0_PITCHES[2]  + idx*OUTPUT0_PITCHES[3]  + OUTPUT0_OFFSET;
+
+    INPUT0_TYPE value = input0[in_id];
+    // neg_slope (which is non-zero for leaky ReLU) is put automatically as #define, refer to the config xml
+    output[out_id] = value < 0 ? value * neg_slope : value;
+}
+```
+
+
+> **NOTE**: As described in the previous section, all items like
+> `INPUT0_TYPE` are actually defined as OpenCL (pre-)compiler inputs by
+> the OpenVINO for efficiency reasons. See [Debugging
+> Tips](#debugging-tips) for information on debugging the results.
+
+## Debugging Tips<a name="debugging-tips"></a>
+
+* **Using `printf` in the OpenCL™ Kernels**.
+To debug the specific values, you can use `printf` in your kernels.
+However, be careful not to output excessively, which
+could generate too much data. The `printf` output is typical, so
+your output can be truncated to fit the buffer. Also, because of
+buffering, you actually get an entire buffer of output when the
+execution ends.<br>
+
+For more information, refer to the [printf
+Function](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/printfFunction.html).
--- a/docs/Extensibility_UG/Intro.md
+++ b/docs/Extensibility_UG/Intro.md
@ -7,6 +7,7 @@
   :hidden:

   openvino_docs_Extensibility_UG_add_openvino_ops
+   openvino_docs_Extensibility_UG_GPU

@endsphinxdirective

--- a/docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md
+++ b/docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md
@ -19,9 +19,9 @@ The OpenVINO Runtime provides capabilities to infer deep learning models on the
 | Plugin                                   | Device types                                                                                                                                                |
 |------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
 |[CPU plugin](CPU.md)              |Intel&reg; Xeon&reg; with Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and AVX512_BF16, Intel&reg; Core&trade; Processors with Intel&reg; AVX2, Intel&reg; Atom&reg; Processors with Intel® Streaming SIMD Extensions (Intel® SSE) |
-|[GPU plugin](GPU.md)            |Intel&reg; Processor Graphics, including Intel&reg; HD Graphics and Intel&reg; Iris&reg; Graphics                                                            |
-|[VPU plugins](VPU.md) (available in the Intel® Distribution of OpenVINO™ toolkit)            |Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X, Intel® Vision Accelerator Design with Intel® Movidius™ VPUs                                                                                           |
-|[GNA plugin](GNA.md) (available in the Intel® Distribution of OpenVINO™ toolkit)              |Intel&reg; Speech Enabling Developer Kit, Amazon Alexa* Premium Far-Field Developer Kit, Intel&reg; Pentium&reg; Silver J5005 Processor, Intel&reg; Pentium&reg; Silver N5000 Processor, Intel&reg; Celeron&reg; J4005 Processor, Intel&reg; Celeron&reg; J4105 Processor, Intel&reg; Celeron&reg; Processor N4100, Intel&reg; Celeron&reg; Processor N4000, Intel&reg; Core&trade; i3-8121U Processor, Intel&reg; Core&trade; i7-1065G7 Processor, Intel&reg; Core&trade; i7-1060G7 Processor, Intel&reg; Core&trade; i5-1035G4 Processor, Intel&reg; Core&trade; i5-1035G7 Processor, Intel&reg; Core&trade; i5-1035G1 Processor, Intel&reg; Core&trade; i5-1030G7 Processor, Intel&reg; Core&trade; i5-1030G4 Processor, Intel&reg; Core&trade; i3-1005G1 Processor, Intel&reg; Core&trade; i3-1000G1 Processor, Intel&reg; Core&trade; i3-1000G4 Processor|
+|[GPU plugin](GPU.md)            |Intel® Graphics, including Intel® HD Graphics, Intel® UHD Graphics, Intel® Iris® Graphics, Intel® Xe Graphics, Intel® Xe MAX Graphics |
+|[VPU plugins](VPU.md)            |Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X, Intel® Vision Accelerator Design with Intel® Movidius™ VPUs                                                                                           |
+|[GNA plugin](GNA.md)              |Intel&reg; Speech Enabling Developer Kit, Amazon Alexa* Premium Far-Field Developer Kit, Intel&reg; Pentium&reg; Silver J5005 Processor, Intel&reg; Pentium&reg; Silver N5000 Processor, Intel&reg; Celeron&reg; J4005 Processor, Intel&reg; Celeron&reg; J4105 Processor, Intel&reg; Celeron&reg; Processor N4100, Intel&reg; Celeron&reg; Processor N4000, Intel&reg; Core&trade; i3-8121U Processor, Intel&reg; Core&trade; i7-1065G7 Processor, Intel&reg; Core&trade; i7-1060G7 Processor, Intel&reg; Core&trade; i5-1035G4 Processor, Intel&reg; Core&trade; i5-1035G7 Processor, Intel&reg; Core&trade; i5-1035G1 Processor, Intel&reg; Core&trade; i5-1030G7 Processor, Intel&reg; Core&trade; i5-1030G4 Processor, Intel&reg; Core&trade; i3-1005G1 Processor, Intel&reg; Core&trade; i3-1000G1 Processor, Intel&reg; Core&trade; i3-1000G4 Processor|

 OpenVINO runtime also has several execution capabilities which work on top of other devices:

@ -34,3 +34,21 @@ OpenVINO runtime also has several execution capabilities which work on top of ot

 Devices similar to the ones we have used for benchmarking can be accessed using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).

+
+## Features support matrix
+The table below demonstrates support of key features by OpenVINO device plugins.
+
+| Capability | CPU | [GPU](./GPU.md) | GNA | VPU |
+| ---------- | --- | --- | --- | --- |
+| [Heterogeneous execution](../hetero_execution.md)| Yes | Yes | ? | ? |
+| [Multi-device execution](../multi_device.md) | Yes | Yes | ? | ? |
+| [Automatic batching](../automatic_batching.md) | No | Yes | ? | ? |
+| [Multi-stream execution](@ref openvino_docs_optimization_guide_dldt_optimization_guide) | Yes | Yes | ? | ? |
+| [Models caching](../Model_caching_overview.md) | Yes | Partial | ? | ? |
+| [Dynamic shapes](../ov_dynamic_shapes.md) | Yes | Partial | ? | ? |
+| Import/Export | Yes | No | ? | ? |
+| [Preprocessing acceleration](../preprocessing_overview.md) | Yes | Yes | ? | ? |
+| [Stateful models](../network_state_intro.md) | Yes | No | ? | ? |
+| [Extensibility](@ref openvino_docs_Extensibility_UG_Intro) | Yes | Yes | ? | ? |
+
+For more details on plugin specific feature limitation see corresponding plugin pages.
--- a/docs/OV_Runtime_UG/supported_plugins/GPU.md
+++ b/docs/OV_Runtime_UG/supported_plugins/GPU.md
@ -6,21 +6,27 @@
   :maxdepth: 1
   :hidden:

-   openvino_docs_OV_UG_supported_plugins_GPU_RemoteBlob_API
-
+   openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API

@endsphinxdirective

-The GPU plugin uses the Intel® Compute Library for Deep Neural Networks (clDNN) to infer deep neural networks.
-clDNN is an open source performance library for Deep Learning (DL) applications intended for acceleration of Deep Learning Inference on Intel® Processor Graphics including Intel® HD Graphics, Intel® Iris® Graphics, Intel® Iris® Xe Graphics, and Intel® Iris® Xe MAX graphics.
-For an in-depth description of clDNN, see [OpenVINO Runtime GPU plugin source files](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/) and [Accelerate Deep Learning Inference with Intel® Processor Graphics](https://software.intel.com/en-us/articles/accelerating-deep-learning-inference-with-intel-processor-graphics).
+The GPU plugin is OpenCL based plugin for inference of deep neural networks on Intel GPUs including integrated and discrete ones.
+For an in-depth description of GPU plugin, see
+- [GPU plugin developers documentation](https://github.com/openvinotoolkit/openvino/wiki/GPUPluginDevelopersDocs)
+- [OpenVINO Runtime GPU plugin source files](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/)
+- [Accelerate Deep Learning Inference with Intel® Processor Graphics](https://software.intel.com/en-us/articles/accelerating-deep-learning-inference-with-intel-processor-graphics).
+
+The GPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit.
+
+See [GPU configuration page](@ref openvino_docs_install_guides_configurations_for_intel_gpu) for more details on how to configure machine to use GPU plugin.

 ## Device Naming Convention
-* Devices are enumerated as "GPU.X" where `X={0, 1, 2,...}`. Only Intel® GPU devices are considered.
-* If the system has an integrated GPU, it always has id=0 ("GPU.0").
+* Devices are enumerated as `"GPU.X"` where `X={0, 1, 2,...}`. Only Intel® GPU devices are considered.
+* If the system has an integrated GPU, it always has id=0 (`"GPU.0"`).
 * Other GPUs have undefined order that depends on the GPU driver.
-* "GPU" is an alias for "GPU.0"
+* `"GPU"` is an alias for `"GPU.0"`
 * If the system doesn't have an integrated GPU, then devices are enumerated starting from 0.
+* For GPUs with multi-tile architecture (multiple sub-devices in OpenCL terms) specific tile may be addresed as `"GPU.X.Y"` where `X,Y={0, 1, 2,...}`, `X` - id of the GPU device, `Y` - id of the tile within device `X`

 For demonstration purposes, see the [Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) that can print out the list of available devices with associated indices. Below is an example output (truncated to the device names only):

@ -36,122 +42,180 @@ Available devices:
    Device: HDDL
 ```

-## Optimizations
+Then device name can be passed to `ov::Core::compile_model()` method:

-The plugin supports algorithms that fuse several operations into one optimized operation. Refer to the sections below for details.
+@sphinxdirective

-> **NOTE**: For operation descriptions, see the [IR Notation Reference](../../ops/opset.md).
+.. tab:: Running on default device

-### Fusing Convolution and Simple Layers
+    .. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
+        :language: cpp
+        :fragment: [compile_model_default_gpu]

-Merge of a Convolution layer and any of the simple layers listed below:
- Activation: ReLU, ELU, Sigmoid, Clamp, and others
- Depthwise: ScaleShift, PReLU
- FakeQuantize
+.. tab:: Running on specific GPU

-> **NOTE**: You can have any number and order of simple layers.
+    .. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
+        :language: cpp
+        :fragment: [compile_model_gpu_with_id]

-A combination of a Convolution layer and simple layers results in a single fused layer called
-*Convolution*:
-![conv_simple_01]
+.. tab:: Running on specific tile
+
+    .. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
+        :language: cpp
+        :fragment: [compile_model_gpu_with_id_and_tile]
+
+@endsphinxdirective
+
+## Supported inference data types
+GPU plugin supports the following data types as inference precision of internal primitives:
+
+- Floating-point data types:
+  - f32
+  - f16
+- Quantized data types:
+  - u8
+  - i8
+  - u1
+
+Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities.
+u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations.
+See [low-precision optimization guide](@ref pot_docs_LowPrecisionOptimizationGuide) for more details on how to get quantized model.
+
+Floating-point precision of a GPU primitive is selected based on operation precision in IR except [compressed f16 IR form](../model_representation.md) which is executed in f16 precision.
+
+> **NOTE**: Harware acceleration for i8/u8 precision may be unavailable on some platforms. In that case model is executed in floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via `ov::device::capabilities` property.
+
+[Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out supported data types for all detected devices.
+
+## Supported features
+
+### Multi-device execution
+If a machine has multiple GPUs (for example integrated GPU and discrete Intel GPU), then any supported model can be executed on all GPUs simultaneously.
+This can be achieved by specifying `"MULTI:GPU.1,GPU.0"` as a target device.
+
+@snippet snippets/gpu/compile_model.cpp compile_model_multi
+
+See [Multi-device execution page](../multi_device.md) for more details.
+
+### Automatic batching
+GPU plugin is capable of reporting `ov::max_batch_size` and `ov::optimal_batch_size` metrics with respect to the current hardware platform and model,
+thus automatic batching can be applied in cases when `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` is set
+or device is specified as `"BATCH:GPU"`.
+
+@sphinxdirective
+
+.. tab:: Batching via BATCH plugin
+
+    .. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
+        :language: cpp
+        :fragment: [compile_model_batch_plugin]
+
+.. tab:: Bacthing via throughput hint
+
+    .. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
+        :language: cpp
+        :fragment: [compile_model_auto_batch]
+
+@endsphinxdirective
+
+See [Automatic batching page](../automatic_batching.md) for more details.
+
+### Multi-stream execution
+If either `ov::num_streams(n_streams)` with `n_streams > 1` or `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` property is set for GPU plugin,
+then multiple streams are created for the model. In case of GPU plugin each stream has its own host thread and associated OpenCL queue
+which means that incoming infer requests can be processed simultaneously.
+
+> **NOTE**: Simultaneous scheduling of kernels to different queues doesn't mean that the kernels are actually executed in parallel on GPU device. The actual behavior depends on the hardware architecture, and in some cases the execution may be serialized inside the GPU driver.
+
+When multiple inferences of the same model need to be executed in parallel, multi-stream feature is preferrable over multiple instances of the model or application,
+since implementation of streams in GPU plugin supports weights memory sharing across streams, thus memory consumption may be less comparing to the other approaches.
+
+See [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide) for more details.
+
+### Dynamic shapes
+GPU plugin supports dynamic shapes for batch dimension only (specified as 'N' in the [layouts terms](../layout_overview.md)) with fixed upper bound. Any other dynamic dimensions are unsupported. Internally GPU plugin creates
+`log2(N)` (`N` - is an upper bound for batch dimension here) low-level execution graphs for batch sizes equal to powers of 2 to emulate dynamic behavior, so that incoming infer request with specific batch size is executed via minimal combination of internal networks.
+For example, batch size 33 may be executed via 2 internal networks with batch size 32 and 1.
+
+> **NOTE**: Such approach requires much more memory and overall model compilation time is significantly bigger comparing to static batch scenario.
+
+The code snippet below demonstrates how to use dynamic batch in simple scenarios:
+
+@snippet snippets/gpu/dynamic_batch.cpp dynamic_batch
+
+See [dynamic shapes guide](../ov_dynamic_shapes.md) for more details.
+
+### Preprocessing acceleration
+GPU plugin has the following additional preprocessing options:
+- `ov::intel_gpu::memory_type::surface` and `ov::intel_gpu::memory_type::buffer` values for `ov::preprocess::InputTensorInfo::set_memory_type()` preprocessing method. These values are intended to be used to provide a hint for the plugin on the type of input Tensors that will be set in runtime to generate proper kernels.
+
+@snippet snippets/gpu/preprocessing.cpp init_preproc
+
+With such preprocessing GPU plugin will expect `ov::intel_gpu::ocl::ClImage2DTensor` (or derived) to be passed for each NV12 plane via `ov::InferRequest::set_tensor()` or `ov::InferRequest::set_tensors()` methods.
+
+Refer to [RemoteTensor API](./GPU_RemoteTensor_API.md) for usage examples.
+
+See [preprocessing API guide](../preprocessing_overview.md) for more details.
+
+### Models caching
+Cache for GPU plugin may be enabled via common OpenVINO `ov::cache_dir` property. GPU plugin implementation supports only compiled kernels caching,
+thus all plugin specific model transformations are executed on each `ov::Core::compile_model()` call regardless `cache_dir` option, but since
+the kernels compilation is a bottleneck in the model loading process, significant load time reduction can be achieved with `ov::cache_dir` property enabled.
+
+See [Model caching overview page](../Model_caching_overview.md) for more details.
+
+### Extensibility
+See [GPU Extensibility](@ref openvino_docs_Extensibility_UG_GPU) page.
+
+### GPU context and memory sharing via RemoteTensor API
+See [RemoteTensor API of GPU Plugin](GPU_RemoteTensor_API.md).


-### Fusing Pooling and FakeQuantize Layers
+## Supported properties
+The plugin supports the properties listed below.

-A combination of Pooling and FakeQuantize layers results in a single fused layer called *Pooling*:
-![pooling_fakequant_01]
+### Read-write properties
+All parameters must be set before calling `ov::Core::compile_model()` in order to take effect or passed as additional argument to `ov::Core::compile_model()`

-### Fusing Activation Layers
+- ov::cache_dir
+- ov::enable_profiling
+- ov::hint::model_priority
+- ov::hint::performance_mode
+- ov::hint::num_requests
+- ov::num_streams
+- ov::compilation_num_threads
+- ov::device::id
+- ov::intel_gpu::hint::host_task_priority
+- ov::intel_gpu::hint::queue_priority
+- ov::intel_gpu::hint::queue_throttle
+- ov::intel_gpu::enable_loop_unrolling

-Given the linear pattern, an Activation layer can be fused into other layers:
+### Read-only properties
+- ov::supported_properties
+- ov::available_devices
+- ov::range_for_async_infer_requests
+- ov::range_for_streams
+- ov::optimal_batch_size
+- ov::max_batch_size
+- ov::device::full_name
+- ov::device::type
+- ov::device::gops
+- ov::device::capabilities
+- ov::intel_gpu::device_total_mem_size
+- ov::intel_gpu::uarch_version
+- ov::intel_gpu::execution_units_count
+- ov::intel_gpu::memory_statistics

-![fullyconnected_activation_01]
+## Limitations
+In some cases GPU plugin may implicitly execute several primitives on CPU using internal implementations which may lead to increase of CPU utilization.
+Below is the list of such operations:
+- Proposal
+- NonMaxSuppression
+- DetectionOutput

-
-### Fusing Convolution and Sum Layers
-
-A combination of Convolution, Simple, and Eltwise layers with the sum operation results in a single layer called  *Convolution*:
-![conv_sum_relu_01]
-
-### Fusing a Group of Convolutions
-
-If a topology contains the following pipeline, a GPU plugin merges Split, Convolution, and Concatenation layers  into a single Convolution layer with the group parameter:
-> **NOTE**: Parameters of the Convolution layers must coincide.
-
-![group_convolutions_01]
-
-### Optimizing Layers Out
-
-The following layers are optimized out under certain conditions:
-  * Crop
-  * Concatenate
-  * Reshape
-  * Flatten
-  * Split
-  * Copy
-
-### Load-Time Execution
-
-Some layers are executed during the load time, not during the inference. One of such layers is PriorBox.
-
-
-## CPU Executed Layers
-
-The following layers are not accelerated on the GPU and executed on the host CPU instead:
-* Proposal
-* NonMaxSuppression
-* PriorBox
-* DetectionOutput
-
-## Supported Configuration Parameters
-
-The plugin supports the configuration parameters listed below.
-All parameters must be set before calling <code>InferenceEngine::Core::LoadNetwork()</code> in order to take effect.
-When specifying key values as raw strings (that is, when using Python API), omit the `KEY_` prefix.
-
-| Parameter Name          | Parameter Values                | Default         | Description                                               |
-|---------------------|-----------------------------|-----------------|-----------------------------------------------------------|
-| `KEY_CACHE_DIR`      | `"<cache_dir>"`                    | `""`              | Specifies a directory where compiled OCL binaries can be cached. First model loading generates the cache, and all subsequent LoadNetwork calls use precompiled kernels which significantly improves load time. If empty - caching is disabled             |
-| `KEY_PERF_COUNT`      | `YES` / `NO`                    | `NO`              | Collect performance counters during inference             |
-| `KEY_CONFIG_FILE`     | `"<file1> [<file2> ...]"`         | `""`              | Load custom layer configuration files                     |
-| `KEY_GPU_HOST_`<br>`TASK_PRIORITY` | `GPU_HOST_TASK_PRIORITY_<HIGH\|MEDIUM\|LOW>`                       | `GPU_HOST_TASK_PRIORITY_MEDIUM`               | This key instructs the GPU plugin which cpu core type of TBB affinity used in load network. <br> This option has 3 types of levels: HIGH, LOW, and ANY. It is only affected on Hybrid CPUs. <br>- LOW - instructs the GPU Plugin to use LITTLE cores if they are available <br>- MEDIUM (DEFAULT) - instructs the GPU Plugin to use any available cores (BIG or LITTLE cores) <br>- HIGH - instructs the GPU Plugin to use BIG cores if they are available |
-| `KEY_GPU_PLUGIN_`<br>`PRIORITY` | `<0-3>`                       | `0`               | OpenCL queue priority (before usage, make sure your OpenCL driver supports appropriate extension)<br> Higher value means higher priority for OpenCL queue. 0 disables the setting. **Deprecated**. Please use KEY_GPU_MODEL_PRIORITY |
-| `KEY_GPU_PLUGIN_`<br>`THROTTLE` | `<0-3>`                       | `2`               | OpenCL queue throttling (before usage, make sure your OpenCL driver supports appropriate extension)<br> Lower value means lower driver thread priority and longer sleep time for it. Has no effect if the driver does not support reqired hint.  |
-| `KEY_CLDNN_ENABLE_`<br>`FP16_FOR_QUANTIZED_`<br>`MODELS` | `YES` / `NO`                       | `YES`               | Allows using FP16+INT8 mixed precision mode, so non-quantized parts of a model will be executed in FP16 precision for FP16 IR. Does not affect quantized FP32 IRs |
-| `KEY_GPU_NV12_`<br>`TWO_INPUTS` | `YES` / `NO`                       | `NO`               | Controls preprocessing logic for nv12 input. If it's set to YES, then device graph will expect that user will set biplanar nv12 blob as input wich will be directly passed to device execution graph. Otherwise, preprocessing via GAPI is used to convert NV12->BGR, thus GPU graph have to expect single input |
-| `KEY_GPU_THROUGHPUT_`<br>`STREAMS`  | `KEY_GPU_THROUGHPUT_AUTO`, or positive integer| 1 | Specifies a number of GPU "execution" streams for the throughput mode (upper bound for a number of inference requests that can be executed simultaneously).<br>This option is can be used to decrease GPU stall time by providing more effective load from several streams. Increasing the number of streams usually is more effective for smaller topologies or smaller input sizes. Note that your application should provide enough parallel slack (e.g. running many inference requests) to leverage full GPU bandwidth. Additional streams consume several times more GPU memory, so make sure the system has enough memory available to suit parallel stream execution. Multiple streams might also put additional load on CPU. If CPU load increases, it can be regulated by setting an appropriate `KEY_GPU_PLUGIN_THROTTLE` option value (see above). If your target system has relatively weak CPU, keep throttling low. <br>The default value is 1, which implies latency-oriented behavior.<br>`KEY_GPU_THROUGHPUT_AUTO` creates bare minimum of streams to improve the performance; this is the most portable option if you are not sure how many resources your target machine has (and what would be the optimal number of streams). <br> A positive integer value creates the requested number of streams. |
-| `KEY_EXCLUSIVE_ASYNC_`<br>`REQUESTS` | `YES` / `NO`                | `NO`              | Forces async requests (also from different executable networks) to execute serially.|
-| `KEY_GPU_MAX_NUM_`<br>`THREADS` | `integer value` | `maximum # of HW threads available in host environment` |  Specifies the number of CPU threads that can be used for GPU engine, e.g, JIT compilation of GPU kernels or cpu kernel processing within GPU plugin. The default value is set as the number of maximum available threads in host environment to minimize the time for LoadNetwork, where the GPU kernel build time occupies a large portion. Note that if the specified value is larger than the maximum available # of threads or less than zero, it is set as maximum available # of threads. It can be specified with a smaller number than the available HW threads according to the usage scenario, e.g., when the user wants to assign more CPU threads while GPU plugin is running. Note that setting this value with lower number will affect not only the network loading time but also the cpu layers of GPU networks that are optimized with multi-threading. |
-| `KEY_GPU_ENABLE_`<br>`LOOP_UNROLLING` | `YES` / `NO`             | `YES`             | Enables recurrent layers such as TensorIterator or Loop with fixed iteration count to be unrolled. It is turned on by default. Turning this key on will achieve better inference performance for loops with not too many iteration counts (less than 16, as a rule of thumb). Turning this key off will achieve better performance for both graph loading time and inference time with many iteration counts (greater than 16). Note that turning this key on will increase the graph loading time in proportion to the iteration counts. Thus, this key should be turned off if graph loading time is considered to be most important target to optimize. |
-| `KEY_CLDNN_PLUGIN_`<br>`PRIORITY` | `<0-3>`                       | `0`               | OpenCL queue priority (before usage, make sure your OpenCL driver supports appropriate extension)<br> Higher value means higher priority for OpenCL queue. 0 disables the setting. **Deprecated**. Please use KEY_GPU_MODEL_PRIORITY |
-| `KEY_CLDNN_PLUGIN_`<br>`THROTTLE` | `<0-3>`                       | `0`               | OpenCL queue throttling (before usage, make sure your OpenCL driver supports appropriate extension)<br> Lower value means lower driver thread priority and longer sleep time for it. 0 disables the setting. **Deprecated**. Please use KEY_GPU_PLUGIN_THROTTLE |
-| `KEY_CLDNN_GRAPH_`<br>`DUMPS_DIR` | `"<dump_dir>"`                       | `""`               | clDNN graph optimizer stages dump output directory (in GraphViz format) **Deprecated**. Will be removed in the next release                                     |
-| `KEY_CLDNN_SOURCES_`<br>`DUMPS_DIR` | `"<dump_dir>"`                       | `""`               | Final optimized clDNN OpenCL sources dump output directory. **Deprecated**. Will be removed in the next release                                   |
-| `KEY_DUMP_KERNELS`    | `YES` / `NO`                    | `NO`              | Dump the final kernels used for custom layers. **Deprecated**. Will be removed in the next release             |
-| `KEY_TUNING_MODE`     | `TUNING_DISABLED` <br /> `TUNING_CREATE` <br />  `TUNING_USE_EXISTING`            | `TUNING_DISABLED` | Disable inference kernel tuning     <br /> Create tuning file (expect much longer runtime)  <br />         Use an existing tuning file. **Deprecated**. Will be removed in the next release |
-| `KEY_TUNING_FILE`     | `"<filename>"`                  | `""`              | Tuning file to create / use. **Deprecated**. Will be removed in the next release |
-
-## Quering GPU specific metric keys
-* MEMORY_STATISTICS : Returns overall memory statistics of `GPU` device allocated by engine with allocation types. If the network has `TensorIterator` or `Loop` operation which is not unrolled, there will be additional allocation at the first inference phase. In such a case, querying for `MEMORY_STATISTICS` should be done after first inference for more accurate result. The code below demonstrates how to query overall memory statistics of `GPU` device:
-
-@snippet snippets/GPU_Metric0.cpp part0
-
-* MAX_BATCH_SIZE : Returns maximum batch size for a given network which is not only executable but also does not lose performance due to the memory swap impact. Note that the returned value may not aligned to power of 2. Also, MODEL_PTR is the required option for this metric since the available max batch size depends on the model size. If the MODEL_PTR is not given, it will return 1. The example code to set the required and optional configs for this metic is available in the following snippet:
-
-@snippet snippets/GPU_Metric1.cpp part1
-
-* OPTIMAL_BATCH_SIZE : Returns _optimal_ batch size for a given network on the given GPU device. The returned value is aligned to power of 2. Also, MODEL_PTR is the required option for this metric since the optimal batch size highly depends on the model. If the MODEL_PTR is not given, the value of 1 is returned. The example code to set the required and optional configs for this metric is available in the following snippet:
-
-@snippet snippets/GPU_Metric1.cpp part2
-## GPU Context and Video Memory Sharing RemoteBlob API
-
-See [RemoteBlob API of GPU Plugin](GPU_RemoteBlob_API.md)
+The behavior depends on specific parameters of the operations and hardware configuration.

 ## See Also
 * [Supported Devices](Supported_Devices.md)
-
-[conv_simple_01]: ../img/conv_simple_01.png
-[pooling_fakequant_01]: ../img/pooling_fakequant_01.png
-[fullyconnected_activation_01]: ../img/fullyconnected_activation_01.png
-[group_convolutions_01]: ../img/group_convolutions_01.png
-[conv_sum_relu_01]: ../img/conv_sum_relu_01.png
+* [Optimization guide](@ref openvino_docs_optimization_guide_dldt_optimization_guide)
+* [GPU plugin developers documentation](https://github.com/openvinotoolkit/openvino/wiki/GPUPluginDevelopersDocs)
--- a/docs/OV_Runtime_UG/supported_plugins/GPU_RemoteBlob_API.md
+++ b/docs/OV_Runtime_UG/supported_plugins/GPU_RemoteBlob_API.md
@ -1,141 +0,0 @@
-Remote Blob API of GPU Plugin {#openvino_docs_OV_UG_supported_plugins_GPU_RemoteBlob_API}
-================================
-
-The GPU plugin implementation of the `RemoteContext` and `RemoteBlob` interfaces supports GPU
-pipeline developers who need video memory sharing and interoperability with existing native APIs
-such as OpenCL\*, Microsoft DirectX\*, or VAAPI\*.
-Using these interfaces allows you to avoid any memory copy overhead when plugging the OpenVINO™ inference 
-into an existing GPU pipeline. It also enables OpenCL kernels participating in the pipeline to become 
-native buffer consumers or producers of the OpenVINO™ inference.
-Since the GPU plugin works on top of the clDNN library, the functionality above is also implemented
-using OpenCL and its sharing extensions provided by Intel®.
-
-There are two interoperability scenarios supported by the Remote Blob API:
-
-* GPU plugin context and memory objects can be constructed from low-level device, display, or memory
-handles and used to create the OpenVINO™ `ExecutableNetwork` or `Blob` class.
-* OpenCL context or buffer handles can be obtained from existing GPU plugin objects, and used in OpenCL processing.
-
-Class and function declarations for the API are defined in the following files:
-* Windows\*: `gpu/gpu_context_api_ocl.hpp` and `gpu/gpu_context_api_dx.hpp`
-* Linux\*: `gpu/gpu_context_api_ocl.hpp` and `gpu/gpu_context_api_va.hpp`
-
-The most common way to enable the interaction of your application with the Remote Blob API is to use user-side utility classes
-and functions that consume or produce native handles directly.
-
-## Execution Context User-Side Wrappers
-
-GPU plugin classes that implement the `RemoteContext` interface are responsible for context sharing.
-Obtaining a pointer to a context object is the first step of sharing pipeline objects.
-The context object of the GPU plugin directly wraps OpenCL context, setting a scope for sharing
-`ExecutableNetwork` and `RemoteBlob` objects.
-To create such objects within user context, explicitly provide the context to the plugin using the
-`make_shared_context()` overloaded function. Depending on the platform, the function accepts the
-`cl_context` handle, the pointer to the `ID3D11Device` interface, or the `VADisplay` handle, and
-returns a smart pointer to the `RemoteContext` plugin object.
-
-If you do not provide any user context, the plugin uses its default internal context.
-The plugin attempts to use the same internal context object as long as plugin options are kept the same.
-Therefore, all ExecutableNetwork objects created during this time share the same context.
-Once the plugin options are changed, the internal context is replaced by the new one.
-
-To request the current default context of the plugin, call the `GetDefaultContext()` method of the core engine.
-To request the internal context of the given `ExecutableNetwork`, use the `GetContext()` method.
-
-## Shared Blob User-Side Wrappers
-
-The classes that implement the `RemoteBlob` interface are both wrappers for native API
-memory handles (which can be obtained from them at any time) and act just like regular OpenVINO™
-`Blob` objects.
-
-Once you obtain the context, you can use it to compile a new `ExecutableNetwork` or create `RemoteBlob`
-objects.
-For network compilation, use a dedicated flavor of `LoadNetwork()`, which accepts the context as an
-additional parameter.
-
-To create a shared blob from a native memory handle, use `make_shared_blob()` overloaded functions
-that can accept the `cl::Buffer`, `cl::Image2D`, `cl_mem` handles, and either `ID3D11Buffer`,
-`ID3D11Texture2D` pointers or the `VASurfaceID` handle.
-All `make_shared_blob()` flavors return a smart pointer to the `Blob` object, which can be directly
-passed to the `SetBlob() `method of an inference request object.
-
-## Direct NV12 video surface input
-
-To support the direct consumption of a hardware video decoder output, plugin accepts two-plane video
-surfaces as arguments for the `make_shared_blob_nv12()` function, which creates an `NV12Blob` object
-and returns a smart pointer to it, which is cast to `Blob::Ptr`.
-
-To ensure that the plugin generates the correct execution graph for the NV12 dual-plane input, set
-the `CLDNNConfigParams::KEY_CLDNN_NV12_TWO_INPUTS` plugin configuration flag to `PluginConfigParams::YES`.
-
-## Context & queue sharing
-
-GPU plugin supports creation of shared context from `cl_command_queue` handle. In that case
-opencl context handle is extracted from given queue via OpenCL™ API, and the queue itself is used inside
-the plugin for further execution of inference primitives. Sharing of the queue changes behavior of `StartAsync()`
-method to guarantee that submission of inference primitives into given queue is finished before
-returning of control back to calling thread.
-
-This sharing mechanism allows to do pipeline synchronization on app side and avoid blocking of host thread
-on waiting for completion of inference. Pseudocode may look as follows:
-
-@snippet snippets/GPU_RemoteBlob_API3.cpp part0
-
-### Limitations
-
- - Some primitives in GPU plugin may block host thread on waiting for previous primitives before adding its kernels
-   to the command queue. In such cases `StartAsync()` call takes much more time to return control to the calling thread
-   as internally it waits for partial or full network completion.
-   Examples of operations: Loop, TensorIterator, DetectionOutput, NonMaxSuppression
- - Synchronization of pre/post processing jobs and inference pipeline inside shared queue is the user responsibility
- - Throughput mode is not available when queue sharing is used, i.e. only single stream can be used for each executable network.
-
-## Low-Level Methods and Their Parameter Description
-
-The high-level wrappers above bring a direct dependency on native APIs to the user program.
-If you want to avoid the dependency, you still can directly use the `CreateContext()`,
-`CreateBlob()`, and `getParams()` methods.
-On this level, native handles are re-interpreted as void pointers and all arguments are passed
-using `std::map` containers that are filled with `std::string, InferenceEngine::Parameter` pairs.
-Two types of map entries are possible: descriptor and container. The first map entry is a
-descriptor, which sets the expected structure  and possible parameter values of the map.
-
-**Parameter Map Entries**
-
-| Key Name           | Description and Possible Parameter Values                                 |
-|----------------|---------------------------------------------------------------------|
-| `CONTEXT_TYPE` | Describes the type of the shared context in a map. Can be `OCL` (for pure OpenCL context) or `VA_SHARED` (for context shared with a video decoding device). |
-| `OCL_CONTEXT` | Contains the OpenCL context handle. |
-| `OCL_QUEUE` | Contains the OpenCL queue handle if queue sharing is needed. |
-| `VA_DEVICE` | Contains the native video decoding device handle. Can be `VADisplay` or `ID3D11Device` (a pointer). |
-| `SHARED_MEM_TYPE` | Describes the type of the shared memory buffer in a map. Can be `OCL_BUFFER` (clBuffer), `OCL_IMAGE2D` (clImage2D), `VA_SURFACE()`,  or `DX_BUFFER`.  |
-| `MEM_HANDLE` | Contains the OpenCL memory handle. |
-| `DEV_OBJECT_HANDLE` | Contains the native video decoder surface handle. |
-| `VA_PLANE` | Contains the NV12 video decoder surface plane index. Can be `0` or `1`. |
-
-> **NOTE**: To initialize the entry key and value, use the `GPU_PARAM_KEY()` or `GPU_PARAM_VALUE()` macro.
-
-## Examples
-
-Refer to the sections below to see pseudo-code of usage examples.
-
-> **NOTE**: For low-level parameter usage examples, see the source code of user-side wrappers from the include files mentioned above.
-
-### OpenCL Kernel Execution on a Shared Buffer
-
-This example uses the OpenCL context obtained from an executable network object.
-
-@snippet snippets/GPU_RemoteBlob_API0.cpp part0
-
-### Running GPU Plugin Inference within User-Supplied Shared Context
-
-@snippet snippets/GPU_RemoteBlob_API1.cpp part1
-
-### Direct Consuming of the NV12 VAAPI Video Decoder Surface on Linux
-
-@snippet snippets/GPU_RemoteBlob_API2.cpp part2
-
-## See Also
-
-* InferenceEngine::Core
-* InferenceEngine::RemoteBlob
--- a/docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md
+++ b/docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md
@ -0,0 +1,324 @@
+Remote Tensor API of GPU Plugin {#openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API}
+================================
+
+The GPU plugin implementation of the `ov::RemoteContext` and `ov::RemoteTensor` interfaces supports GPU
+pipeline developers who need video memory sharing and interoperability with existing native APIs
+such as OpenCL\*, Microsoft DirectX\*, or VAAPI\*.
+Using of these interfaces allows you to avoid any memory copy overhead when plugging the OpenVINO™ inference
+into an existing GPU pipeline. It also enables OpenCL kernels participating in the pipeline to become
+native buffer consumers or producers of the OpenVINO™ inference.
+
+There are two interoperability scenarios supported by the Remote Tensor API:
+
+* GPU plugin context and memory objects can be constructed from low-level device, display, or memory
+handles and used to create the OpenVINO™ `ov::CompiledModel` or `ov::Tensor` objects.
+* OpenCL context or buffer handles can be obtained from existing GPU plugin objects, and used in OpenCL processing on the application side.
+
+Class and function declarations for the API are defined in the following files:
+* Windows\*: `openvino/runtime/intel_gpu/ocl/ocl.hpp` and `openvino/runtime/intel_gpu/ocl/dx.hpp`
+* Linux\*: `openvino/runtime/intel_gpu/ocl/ocl.hpp` and `openvino/runtime/intel_gpu/ocl/va.hpp`
+
+The most common way to enable the interaction of your application with the Remote Tensor API is to use user-side utility classes
+and functions that consume or produce native handles directly.
+
+## Context sharing between application and GPU plugin
+
+GPU plugin classes that implement the `ov::RemoteContext` interface are responsible for context sharing.
+Obtaining a context object is the first step of sharing pipeline objects.
+The context object of the GPU plugin directly wraps OpenCL context, setting a scope for sharing
+`ov::CompiledModel` and `ov::RemoteTensor` objects. `ov::RemoteContext` object can be either created on top ov
+existing handle from native api or retrieved from GPU plugin.
+
+Once you obtain the context, you can use it to compile a new `ov::CompiledModel` or create `ov::RemoteTensor`
+objects.
+For network compilation, use a dedicated flavor of `ov::Core::compile_model()`, which accepts the context as an
+additional parameter.
+
+### Creation of RemoteContext from native handle
+To create `ov::RemoteContext` object for user context, explicitly provide the context to the plugin using constructor for one
+of `ov::RemoteContext` derived classes.
+
+@sphinxdirective
+
+.. tab:: Linux
+
+    .. tab:: Create from cl_context
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [context_from_cl_context]
+
+    .. tab:: Create from cl_queue
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [context_from_cl_queue]
+
+    .. tab:: Create from VADisplay
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [context_from_va_display]
+
+.. tab:: Windows
+
+    .. tab:: Create from cl_context
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [context_from_cl_context]
+
+    .. tab:: Create from cl_queue
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [context_from_cl_queue]
+
+    .. tab:: Create from ID3D11Device
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [context_from_d3d_device]
+
+@endsphinxdirective
+
+
+### Getting RemoteContext from the plugin
+If you do not provide any user context, the plugin uses its default internal context.
+The plugin attempts to use the same internal context object as long as plugin options are kept the same.
+Therefore, all `ov::CompiledModel` objects created during this time share the same context.
+Once the plugin options are changed, the internal context is replaced by the new one.
+
+To request the current default context of the plugin use one of the following methods:
+
+@sphinxdirective
+
+.. tab:: Get context from Core
+
+    .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+        :language: cpp
+        :fragment: [default_context_from_core]
+
+.. tab:: Get context from CompiledModel
+
+    .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+        :language: cpp
+        :fragment: [default_context_from_model]
+
+
+@endsphinxdirective
+
+## Memory sharing between application and GPU plugin
+
+The classes that implement the `ov::RemoteTensor` interface are the wrappers for native API
+memory handles (which can be obtained from them at any time).
+
+To create a shared tensor from a native memory handle, use dedicated `create_tensor`or `create_tensor_nv12` methods
+of the `ov::RemoteContext` sub-classes.
+`ov::intel_gpu::ocl::ClContext` has multiple overloads of `create_tensor` methods which allow to wrap pre-allocated native handles with `ov::RemoteTensor`
+object or request plugin to allocate specific device memory. See code snippets below for more details.
+
+@sphinxdirective
+
+.. tab:: Wrap native handles
+
+    .. tab:: USM pointer
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [wrap_usm_pointer]
+
+    .. tab:: cl_mem
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [wrap_cl_mem]
+
+    .. tab:: cl::Buffer
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [wrap_cl_buffer]
+
+    .. tab:: cl::Image2D
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [wrap_cl_image]
+
+    .. tab:: biplanar NV12 surface
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [wrap_nv12_surface]
+
+.. tab:: Allocate device memory
+
+    .. tab:: USM host memory
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [allocate_usm_host]
+
+    .. tab:: USM device memory
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [allocate_usm_device]
+
+    .. tab:: cl::Buffer
+
+        .. doxygensnippet:: docs/snippets/gpu/remote_objects_creation.cpp
+          :language: cpp
+          :fragment: [allocate_cl_buffer]
+
+@endsphinxdirective
+
+`ov::intel_gpu::ocl::D3DContext` and `ov::intel_gpu::ocl::VAContext` classes are derived from `ov::intel_gpu::ocl::ClContext`,
+thus they provide functionality described above and extends it
+to allow creation of `ov::RemoteTensor` objects from `ID3D11Buffer`, `ID3D11Texture2D` pointers or the `VASurfaceID` handle respectively.
+
+## Direct NV12 video surface input
+
+To support the direct consumption of a hardware video decoder output, plugin accepts two-plane video
+surfaces as arguments for the `create_tensor_nv12()` function, which creates a pair or `ov::RemoteTensor`
+objects which represents Y and UV planes.
+
+To ensure that the plugin generates the correct execution graph for the NV12 dual-plane input, static preprocessing
+should be added before model compilation:
+
+@snippet snippets/gpu/preprocessing.cpp init_preproc
+
+Since `ov::intel_gpu::ocl::ClImage2DTensor` (and derived classes) doesn't support batched surfaces, in cases when batching and surface sharing are required
+at the same time, user need to set inputs via `ov::InferRequest::set_tensors` method with vector of shared surfaces for each plane:
+
+@sphinxdirective
+
+.. tab:: Single batch
+
+    .. doxygensnippet:: docs/snippets/gpu/preprocessing.cpp
+        :language: cpp
+        :fragment: [single_batch]
+
+.. tab:: Multiple batches
+
+    .. doxygensnippet:: docs/snippets/gpu/preprocessing.cpp
+        :language: cpp
+        :fragment: [batched_case]
+
+
+@endsphinxdirective
+
+I420 color format can be processed in similar way
+
+## Context & queue sharing
+
+GPU plugin supports creation of shared context from `cl_command_queue` handle. In that case
+opencl context handle is extracted from given queue via OpenCL™ API, and the queue itself is used inside
+the plugin for further execution of inference primitives. Sharing of the queue changes behavior of `ov::InferRequest::start_async()`
+method to guarantee that submission of inference primitives into given queue is finished before
+returning of control back to calling thread.
+
+This sharing mechanism allows to do pipeline synchronization on app side and avoid blocking of host thread
+on waiting for completion of inference. Pseudocode may look as follows:
+
+@sphinxdirective
+.. raw:: html
+
+   <div class="collapsible-section" data-title="Queue and context sharing example">
+
+@endsphinxdirective
+
+@snippet snippets/gpu/queue_sharing.cpp queue_sharing
+
+@sphinxdirective
+.. raw:: html
+
+   </div>
+
+@endsphinxdirective
+
+### Limitations
+
+ - Some primitives in GPU plugin may block host thread on waiting for previous primitives before adding its kernels
+   to the command queue. In such cases `ov::InferRequest::start_async()` call takes much more time to return control to the calling thread
+   as internally it waits for partial or full network completion.
+   Examples of operations: Loop, TensorIterator, DetectionOutput, NonMaxSuppression
+ - Synchronization of pre/post processing jobs and inference pipeline inside shared queue is the user responsibility
+ - Throughput mode is not available when queue sharing is used, i.e. only single stream can be used for each compiled model.
+
+## Low-Level Methods for RemoteContext and RemoteTensor creation
+
+The high-level wrappers above bring a direct dependency on native APIs to the user program.
+If you want to avoid the dependency, you still can directly use the `ov::Core::create_context()`,
+`ov::RemoteContext::create_tensor()`, and `ov::RemoteContext::get_params()` methods.
+On this level, native handles are re-interpreted as void pointers and all arguments are passed
+using `ov::AnyMap` containers that are filled with `std::string, ov::Any` pairs.
+Two types of map entries are possible: descriptor and container. The first map entry is a
+descriptor, which sets the expected structure and possible parameter values of the map.
+
+Refer to `openvino/runtime/intel_gpu/remote_properties.hpp` header file for possible low-level properties and their description.
+
+## Examples
+
+Refer to the sections below to see pseudo-code of usage examples.
+
+> **NOTE**: For low-level parameter usage examples, see the source code of user-side wrappers from the include files mentioned above.
+
+
+@sphinxdirective
+.. raw:: html
+
+   <div class="collapsible-section" data-title="OpenCL Kernel Execution on a Shared Buffer">
+
+@endsphinxdirective
+
+This example uses the OpenCL context obtained from an compiled model object.
+
+@snippet snippets/gpu/context_sharing.cpp context_sharing_get_from_ov
+
+@sphinxdirective
+.. raw:: html
+
+   </div>
+
+@endsphinxdirective
+
+
+@sphinxdirective
+.. raw:: html
+
+   <div class="collapsible-section" data-title="Running GPU Plugin Inference within User-Supplied Shared Context">
+
+@endsphinxdirective
+
+@snippet snippets/gpu/context_sharing.cpp context_sharing_user_handle
+
+@sphinxdirective
+.. raw:: html
+
+   </div>
+
+@endsphinxdirective
+
+
+@sphinxdirective
+.. raw:: html
+
+   <div class="collapsible-section" data-title="Direct Consuming of the NV12 VAAPI Video Decoder Surface on Linux">
+
+@endsphinxdirective
+
+@snippet snippets/gpu/context_sharing_va.cpp context_sharing_va
+
+@sphinxdirective
+.. raw:: html
+
+   </div>
+
+@endsphinxdirective
+
+## See Also
+
+* ov::Core
+* ov::RemoteTensor
--- a/docs/OV_Runtime_UG/supported_plugins/Supported_Devices.md
+++ b/docs/OV_Runtime_UG/supported_plugins/Supported_Devices.md
@ -13,8 +13,8 @@ The OpenVINO Runtime provides unique capabilities to infer deep learning models
 |[CPU plugin](CPU.md)              |Intel&reg; Xeon&reg; with Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and AVX512_BF16, Intel&reg; Core&trade; Processors with Intel&reg; AVX2, Intel&reg; Atom&reg; Processors with Intel® Streaming SIMD Extensions (Intel® SSE) |
 |[VPU plugins](VPU.md) (available in the Intel® Distribution of OpenVINO™ toolkit)            |Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X, Intel® Vision Accelerator Design with Intel® Movidius™ VPUs                                                                                           |
 |[GNA plugin](GNA.md) (available in the Intel® Distribution of OpenVINO™ toolkit)              |Intel&reg; Speech Enabling Developer Kit, Amazon Alexa* Premium Far-Field Developer Kit, Intel&reg; Pentium&reg; Silver J5005 Processor, Intel&reg; Pentium&reg; Silver N5000 Processor, Intel&reg; Celeron&reg; J4005 Processor, Intel&reg; Celeron&reg; J4105 Processor, Intel&reg; Celeron&reg; Processor N4100, Intel&reg; Celeron&reg; Processor N4000, Intel&reg; Core&trade; i3-8121U Processor, Intel&reg; Core&trade; i7-1065G7 Processor, Intel&reg; Core&trade; i7-1060G7 Processor, Intel&reg; Core&trade; i5-1035G4 Processor, Intel&reg; Core&trade; i5-1035G7 Processor, Intel&reg; Core&trade; i5-1035G1 Processor, Intel&reg; Core&trade; i5-1030G7 Processor, Intel&reg; Core&trade; i5-1030G4 Processor, Intel&reg; Core&trade; i3-1005G1 Processor, Intel&reg; Core&trade; i3-1000G1 Processor, Intel&reg; Core&trade; i3-1000G4 Processor|
-|[Multi-Device execution](../multi_device.md) |Multi-Device execution enables simultaneous inference of the same model on several devices in parallel    |   
-|[Auto-Device plugin](../auto_device_selection.md) |Auto-Device plugin enables selecting Intel&reg; device for inference automatically |   
+|[Multi-Device execution](../multi_device.md) |Multi-Device execution enables simultaneous inference of the same model on several devices in parallel    |
+|[Auto-Device plugin](../auto_device_selection.md) |Auto-Device plugin enables selecting Intel&reg; device for inference automatically |
 |[Heterogeneous plugin](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers)).                                                           |

 Devices similar to the ones we have used for benchmarking can be accessed using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).
@ -69,10 +69,9 @@ For example, the CHW value at index (c,h,w) is physically located at index (c\*H
 |Plugin        |FP32                    |FP16                    |I8                      |
 |:-------------|:----------------------:|:----------------------:|:----------------------:|
 |CPU plugin    |Supported and preferred |Supported               |Supported               |
-|GPU plugin    |Supported               |Supported and preferred |Supported\*             |
+|GPU plugin    |Supported               |Supported and preferred |Supported               |
 |VPU plugins   |Not supported           |Supported               |Not supported           |
 |GNA plugin    |Supported               |Supported               |Not supported           |
-<br>\* - currently, only limited set of topologies might benefit from enabling I8 model on GPU<br>
 For [Multi-Device](../multi_device.md) and [Heterogeneous](../hetero_execution.md) executions
 the supported models formats depends on the actual underlying devices. _Generally, FP16 is preferable as it is most ubiquitous and performant_.

--- a/docs/snippets/CMakeLists.txt
+++ b/docs/snippets/CMakeLists.txt
@ -4,15 +4,55 @@

 set(TARGET_NAME ie_docs_snippets)

-file(GLOB SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/*.cpp")
+file(GLOB SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/*.cpp"
+                  "${CMAKE_CURRENT_SOURCE_DIR}/gpu/*.cpp")

-# remove OpenCL related sources
-# TODO: fix compilation of OpenCL files
-if(NOT CLDNN__IOCL_ICD_INCDIRS OR TRUE)
-    list(REMOVE_ITEM SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/GPU_RemoteBlob_API0.cpp"
-                             "${CMAKE_CURRENT_SOURCE_DIR}/GPU_RemoteBlob_API1.cpp"
-                             "${CMAKE_CURRENT_SOURCE_DIR}/GPU_RemoteBlob_API2.cpp"
-                             "${CMAKE_CURRENT_SOURCE_DIR}/GPU_RemoteBlob_API3.cpp")
+find_package(OpenCL)
+find_path(OpenCL_HPP_INCLUDE_DIR
+    NAMES
+      CL/cl2.hpp OpenCL/cl2.hpp
+    HINTS
+      ${opencl_root_hints}
+      ENV "PROGRAMFILES(X86)"
+      ENV AMDAPPSDKROOT
+      ENV INTELOCLSDKROOT
+      ENV NVSDKCOMPUTE_ROOT
+      ENV CUDA_PATH
+      ENV ATISTREAMSDKROOT
+      ENV OCL_ROOT
+    PATH_SUFFIXES
+      include
+      OpenCL/common/inc
+      "AMD APP/include")
+
+if(TARGET OpenCL)
+    # Use OpenCL CPP headers from sources if present
+    set(OpenCL_HEADERS ${OPENCL_HEADERS_DIR})
+    set(OpenCL_LIB "OpenCL")
+elseif(OpenCL_HPP_INCLUDE_DIR)
+    # Append OpenCL CPP headers to C headers and use both
+    set(OpenCL_HEADERS ${OpenCL_INCLUDE_DIR} ${OpenCL_HPP_INCLUDE_DIR})
+    set(OpenCL_LIB "OpenCL::OpenCL")
+endif()
+
+# remove GPU remote snippets if OCL hasn't been found
+if (NOT (OpenCL_FOUND AND OpenCL_HEADERS))
+    list(REMOVE_ITEM SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/gpu/context_sharing_va.cpp"
+                             "${CMAKE_CURRENT_SOURCE_DIR}/gpu/context_sharing.cpp"
+                             "${CMAKE_CURRENT_SOURCE_DIR}/gpu/preprocessing.cpp"
+                             "${CMAKE_CURRENT_SOURCE_DIR}/gpu/queue_sharing.cpp"
+                             "${CMAKE_CURRENT_SOURCE_DIR}/gpu/remote_objects_creation.cpp")
+endif()
+
+# try to find VA libraries
+find_package(PkgConfig QUIET)
+if(PkgConfig_FOUND)
+    pkg_search_module(LIBVA QUIET libva)
+endif()
+
+# TODO: pkg_search_module finds libva not in sysroot
+if(ANDROID)
+    set(LIBVA_FOUND OFF CACHE BOOL "" FORCE)
 endif()

 # remove OpenCV related sources
@ -37,8 +77,20 @@ list(REMOVE_ITEM SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/dldt_optimization_guide2.c
 add_library(${TARGET_NAME} STATIC ${SOURCES})
 target_include_directories(${TARGET_NAME} PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}/../template_extension/new/")

-if(CLDNN__IOCL_ICD_INCDIRS)
-    target_include_directories(${TARGET_NAME} SYSTEM PRIVATE ${CLDNN__IOCL_ICD_INCDIRS})
+if(OpenCL_FOUND AND OpenCL_HEADERS)
+    target_link_libraries(${TARGET_NAME} PRIVATE ${OpenCL_LIB})
+    target_include_directories(${TARGET_NAME} PRIVATE ${OpenCL_HEADERS})
+
+    if(LIBVA_FOUND)
+        target_compile_definitions(${TARGET_NAME} PRIVATE ENABLE_LIBVA)
+        target_include_directories(${TARGET_NAME} PRIVATE ${LIBVA_INCLUDE_DIRS})
+        target_link_libraries(${TARGET_NAME} PRIVATE ${LIBVA_LINK_LIBRARIES})
+    endif()
+
+    if(WIN32)
+        target_compile_definitions(${TARGET_NAME} PRIVATE ENABLE_DX11)
+        target_link_libraries(${TARGET_NAME} PRIVATE d3d11 dxgi)
+    endif()
 endif()

 if(OpenCV_FOUND)
--- a/docs/snippets/DynamicBatching.cpp
+++ b/docs/snippets/DynamicBatching.cpp
@ -1,41 +0,0 @@
-#include <ie_core.hpp>
-
-int main() {
-int FLAGS_bl = 1;
-auto imagesData = std::vector<std::string>(2);
-auto imagesData2 = std::vector<std::string>(4);
-//! [part0]
-int dynBatchLimit = FLAGS_bl; //take dynamic batch limit from command line option
-
-// Read network model
-InferenceEngine::Core core;
-InferenceEngine::CNNNetwork network = core.ReadNetwork("sample.xml");
-
-
-// enable dynamic batching and prepare for setting max batch limit
-const std::map<std::string, std::string> dyn_config = 
-{ { InferenceEngine::PluginConfigParams::KEY_DYN_BATCH_ENABLED, InferenceEngine::PluginConfigParams::YES } };
-network.setBatchSize(dynBatchLimit);
-
-// create executable network and infer request
-auto executable_network = core.LoadNetwork(network, "CPU", dyn_config);
-auto infer_request = executable_network.CreateInferRequest();
-
-// ...
-
-// process a set of images
-// dynamically set batch size for subsequent Infer() calls of this request
-size_t batchSize = imagesData.size();
-infer_request.SetBatch(batchSize);
-infer_request.Infer();
-
-// ...
-
-// process another set of images
-batchSize = imagesData2.size();
-infer_request.SetBatch(batchSize);
-infer_request.Infer();
-//! [part0]
-
-return 0;
-}
--- a/docs/snippets/GPU_Kernel.cpp
+++ b/docs/snippets/GPU_Kernel.cpp
@ -1,12 +0,0 @@
-#include <ie_core.hpp>
-
-int main() {
-using namespace InferenceEngine;
-//! [part0]
-InferenceEngine::Core core;
-// Load GPU Extensions
-core.SetConfig({ { InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "<path_to_the_xml_file>" } }, "GPU");
-//! [part0]
-
-return 0;
-}
--- a/docs/snippets/GPU_Metric0.cpp
+++ b/docs/snippets/GPU_Metric0.cpp
@ -1,12 +0,0 @@
-#include <openvino/runtime/core.hpp>
-#include <openvino/runtime/intel_gpu/properties.hpp>
-
-int main() {
-//! [part0]
-ov::Core core;
-auto model = core.read_model("sample.xml");
-auto compiledModel = core.compile_model(model, "GPU");
-std::map<std::string, uint64_t> statistics_map = core.get_property("GPU", ov::intel_gpu::memory_statistics);
-//! [part0]
-return 0;
-}
--- a/docs/snippets/GPU_Metric1.cpp
+++ b/docs/snippets/GPU_Metric1.cpp
@ -1,25 +0,0 @@
-#include <openvino/runtime/core.hpp>
-#include <openvino/runtime/intel_gpu/properties.hpp>
-
-int main() {
-//! [part1]
-ov::Core core;
-std::shared_ptr<ov::Model> model = core.read_model("network.xml");
-uint32_t n_streams = 2;
-int64_t available_device_mem_size = 3221225472;
-ov::AnyMap options = {
-    ov::hint::model(model),   // Required. Set the address of the target network. If this is not set, the MAX_BATCH_SIZE returns 1.
-    ov::num_streams(n_streams),  // Optional. Set only when you want to estimate max batch size for a specific throughtput streams. Default is 1 or throughtput streams set by set_property.
-    ov::intel_gpu::hint::available_device_mem(available_device_mem_size)  // Optional. Set only when you want to limit the available device mem size.
-};
-
-uint32_t max_batch_size = core.get_property("GPU", ov::max_batch_size, options);
-//! [part1]
-//! [part2]
-// This is not entirely GPU-specific property (so common `ov::` property is used rather than `ov::intel_gpu::` below),
-// but the GPU is the only device that supports that at the moment.
-// For the GPU, the property already accommodates limitation for the on-device memory that the MAX_BATCH_SIZE poses.
-// so OPTIMAL_BATCH_SIZE is always less than MAX_BATCH_SIZE. Unlike the latter it is also aligned to the power of 2.
-uint32_t optimal_batch_size = core.get_property("GPU", ov::optimal_batch_size, options);
-//! [part2]
-}
--- a/docs/snippets/GPU_RemoteBlob_API0.cpp
+++ b/docs/snippets/GPU_RemoteBlob_API0.cpp
@ -1,60 +0,0 @@
-#define CL_HPP_MINIMUM_OPENCL_VERSION 120
-#define CL_HPP_TARGET_OPENCL_VERSION 120
-
-#include <ie_core.hpp>
-#include <CL/cl2.hpp>
-#include <gpu/gpu_context_api_ocl.hpp>
-
-
-int main() {
-using namespace InferenceEngine;
-//! [part0]
-
-
-// ...
-
-
-// initialize the core and load the network 
-InferenceEngine::Core ie;
-auto net = ie.ReadNetwork("network.xml");
-auto exec_net = ie.LoadNetwork(net, "GPU");
-
-
-// obtain the RemoteContext pointer from the executable network object
-auto cldnn_context = exec_net.GetContext();
-// obtain the OpenCL context handle from the RemoteContext,
-// get device info and create a queue
-cl::Context ctx = std::dynamic_pointer_cast<cl::Context>(cldnn_context);
-_device = cl::Device(_context.getInfo<CL_CONTEXT_DEVICES>()[0].get(), true);
-cl::CommandQueue _queue;
-cl_command_queue_properties props = CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
-_queue = cl::CommandQueue(_context, _device, props);
-
-
-// create the OpenCL buffer within the obtained context
-cl::Buffer shared_buffer(ctx, CL_MEM_READ_WRITE, image_size * num_channels, NULL, &err);
-// wrap the buffer into RemoteBlob
-auto shared_blob = gpu::make_shared_blob(input_info->getTensorDesc(), cldnn_context, shared_buffer);
-
-
-// ...
-// execute user kernel
-cl::Kernel kernel(program, kernelName.c_str());
-kernel.setArg(0, shared_buffer);
-queue.enqueueNDRangeKernel(kernel,
-                           cl::NDRange(0),
-                           cl::NDRange(image_size),
-                           cl::NDRange(1),
-                           0, // wait events *
-                           &profileEvent);
-queue.finish();
-// ...
-
-
-// pass results to the inference
-inf_req_shared.SetBlob(input_name, shared_blob);
-inf_req_shared.Infer();
-//! [part0]
-
-return 0;
-}
--- a/docs/snippets/GPU_RemoteBlob_API1.cpp
+++ b/docs/snippets/GPU_RemoteBlob_API1.cpp
@ -1,32 +0,0 @@
-#define CL_HPP_MINIMUM_OPENCL_VERSION 120
-#define CL_HPP_TARGET_OPENCL_VERSION 120
-
-#include <ie_core.hpp>
-#include <CL/cl2.hpp>
-#include <gpu/gpu_context_api_ocl.hpp>
-
-
-int main() {
-using namespace InferenceEngine;
-//! [part1]
-
-// ...
-
-cl::Context ctx = get_my_OpenCL_context();
-
-// share the context with GPU plugin and compile ExecutableNetwork
-auto remote_context = gpu::make_shared_context(ie, "GPU", ocl_instance->_context.get());
-auto exec_net_shared = ie.LoadNetwork(net, remote_context);
-auto inf_req_shared = exec_net_shared.CreateInferRequest();
-
-
-// ...
-// do OpenCL processing stuff
-// ...
-
-// run the inference
-inf_req_shared.Infer();
-//! [part1]
-
-return 0;
-}
--- a/docs/snippets/GPU_RemoteBlob_API2.cpp
+++ b/docs/snippets/GPU_RemoteBlob_API2.cpp
@ -1,54 +0,0 @@
-#include <ie_core.hpp>
-#include <gpu/gpu_context_api_va.hpp>
-#include <gpu/gpu_config.hpp>
-
-
-int main() {
-using namespace InferenceEngine;
-//! [part2]
-
-// ...
-
-
-// initialize the objects
-CNNNetwork network = ie.ReadNetwork(xmlFileName, binFileName);
-
-
-// ...
-
-
-auto inputInfoItem = *inputInfo.begin();
-inputInfoItem.second->setPrecision(Precision::U8);
-inputInfoItem.second->setLayout(Layout::NCHW);
-inputInfoItem.second->getPreProcess().setColorFormat(ColorFormat::NV12);
-
-VADisplay disp = get_VA_Device();
-// create the shared context object
-auto shared_va_context = gpu::make_shared_context(ie, "GPU", disp);
-// compile network within a shared context
-ExecutableNetwork executable_network = ie.LoadNetwork(network,
-                                                      shared_va_context,
-                                                      { { GPUConfigParams::KEY_GPU_NV12_TWO_INPUTS,
-                                                          PluginConfigParams::YES } });
-
-
-// decode/inference loop
-for (int i = 0; i < nframes; i++) {
-//     ...
-    // execute decoding and obtain decoded surface handle
-    decoder.DecodeFrame();
-    VASurfaceID va_surface = decoder.get_VA_output_surface();
-//     ...
-    //wrap decoder output into RemoteBlobs and set it as inference input
-    auto nv12_blob = gpu::make_shared_blob_nv12(ieInHeight,
-                                                ieInWidth,
-                                                shared_va_context,
-                                                va_surface
-                                                );
-    inferRequests[currentFrame].SetBlob(input_name, nv12_blob);
-    inferRequests[currentFrame].StartAsync();
-    inferRequests[prevFrame].Wait(InferenceEngine::InferRequest::WaitMode::RESULT_READY);
-}
-//! [part2]
-return 0;
-}
--- a/docs/snippets/GPU_RemoteBlob_API3.cpp
+++ b/docs/snippets/GPU_RemoteBlob_API3.cpp
@ -1,76 +0,0 @@
-#define CL_HPP_MINIMUM_OPENCL_VERSION 120
-#define CL_HPP_TARGET_OPENCL_VERSION 120
-
-#include <ie_core.hpp>
-#include <CL/cl2.hpp>
-#include <gpu/gpu_context_api_ocl.hpp>
-
-
-int main() {
-using namespace InferenceEngine;
-//! [part0]
-
-
-// ...
-
-
-// initialize the core and read the network
-InferenceEngine::Core ie;
-auto net = ie.ReadNetwork("network.xml");
-
-// initialize opencl context and create queue
-cl::Context ctx = get_my_OpenCL_context();
-cl::CommandQueue queue = get_my_OpenCL_queue();
-
-// share the queue with GPU plugin and compile ExecutableNetwork
-auto remote_context = gpu::make_shared_context(ie, "GPU", queue.get());
-auto exec_net_shared = ie.LoadNetwork(net, remote_context);
-
-// create the OpenCL buffers within the context
-cl::Buffer shared_in_buffer(ctx, CL_MEM_READ_WRITE, image_size * num_channels, NULL, &err);
-cl::Buffer shared_out_buffer(ctx, CL_MEM_READ_WRITE, image_size * num_channels, NULL, &err);
-// wrap in and out buffers into RemoteBlob and set them to infer request
-auto shared_in_blob = gpu::make_shared_blob(input_info->getTensorDesc(), remote_context, shared_in_buffer);
-auto shared_out_blob = gpu::make_shared_blob(out_data->getTensorDesc(), remote_context, shared_out_buffer);
-auto infer_request = exec_net_shared.CreateInferRequest();
-infer_request.SetBlob(input_name, shared_in_blob);
-infer_request.SetBlob(output_name, shared_out_blob);
-
-// ...
-// execute user kernel
-cl::Kernel kernel_preproc(program, kernel_name_preproc.c_str());
-kernel_preproc.setArg(0, shared_in_buffer);
-queue.enqueueNDRangeKernel(kernel_preproc,
-                           cl::NDRange(0),
-                           cl::NDRange(image_size),
-                           cl::NDRange(1),
-                           nullptr,  // wait events *
-                           &profileEvent);
-// Blocking clFinish() call is not required, but this barrier is added to the queue to guarantee that user kernel is finished
-// before any inference primitive is started
-queue.enqueueBarrierWithWaitList(nullptr, nullptr);
-// ...
-
-// pass results to the inference
-// since the remote context is created with queue sharing, StartAsync() guarantees that scheduling is finished
-infer_request.StartAsync();
-
-// execute some postprocessing kernel.
-// infer_request.Wait() is not called, synchonization between inference and post-processing is done via
-// enqueueBarrierWithWaitList call.
-cl::Kernel kernel_postproc(program, kernel_name_postproc.c_str());
-kernel_postproc.setArg(0, shared_out_buffer);
-queue.enqueueBarrierWithWaitList(nullptr, nullptr);
-queue.enqueueNDRangeKernel(kernel_postproc,
-                           cl::NDRange(0),
-                           cl::NDRange(image_size),
-                           cl::NDRange(1),
-                           nullptr,  // wait events *
-                           &profileEvent);
-
-// Wait for pipeline completion
-queue.finish();
-//! [part0]
-
-return 0;
-}
--- a/docs/snippets/gpu/compile_model.cpp
+++ b/docs/snippets/gpu/compile_model.cpp
@ -0,0 +1,54 @@
+#include <openvino/runtime/core.hpp>
+#include <openvino/runtime/intel_gpu/properties.hpp>
+
+
+int main() {
+{
+    //! [compile_model_default_gpu]
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto compiled_model = core.compile_model(model, "GPU");
+    //! [compile_model_default_gpu]
+}
+
+{
+    //! [compile_model_gpu_with_id]
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto compiled_model = core.compile_model(model, "GPU.1");
+    //! [compile_model_gpu_with_id]
+}
+
+{
+    //! [compile_model_gpu_with_id_and_tile]
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto compiled_model = core.compile_model(model, "GPU.1.0");
+    //! [compile_model_gpu_with_id_and_tile]
+}
+
+{
+    //! [compile_model_multi]
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto compiled_model = core.compile_model(model, "MULTI:GPU.1,GPU.0");
+    //! [compile_model_multi]
+}
+
+{
+    //! [compile_model_batch_plugin]
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto compiled_model = core.compile_model(model, "BATCH:GPU");
+    //! [compile_model_batch_plugin]
+}
+
+{
+    //! [compile_model_auto_batch]
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
+    //! [compile_model_auto_batch]
+}
+    return 0;
+}
--- a/docs/snippets/gpu/context_sharing.cpp
+++ b/docs/snippets/gpu/context_sharing.cpp
@ -0,0 +1,77 @@
+#include <openvino/runtime/core.hpp>
+#include <openvino/runtime/intel_gpu/ocl/ocl.hpp>
+
+cl::Context get_ocl_context(); // a function which returns cl context created on the app side
+
+int main() {
+{
+    //! [context_sharing_get_from_ov]
+
+    // ...
+
+    // initialize the core and load the network
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto compiled_model = core.compile_model(model, "GPU");
+    auto infer_request = compiled_model.create_infer_request();
+
+
+    // obtain the RemoteContext from the compiled model object and cast it to ClContext
+    auto gpu_context = compiled_model.get_context().as<ov::intel_gpu::ocl::ClContext>();
+    // obtain the OpenCL context handle from the RemoteContext,
+    // get device info and create a queue
+    cl::Context cl_context = gpu_context;
+    cl::Device device = cl::Device(cl_context.getInfo<CL_CONTEXT_DEVICES>()[0].get(), true);
+    cl_command_queue_properties props = CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
+    cl::CommandQueue queue = cl::CommandQueue(cl_context, device, props);
+
+    // create the OpenCL buffer within the obtained context
+    auto input = model->get_parameters().at(0);
+    auto input_size = ov::shape_size(input->get_shape());
+    cl_int err;
+    cl::Buffer shared_buffer(cl_context, CL_MEM_READ_WRITE, input_size, NULL, &err);
+    // wrap the buffer into RemoteBlob
+    auto shared_blob = gpu_context.create_tensor(input->get_element_type(), input->get_shape(), shared_buffer);
+
+    // ...
+    // execute user kernel
+    cl::Program program;
+    cl::Kernel kernel(program, "user_kernel");
+    kernel.setArg(0, shared_buffer);
+    queue.enqueueNDRangeKernel(kernel,
+                               cl::NDRange(0),
+                               cl::NDRange(input_size),
+                               cl::NDRange(1),
+                               nullptr,
+                               nullptr);
+    queue.finish();
+    // ...
+    // pass results to the inference
+    infer_request.set_tensor(input, shared_blob);
+    infer_request.infer();
+    //! [context_sharing_get_from_ov]
+}
+{
+
+    //! [context_sharing_user_handle]
+    cl::Context ctx = get_ocl_context();
+
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+
+    // share the context with GPU plugin and compile ExecutableNetwork
+    auto remote_context = ov::intel_gpu::ocl::ClContext(core, ctx.get());
+    auto exec_net_shared = core.compile_model(model, remote_context);
+    auto inf_req_shared = exec_net_shared.create_infer_request();
+
+
+    // ...
+    // do OpenCL processing stuff
+    // ...
+
+    // run the inference
+    inf_req_shared.infer();
+    //! [context_sharing_user_handle]
+}
+    return 0;
+}
--- a/docs/snippets/gpu/context_sharing_va.cpp
+++ b/docs/snippets/gpu/context_sharing_va.cpp
@ -0,0 +1,57 @@
+#ifdef ENABLE_LIBVA
+#include <openvino/runtime/core.hpp>
+#include <openvino/runtime/intel_gpu/ocl/va.hpp>
+#include <openvino/runtime/intel_gpu/properties.hpp>
+#include <openvino/core/preprocess/pre_post_process.hpp>
+
+VADisplay get_va_display();
+VASurfaceID decode_va_surface();
+
+int main() {
+    // initialize the objects
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+
+
+    // ...
+
+
+    //! [context_sharing_va]
+
+    // ...
+
+    using namespace ov::preprocess;
+    auto p = PrePostProcessor(model);
+    p.input().tensor().set_element_type(ov::element::u8)
+                      .set_color_format(ov::preprocess::ColorFormat::NV12_TWO_PLANES, {"y", "uv"})
+                      .set_memory_type(ov::intel_gpu::memory_type::surface);
+    p.input().preprocess().convert_color(ov::preprocess::ColorFormat::BGR);
+    p.input().model().set_layout("NCHW");
+    model = p.build();
+
+    VADisplay disp = get_va_display();
+    // create the shared context object
+    auto shared_va_context = ov::intel_gpu::ocl::VAContext(core, disp);
+    // compile model within a shared context
+    auto compiled_model = core.compile_model(model, shared_va_context);
+
+    auto input = model->get_parameters().at(0);
+    size_t width = 1024;
+    size_t height = 768;
+
+    // execute decoding and obtain decoded surface handle
+    VASurfaceID va_surface = decode_va_surface();
+    //     ...
+    //wrap decoder output into RemoteBlobs and set it as inference input
+    auto nv12_blob = shared_va_context.create_tensor_nv12(height, width, va_surface);
+
+    auto infer_request = compiled_model.create_infer_request();
+    infer_request.set_tensor("y", nv12_blob.first);
+    infer_request.set_tensor("uv", nv12_blob.second);
+    infer_request.start_async();
+    infer_request.wait();
+    //! [context_sharing_va]
+
+    return 0;
+}
+#endif  // ENABLE_LIBVA
--- a/docs/snippets/gpu/custom_kernels_api.cpp
+++ b/docs/snippets/gpu/custom_kernels_api.cpp
@ -0,0 +1,11 @@
+#include <openvino/runtime/core.hpp>
+
+int main() {
+    //! [part0]
+    ov::Core core;
+    // Load GPU Extensions
+    core.set_property("GPU", {{ CONFIG_KEY(CONFIG_FILE), "<path_to_the_xml_file>" }});
+    //! [part0]
+
+    return 0;
+}
--- a/docs/snippets/gpu/dynamic_batch.cpp
+++ b/docs/snippets/gpu/dynamic_batch.cpp
@ -0,0 +1,34 @@
+#include <openvino/runtime/core.hpp>
+
+int main() {
+size_t C = 3;
+size_t H = 224;
+size_t W = 224;
+
+//! [dynamic_batch]
+
+// Read model
+ov::Core core;
+auto model = core.read_model("model.xml");
+
+model->reshape({{ov::Dimension(1, 10), ov::Dimension(C), ov::Dimension(H), ov::Dimension(W)}});  // {1..10, C, H, W}
+
+// compile model and create infer request
+auto compiled_model = core.compile_model(model, "GPU");
+auto infer_request = compiled_model.create_infer_request();
+auto input = model->get_parameters().at(0);
+
+// ...
+
+// create input tensor with specific batch size
+ov::Tensor input_tensor(input->get_element_type(), {2, C, H, W});
+
+// ...
+
+infer_request.set_tensor(input, input_tensor);
+infer_request.infer();
+
+//! [dynamic_batch]
+
+return 0;
+}
--- a/docs/snippets/gpu/preprocessing.cpp
+++ b/docs/snippets/gpu/preprocessing.cpp
@ -0,0 +1,54 @@
+#include <openvino/runtime/core.hpp>
+#include <openvino/runtime/intel_gpu/ocl/ocl.hpp>
+#include <openvino/runtime/intel_gpu/properties.hpp>
+#include <openvino/core/preprocess/pre_post_process.hpp>
+
+ov::intel_gpu::ocl::ClImage2DTensor get_y_tensor();
+ov::intel_gpu::ocl::ClImage2DTensor get_uv_tensor();
+
+int main() {
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+
+    //! [init_preproc]
+    using namespace ov::preprocess;
+    auto p = PrePostProcessor(model);
+    p.input().tensor().set_element_type(ov::element::u8)
+                      .set_color_format(ov::preprocess::ColorFormat::NV12_TWO_PLANES, {"y", "uv"})
+                      .set_memory_type(ov::intel_gpu::memory_type::surface);
+    p.input().preprocess().convert_color(ov::preprocess::ColorFormat::BGR);
+    p.input().model().set_layout("NCHW");
+    auto model_with_preproc = p.build();
+    //! [init_preproc]
+
+    auto compiled_model = core.compile_model(model, "GPU");
+    auto context = compiled_model.get_context().as<ov::intel_gpu::ocl::ClContext>();
+    auto input = model->get_parameters().at(0);
+    auto infer_request = compiled_model.create_infer_request();
+
+{
+
+    //! [single_batch]
+    ov::intel_gpu::ocl::ClImage2DTensor y_tensor = get_y_tensor();
+    ov::intel_gpu::ocl::ClImage2DTensor uv_tensor = get_uv_tensor();
+    infer_request.set_tensor("y", y_tensor);
+    infer_request.set_tensor("uv", uv_tensor);
+    infer_request.infer();
+    //! [single_batch]
+}
+
+{
+    auto y_tensor_0 = get_y_tensor();
+    auto y_tensor_1 = get_y_tensor();
+    auto uv_tensor_0 = get_uv_tensor();
+    auto uv_tensor_1 = get_uv_tensor();
+    //! [batched_case]
+    std::vector<ov::Tensor> y_tensors = {y_tensor_0, y_tensor_1};
+    std::vector<ov::Tensor> uv_tensors = {uv_tensor_0, uv_tensor_1};
+    infer_request.set_tensors("y", y_tensors);
+    infer_request.set_tensors("uv", uv_tensors);
+    infer_request.infer();
+    //! [batched_case]
+}
+    return 0;
+}
--- a/docs/snippets/gpu/queue_sharing.cpp
+++ b/docs/snippets/gpu/queue_sharing.cpp
@ -0,0 +1,79 @@
+#include <openvino/runtime/core.hpp>
+#include <openvino/runtime/intel_gpu/ocl/ocl.hpp>
+
+
+cl::CommandQueue get_ocl_queue(); // a function which returns cl queue created on the app side
+cl::Context get_ocl_context(); // a function which returns cl context created on the app side
+
+int main() {
+    //! [queue_sharing]
+
+    // ...
+
+    // initialize the core and read the model
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+
+    // get opencl queue object
+    cl::CommandQueue queue = get_ocl_queue();
+    cl::Context cl_context = get_ocl_context();
+
+    // share the queue with GPU plugin and compile model
+    auto remote_context = ov::intel_gpu::ocl::ClContext(core, queue.get());
+    auto exec_net_shared = core.compile_model(model, remote_context);
+
+    auto input = model->get_parameters().at(0);
+    auto input_size = ov::shape_size(input->get_shape());
+    auto output = model->get_results().at(0);
+    auto output_size = ov::shape_size(output->get_shape());
+    cl_int err;
+
+    // create the OpenCL buffers within the context
+    cl::Buffer shared_in_buffer(cl_context, CL_MEM_READ_WRITE, input_size, NULL, &err);
+    cl::Buffer shared_out_buffer(cl_context, CL_MEM_READ_WRITE, output_size, NULL, &err);
+    // wrap in and out buffers into RemoteTensor and set them to infer request
+    auto shared_in_blob = remote_context.create_tensor(input->get_element_type(), input->get_shape(), shared_in_buffer);
+    auto shared_out_blob = remote_context.create_tensor(output->get_element_type(), output->get_shape(), shared_out_buffer);
+    auto infer_request = exec_net_shared.create_infer_request();
+    infer_request.set_tensor(input, shared_in_blob);
+    infer_request.set_tensor(output, shared_out_blob);
+
+    // ...
+    // execute user kernel
+    cl::Program program;
+    cl::Kernel kernel_preproc(program, "user_kernel_preproc");
+    kernel_preproc.setArg(0, shared_in_buffer);
+    queue.enqueueNDRangeKernel(kernel_preproc,
+                               cl::NDRange(0),
+                               cl::NDRange(input_size),
+                               cl::NDRange(1),
+                               nullptr,
+                               nullptr);
+    // Blocking clFinish() call is not required, but this barrier is added to the queue to guarantee that user kernel is finished
+    // before any inference primitive is started
+    queue.enqueueBarrierWithWaitList(nullptr, nullptr);
+    // ...
+
+    // pass results to the inference
+    // since the remote context is created with queue sharing, start_async() guarantees that scheduling is finished
+    infer_request.start_async();
+
+    // execute some postprocessing kernel.
+    // infer_request.wait() is not called, synchonization between inference and post-processing is done via
+    // enqueueBarrierWithWaitList call.
+    cl::Kernel kernel_postproc(program, "user_kernel_postproc");
+    kernel_postproc.setArg(0, shared_out_buffer);
+    queue.enqueueBarrierWithWaitList(nullptr, nullptr);
+    queue.enqueueNDRangeKernel(kernel_postproc,
+                               cl::NDRange(0),
+                               cl::NDRange(output_size),
+                               cl::NDRange(1),
+                               nullptr,
+                               nullptr);
+
+    // Wait for pipeline completion
+    queue.finish();
+    //! [queue_sharing]
+
+    return 0;
+}
--- a/docs/snippets/gpu/remote_objects_creation.cpp
+++ b/docs/snippets/gpu/remote_objects_creation.cpp
@ -0,0 +1,154 @@
+#include <openvino/runtime/core.hpp>
+#include <openvino/runtime/intel_gpu/properties.hpp>
+
+#include <openvino/runtime/intel_gpu/ocl/ocl.hpp>
+
+#ifdef WIN32
+#include <openvino/runtime/intel_gpu/ocl/dx.hpp>
+#elif defined(ENABLE_LIBVA)
+#include <openvino/runtime/intel_gpu/ocl/va.hpp>
+#endif
+
+void* allocate_usm_buffer(size_t size);
+cl_mem allocate_cl_mem(size_t size);
+cl_context get_cl_context();
+cl_command_queue get_cl_queue();
+cl::Buffer allocate_buffer(size_t size);
+cl::Image2D allocate_image(size_t size);
+
+
+#ifdef WIN32
+ID3D11Device* get_d3d_device();
+#elif defined(ENABLE_LIBVA)
+VADisplay get_va_display();
+#endif
+
+int main() {
+    ov::Core core;
+    auto model = core.read_model("model.xml");
+    auto input = model->get_parameters().at(0);
+    auto input_size = ov::shape_size(input->get_shape());
+
+    auto compiled_model = core.compile_model(model, "GPU");
+    auto gpu_context = compiled_model.get_context().as<ov::intel_gpu::ocl::ClContext>();
+
+    auto in_element_type = input->get_element_type();
+    auto in_shape = input->get_shape();
+
+{
+    //! [wrap_usm_pointer]
+    void* shared_buffer = allocate_usm_buffer(input_size);
+    auto remote_tensor = gpu_context.create_tensor(in_element_type, in_shape, shared_buffer);
+    //! [wrap_usm_pointer]
+}
+
+{
+    //! [wrap_cl_mem]
+    cl_mem shared_buffer = allocate_cl_mem(input_size);
+    auto remote_tensor = gpu_context.create_tensor(in_element_type, in_shape, shared_buffer);
+    //! [wrap_cl_mem]
+}
+
+{
+    //! [wrap_cl_buffer]
+    cl::Buffer shared_buffer = allocate_buffer(input_size);
+    auto remote_tensor = gpu_context.create_tensor(in_element_type, in_shape, shared_buffer);
+    //! [wrap_cl_buffer]
+}
+
+{
+    //! [wrap_cl_image]
+    cl::Image2D shared_buffer = allocate_image(input_size);
+    auto remote_tensor = gpu_context.create_tensor(in_element_type, in_shape, shared_buffer);
+    //! [wrap_cl_image]
+}
+
+{
+    //! [allocate_usm_device]
+    auto remote_tensor = gpu_context.create_usm_device_tensor(in_element_type, in_shape);
+    // Extract raw usm pointer from remote tensor
+    void* usm_ptr = remote_tensor.get();
+    //! [allocate_usm_device]
+}
+
+{
+    //! [allocate_usm_host]
+    ov::intel_gpu::ocl::USMTensor remote_tensor = gpu_context.create_usm_host_tensor(in_element_type, in_shape);
+    // Extract raw usm pointer from remote tensor
+    void* usm_ptr = remote_tensor.get();
+    //! [allocate_usm_host]
+}
+
+{
+    //! [allocate_cl_buffer]
+    ov::RemoteTensor remote_tensor = gpu_context.create_tensor(in_element_type, in_shape);
+    // Cast from base to derived class and extract ocl memory handle
+    auto buffer_tensor = remote_tensor.as<ov::intel_gpu::ocl::ClBufferTensor>();
+    cl_mem handle = buffer_tensor.get();
+    //! [allocate_cl_buffer]
+}
+
+{
+    size_t width = 1024;
+    size_t height = 768;
+
+    size_t y_plane_size = width*height;
+    size_t uv_plane_size = width*height / 2;
+
+    //! [wrap_nv12_surface]
+    cl::Image2D y_plane_surface = allocate_image(y_plane_size);
+    cl::Image2D uv_plane_surface = allocate_image(uv_plane_size);
+    auto remote_tensor = gpu_context.create_tensor_nv12(y_plane_surface, uv_plane_surface);
+    auto y_tensor = remote_tensor.first;
+    auto uv_tensor = remote_tensor.second;
+    //! [wrap_nv12_surface]
+}
+
+{
+    //! [context_from_cl_context]
+    cl_context ctx = get_cl_context();
+    ov::intel_gpu::ocl::ClContext gpu_context(core, ctx);
+    //! [context_from_cl_context]
+}
+
+
+{
+    //! [context_from_cl_queue]
+    cl_command_queue queue = get_cl_queue();
+    ov::intel_gpu::ocl::ClContext gpu_context(core, queue);
+    //! [context_from_cl_queue]
+}
+
+#ifdef WIN32
+{
+    //! [context_from_d3d_device]
+    ID3D11Device* device = get_d3d_device();
+    ov::intel_gpu::ocl::D3DContext gpu_context(core, device);
+    //! [context_from_d3d_device]
+}
+#elif defined(ENABLE_LIBVA)
+{
+    //! [context_from_va_display]
+    VADisplay display = get_va_display();
+    ov::intel_gpu::ocl::VAContext gpu_context(core, display);
+    //! [context_from_va_display]
+}
+#endif
+{
+    //! [default_context_from_core]
+    auto gpu_context = core.get_default_context("GPU").as<ov::intel_gpu::ocl::ClContext>();
+    // Extract ocl context handle from RemoteContext
+    cl_context context_handle = gpu_context.get();
+    //! [default_context_from_core]
+}
+
+{
+    //! [default_context_from_model]
+    auto gpu_context = compiled_model.get_context().as<ov::intel_gpu::ocl::ClContext>();
+    // Extract ocl context handle from RemoteContext
+    cl_context context_handle = gpu_context.get();
+    //! [default_context_from_model]
+}
+
+    return 0;
+}
--- a/src/inference/include/openvino/runtime/intel_gpu/properties.hpp
+++ b/src/inference/include/openvino/runtime/intel_gpu/properties.hpp
@ -92,18 +92,18 @@ static constexpr Property<ov::hint::Priority> host_task_priority{"GPU_HOST_TASK_
 static constexpr Property<int64_t> available_device_mem{"AVAILABLE_DEVICE_MEM_SIZE"};
 }  // namespace hint

-namespace memory_type {
 /**
 * @brief These keys instruct the GPU plugin to use surface/buffer memory type.
 */
+namespace memory_type {
 static constexpr auto surface = "GPU_SURFACE";  //!< Native video decoder surface
 static constexpr auto buffer = "GPU_BUFFER";    //!< OpenCL buffer
 }  // namespace memory_type

-namespace capability {
 /**
 * @brief Possible return value for ov::device::capabilities property
 */
+namespace capability {
 constexpr static const auto HW_MATMUL = "GPU_HW_MATMUL";  //!< Device has hardware block for matrix multiplication
 }  // namespace capability
 }  // namespace intel_gpu