DOCS: Proofreading OpenVINO Extensibility for 22.2 (#14032)

This commit is contained in:
Maciej Smyk
2022-12-13 11:19:14 +01:00
committed by GitHub
parent ec21e6906b
commit 608d002402
3 changed files with 119 additions and 136 deletions

View File

@@ -1,6 +1,6 @@
# How to Implement Custom GPU Operations {#openvino_docs_Extensibility_UG_GPU}
To enable operations not supported by OpenVINO out of the box, you may need an extension for an OpenVINO operation set, and a custom kernel for the device you will target. This page describes custom kernel support for the GPU device.
To enable operations not supported by OpenVINO out of the box, you may need an extension for OpenVINO operation set, and a custom kernel for the device you will target. This article describes custom kernel support for the GPU device.
The GPU codepath abstracts many details about OpenCL. You need to provide the kernel code in OpenCL C and an XML configuration file that connects the kernel and its parameters to the parameters of the operation.
@@ -30,7 +30,7 @@ $ ./classification_sample -m <path_to_model>/bvlc_alexnet_fp16.xml -i ./validati
## Configuration File Format <a name="config-file-format"></a>
The configuration file is expected to follow the `.xml` file structure
with a node of the `CustomLayer` type for every custom operation you provide.
with a node of the type `CustomLayer` for every custom operation you provide.
The definitions described in the sections below use the following notations:
@@ -43,44 +43,44 @@ Notation | Description
### CustomLayer Node and Sub-Node Structure
`CustomLayer` node contains the entire configuration for a single custom operation.
The `CustomLayer` node contains the entire configuration for a single custom operation.
| Attribute Name |\# | Description |
|-----|-----|-----|
| `name` | (1) | The name of the operation type to be used. This name should be identical to the type used in the IR.|
| `type` | (1) | Must be `SimpleGPU`. |
| `version` | (1) | Must be `1`. |
| `name` | (1) | The name of the operation type to be used. This name should be identical to the type used in the OpenVINO IR.|
| `type` | (1) | Must be `SimpleGPU`. |
| `version` | (1) | Must be `1`. |
**Sub-nodes**: `Kernel` (1), `Buffers` (1), `CompilerOptions` (0+),
`WorkSizes` (0/1)
### Kernel Node and Sub-Node Structure
`Kernel` node contains all kernel source code configuration.
The `Kernel` node contains all kernel source code configuration.
**Sub-nodes**: `Source` (1+), `Define` (0+)
### Source Node and Sub-Node Structure
`Source` node points to a single OpenCL source file.
The `Source` node points to a single OpenCL source file.
| Attribute Name | \# |Description|
|-----|-----|-----|
| `filename` | (1) | Name of the file containing OpenCL source code. Note that the path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
| `filename` | (1) | Name of the file containing OpenCL source code. The path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
**Sub-nodes**: None
### Define Node and Sub-Node Structure
`Define` node configures a single `#&zwj;define` instruction to be added to
The `Define` node configures a single `#&zwj;define` instruction to be added to
the sources during compilation (JIT).
| Attribute Name | \# | Description |
|------|-------|------|
| `name` | (1) | The name of the defined JIT. For static constants, this can include the value as well, which is taken as a string. |
| `param` | (0/1) | This parameter value is used as the value of this JIT definition. |
| `type` | (0/1) | The parameter type. Accepted values: `int`, `float`, and `int[]`, `float[]` for arrays. |
| `default` | (0/1) | The default value to be used if the specified parameters are missing from the operation in the IR. |
| `name` | (1) | The name of the defined JIT. For static constants, this can include the value as well, which is taken as a string. |
| `param` | (0/1) | This parameter value is used as the value of this JIT definition. |
| `type` | (0/1) | The parameter type. Accepted values: `int`, `float`, and `int[]`, `float[]` for arrays. |
| `default` | (0/1) | The default value to be used if the specified parameters are missing from the operation in the OpenVINO IR. |
**Sub-nodes:** None
@@ -89,37 +89,37 @@ The resulting JIT has the following form:
### Buffers Node and Sub-Node Structure
`Buffers` node configures all input/output buffers for the OpenCL entry
The `Buffers` node configures all input/output buffers for the OpenCL entry
function. No buffers node structure exists.
**Sub-nodes:** `Data` (0+), `Tensor` (1+)
### Data Node and Sub-Node Structure
`Data` node configures a single input with static data, for example,
The `Data` node configures a single input with static data, for example,
weights or biases.
| Attribute Name | \# | Description |
|----|-----|------|
| `name` | (1) | Name of a blob attached to an operation in the IR |
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to |
| `name` | (1) | Name of a blob attached to an operation in the OpenVINO IR. |
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to. |
**Sub-nodes**: None
### Tensor Node and Sub-Node Structure
`Tensor` node configures a single input or output tensor.
The `Tensor` node configures a single input or output tensor.
| Attribute Name | \# | Description |
|------|-------|-------|
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to. |
| `type` | (1) | `input` or `output` |
| `port-index` | (1) | 0-based index in the operation input/output ports in the IR |
| `format` | (0/1) | Data layout declaration for the tensor. Accepted values: `BFYX`, `BYXF`, `YXFB`, `FYXB`, and same values in all lowercase. Default value: `BFYX` |
| `port-index` | (1) | 0-based index in the operation input/output ports in the OpenVINO IR |
| `format` | (0/1) | Data layout declaration for the tensor. Accepted values: `BFYX`, `BYXF`, `YXFB`, `FYXB`(also in lowercase). The default value: `BFYX` |
### CompilerOptions Node and Sub-Node Structure
`CompilerOptions` node configures the compilation flags for the OpenCL
The `CompilerOptions` node configures the compilation flags for the OpenCL
sources.
| Attribute Name | \# | Description |
@@ -130,20 +130,20 @@ sources.
### WorkSizes Node and Sub-Node Structure
`WorkSizes` node configures the global/local work sizes to be used when
The `WorkSizes` node configures the global/local work sizes to be used when
queuing an OpenCL program for execution.
| Attribute Name | \# | Description |
|-----|------|-----|
| `global`<br>`local` | (0/1)<br>(0/1) | An array of up to three integers or formulas for defining OpenCL work-sizes to be used during execution.<br> The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,\*,%. All operators are evaluated in integer arithmetic. <br>Default value: `global=”B*F*Y*X” local=””` |
| `dim` | (0/1) | A tensor to take the work-size from. Accepted values: `input N`, `output`, where `N` is an index of input tensor starting with 0. Default value: `output` |
| `dim` | (0/1) | A tensor to take the work-size from. Accepted values: `input N`, `output`, where `N` is an index of input tensor starting with 0. The default value: `output` |
**Sub-nodes**: None
## Example Configuration File
The following code sample provides an example configuration file in XML
format. For information on the configuration file structure, see
format. For information on the configuration file structure, see the
[Configuration File Format](#config-file-format).
```xml
<CustomLayer name="ReLU" type="SimpleGPU" version="1">
@@ -169,22 +169,22 @@ For an example, see [Example Kernel](#example-kernel).
| Name | Value |
|---|---|
| `NUM_INPUTS` | Number of the input tensors bound to this kernel |
| `GLOBAL_WORKSIZE` | An array of global work sizes used to execute this kernel |
| `GLOBAL_WORKSIZE_SIZE` | The size of the `GLOBAL_WORKSIZE` array |
| `LOCAL_WORKSIZE` | An array of local work sizes used to execute this kernel |
| `LOCAL_WORKSIZE_SIZE` | The size of the `LOCAL_WORKSIZE` array |
| `<TENSOR>_DIMS`| An array of the tensor dimension sizes. Always ordered as `BFYX` |
| `NUM_INPUTS` | Number of the input tensors bound to this kernel. |
| `GLOBAL_WORKSIZE` | An array of global work sizes used to execute this kernel. |
| `GLOBAL_WORKSIZE_SIZE` | The size of the `GLOBAL_WORKSIZE` array. |
| `LOCAL_WORKSIZE` | An array of local work sizes used to execute this kernel. |
| `LOCAL_WORKSIZE_SIZE` | The size of the `LOCAL_WORKSIZE` array. |
| `<TENSOR>_DIMS`| An array of the tensor dimension sizes. Always ordered as `BFYX`. |
| `<TENSOR>_DIMS_SIZE`| The size of the `<TENSOR>_DIMS` array.|
| `<TENSOR>_TYPE`| The datatype of the tensor: `float`, `half`, or `char`|
| `<TENSOR>_TYPE`| The datatype of the tensor: `float`, `half`, or `char`. |
| `<TENSOR>_FORMAT_<TENSOR_FORMAT>` | The format of the tensor, BFYX, BYXF, YXFB , FYXB, or ANY. The format is concatenated to the defined name. You can use the tensor format to define codepaths in your code with `#&zwj;ifdef/#&zwj;endif`. |
| `<TENSOR>_LOWER_PADDING` | An array of padding elements used for the tensor dimensions before they start. Always ordered as BFYX.|
| `<TENSOR>_LOWER_PADDING_SIZE` | The size of the `<TENSOR>_LOWER_PADDING` array |
| `<TENSOR>_LOWER_PADDING_SIZE` | The size of the `<TENSOR>_LOWER_PADDING` array. |
| `<TENSOR>_UPPER_PADDING` | An array of padding elements used for the tensor dimensions after they end. Always ordered as BFYX. |
| `<TENSOR>_UPPER_PADDING_SIZE` | The size of the `<TENSOR>_UPPER_PADDING` array |
| `<TENSOR>_PITCHES` | The offset (in elements) between adjacent elements in each dimension. Always ordered as BFYX.|
| `<TENSOR>_PITCHES_SIZE`| The size of the `<TENSOR>_PITCHES` array |
| `<TENSOR>_OFFSET`| The number of elements from the start of the tensor to the first valid element, bypassing the lower padding. |
| `<TENSOR>_UPPER_PADDING_SIZE` | The size of the `<TENSOR>_UPPER_PADDING` array. |
| `<TENSOR>_PITCHES` | The offset (in elements) between adjacent elements in each dimension. Always ordered as BFYX. |
| `<TENSOR>_PITCHES_SIZE`| The size of the `<TENSOR>_PITCHES` array. |
| `<TENSOR>_OFFSET`| The number of elements from the start of the tensor to the first valid element, bypassing the lower padding. |
All `<TENSOR>` values are automatically defined for every tensor
bound to this operation, such as `INPUT0`, `INPUT1`, and `OUTPUT0`, as shown
@@ -219,7 +219,7 @@ __kernel void example_relu_kernel(
```
> **NOTE**: As described in the previous section, all items like
> **NOTE**: As described in the previous section, all items such as the
> `INPUT0_TYPE` are actually defined as OpenCL (pre-)compiler inputs by
> OpenVINO for efficiency reasons. See the [Debugging
> Tips](#debugging-tips) below for information on debugging the results.
@@ -234,5 +234,4 @@ your output can be truncated to fit the buffer. Also, because of
buffering, you actually get an entire buffer of output when the
execution ends.<br>
For more information, refer to the [printf
Function](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/printfFunction.html).
For more information, refer to the [printf Function](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/printfFunction.html).

View File

@@ -19,62 +19,61 @@ TensorFlow, PyTorch, ONNX, PaddlePaddle, Apache MXNet, Caffe, and Kaldi. The lis
each of the supported frameworks. To see the operations supported by your framework, refer to
[Supported Framework Operations](../MO_DG/prepare_model/Supported_Frameworks_Layers.md).
Custom operations, that is those not included in the list, are not recognized by OpenVINO out-of-the-box. The need for a custom operation may appear in two main cases:
Custom operations, which are not included in the list, are not recognized by OpenVINO out-of-the-box. The need for custom operation may appear in two cases:
1. A regular framework operation that is new or rarely used, which is why it hasnt been implemented in OpenVINO yet.
1. A new or rarely used regular framework operation is not supported in OpenVINO yet.
2. A new user operation that was created for some specific model topology by a model author using framework extension capabilities.
2. A new user operation that was created for some specific model topology by the author of the model using framework extension capabilities.
Importing models with such operations requires additional steps. This guide illustrates the workflow for running inference on models featuring custom operations, allowing you to plug in your own implementation for them. OpenVINO Extensibility API lets you add support for those custom operations and use one implementation for Model Optimizer and OpenVINO Runtime.
Importing models with such operations requires additional steps. This guide illustrates the workflow for running inference on models featuring custom operations. This allows plugging in your own implementation for them. OpenVINO Extensibility API enables adding support for those custom operations and using one implementation for Model Optimizer and OpenVINO Runtime.
Defining a new custom operation basically consist of two parts:
Defining a new custom operation basically consists of two parts:
1. Definition of operation semantics in OpenVINO, the code that describes how this operation should be inferred consuming input tensor(s) and producing output tensor(s). How to implement execution kernels for [GPU](./GPU_Extensibility.md) and [VPU](./VPU_Extensibility.md) is described in separate guides.
1. Definition of operation semantics in OpenVINO, the code that describes how this operation should be inferred consuming input tensor(s) and producing output tensor(s). The implementation of execution kernels for [GPU](./GPU_Extensibility.md) and [VPU](./VPU_Extensibility.md) is described in separate guides.
2. Mapping rule that facilitates conversion of framework operation representation to OpenVINO defined operation semantics.
The first part is required for inference, the second part is required for successful import of a model containing such operations from the original framework model format. There are several options to implement each part, the next sections will describe them in detail.
The first part is required for inference. The second part is required for successful import of a model containing such operations from the original framework model format. There are several options to implement each part. The following sections will describe them in detail.
## Definition of Operation Semantics
If the custom operation can be mathematically represented as a combination of exiting OpenVINO operations and such decomposition gives desired performance, then low-level operation implementation is not required. Refer to the latest OpenVINO operation set, when deciding feasibility of such decomposition. You can use any valid combination of exiting operations. The next section of this document describes the way to map a custom operation.
If the custom operation can be mathematically represented as a combination of exiting OpenVINO operations and such decomposition gives desired performance, then low-level operation implementation is not required. When deciding feasibility of such decomposition refer to the latest OpenVINO operation set. You can use any valid combination of exiting operations. How to map a custom operation is described in the next section of this document.
If such decomposition is not possible or appears too bulky with a large number of constituent operations that do not perform well, then a new class for the custom operation should be implemented, as described in the [Custom Operation Guide](add_openvino_ops.md).
If such decomposition is not possible or appears too bulky with lots of consisting operations that are not performing well, then a new class for the custom operation should be implemented as described in the [Custom Operation Guide](add_openvino_ops.md).
Prefer implementing a custom operation class if you already have a generic C++ implementation of operation kernel. Otherwise try to decompose the operation first as described above and then after verifying correctness of inference and resulting performance, optionally invest to implementing bare metal C++ implementation.
You might prefer implementing a custom operation class if you already have a generic C++ implementation of operation kernel. Otherwise, try to decompose the operation first, as described above. Then, after verifying correctness of inference and resulting performance, you may move on to optional implementation of Bare Metal C++.
## Mapping from Framework Operation
Depending on model format used for import, mapping of custom operation is implemented differently, choose one of:
Mapping of custom operation is implemented differently, depending on model format used for import. You may choose one of the following:
1. If model is represented in ONNX (including models exported from Pytorch in ONNX) or PaddlePaddle formats, then one of the classes from [Frontend Extension API](frontend_extensions.md) should be used. It consists of several classes available in C++ which can be used with Model Optimizer `--extensions` option or when model is imported directly to OpenVINO run-time using read_model method. Python API is also available for run-time model importing.
1. If a model is represented in the ONNX (including models exported from Pytorch in ONNX) or PaddlePaddle formats, then one of the classes from [Frontend Extension API](frontend_extensions.md) should be used. It consists of several classes available in C++ which can be used with the `--extensions` option in Model Optimizer or when a model is imported directly to OpenVINO runtime using the `read_model` method. Python API is also available for runtime model import.
2. If model is represented in TensorFlow, Caffe, Kaldi or MXNet formats, then [Model Optimizer Extensions](../MO_DG/prepare_model/customize_model_optimizer/Customize_Model_Optimizer.md) should be used. This approach is available for model conversion in Model Optimizer only.
2. If a model is represented in the TensorFlow, Caffe, Kaldi or MXNet formats, then [Model Optimizer Extensions](../MO_DG/prepare_model/customize_model_optimizer/Customize_Model_Optimizer.md) should be used. This approach is available for model conversion in Model Optimizer only.
Existing of two approaches simultaneously is explained by two different types of frontends used for model conversion in OpenVINO: new frontends (ONNX, PaddlePaddle) and legacy frontends (TensorFlow, Caffe, Kaldi and Apache MXNet). Model Optimizer can use both front-ends in contrast to the direct import of model with `read_model` method which can use new frontends only. Follow one of the appropriate guides referenced above to implement mappings depending on framework frontend.
If you are implementing extensions for ONNX or PaddlePaddle new frontends and plan to use Model Optimizer `--extension` option for model conversion, then the extensions should be
If you are implementing extensions for new ONNX or PaddlePaddle frontends and plan to use the `--extensions` option in Model Optimizer for model conversion, then the extensions should be:
1. Implemented in C++ only
1. Implemented in C++ only.
2. Compiled as a separate shared library (see details how to do that later in this guide).
2. Compiled as a separate shared library (see details on how to do this further in this guide).
You cannot write new frontend extensions using Python API if you plan to use them with Model Optimizer.
Model Optimizer does not support new frontend extensions written in Python API.
Remaining part of this guide uses Frontend Extension API applicable for new frontends.
Remaining part of this guide describes application of Frontend Extension API for new frontends.
## Registering Extensions
A custom operation class and a new mapping frontend extension class object should be registered to be usable in OpenVINO runtime.
> **NOTE**: This documentation is written based on the [Template extension](https://github.com/openvinotoolkit/openvino/tree/master/docs/template_extension/new), which demonstrates extension development details based on minimalistic `Identity` operation that is a placeholder for your real custom operation. You can review the complete code, which is fully compliable, to see how it works.
> **NOTE**: This documentation is derived from the [Template extension](https://github.com/openvinotoolkit/openvino/tree/master/src/core/template_extension/new), which demonstrates the details of extension development. It is based on minimalistic `Identity` operation that is a placeholder for your real custom operation. Review the complete, fully compilable code to see how it works.
To load the extensions to the `ov::Core` object, use the `ov::Core::add_extension` method, this method allows to load library with extensions or extensions from the code.
Use the `ov::Core::add_extension` method to load the extensions to the `ov::Core` object. This method allows loading library with extensions or extensions from the code.
### Load extensions to core
### Load Extensions to Core
Extensions can be loaded from code with `ov::Core::add_extension` method:
Extensions can be loaded from a code with the `ov::Core::add_extension` method:
@sphinxtabset
@@ -92,7 +91,7 @@ Extensions can be loaded from code with `ov::Core::add_extension` method:
@endsphinxtabset
`Identity` is custom operation class defined in [Custom Operation Guide](add_openvino_ops.md). This is enough to enable reading IR which uses `Identity` extension operation emitted by Model Optimizer. To be able to load original model directly to the runtime, you need to add also a mapping extension:
The `Identity` is a custom operation class defined in [Custom Operation Guide](add_openvino_ops.md). This is sufficient to enable reading OpenVINO IR which uses the `Identity` extension operation emitted by Model Optimizer. In order to load original model directly to the runtime, add a mapping extension:
@sphinxdirective
@@ -110,32 +109,34 @@ Extensions can be loaded from code with `ov::Core::add_extension` method:
@endsphinxdirective
When Python API is used there is no way to implement a custom OpenVINO operation. Also, even if custom OpenVINO operation is implemented in C++ and loaded to the runtime through a shared library, there is still no way to add a frontend mapping extension that refers to this custom operation. Use C++ shared library approach to implement both operations semantics and framework mapping in this case.
When Python API is used, there is no way to implement a custom OpenVINO operation. Even if custom OpenVINO operation is implemented in C++ and loaded into the runtime by a shared library, there is still no way to add a frontend mapping extension that refers to this custom operation. In this case, use C++ shared library approach to implement both operations semantics and framework mapping.
You still can use Python for operation mapping and decomposition in case if operations from the standard OpenVINO operation set is used only.
Python can still be used to map and decompose operations when only operations from the standard OpenVINO operation set are used.
### Create library with extensions
### Create a Library with Extensions
You need to create extension library in the following cases:
- Convert model with custom operations in Model Optimizer
- Load model with custom operations in Python application. It is applicable for both framework model and IR.
- Loading models with custom operations in tools that support loading extensions from a library, for example `benchmark_app`.
An extension library should be created in the following cases:
If you want to create an extension library, for example in order to load these extensions to the Model Optimizer, you need to do next steps:
Create an entry point for extension library. OpenVINO™ provides an `OPENVINO_CREATE_EXTENSIONS()` macro, which allows to define an entry point to a library with OpenVINO™ Extensions.
This macro should have a vector of all OpenVINO™ Extensions as an argument.
- Conversion of a model with custom operations in Model Optimizer.
- Loading a model with custom operations in a Python application. This applies to both framework model and OpenVINO IR.
- Loading models with custom operations in tools that support loading extensions from a library, for example the `benchmark_app`.
Based on that, the declaration of an extension class can look as follows:
To create an extension library, for example, to load the extensions into Model Optimizer, perform the following:
1. Create an entry point for extension library. OpenVINO provides the `OPENVINO_CREATE_EXTENSIONS()` macro, which allows to define an entry point to a library with OpenVINO Extensions.
This macro should have a vector of all OpenVINO Extensions as an argument.
Based on that, the declaration of an extension class might look like the following:
@snippet template_extension/new/ov_extension.cpp ov_extension:entry_point
To configure the build of your extension library, use the following CMake script:
2. Configure the build of your extension library, using the following CMake script:
@snippet template_extension/new/CMakeLists.txt cmake:extension
This CMake script finds the OpenVINO using the `find_package` CMake command.
This CMake script finds OpenVINO, using the `find_package` CMake command.
To build the extension library, run the commands below:
3. Build the extension library, running the commands below:
```sh
$ cd docs/template_extension/new
@@ -145,7 +146,7 @@ $ cmake -DOpenVINO_DIR=<OpenVINO_DIR> ../
$ cmake --build .
```
After the build you can use path to your extension library to load your extensions to OpenVINO Runtime:
4. After the build, you may use the path to your extension library to load your extensions to OpenVINO Runtime:
@sphinxtabset
@@ -168,4 +169,3 @@ After the build you can use path to your extension library to load your extensio
* [OpenVINO Transformations](./ov_transformations.md)
* [Using OpenVINO Runtime Samples](../OV_Runtime_UG/Samples_Overview.md)
* [Hello Shape Infer SSD sample](../../samples/cpp/hello_reshape_ssd/README.md)

View File

@@ -2,9 +2,10 @@
To enable operations not supported by OpenVINO™ out of the box, you need a custom extension for Model Optimizer, a custom nGraph operation set, and a custom kernel for the device you will target. This page describes custom kernel support for one the VPU, the Intel® Neural Compute Stick 2 device, which uses the MYRIAD device plugin.
> **NOTES:**
> * OpenCL\* custom layer support is available in the preview mode.
> **NOTE:**
> * OpenCL custom layer support is available in the preview mode.
> * This section assumes you are familiar with developing kernels using OpenCL.
To customize your topology with an OpenCL layer, carry out the tasks described on this page:
1. Write and compile your OpenCL code with the standalone offline OpenCL compiler (`clc`).
@@ -13,9 +14,9 @@ To customize your topology with an OpenCL layer, carry out the tasks described o
## Compile OpenCL code for VPU (Intel® Neural Compute Stick 2)
> **NOTE**: OpenCL compiler, targeting Intel® Neural Compute Stick 2 for the SHAVE* processor only, is redistributed with OpenVINO.
OpenCL support is provided by ComputeAorta* and is distributed under a license agreement between Intel® and Codeplay* Software Ltd.
The OpenCL toolchain for the Intel® Neural Compute Stick 2 supports offline compilation only, so first compile OpenCL C code using the standalone `clc` compiler. You can find the compiler binary at `<INSTALL_DIR>/tools/cl_compiler`.
> **NOTE**: OpenCL compiler, targeting Intel® Neural Compute Stick 2 for the SHAVE processor only, is redistributed with OpenVINO.
OpenCL support is provided by ComputeAorta and is distributed under a license agreement between Intel® and Codeplay Software Ltd.
The OpenCL toolchain for the Intel® Neural Compute Stick 2 supports offline compilation only. Start with compiling OpenCL C code, using the standalone `clc` compiler. You can find the compiler binary at `<INSTALL_DIR>/tools/cl_compiler`.
> **NOTE**: By design, custom OpenCL layers support any OpenCL kernels written assuming OpenCL version 1.2. It also supports half float extension and is optimized for this type, because it is a native type for Intel® Movidius™ VPUs.
1. Prior to running a compilation, make sure that the following variables are set:
@@ -63,7 +64,7 @@ Each custom layer is described with the `CustomLayer` node. It has the following
- Node `Source` must contain the following attributes:
- `filename` The path to a compiled binary relative to the XML configuration file.
- Sub-node `Parameters` Describes parameters bindings. For more information, see the description below.
- Sub-node `WorkSizes` Describes local and global work group sizes and the source for dimension deduction as a pair `direction,port`. In the example above, the work group is described relatively to the dimension of the input tensor that comes through port 0 in the IR. `global` and `local` work group configurations support any simple math expressions with +,-,\*,/, and () from `B`(batch), `Y`(height), `X`(width) and `F`(channels).
- Sub-node `WorkSizes` Describes local and global work group sizes and the source for dimension deduction as a pair `direction,port`. In the example above, the work group is described relatively to the dimension of the input tensor that comes through port 0 in the OpenVINO IR. Work group configurations, namely `global` and `local` support any simple math expressions with +,-,\*,/, and () from `B`(batch), `Y`(height), `X`(width) and `F`(channels).
- Sub-node `Where` Allows to customize bindings with the `key="value"` attribute. For example, to substitute only 3x3 convolutions, write `<Where kernel="3,3"/>` in the binding xml.
Parameter description supports `Tensor` of one of tensor types such as `input`, `output`, `input_buffer`, `output_buffer` or `data`, `Scalar`, or `Data` nodes and has the following format:
@@ -77,7 +78,7 @@ Each custom layer is described with the `CustomLayer` node. It has the following
- `type` Node type: `input_buffer` or `output_buffer`. Use the appropriate type to bind multiple kernels that correspond to different stages of the same layer.
- `port-index` The unique identifier to bind by.
- `dim` The dim source with the same `direction,port` format used for `WorkSizes` bindings.
- `size` Amount of bytes needed. Current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and might be expended in the future.
- `size` Amount of bytes needed. Current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and might be extended in the future.
Here is an example of multi-stage MVN layer binding:
```xml
@@ -107,7 +108,7 @@ Each custom layer is described with the `CustomLayer` node. It has the following
<WorkSizes dim="output,0" global="((Y+7)/8)*8,F,1" local="8,1,1"/>
</CustomLayer>
```
- Each `Tensor` node that has the type `data` must contain the following attributes:
- Each `Tensor` node that has the `data` type must contain the following attributes:
- `source` A name of the blob as it is in the IR. Typical example is `weights` for convolution.
- `format` Specifies the channel order in the tensor. Optional conversion layers are generated if the custom layer format is not.
```xml
@@ -133,7 +134,7 @@ Each custom layer is described with the `CustomLayer` node. It has the following
- Each `Data` node must contain the following attributes:
- `arg-name` The name of a kernel parameter in the kernel signature.
- `type` Node type. Currently, `local_data` is the only supported value, which defines buffer allocated in fast local on-chip memory. It is limited to 100KB for all `__local` and
`__private` arrays defined inside the kernel as well as all `__local` parameters passed to the kernel. Note that a manual-DMA extension requires double buffering.
`__private` arrays defined inside the kernel as well as all `__local` parameters passed to the kernel. A manual-DMA extension requires double buffering.
If the custom layer is detected to run out of local memory, the inference fails.
- `dim` The dim source with the same `direction,port` format used for `WorkSizes` bindings.
- `size` Amount of bytes needed. The current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and may be extended in the future.
@@ -158,14 +159,13 @@ Each custom layer is described with the `CustomLayer` node. It has the following
## Pass Configuration File to OpenVINO™ Runtime
> **NOTE**: If both native and custom layer implementations are present, the custom kernel has a priority over the native one.
Before loading the network that features the custom layers, provide a separate configuration file and load it using the ov::Core::set_property() method with the "CONFIG_KEY" key and the configuration file name as a value before loading the network that uses custom operations to the plugin:
Before loading the network that features the custom layers, provide a separate configuration file and load it using the `ov::Core::set_property()` method. Use the "CONFIG_KEY" key and the configuration file name as a value before loading the network that uses custom operations to the plugin:
@snippet docs/snippets/vpu/custom_op.cpp part0
## Optimizing Kernels with OpenCL for VPU (Intel® Neural Compute Stick 2)
This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL
programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.
This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.
| OpenCL Model | VPU Mapping|
|-----|----|
@@ -175,41 +175,33 @@ programming model and OpenCL kernel language is assumed and not a subject of thi
| Global memory | Mapped to DDR, used to pass execution preserved parameters for inputs, outputs, and blobs |
| Work group | Executed on a single SHAVE core iterating over multiple work items |
Note that by the OpenCL specification, the work group execution order is not specified. This means that it is your
responsibility to ensure that race conditions among work groups are not introduced. Custom layer runtime spits evenly
work grid among available compute resources and executes them in an arbitrary order. This static scheduling approach works best if the load is evenly spread out across work groups, which is a typical case for Deep Learning kernels. The following guidelines are recommended to use for work group partitioning:
The work group execution order is not defined in the OpenCL specifications. This means it is your responsibility to ensure that race conditions among work groups are not introduced. Custom layer runtime distributes work grid evenly among available compute resources and executes them in an arbitrary order. This static scheduling approach works best if the load is evenly spread out across work groups, which is a typical case for Deep Learning kernels. The following guidelines are recommended to use for work group partitioning:
1. Split work evenly across work groups.
1. Distribute work evenly across work groups.
2. Adjust work group granularity to maintain equal workload for all compute codes.
3. Set the maximum number of cores using the `max-shaves` attribute for the `CustomLayer` node. This keeps more resources for the rest of topology. It is also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization.
4. Try an alternate data layout (`BFXY`/`BYXF`) for the kernel if it improves work group partitioning or data access patterns.
Consider not just specific layer boost, but full topology performance because data conversion layers would be automatically inserted
as appropriate.
4. Try an alternate data layout (`BFXY`/`BYXF`) for the kernel to see if it improves work group partitioning or data access patterns.
Consider not just specific layer boost, but also full topology performance because data conversion layers will be automatically inserted as appropriate.
Offline OpenCL compiler (`clc`) features automatic vectorization over `get_global_id(0)` usage, if uniform access is detected.
For example, the kernel below could be automatically vectorized:
```cpp
__kernel void cvtf32f16(__global float* restrict inImage, __global half* restrict outImage,
float scale, float bais)
float scale, float bias)
{
int idx = get_global_id(0) + get_global_id(1) * get_global_size(0) + get_global_id(2) * get_global_size(0) * get_global_size(1);
outImage[idx] = convert_half(inImage[idx]*scale+bais);
outImage[idx] = convert_half(inImage[idx]*scale+bias);
}
```
However, this work-group based vectorizer (WGV) conflicts with the default LLVM vectorizer based on superword level parallelism
(SLP) for the current compiler version. Manual vectorization is recommended to provide the best performance for non-uniform code
patterns. WGV works if and only if vector types are not used in the code.
However, this work-group based vectorizer (WGV) conflicts with the default LLVM vectorizer based on superword level parallelism (SLP) for the current compiler version. Manual vectorization is recommended to provide the best performance for non-uniform code patterns. WGV works if and only if vector types are not used in the code.
Here is a short list of optimization tips:
1. Help auto-vectorizer ensure non-aliasing pointers for kernel parameters by putting `restrict` where possible.
- This can give a performance boost, especially for kernels with unrolling, like `ocl_grn` from the example below.
- Place `restrict` markers for kernels with manually vectorized codes. In the `ocl_grn` kernel below, the unrolled version without `restrict` is up to 20% slower than the most optimal one, which combines unrolling and `restrict`.
2. Put `#&zwj;pragma unroll N` to your loop header. The compiler does not trigger unrolling by default, so it is your responsibility to
annotate the code with pragmas as appropriate. The `ocl_grn` version with `#&zwj;pragma unroll 4` is up to 50% faster, most of which comes from unrolling the first loop, because LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the fist loop
`variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);` is only 2-stage (load-compute). Pay
attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that
still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from `4` to `6` results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel® Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store, and compute stages:
1. Help auto-vectorizer ensure non-aliasing pointers for kernel parameters by putting the `restrict` markers where possible.
- This can give a performance boost, especially for kernels with unrolling, like the `ocl_grn` from the example below.
- Place `restrict` markers for kernels with manually vectorized codes. In the `ocl_grn` kernel below, the unrolled version without the `restrict` is up to 20% slower than the most optimal one, which combines both unrolling and `restrict`.
2. Put `#&zwj;pragma unroll N` to your loop header. The compiler does not trigger unrolling by default, so it is your responsibility to annotate the code with pragmas as appropriate. The `ocl_grn` version with `#&zwj;pragma unroll 4` is up to 50% faster, most of which comes from unrolling the first loop, because LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the first loop
The `variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);` is only 2-stage (load-compute). Pay attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from `4` to `6` results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store, and compute stages:
```cpp
__kernel void ocl_grn(__global const half* restrict src_data, __global half* restrict dst_data, int C, float bias)
{
@@ -227,7 +219,7 @@ __kernel void ocl_grn(__global const half* restrict src_data, __global half* res
dst_data[c*H*W + y*W + x] = (half)((float)src_data[c*H*W + y*W + x] * variance);
}
```
To check the efficiency of WGV, you can compare performance of the kernel above with the kernel below, which is manually vectorized over width:
To check the efficiency of WGV, compare performance of the kernel above with the kernel below, which is manually vectorized over width:
```cpp
__kernel void ocl_grn_line(__global const half* restrict src_data, __global half* restrict dst_data, int C, int W, float bias)
{
@@ -267,19 +259,14 @@ __kernel void ocl_grn_line(__global const half* restrict src_data, __global hal
```
Both versions perform the same, but the second one has more complex code.
3. If it is easy to predict the work group size, you can also use the `reqd_work_group_size` kernel attribute to ask the compiler
to unroll the code up to the local size of the work group. Note that if the kernel is actually executed with the
different work group configuration, the result is undefined.
3. If it is easy to predict the work group size, use the `reqd_work_group_size` kernel attribute to ask the compiler to unroll the code up to the local size of the work group. If the kernel is actually executed with the different work group configuration, the result is undefined.
4. Prefer to use the `half` compute if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functions `half_*` are mapped to a single hardware instruction.
4. Prefer to use the `half` compute if it keeps reasonable accuracy. A 16-bit float is a native type for Intel Neural Compute Stick 2, most of the `half_*` functions are mapped to a single hardware instruction.
Use the standard `native_*` function for the rest of types.
5. Prefer to use the `convert_half` function over `vstore_half` if conversion to 32-bit float is required. `convert_half` is mapped to a single hardware instruction. For the `cvtf32f16` kernel above, the line `outImage[idx] = convert_half(inImage[idx]*scale+bais);` is eight times slower than the code with `vstore_half`.
5. Prefer to use the `convert_half` function over the `vstore_half` if conversion to 32-bit float is required. The `convert_half` function is mapped to a single hardware instruction. For the `cvtf32f16` kernel above, the `outImage[idx] = convert_half(inImage[idx]*scale+bias);` code is eight times slower than the code with `vstore_half`.
6. Mind early exits. Early exit can be extremely costly for the current version of the `clc` compiler due to conflicts with the
auto-vectorizer. The generic advice would be to setup local size by `x` dimension equal to inputs or/and outputs width.
If it is impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks, for example,
`if (get_global_id(0) >= width) return`, use line-wise kernel variant with manual vectorization.
6. Be aware of early exits, as they can be extremely costly for the current version of the `clc` compiler due to conflicts with the auto-vectorizer. It is recommended to setup local size by `x` dimension equal to inputs or/and outputs width. If it is impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks, for example, `if (get_global_id(0) >= width) return`, use line-wise kernel variant with manual vectorization.
The kernel example below demonstrates the impact of early exits on kernel performance.
```cpp
// Initial version
@@ -302,8 +289,8 @@ The kernel example below demonstrates the impact of early exits on kernel perfor
}
```
This `reorg` kernel is auto-vectorizable, but an input for YOLO v2 topology is `NCHW=<1,64,26,26>` and it is not multiple of vector width, which is `8` for `half` data type. As a result, the Inference Engine does not select the auto-vectorized kernel.
To compare performance of auto-vectorized and scalar version of the kernel, change the input size to`NCHW=<1,64,26,32>`. This enables the auto-vectorized version to be selected by the Inference Engine and can give you about 30% uplift.
Since the auto-vectorized version is faster, it makes sense to enable it for the YOLO v2 topology input size by setting the local size multiple of vector, for example, 32, and adjust global sizes accordingly. As a result, the execution work grid exceeds actual input dimension, so out-of-bound checks should be inserted. See the updated kernel version below:
To compare performance of auto-vectorized and scalar version of the kernel, change the input size to `NCHW=<1,64,26,32>`. This enables the auto-vectorized version to be selected by the Inference Engine and can give you about 30% uplift.
Since the auto-vectorized version is faster, it is recommended to enable it for the YOLO v2 topology input size by setting the local size multiple of vector, for example, `32`, and adjust global sizes accordingly. As a result, the execution work grid exceeds actual input dimension, so out-of-bound checks should be inserted. See the updated kernel version below:
```cpp
// Version with out-of-bound checks added
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int W, int stride)
@@ -324,7 +311,7 @@ Since the auto-vectorized version is faster, it makes sense to enable it for the
out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2];
}
```
This code performs the same as the initial kernel above (scalar) due to branching overhead. If you replace min/max expression `w = min(w, W-1);` with `if (w >= W) return;`, runtime increases up to 2x against to code without branching (initial version).<br>
This code performs the same as the initial kernel above (scalar) due to branching overhead. If the `w = min(w, W-1);` min/max expression is replaced with the `if (w >= W) return;`, runtime increases up to 2x against to code without branching (initial version).<br>
If branching is inevitable for your element-based kernel, it is recommended to change the scheme to line-based. See the kernel variant below:
```cpp
// Line-wise version
@@ -347,8 +334,8 @@ __kernel void reorg(const __global half* restrict src, __global half* restrict o
}
```
This decreases the execution time up to 40% against the best performing vectorized kernel without early exits (initial version).
7. Reuse computations among work items by using line-based kernels or sharing values though `__local` memory.
8. Improve data access locality. Most of custom kernels are memory bound while convolution and fully connected layers are hardware-implemented. The code below demonstrates a further optimized version of the `reorg` kernel unrolled by `stride`:
7. Reuse computations among work items by using line-based kernels or sharing values through the `__local` memory.
8. Improve data access locality. Most of custom kernels are memory bound while convolution and fully connected layers are hardware-implemented. The code below demonstrates a further optimized version of the `reorg` kernel unrolled by the `stride`:
```cpp
// Unrolled line-wise version
__kernel void reorg_unrolled_by_stride(const __global half* restrict src, __global half* restrict dst,
@@ -366,14 +353,11 @@ This decreases the execution time up to 40% against the best performing vectoriz
dst[W*H*C2*(stride_y*stride+stride_x) + W*H*c2 + W*h + w] = src[W2*H2*c2 + W2*h*stride + W2*stride_y + w2 + stride_x];
}
```
`scr` data in this case loaded only once. As the result, the cycle count drops up to 45% against the line-wise version.
The `scr` data in this case is loaded only once. As the result, the cycle count drops up to 45% against the line-wise version.
9. Copy data from `__dlobal` to `__local` or `__private` memory if the data is accessed more than once. Access to
`__dlobal` memory is orders of magnitude slower than access to `__local`/`__private` due to statically scheduled pipeline, which
stalls completely on memory access without any prefetch. The same recommendation is applicable for scalar load/store
from/to a `__blobal` pointer since work-group copying could be done in a vector fashion.
9. Copy data from the `__dlobal` to the `__local` or `__private` memory if the data is accessed more than once. Access to the `__dlobal` memory is orders of magnitude slower than access to the `__local`/`__private` due to statically scheduled pipeline, which stalls completely on memory access without any prefetch. The same recommendation is applicable for scalar load/store from/to the `__blobal` pointer since work-group copying could be done in a vector fashion.
10. Use a manual DMA extension. Local (on-chip) memory throughput is up to 24x higher than DDR throughput. Starting from OpenVINO 2020.1, VPU OpenCL features manual-DMA kernel extension to copy sub-tensor used by work group into local memory and performing compute without DDR evolved. Here is the simple GRN kernel implementation that runs over DDR. Local size is in the form (width of the input tensor, 1, 1) to define a large enough work group to get code automatically vectorized and unrolled, while global size is (width of the input tensor, height of the input tensor, 1):
10. Use a manual DMA extension. Local (on-chip) memory throughput is up to 24x higher than DDR throughput. Since the OpenVINO 2020.1 release, VPU OpenCL features manual-DMA kernel extension to copy sub-tensor used by a work group into local memory and performing compute without DDR evolved. Here is the simple GRN kernel implementation that runs over DDR. Local size is in the form (width of the input tensor, 1, 1) to define a large enough work group to get code automatically vectorized and unrolled, while global size is (width of the input tensor, height of the input tensor, 1):
```cpp
__kernel void grn_NCHW(
__global const half* restrict src_data,
@@ -398,7 +382,7 @@ from/to a `__blobal` pointer since work-group copying could be done in a vector
}
```
This kernel can be rewritten to introduce special data binding `__dma_preload` and `__dma_postwrite intrinsics`. This means that instead of one kernel, a group of three kernels should be implemented: `kernelName`, `__dma_preload_kernelName`, and `__dma_postwrite_kernelName`. `__dma_preload_kernelName` for a particular work group `n` is guaranteed to be executed before the `n`-th work group itself, while `__dma_postwrite_kernelName` is guaranteed to be executed after a corresponding work group. You can define one of those functions that are intended to be used to copy data from-to `__global` and `__local` memory. The syntactics requires exact functional signature match. The example below illustrates how to prepare your kernel for manual-DMA.
This kernel can be rewritten to introduce the `__dma_preload` and `__dma_postwrite intrinsics` special data binding. This means that instead of one kernel, a group of three kernels should be implemented: `kernelName`, `__dma_preload_kernelName`, and `__dma_postwrite_kernelName`. The `__dma_preload_kernelName` kernel for a particular work group `n` is guaranteed to be executed before the `n`-th work group itself, while the `__dma_postwrite_kernelName` is guaranteed to be executed after a corresponding work group. One of those functions may be defined to copy data from-to `__global` and `__local` memory. The syntactics requires exact functional signature match. The example below illustrates how to prepare your kernel for manual-DMA.
```cpp
__kernel void __dma_preload_grn_NCHW(
@@ -557,9 +541,9 @@ __kernel void grn_NCHW(
}
```
Note the `get_local_size` and `get_local_id` usage inside the kernel. 21x speedup is expected for a kernel on enet-curbs setup because it was completely limited by memory usage.
> **NOTE**: The `get_local_size` and `get_local_id` usage inside the kernel. 21x speedup is expected for a kernel on enet-curbs setup since it is completely limited by memory usage.
An alternative method to using DMA is to use work item copy extension. Those functions are executed inside a kernel and requires work groups equal to single work item.
An alternative method to using DMA is to use work item copy extension. Those functions are executed inside a kernel and require work groups equal to single work item.
Here is the list of supported work item functions:
```cpp