Review Inference Engine Extensibility docs (#4267)
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Custom nGraph Operation {#openvino_docs_IE_DG_Extensibility_DG_AddingNGraphOps}
|
||||
|
||||
Inference Engine Extension API allows to register operation sets (opsets) with custom nGraph operations, it allows to support Networks with unknown operations.
|
||||
Inference Engine Extension API enables you to register operation sets (opsets) with custom nGraph operations to support models with operations which OpenVINO™ does not support out-of-the-box.
|
||||
|
||||
## Operation Class
|
||||
|
||||
@@ -8,17 +8,17 @@ To add your custom nGraph operation, create a new class that extends `ngraph::Op
|
||||
|
||||
1. Add the `NGRAPH_RTTI_DECLARATION` and `NGRAPH_RTTI_DEFINITION` macros which define a `NodeTypeInfo` object that identifies the type of the operation to the graph users and helps with dynamic type resolution. The type info of an nGraph operation currently consists of a string identifier and a version number, but this may change in the future.
|
||||
|
||||
2. Implement constructors that can optionally take the operation inputs and attributes as parameters.
|
||||
2. Implement constructors that optionally take the operation inputs and attributes as parameters.
|
||||
|
||||
3. Override the shape inference method `validate_and_infer_types`. This method is called multiple times during graph manipulations to determine the shapes and element types of the outputs of the operations. You can access the input shapes through the `get_input_partial_shape()` method and input element types through the `get_input_element_type()` method of `ngraph::Node`. Set the inferred shape and element type of the output using `set_output_type`.
|
||||
3. Override the shape inference method `validate_and_infer_types`. This method is called multiple times during graph manipulations to determine the shapes and element types of the operations outputs. To access the input shapes and input element types, use the `get_input_partial_shape()` and `get_input_element_type()` methods of `ngraph::Node`. Set the inferred shape and element type of the output using `set_output_type`.
|
||||
|
||||
4. Override the `clone_with_new_inputs` method, which allows graph manipulation routines to create copies of this operation and connect it to different nodes during optimization.
|
||||
4. Override the `clone_with_new_inputs` method, which enables graph manipulation routines to create copies of this operation and connect it to different nodes during optimization.
|
||||
|
||||
5. Override the `visit_attributes` method, which allows serialization and deserialization of attributes. An `AttributeVisitor` is passed to the method, and the implementation is expected to walk over all the attributes in the op using the type-aware `on_attribute` helper. Helpers are already implemented for standard C++ types like `int64_t`, `float`, `bool`, `vector` and for existing nGraph defined types.
|
||||
5. Override the `visit_attributes` method, which enables serialization and deserialization of operation attributes. An `AttributeVisitor` is passed to the method, and the implementation is expected to walk over all the attributes in the op using the type-aware `on_attribute` helper. Helpers are already implemented for standard C++ types like `int64_t`, `float`, `bool`, `vector`, and for existing nGraph defined types.
|
||||
|
||||
6. Override `evaluate`, which is an optional method that enables the application of constant folding if there is a custom operation on the constant branch.
|
||||
|
||||
Based on that, declaration of a operation class can look as follows:
|
||||
Based on that, declaration of an operation class can look as follows:
|
||||
|
||||
@snippet template_extension/op.hpp op:header
|
||||
|
||||
@@ -26,36 +26,38 @@ Based on that, declaration of a operation class can look as follows:
|
||||
|
||||
The provided implementation has several fields:
|
||||
|
||||
* `add` of type `int64_t` is an attribute of custom operation
|
||||
* `type_info` of type `ngraph::NodeTypeInfo` defines the type and version of operation
|
||||
* `add` of type `int64_t` is an attribute of a custom operation.
|
||||
* `type_info` of type `ngraph::NodeTypeInfo` defines the type and version of an operation.
|
||||
|
||||
### Operation Constructors
|
||||
|
||||
nGraph operation contains two constructors: a default constructor, which allows to create operation without attributes and a constructor that creates and validates operation with specified inputs and attributes.
|
||||
nGraph operation contains two constructors:
|
||||
* Default constructor, which enables you to create an operation without attributes
|
||||
* Constructor that creates and validates an operation with specified inputs and attributes
|
||||
|
||||
@snippet template_extension/op.cpp op:ctor
|
||||
|
||||
### `validate_and_infer_types()`
|
||||
|
||||
`ngraph::Node::validate_and_infer_types` method validates operation attributes and calculates output shapes using attributes of operation.
|
||||
`ngraph::Node::validate_and_infer_types` method validates operation attributes and calculates output shapes using attributes of the operation.
|
||||
|
||||
@snippet template_extension/op.cpp op:validate
|
||||
|
||||
### `clone_with_new_inputs()`
|
||||
|
||||
`ngraph::Node::clone_with_new_inputs` method creates a copy of nGraph operation with new inputs.
|
||||
`ngraph::Node::clone_with_new_inputs` method creates a copy of the nGraph operation with new inputs.
|
||||
|
||||
@snippet template_extension/op.cpp op:copy
|
||||
|
||||
### `visit_attributes()`
|
||||
|
||||
`ngraph::Node::visit_attributes` method allows to visit all operation attributes.
|
||||
`ngraph::Node::visit_attributes` method enables you to visit all operation attributes.
|
||||
|
||||
@snippet template_extension/op.cpp op:visit_attributes
|
||||
|
||||
### `evaluate()`
|
||||
|
||||
`ngraph::Node::evaluate` method allows to apply constant folding to an operation.
|
||||
`ngraph::Node::evaluate` method enables you to apply constant folding to an operation.
|
||||
|
||||
@snippet template_extension/op.cpp op:evaluate
|
||||
|
||||
@@ -67,7 +69,7 @@ To add custom operations to the [Extension](Extension.md) class, create an opera
|
||||
|
||||
This method returns a map of opsets that exist in the extension library.
|
||||
|
||||
nGraph provides opsets mechanism for operation versioning. Different opsets distinguish between different versions of one operation.
|
||||
nGraph provides an opset mechanism to group operations into clusters. S. Different opsets distinguish between different versions of one operation.
|
||||
|
||||
When specifying opset names, follow the rules below:
|
||||
* Use unique opset names.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# How to Implement Custom CPU Operations {#openvino_docs_IE_DG_Extensibility_DG_CPU_Kernel}
|
||||
|
||||
The primary vehicle for the performance of the CPU codepath in the Inference Engine is the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), and new CPU kernels extend the Inference Engine plugin for the Intel MKL-DNN. Implementing the InferenceEngine::ILayerExecImpl defines a general CPU-side extension. There are no Intel MKL-DNN specifics in the way you need to implement a kernel.
|
||||
The primary means of the performance of the CPU codepath in the Inference Engine is the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), and new CPU kernels extend the Inference Engine plugin for the Intel MKL-DNN. Implementing the InferenceEngine::ILayerExecImpl defines a general CPU-side extension. There are no Intel MKL-DNN specifics in the way you need to implement a kernel.
|
||||
|
||||
## Implementation Class
|
||||
|
||||
@@ -13,20 +13,20 @@ Based on that, declaration of a kernel implementation class can look as follows:
|
||||
|
||||
The provided implementation has several fields:
|
||||
|
||||
* `add` of the type `int64_t` is an attribute of a custom operation
|
||||
* `inShape` of the type `ngraph::Shape` is an input shape
|
||||
* `outShape` of the type `ngraph::Shape` is an output shape
|
||||
* `error` of the type `std::string` is a field to handle errors from a constructor
|
||||
* `add` of the type `int64_t` is an attribute of a custom operation.
|
||||
* `inShape` of the type `ngraph::Shape` is an input shape.
|
||||
* `outShape` of the type `ngraph::Shape` is an output shape.
|
||||
* `error` of the type `std::string` is a field to handle errors from a constructor.
|
||||
|
||||
### Constructor of Implementation
|
||||
|
||||
An implementation constructor checks parameters of nGraph operation, stores needed attributes, and stores an error message in the case of an error.
|
||||
An implementation constructor checks parameters of an nGraph operation, stores required attributes, and stores an error message in the case of an error.
|
||||
|
||||
@snippet template_extension/cpu_kernel.cpp cpu_implementation:ctor
|
||||
|
||||
### `getSupportedConfigurations`
|
||||
|
||||
InferenceEngine::ILayerExecImpl::getSupportedConfigurations method returns all supported configuration formats (input/output tensor layouts) for your implementation. To specify formats of data, use InferenceEngine::TensorDesc. Refer to the [Memory Primitives](../Memory_primitives.md) section for instructions on how to do it.
|
||||
InferenceEngine::ILayerExecImpl::getSupportedConfigurations method returns all supported configuration formats (input/output tensor layouts) for your implementation. To specify formats of data, use InferenceEngine::TensorDesc. Refer to the [Memory Primitives](../Memory_primitives.md) section for instructions.
|
||||
|
||||
@snippet template_extension/cpu_kernel.cpp cpu_implementation:getSupportedConfigurations
|
||||
|
||||
|
||||
@@ -1,17 +1,17 @@
|
||||
# Custom ONNX operators {#openvino_docs_IE_DG_Extensibility_DG_Custom_ONNX_Ops}
|
||||
# Custom ONNX* Operators {#openvino_docs_IE_DG_Extensibility_DG_Custom_ONNX_Ops}
|
||||
|
||||
ONNX importer provides mechanism to register custom ONNX operators based on predefined or user-defined nGraph operations.
|
||||
The ONNX\* importer provides a mechanism to register custom ONNX operators based on predefined or custom nGraph operations.
|
||||
The function responsible for registering a new operator is called `ngraph::onnx_import::register_operator` and is defined in `onnx_import/onnx_utils.hpp`.
|
||||
|
||||
## Registering custom ONNX operator based on predefined nGraph operations
|
||||
## Register Custom ONNX Operator Based on Predefined nGraph Operations
|
||||
|
||||
The steps below explain how to register a custom ONNX operator, for example, CustomRelu, in a domain called com.example.
|
||||
The steps below explain how to register a custom ONNX operator, for example, CustomRelu, in a domain called `com.example`.
|
||||
CustomRelu is defined as follows:
|
||||
```
|
||||
x >= 0 => f(x) = x * alpha
|
||||
x < 0 => f(x) = x * beta
|
||||
x < 0 => f(x) = x * beta
|
||||
```
|
||||
where alpha, beta are float constants.
|
||||
where `alpha` and `beta` are float constants.
|
||||
|
||||
1. Include headers:
|
||||
@snippet onnx_custom_op/onnx_custom_op.cpp onnx_custom_op:headers
|
||||
@@ -20,38 +20,40 @@ where alpha, beta are float constants.
|
||||
@snippet onnx_custom_op/onnx_custom_op.cpp onnx_custom_op:register_operator
|
||||
The `register_operator` function takes four arguments: op_type, opset version, domain, and a function object.
|
||||
The function object is a user-defined function that takes `ngraph::onnx_import::Node` as an input and based on that, returns a graph with nGraph operations.
|
||||
The `ngraph::onnx_import::Node` class represents a node in ONNX model. It provides functions to fetch input node(s) (`get_ng_inputs`), fetch attribute value (`get_attribute_value`) and many more (please refer to `onnx_import/core/node.hpp` for full class declaration).
|
||||
New operator registration must happen before the ONNX model is read, for example, if an ONNX model uses the 'CustomRelu' operator, `register_operator("CustomRelu", ...)` must be called before InferenceEngine::Core::ReadNetwork.
|
||||
Re-registering ONNX operators within the same process is supported. During registration of the existing operator, a warning is printed.
|
||||
The `ngraph::onnx_import::Node` class represents a node in an ONNX model. It provides functions to fetch input node(s) using `get_ng_inputs`, attribute value using `get_attribute_value`, and many more. See `onnx_import/core/node.hpp` for full class declaration.
|
||||
|
||||
The example below demonstrates an exemplary model that requires previously created 'CustomRelu' operator:
|
||||
New operator registration must happen before an ONNX model is read. For example, if an model uses the `CustomRelu` operator, call `register_operator("CustomRelu", ...)` before InferenceEngine::Core::ReadNetwork.
|
||||
Reregistering ONNX operators within the same process is supported. If you register an existing operator, you get a warning.
|
||||
|
||||
The example below demonstrates an exemplary model that requires a previously created `CustomRelu` operator:
|
||||
@snippet onnx_custom_op/onnx_custom_op.cpp onnx_custom_op:model
|
||||
|
||||
|
||||
For a reference on how to create a graph with nGraph operations, visit [Custom nGraph Operations](AddingNGraphOps.md).
|
||||
For a complete list of predefined nGraph operators, visit [available operations sets](../../ops/opset.md).
|
||||
To create a graph with nGraph operations, visit [Custom nGraph Operations](AddingNGraphOps.md).
|
||||
For a complete list of predefined nGraph operators, visit [Available Operations Sets](../../ops/opset.md).
|
||||
|
||||
If operator is no longer needed, it can be unregistered by calling `unregister_operator`. The function takes three arguments `op_type`, `version`, and `domain`.
|
||||
If you do not need an operator anymore, unregister it by calling `unregister_operator`. The function takes three arguments: `op_type`, `version`, and `domain`.
|
||||
@snippet onnx_custom_op/onnx_custom_op.cpp onnx_custom_op:unregister_operator
|
||||
|
||||
## Registering custom ONNX operator based on custom nGraph operations
|
||||
## Register Custom ONNX Operator Based on Custom nGraph Operations
|
||||
|
||||
The same principles apply when registering custom ONNX operator based on custom nGraph operations.
|
||||
This example shows how to register custom ONNX operator based on `Operation` presented in [this tutorial](AddingNGraphOps.md), which is used in [TemplateExtension](Extension.md).
|
||||
The same principles apply when registering a custom ONNX operator based on custom nGraph operations.
|
||||
This example shows how to register a custom ONNX operator based on `Operation` presented in [this tutorial](AddingNGraphOps.md), which is used in [TemplateExtension](Extension.md).
|
||||
@snippet template_extension/extension.cpp extension:ctor
|
||||
|
||||
Here, the `register_operator` function is called in Extension's constructor, which makes sure that it is called before InferenceEngine::Core::ReadNetwork (since InferenceEngine::Core::AddExtension must be called before a model with custom operator is read).
|
||||
Here, the `register_operator` function is called in the constructor of Extension. The constructor makes sure that the function is called before InferenceEngine::Core::ReadNetwork, because InferenceEngine::Core::AddExtension must be called before a model with a custom operator is read.
|
||||
|
||||
The example below demonstrates how to unregister operator from Extension's destructor:
|
||||
The example below demonstrates how to unregister an operator from the destructor of Extension:
|
||||
@snippet template_extension/extension.cpp extension:dtor
|
||||
Note that it is mandatory to unregister custom ONNX operator if it is defined in dynamic shared library.
|
||||
|
||||
## Requirements for building with CMake
|
||||
> **NOTE**: It is mandatory to unregister a custom ONNX operator if it is defined in a dynamic shared library.
|
||||
|
||||
Program that uses the `register_operator` functionality, requires (in addition to Inference Engine) `ngraph` and `onnx_importer` libraries.
|
||||
The `onnx_importer` is a component of `ngraph` package , so `find_package(ngraph REQUIRED COMPONENTS onnx_importer)` is sufficient to find both.
|
||||
The `ngraph` package exposes two variables (`${NGRAPH_LIBRARIES}` and `${ONNX_IMPORTER_LIBRARIES}`), which reference `ngraph` and `onnx_importer` libraries.
|
||||
## Requirements for Building with CMake
|
||||
|
||||
A program that uses the `register_operator` functionality requires `ngraph` and `onnx_importer` libraries in addition to the Inference Engine.
|
||||
The `onnx_importer` is a component of the `ngraph` package , so `find_package(ngraph REQUIRED COMPONENTS onnx_importer)` can find both.
|
||||
The `ngraph` package exposes two variables, `${NGRAPH_LIBRARIES}` and `${ONNX_IMPORTER_LIBRARIES}`, which reference the `ngraph` and `onnx_importer` libraries.
|
||||
Those variables need to be passed to the `target_link_libraries` command in the CMakeLists.txt file.
|
||||
|
||||
See below CMakeLists.txt for reference:
|
||||
See CMakeLists.txt below for reference:
|
||||
@snippet onnx_custom_op/CMakeLists.txt cmake:onnx_custom_op
|
||||
|
||||
@@ -1,29 +1,29 @@
|
||||
# Extension Library {#openvino_docs_IE_DG_Extensibility_DG_Extension}
|
||||
|
||||
Inference Engine provides an InferenceEngine::IExtension interface, which defines the interface for Inference Engine Extension libraries.
|
||||
All extension libraries should be inherited from this interface. The example below contains implementation of two operations: `Template`
|
||||
Inherit all extension libraries from this interface. The example below contains an implementation of two operations: `Template`
|
||||
used as an example in this document and `FFT` used as a more complex example from the [Custom Operations Guide](../../HOWTO/Custom_Layers_Guide.md).
|
||||
|
||||
> **NOTE**: `FFT` operation is implemented using OpenCV library functions `cv::dft` and `cv::idft`.
|
||||
> **NOTE**: `FFT` operation is implemented using the OpenCV library functions `cv::dft` and `cv::idft`.
|
||||
|
||||
Based on that, declaration of an extension class can look as follows:
|
||||
Based on that, the declaration of an extension class can look as follows:
|
||||
|
||||
@snippet template_extension/extension.hpp extension:header
|
||||
|
||||
The extension library should contain and export the method InferenceEngine::CreateExtension, which creates an `Extension` class:
|
||||
The extension library should contain and export the InferenceEngine::CreateExtension method, which creates an `Extension` class:
|
||||
|
||||
@snippet template_extension/extension.cpp extension:CreateExtension
|
||||
|
||||
Also, an `Extension` object should implement the following methods:
|
||||
|
||||
* InferenceEngine::IExtension::Release deletes an extension object
|
||||
* InferenceEngine::IExtension::Release deletes an extension object.
|
||||
|
||||
* InferenceEngine::IExtension::GetVersion returns information about version of the library
|
||||
* InferenceEngine::IExtension::GetVersion returns information about the version of the library.
|
||||
|
||||
@snippet template_extension/extension.cpp extension:GetVersion
|
||||
|
||||
Implement the InferenceEngine::IExtension::getOpSets method if the extension contains custom layers.
|
||||
Read the [guide about custom operations](AddingNGraphOps.md) for more information.
|
||||
Implement the InferenceEngine::IExtension::getOpSets method if the extension contains custom layers.
|
||||
Read [Custom nGraph Operation](AddingNGraphOps.md) for more information.
|
||||
|
||||
To understand how integrate execution kernels to the extension library, read the [guide about development of custom CPU kernels](CPU_Kernel.md).
|
||||
To understand how to register custom ONNX operator to the extension library, read the [guide about custom ONNX operators](Custom_ONNX_Ops.md).
|
||||
To integrate execution kernels to the extension library, read [How to Implement Custom CPU Operations](CPU_Kernel.md).
|
||||
To register a custom ONNX\* operator to the extension library, read [Custom ONNX Operators](Custom_ONNX_Ops.md).
|
||||
|
||||
@@ -1,15 +1,15 @@
|
||||
# How to Implement Custom GPU Operations {#openvino_docs_IE_DG_Extensibility_DG_GPU_Kernel}
|
||||
|
||||
The GPU codepath abstracts many details about OpenCL™. You need to provide the kernel code in OpenCL C and the configuration file that connects the kernel and its parameters to the parameters of the operation.
|
||||
The GPU codepath abstracts many details about OpenCL\*. You need to provide the kernel code in OpenCL C and the configuration file that connects the kernel and its parameters to the parameters of the operation.
|
||||
|
||||
There are two options of using custom operation configuration file:
|
||||
There are two options of using the custom operation configuration file:
|
||||
|
||||
* Include a section with your kernels into the global automatically-loaded `cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml` file, which is hosted in the `<INSTALL_DIR>/deployment_tools/inference_engine/bin/intel64/{Debug/Release}` folder
|
||||
* Call the `InferenceEngine::Core::SetConfig()` method from your application with the `InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE` key and the configuration file name as a value before loading the network that uses custom operations to the plugin:
|
||||
|
||||
@snippet snippets/GPU_Kernel.cpp part0
|
||||
|
||||
All Inference Engine samples, except trivial `hello_classification`,
|
||||
All Inference Engine samples, except the trivial `hello_classification`,
|
||||
feature a dedicated command-line option `-c` to load custom kernels. For example, to load custom operations for the classification sample, run the command below:
|
||||
```sh
|
||||
$ ./classification_sample -m <path_to_model>/bvlc_alexnet_fp16.xml -i ./validation_set/daily/227x227/apron.bmp -d GPU
|
||||
@@ -25,12 +25,12 @@ The definitions described in the sections below use the following notations:
|
||||
|
||||
Notation | Description
|
||||
---|---
|
||||
(0/1) | Can have 0 or 1 instances of this node/attribute
|
||||
(1) | Must have only 1 instance of this node/attribute
|
||||
(0+) | Can have any number of instances of this node/attribute
|
||||
(1+) | Can have 1 or more instances of this node/attribute
|
||||
(0/1) | Can have zero or one instance of this node or attribute
|
||||
(1) | Must have only one instance of this node or attribute
|
||||
(0+) | Can have any number of instances of this node or attribute
|
||||
(1+) | Can have one or more instances of this node or attribute
|
||||
|
||||
### CustomLayer Node and Sub-node Structure
|
||||
### CustomLayer Node and Sub-Node Structure
|
||||
|
||||
`CustomLayer` node contains the entire configuration for a single custom operation.
|
||||
|
||||
@@ -43,60 +43,60 @@ Notation | Description
|
||||
**Sub-nodes**: `Kernel` (1), `Buffers` (1), `CompilerOptions` (0+),
|
||||
`WorkSizes` (0/1)
|
||||
|
||||
### Kernel Node and Sub-node Structure
|
||||
### Kernel Node and Sub-Node Structure
|
||||
|
||||
`Kernel` node contains all kernel source code configuration. No kernel
|
||||
node structure exists.
|
||||
|
||||
**Sub-nodes**: `Source` (1+), `Define` (0+)
|
||||
|
||||
### Source Node and Sub-node Structure
|
||||
### Source Node and Sub-Node Structure
|
||||
|
||||
`Source` node points to a single OpenCL source file.
|
||||
|
||||
| Attribute Name | \# ||
|
||||
| Attribute Name | \# |Description|
|
||||
|-----|-----|-----|
|
||||
| `filename` | (1) | Name of the file containing OpenCL source code. Notice that path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
|
||||
| `filename` | (1) | Name of the file containing OpenCL source code. Note that the path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
|
||||
|
||||
**Sub-nodes**: None
|
||||
|
||||
### Define Node and Sub-node Structure
|
||||
### Define Node and Sub-Node Structure
|
||||
|
||||
`Define` node configures a single `#‍define` instruction to be added to
|
||||
the sources during compilation (JIT).
|
||||
|
||||
| Attribute Name | \# | Description |
|
||||
|------|-------|------|
|
||||
| `name` | (1) | The name of the defined JIT. For static constants, this can include the value as well (taken as a string). |
|
||||
| `name` | (1) | The name of the defined JIT. For static constants, this can include the value as well, which is taken as a string. |
|
||||
| `param` | (0/1) | This parameter value is used as the value of this JIT definition. |
|
||||
| `type` | (0/1) | The parameter type. Accepted values: `int`, `float`, and `int[]`, `float[]` for arrays. |
|
||||
| `default` | (0/1) | The default value to be used if the specified parameters is missing from the operation in the IR. |
|
||||
| `default` | (0/1) | The default value to be used if the specified parameters are missing from the operation in the IR. |
|
||||
|
||||
**Sub-nodes:** None
|
||||
|
||||
The resulting JIT has the following form:
|
||||
`#‍define [name] [type] [value/default]`.
|
||||
|
||||
### Buffers Node and Sub-node Structure
|
||||
### Buffers Node and Sub-Node Structure
|
||||
|
||||
`Buffers` node configures all input/output buffers for the OpenCL entry
|
||||
function. No buffers node structure exists.
|
||||
|
||||
**Sub-nodes:** `Data` (0+), `Tensor` (1+)
|
||||
|
||||
### Data Node and Sub-node Structure
|
||||
### Data Node and Sub-Node Structure
|
||||
|
||||
`Data` node configures a single input with static data (for example,
|
||||
weights or biases).
|
||||
`Data` node configures a single input with static data, for example,
|
||||
weights or biases.
|
||||
|
||||
| Attribute Name | \# | Description |
|
||||
|----|-----|------|
|
||||
| `name` | (1) | Name of a blob attached to a operation in the IR |
|
||||
| `name` | (1) | Name of a blob attached to an operation in the IR |
|
||||
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to |
|
||||
|
||||
**Sub-nodes**: None
|
||||
|
||||
### Tensor Node and Sub-node Structure
|
||||
### Tensor Node and Sub-Node Structure
|
||||
|
||||
`Tensor` node configures a single input or output tensor.
|
||||
|
||||
@@ -105,9 +105,9 @@ weights or biases).
|
||||
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to. |
|
||||
| `type` | (1) | `input` or `output` |
|
||||
| `port-index` | (1) | 0-based index in the operation input/output ports in the IR |
|
||||
| `format` | (0/1) | Data layout declaration for the tensor. Accepted values: `BFYX`, `BYXF`, `YXFB`, `FYXB` (also in all lowercase). Default value: `BFYX` |
|
||||
| `format` | (0/1) | Data layout declaration for the tensor. Accepted values: `BFYX`, `BYXF`, `YXFB`, `FYXB`, and same values in all lowercase. Default value: `BFYX` |
|
||||
|
||||
### CompilerOptions Node and Sub-node Structure
|
||||
### CompilerOptions Node and Sub-Node Structure
|
||||
|
||||
`CompilerOptions` node configures the compilation flags for the OpenCL
|
||||
sources.
|
||||
@@ -118,22 +118,22 @@ sources.
|
||||
|
||||
**Sub-nodes**: None
|
||||
|
||||
### WorkSizes Node and Sub-node Structure
|
||||
### WorkSizes Node and Sub-Node Structure
|
||||
|
||||
`WorkSizes` node configures the global/local work sizes to be used when
|
||||
queuing the OpenCL program for execution.
|
||||
queuing an OpenCL program for execution.
|
||||
|
||||
| Attribute Name | \# | Description |
|
||||
|-----|------|-----|
|
||||
| `global`<br>`local` | (0/1)<br>(0/1) | An array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used during execution.<br> The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,\*,% (all evaluated in integer arithmetic). <br>Default value: `global=”B*F*Y*X” local=””` |
|
||||
| `dim` | (0/1) | A tensor to take the work size from. Accepted values: `input N`, `output`, where `N` is an index of input tensor starting with 0. Default value: `output` |
|
||||
| `global`<br>`local` | (0/1)<br>(0/1) | An array of up to three integers or formulas for defining OpenCL work-sizes to be used during execution.<br> The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,\*,%. All operators are evaluated in integer arithmetic. <br>Default value: `global=”B*F*Y*X” local=””` |
|
||||
| `dim` | (0/1) | A tensor to take the work-size from. Accepted values: `input N`, `output`, where `N` is an index of input tensor starting with 0. Default value: `output` |
|
||||
|
||||
**Sub-nodes**: None
|
||||
|
||||
## Example Configuration File
|
||||
|
||||
The following code sample provides an example configuration file (in the
|
||||
`.xml` format). For information on configuration file structure, see
|
||||
The following code sample provides an example configuration file in the
|
||||
`.xml` format. For information on the configuration file structure, see
|
||||
[Configuration File Format](#config-file-format).
|
||||
```xml
|
||||
<CustomLayer name="ReLU" type="SimpleGPU" version="1">
|
||||
@@ -150,10 +150,10 @@ The following code sample provides an example configuration file (in the
|
||||
</CustomLayer>
|
||||
```
|
||||
|
||||
## Built-In Defines for Custom Layers
|
||||
## Built-In Definitions for Custom Layers
|
||||
|
||||
The following table includes definitions that are attached before
|
||||
the user sources, where `<TENSOR>` is the actual input and output, for
|
||||
user sources, where `<TENSOR>` is the actual input and output, for
|
||||
example, `INPUT0` or `OUTPUT0`.
|
||||
|
||||
For an example, see [Example Kernel](#example-kernel).
|
||||
@@ -175,10 +175,10 @@ For an example, see [Example Kernel](#example-kernel).
|
||||
| `<TENSOR>_UPPER_PADDING_SIZE` | The size of the `<TENSOR>_UPPER_PADDING` array |
|
||||
| `<TENSOR>_PITCHES` | The number of elements between adjacent elements in each dimension. Always ordered as BFYX.|
|
||||
| `<TENSOR>_PITCHES_SIZE`| The size of the `<TENSOR>_PITCHES` array |
|
||||
| `<TENSOR>_OFFSET`| The number of elements from the start of the tensor to the first valid element (bypassing the lower padding) |
|
||||
| `<TENSOR>_OFFSET`| The number of elements from the start of the tensor to the first valid element, bypassing the lower padding. |
|
||||
All `<TENSOR>` values are automatically defined for every tensor
|
||||
bound to this operation (`INPUT0`, `INPUT1`, `OUTPUT0`, and so on), as shown
|
||||
in the following for example:
|
||||
bound to this operation, such as `INPUT0`, `INPUT1`, and `OUTPUT0`, as shown
|
||||
in the following example:
|
||||
|
||||
```sh
|
||||
#define INPUT0_DIMS_SIZE 4
|
||||
@@ -208,7 +208,7 @@ __kernel void example_relu_kernel(
|
||||
}
|
||||
```
|
||||
|
||||
> **NOTE:** As described in the previous section, all the things like
|
||||
> **NOTE:** As described in the previous section, all things like
|
||||
> `INPUT0_TYPE` are actually defined as OpenCL (pre-)compiler inputs by
|
||||
> the Inference Engine for efficiency reasons. See [Debugging
|
||||
> Tips](#debugging-tips) for information on debugging the results.
|
||||
|
||||
@@ -1,25 +1,25 @@
|
||||
# Inference Engine Extensibility Mechanism {#openvino_docs_IE_DG_Extensibility_DG_Intro}
|
||||
|
||||
Inference Engine Extensibility API allows to add support of custom operations to the Inference Engine.
|
||||
Inference Engine Extensibility API enables you to add support of custom operations to the Inference Engine.
|
||||
Extension should contain operation sets with custom operations and execution kernels for custom operations.
|
||||
Physically, an extension library can be represented as a dynamic library exporting the single `CreateExtension` function
|
||||
that allows to create a new extension instance.
|
||||
that creates a new extension instance.
|
||||
|
||||
Extensibility library can be loaded to the `InferenceEngine::Core` object using the
|
||||
To load the Extensibility library to the `InferenceEngine::Core` object, use the
|
||||
`InferenceEngine::Core::AddExtension` method.
|
||||
|
||||
## Inference Engine Extension Library
|
||||
|
||||
Inference Engine Extension dynamic library contains several components:
|
||||
Inference Engine Extension dynamic library contains the following components:
|
||||
|
||||
* [Extension Library](Extension.md):
|
||||
- Contains custom operation sets
|
||||
- Provides CPU implementations for custom operations
|
||||
- Contains custom operation sets.
|
||||
- Provides CPU implementations for custom operations.
|
||||
* [Custom nGraph Operation](AddingNGraphOps.md):
|
||||
- Allows to use `InferenceEngine::Core::ReadNetwork` to read Intermediate Representation (IR) with unsupported
|
||||
operations
|
||||
- Allows to create `ngraph::Function` with unsupported operations
|
||||
- Provides shape inference mechanism for custom operations
|
||||
- Enables the use of `InferenceEngine::Core::ReadNetwork` to read Intermediate Representation (IR) with unsupported
|
||||
operations.
|
||||
- Enables the creation of `ngraph::Function` with unsupported operations.
|
||||
- Provides a shape inference mechanism for custom operations.
|
||||
|
||||
> **NOTE**: This documentation is written based on the `Template extension`, which demonstrates extension
|
||||
development details. Find the complete code of the `Template extension`, which is fully compilable and up-to-date,
|
||||
@@ -43,10 +43,8 @@ The following pages describe how to integrate custom _kernels_ into the Inferenc
|
||||
* [Introduction to development of custom GPU kernels](GPU_Kernel.md)
|
||||
* [Introduction to development of custom VPU kernels](VPU_Kernel.md)
|
||||
|
||||
## Additional Resources
|
||||
## See Also
|
||||
|
||||
* [Build an extension library using CMake*](Building.md)
|
||||
|
||||
## See Also
|
||||
* [Using Inference Engine Samples](../Samples_Overview.md)
|
||||
* [Hello Shape Infer SSD sample](../../../inference-engine/samples/hello_reshape_ssd/README.md)
|
||||
|
||||
@@ -1,24 +1,24 @@
|
||||
# How to Implement Custom Layers for VPU (Intel® Neural Compute Stick 2) {#openvino_docs_IE_DG_Extensibility_DG_VPU_Kernel}
|
||||
|
||||
> **NOTE:** OpenCL™ custom layer support is available in the preview mode.
|
||||
> **NOTES:**
|
||||
> * OpenCL\* custom layer support is available in the preview mode.
|
||||
> * This section assumes you are familiar with developing kernels using OpenCL.
|
||||
|
||||
> **NOTE:** This section assumes you are familiar with developing kernels using OpenCL™.
|
||||
To customize your topology with an OpenCL layer, follow the steps below:
|
||||
|
||||
To customize your topology with an OpenCL™ layer, follow the steps below:
|
||||
1. Write and compile your OpenCL code with the standalone offline OpenCL compiler (`clc`).
|
||||
2. Write a configuration file to bind the OpenCL kernel to the topology file (`.xml`) of the model IR.
|
||||
3. Pass the configuration file to the Inference Engine with the model IR.
|
||||
|
||||
1. Write and compile you OpenCL™ code with the standalone offline OpenCL™ compiler (`clc`).
|
||||
2. Write a configuration file to bind the OpenCL™ kernel to the topology file (`.xml`) of the model IR.
|
||||
3. Pass the configuration file to Inference engine with the model IR.
|
||||
|
||||
## Compile OpenCL™ code for VPU (Intel® Neural Compute Stick 2)
|
||||
## Compile OpenCL code for VPU (Intel® Neural Compute Stick 2)
|
||||
|
||||
> **NOTE:** OpenCL compiler, targeting Intel® Neural Compute Stick 2 for the SHAVE* processor only, is redistributed with OpenVINO.
|
||||
OpenCL support is provided by ComputeAorta*, and is distributed under a license agreement between Intel® and Codeplay* Software Ltd.
|
||||
|
||||
The OpenCL™ toolchain for the Intel® Neural Compute Stick 2 supports offline compilation only, so first compile OpenCL C code using the standalone `clc` compiler. You can find the compiler binary at `<INSTALL_DIR>/deployment_tools/tools/cl_compiler`.
|
||||
The OpenCL toolchain for the Intel® Neural Compute Stick 2 supports offline compilation only, so first compile OpenCL C code using the standalone `clc` compiler. You can find the compiler binary at `<INSTALL_DIR>/deployment_tools/tools/cl_compiler`.
|
||||
|
||||
> **NOTE:** By design, custom OpenCL layers support any OpenCL kernels written with 1.2 version assumed. It also supports half float
|
||||
extension and is optimized for this type, because it is a native type for Intel® Movidius™ VPUs.
|
||||
extension and is optimized for this type, because it is a native type for Intel® Movidius™ VPUs.
|
||||
|
||||
1. Prior to running a compilation, make sure that the following variables are set:
|
||||
* `SHAVE_MA2X8XLIBS_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/lib/`
|
||||
@@ -57,25 +57,25 @@ Configuration file for this kernel might be the following:
|
||||
```
|
||||
Each custom layer is described with the `CustomLayer` node. It has the following nodes and attributes:
|
||||
- Root node `CustomLayer` contains the following attributes:
|
||||
- `name` – (Required) A name of the Inference Engine layer to bind the kernel with.
|
||||
- `name` – (Required) The name of the Inference Engine layer to bind the kernel with.
|
||||
- `type` and `version` – (Required) Reserved for future use. Set them to `MVCL` and `1` respectively.
|
||||
- `max-shaves` – (Optional) The maximum number of SHAVE cores that should be dedicated for the layer. It is useful for debugging concurrency issues or for resource saving if memory bound kernel does not scale well with the number of cores, so more resources can be left for the rest of a topology.
|
||||
- `max-shaves` – (Optional) The maximum number of SHAVE cores that should be dedicated for the layer. It is useful for debugging concurrency issues or for resource saving that memory bound kernel does not scale well with the number of cores, so more resources can be left for the rest of a topology.
|
||||
- Sub-node `Kernel` must contain the following attributes:
|
||||
- `entry` – A name of your kernel function as you defined it in a source file (in the example above, it is `reorg_nhwc`).
|
||||
- `entry` – The name of your kernel function as you defined it in a source file. In the example above, it is `reorg_nhwc`.
|
||||
- Node `Source` must contain the following attributes:
|
||||
- `filename` – A path to a compiled binary relative to the `.xml` binding file.
|
||||
- `filename` – The path to a compiled binary relative to the `.xml` binding file.
|
||||
- Sub-node `Parameters` – Describes parameters bindings. For more information, see the description below.
|
||||
- Sub-node `WorkSizes` – Describes local and global work group sizes and the source for dimension deduction as a pair `direction,port`. In the example above, the work group is described relatively to the dimension of the input tensor that comes through port 0 in the IR. `global` and `local` work group configurations support any simple math expressions with +,-,\*,/, and () from `B`(batch), `Y`(height), `X`(width) and `F`(channels).
|
||||
- Sub-node `Where` – Allows to customize bindings with the `key="value"` attribute. For example, to substitute only 3x3 convolutions, write `<Where kernel="3,3"/>` in the binging xml.
|
||||
- Sub-node `Where` – Allows to customize bindings with the `key="value"` attribute. For example, to substitute only 3x3 convolutions, write `<Where kernel="3,3"/>` in the binding xml.
|
||||
|
||||
Parameter description supports `Tensor` of one of tensor types such as `input`, `output`, `input_buffer`, `output_buffer` or `data`, `Scalar`, or `Data` nodes and has the following format:
|
||||
- Each `Tensor` node of `input` or `output` type must contain the following attributes:
|
||||
- `arg-name` – A name of a kernel parameter in the kernel signature.
|
||||
- `arg-name` – The name of a kernel parameter in the kernel signature.
|
||||
- `type` – Node type: `input` or `output` as in the IR.
|
||||
- `port-index` – A number of input/output ports as in the IR.
|
||||
- `format` – The channel order in the tensor. Optional conversion layers are generated if the custom layer format is not compatible with formats of neighboring layers. `BFXY`, `BYXF`, and `ANY` formats are supported currently.
|
||||
- Each `Tensor` node of `input_buffer` or `output_buffer` type must contain the following attributes:
|
||||
- `arg-name` – A name of a kernel parameter in the kernel signature.
|
||||
- `arg-name` – The name of a kernel parameter in the kernel signature.
|
||||
- `type` – Node type: `input_buffer` or `output_buffer`. Use the appropriate type to bind multiple kernels that correspond to different stages of the same layer.
|
||||
- `port-index` – The unique identifier to bind by.
|
||||
- `dim` – The dim source with the same `direction,port` format used for `WorkSizes` bindings.
|
||||
@@ -110,7 +110,7 @@ Each custom layer is described with the `CustomLayer` node. It has the following
|
||||
</CustomLayer>
|
||||
```
|
||||
- Each `Tensor` node that has the type `data` must contain the following attributes:
|
||||
- `source` – A name of the blob as it is in the IR (typical example is `weights` for convolution
|
||||
- `source` – A name of the blob as it is in the IR. Typical example is `weights` for convolution.
|
||||
- `format` – Specifies the channel order in the tensor. Optional conversion layers are generated if the custom layer format is not.
|
||||
```xml
|
||||
<CustomLayer name="BinaryConvolution" type="MVCL" version="1">
|
||||
@@ -127,15 +127,15 @@ Each custom layer is described with the `CustomLayer` node. It has the following
|
||||
</CustomLayer>
|
||||
```
|
||||
- Each `Scalar` node must contain the following attributes:
|
||||
- `arg-name` – A name of a kernel parameter in the kernel signature.
|
||||
- `arg-name` – The name of a kernel parameter in the kernel signature.
|
||||
- `type` – `int` or `float` value. It is used for correct argument extraction from IR parameters.
|
||||
- `source` – Contains the name of the parameter in the IR file or input/output (`I`/`O`, `In`/`On`, where `n` is a port number)
|
||||
followed by dimension `B`(batch), `Y`(height), `X`(width), or `F`(channels).
|
||||
|
||||
- Each `Data` node must contain the following attributes:
|
||||
- `arg-name` – A name of a kernel parameter in the kernel signature.
|
||||
- `type` – Node type. Currently, `local_data` is the only supported value, which defines buffer allocated in fast local on-chip memory. It is limited to 100K for all `__local` and
|
||||
`__private` arrays defined inside the kernel as well as all `__local` parameters passed to the kernel. Please, consider that a manual-DMA extension requires double buffering.
|
||||
- `arg-name` – The name of a kernel parameter in the kernel signature.
|
||||
- `type` – Node type. Currently, `local_data` is the only supported value, which defines buffer allocated in fast local on-chip memory. It is limited to 100KB for all `__local` and
|
||||
`__private` arrays defined inside the kernel as well as all `__local` parameters passed to the kernel. Note that a manual-DMA extension requires double buffering.
|
||||
If the custom layer is detected to run out of local memory, the inference fails.
|
||||
- `dim` – The dim source with the same `direction,port` format used for `WorkSizes` bindings.
|
||||
- `size` – Amount of bytes needed. The current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and may be extended in the future.
|
||||
@@ -177,7 +177,7 @@ config["VPU_CUSTOM_LAYERS"] = "/path/to/your/customLayers.xml";
|
||||
auto exeNetwork = core.LoadNetwork(cnnNetwork, "MYRIAD", networkConfig);
|
||||
```
|
||||
|
||||
## Optimizing Kernels with OpenCL™ for VPU (Intel® Neural Compute Stick 2)
|
||||
## Optimizing Kernels with OpenCL for VPU (Intel® Neural Compute Stick 2)
|
||||
|
||||
This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL
|
||||
programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.
|
||||
@@ -196,9 +196,9 @@ work grid among available compute resources and executes them in an arbitrary or
|
||||
|
||||
1. Split work evenly across work groups.
|
||||
2. Adjust work group granularity to maintain equal workload for all compute codes.
|
||||
3. Set the maximum number of cores (using the `max-shaves` attribute for the `CustomLayer` node). This keeps more resources for the rest of topology. It is also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization.
|
||||
3. Set the maximum number of cores using the `max-shaves` attribute for the `CustomLayer` node. This keeps more resources for the rest of topology. It is also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization.
|
||||
4. Try an alternate data layout (`BFXY`/`BYXF`) for the kernel if it improves work group partitioning or data access patterns.
|
||||
Consider full topology performance (not just specific layer boost) since data conversion layers would be automatically inserted
|
||||
Consider not just specific layer boost, but full topology performance because data conversion layers would be automatically inserted
|
||||
as appropriate.
|
||||
|
||||
Offline OpenCL compiler (`clc`) features automatic vectorization over `get_global_id(0)` usage, if uniform access is detected.
|
||||
@@ -218,11 +218,11 @@ patterns. WGV works if and only if vector types are not used in the code.
|
||||
Here is a short list of optimization tips:
|
||||
|
||||
1. Help auto-vectorizer ensure non-aliasing pointers for kernel parameters by putting `restrict` where possible.
|
||||
- This may give a performance boost, especially for kernels with unrolling, like `ocl_grn` from the example below.
|
||||
- This can give a performance boost, especially for kernels with unrolling, like `ocl_grn` from the example below.
|
||||
- Place `restrict` markers for kernels with manually vectorized codes. In the `ocl_grn` kernel below, the unrolled version without `restrict` is up to 20% slower than the most optimal one, which combines unrolling and `restrict`.
|
||||
2. Put `#‍pragma unroll N` to your loop header. Since the compiler does not trigger unrolling by default, it is your responsibility to
|
||||
2. Put `#‍pragma unroll N` to your loop header. The compiler does not trigger unrolling by default, so it is your responsibility to
|
||||
annotate the code with pragmas as appropriate. The `ocl_grn` version with `#‍pragma unroll 4` is up to 50% faster, most of which comes from unrolling the first loop, because LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the fist loop
|
||||
`variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);` is only 2-stage (load-compute). Please, pay
|
||||
`variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);` is only 2-stage (load-compute). Pay
|
||||
attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that
|
||||
still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from `4`to `6` results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel® Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store, and compute stages:
|
||||
```cpp
|
||||
@@ -294,15 +294,15 @@ __kernel void ocl_grn_line(__global const half* restrict src_data, __global hal
|
||||
Both versions perform the same, but the second one has more complex code.
|
||||
|
||||
3. If it is easy to predict the work group size, you can also use the `reqd_work_group_size` kernel attribute to ask the compiler
|
||||
to unroll the code up to local size of the work group. Please note that if the kernel is actually executed with the
|
||||
to unroll the code up to the local size of the work group. Note that if the kernel is actually executed with the
|
||||
different work group configuration, the result is undefined.
|
||||
|
||||
4. Prefer to use the `half` compute, if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functions `half_*` are mapped to a single hardware instruction.
|
||||
4. Prefer to use the `half` compute if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functions `half_*` are mapped to a single hardware instruction.
|
||||
Use the standard `native_*` function for the rest of types.
|
||||
|
||||
5. Prefer to use the `convert_half` function over `vstore_half` if conversion to 32-bit float is required. `convert_half` is mapped to a single hardware instruction. For the `cvtf32f16` kernel above, the line `outImage[idx] = convert_half(inImage[idx]*scale+bais);` is 8 times slower than the code with `vstore_half`.
|
||||
5. Prefer to use the `convert_half` function over `vstore_half` if conversion to 32-bit float is required. `convert_half` is mapped to a single hardware instruction. For the `cvtf32f16` kernel above, the line `outImage[idx] = convert_half(inImage[idx]*scale+bais);` is eight times slower than the code with `vstore_half`.
|
||||
|
||||
6. Mind early exits. Early exit may be extremely costly for the current version of the `clc` compiler due to conflicts with the
|
||||
6. Mind early exits. Early exit can be extremely costly for the current version of the `clc` compiler due to conflicts with the
|
||||
auto-vectorizer. The generic advice would be to setup local size by `x` dimension equal to inputs or/and outputs width.
|
||||
If it is impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks, for example,
|
||||
`if (get_global_id(0) >= width) return`, use line-wise kernel variant with manual vectorization.
|
||||
@@ -333,9 +333,9 @@ The kernel example below demonstrates the impact of early exits on kernel perfor
|
||||
out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2];
|
||||
}
|
||||
```
|
||||
This `reorg` kernel is auto-vectorizable, but an input for YOLO v2 topology is `NCHW=<1,64,26,26>` and it is not multiple of vector width (which is `8` for `half` data type). As a result, the Inference Engine does not select the auto-vectorized kernel.
|
||||
To compare performance of auto-vectorized and scalar version of the kernel, change the input size to`NCHW=<1,64,26,32>`. This allows the auto-vectorized version to be selected by the Inference Engine and can give you about 30% uplift.
|
||||
Since the auto-vectorized version is faster, it makes sense to enable it for the YOLO v2 topology input size by setting the local size multiple of vector (e.g. 32) and adjust global sizes accordingly. As a result, the execution work grid exceeds actual input dimension, so out-of-bound checks should be inserted. See the updated kernel version below:
|
||||
This `reorg` kernel is auto-vectorizable, but an input for YOLO v2 topology is `NCHW=<1,64,26,26>` and it is not multiple of vector width, which is `8` for `half` data type. As a result, the Inference Engine does not select the auto-vectorized kernel.
|
||||
To compare performance of auto-vectorized and scalar version of the kernel, change the input size to`NCHW=<1,64,26,32>`. This enables the auto-vectorized version to be selected by the Inference Engine and can give you about 30% uplift.
|
||||
Since the auto-vectorized version is faster, it makes sense to enable it for the YOLO v2 topology input size by setting the local size multiple of vector, for example, 32, and adjust global sizes accordingly. As a result, the execution work grid exceeds actual input dimension, so out-of-bound checks should be inserted. See the updated kernel version below:
|
||||
```cpp
|
||||
// Version with out-of-bound checks added
|
||||
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int W, int stride)
|
||||
@@ -445,7 +445,7 @@ from/to a `__blobal` pointer since work-group copying could be done in a vector
|
||||
}
|
||||
}
|
||||
```
|
||||
This kernel can be rewritten to introduce special data binding `__dma_preload` and `__dma_postwrite intrinsics`. This means that instead of one kernel, a group of three kernels should be implemented: `kernelName`, `__dma_preload_kernelName` and `__dma_postwrite_kernelName`. `__dma_preload_kernelName` for a particular work group `n` is guaranteed to be executed before `n`-th work group itself, while `__dma_postwrite_kernelName` is guaranteed to be executed after a corresponding work group. You can define one of those functions that are intended to be used to copy data from-to `__global` and `__local` memory. The syntactics requires exact functional signature match. The example below illustrates how to prepare your kernel for manual-DMA.
|
||||
This kernel can be rewritten to introduce special data binding `__dma_preload` and `__dma_postwrite intrinsics`. This means that instead of one kernel, a group of three kernels should be implemented: `kernelName`, `__dma_preload_kernelName`, and `__dma_postwrite_kernelName`. `__dma_preload_kernelName` for a particular work group `n` is guaranteed to be executed before the `n`-th work group itself, while `__dma_postwrite_kernelName` is guaranteed to be executed after a corresponding work group. You can define one of those functions that are intended to be used to copy data from-to `__global` and `__local` memory. The syntactics requires exact functional signature match. The example below illustrates how to prepare your kernel for manual-DMA.
|
||||
```cpp
|
||||
__kernel void __dma_preload_grn_NCHW(
|
||||
__global const half* restrict src,
|
||||
@@ -612,7 +612,7 @@ __kernel void grn_NCHW(
|
||||
}
|
||||
```
|
||||
|
||||
Please note `get_local_size` and `get_local_id` usage inside the kernel. 21x speedup is expected for a kernel on enet-curbs setup since it was completely limited by memory usage.
|
||||
Note the `get_local_size` and `get_local_id` usage inside the kernel. 21x speedup is expected for a kernel on enet-curbs setup because it was completely limited by memory usage.
|
||||
|
||||
An alternative method of using DMA is to use work item copy extension. Those functions are executed inside a kernel and requires work groups equal to single work item.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user