Files
openvino/docs/Extensibility_UG/GPU_Extensibility.md

240 lines
11 KiB
Markdown
Raw Normal View History

2022-03-10 15:01:52 +03:00
# How to Implement Custom GPU Operations {#openvino_docs_Extensibility_UG_GPU}
To enable operations not supported by OpenVINO out of the box, you may need an extension for an OpenVINO operation set, and a custom kernel for the device you will target. This page describes custom kernel support for the GPU device.
2022-03-10 15:01:52 +03:00
The GPU codepath abstracts many details about OpenCL. You need to provide the kernel code in OpenCL C and an XML configuration file that connects the kernel and its parameters to the parameters of the operation.
2022-03-10 15:01:52 +03:00
There are two options for using the custom operation configuration file:
* Include a section with your kernels into the automatically-loaded `<lib_path>/cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml` file.
* Call the `ov::Core::set_property()` method from your application with the `"CONFIG_FILE"` key and the configuration file name as a value before loading the network that uses custom operations to the plugin:
@sphinxtabset
@sphinxtab{C++}
@snippet docs/snippets/gpu/custom_kernels_api.cpp part0
@endsphinxtab
@sphinxtab{Python}
@snippet docs/snippets/gpu/custom_kernels_api.py part0
@endsphinxtab
@endsphinxtabset
2022-03-10 15:01:52 +03:00
All OpenVINO samples, except the trivial `hello_classification`, and most Open Model Zoo demos
feature a dedicated command-line option `-c` to load custom kernels. For example, to load custom operations for the classification sample, run the command below:
```sh
$ ./classification_sample -m <path_to_model>/bvlc_alexnet_fp16.xml -i ./validation_set/daily/227x227/apron.bmp -d GPU
-c <absolute_path_to_config>/custom_layer_example.xml
```
## Configuration File Format <a name="config-file-format"></a>
The configuration file is expected to follow the `.xml` file structure
with a node of the `CustomLayer` type for every custom operation you provide.
2022-03-10 15:01:52 +03:00
The definitions described in the sections below use the following notations:
Notation | Description
---|---
(0/1) | Can have zero or one instance of this node or attribute
(1) | Must have only one instance of this node or attribute
(0+) | Can have any number of instances of this node or attribute
(1+) | Can have one or more instances of this node or attribute
### CustomLayer Node and Sub-Node Structure
`CustomLayer` node contains the entire configuration for a single custom operation.
| Attribute Name |\# | Description |
|-----|-----|-----|
| `name` | (1) | The name of the operation type to be used. This name should be identical to the type used in the IR.|
| `type` | (1) | Must be `SimpleGPU`. |
| `version` | (1) | Must be `1`. |
**Sub-nodes**: `Kernel` (1), `Buffers` (1), `CompilerOptions` (0+),
`WorkSizes` (0/1)
### Kernel Node and Sub-Node Structure
`Kernel` node contains all kernel source code configuration.
**Sub-nodes**: `Source` (1+), `Define` (0+)
### Source Node and Sub-Node Structure
`Source` node points to a single OpenCL source file.
| Attribute Name | \# |Description|
|-----|-----|-----|
| `filename` | (1) | Name of the file containing OpenCL source code. Note that the path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
**Sub-nodes**: None
### Define Node and Sub-Node Structure
`Define` node configures a single `#&zwj;define` instruction to be added to
the sources during compilation (JIT).
| Attribute Name | \# | Description |
|------|-------|------|
| `name` | (1) | The name of the defined JIT. For static constants, this can include the value as well, which is taken as a string. |
| `param` | (0/1) | This parameter value is used as the value of this JIT definition. |
| `type` | (0/1) | The parameter type. Accepted values: `int`, `float`, and `int[]`, `float[]` for arrays. |
| `default` | (0/1) | The default value to be used if the specified parameters are missing from the operation in the IR. |
**Sub-nodes:** None
The resulting JIT has the following form:
`#&zwj;define [name] [type] [value/default]`.
### Buffers Node and Sub-Node Structure
`Buffers` node configures all input/output buffers for the OpenCL entry
function. No buffers node structure exists.
**Sub-nodes:** `Data` (0+), `Tensor` (1+)
### Data Node and Sub-Node Structure
`Data` node configures a single input with static data, for example,
weights or biases.
| Attribute Name | \# | Description |
|----|-----|------|
| `name` | (1) | Name of a blob attached to an operation in the IR |
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to |
**Sub-nodes**: None
### Tensor Node and Sub-Node Structure
`Tensor` node configures a single input or output tensor.
| Attribute Name | \# | Description |
|------|-------|-------|
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to. |
| `type` | (1) | `input` or `output` |
| `port-index` | (1) | 0-based index in the operation input/output ports in the IR |
| `format` | (0/1) | Data layout declaration for the tensor. Accepted values: `BFYX`, `BYXF`, `YXFB`, `FYXB`, and same values in all lowercase. Default value: `BFYX` |
### CompilerOptions Node and Sub-Node Structure
`CompilerOptions` node configures the compilation flags for the OpenCL
sources.
| Attribute Name | \# | Description |
|--------|-----|------|
| `options` | (1) | Options string to be passed to the OpenCL compiler |
**Sub-nodes**: None
### WorkSizes Node and Sub-Node Structure
`WorkSizes` node configures the global/local work sizes to be used when
queuing an OpenCL program for execution.
| Attribute Name | \# | Description |
|-----|------|-----|
| `global`<br>`local` | (0/1)<br>(0/1) | An array of up to three integers or formulas for defining OpenCL work-sizes to be used during execution.<br> The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,\*,%. All operators are evaluated in integer arithmetic. <br>Default value: `global=”B*F*Y*X” local=””` |
| `dim` | (0/1) | A tensor to take the work-size from. Accepted values: `input N`, `output`, where `N` is an index of input tensor starting with 0. Default value: `output` |
**Sub-nodes**: None
## Example Configuration File
The following code sample provides an example configuration file in XML
format. For information on the configuration file structure, see
[Configuration File Format](#config-file-format).
```xml
<CustomLayer name="ReLU" type="SimpleGPU" version="1">
<Kernel entry="example_relu_kernel">
<Source filename="custom_layer_kernel.cl"/>
<Define name="neg_slope" type="float" param="negative_slope" default="0.0"/>
</Kernel>
<Buffers>
<Tensor arg-index="0" type="input" port-index="0" format="BFYX"/>
<Tensor arg-index="1" type="output" port-index="0" format="BFYX"/>
</Buffers>
<CompilerOptions options="-cl-mad-enable"/>
<WorkSizes global="X,Y,B*F"/>
</CustomLayer>
```
## Built-In Definitions for Custom Layers
The following table includes definitions that are attached before
user sources.
For an example, see [Example Kernel](#example-kernel).
| Name | Value |
|---|---|
| `NUM_INPUTS` | Number of the input tensors bound to this kernel |
| `GLOBAL_WORKSIZE` | An array of global work sizes used to execute this kernel |
| `GLOBAL_WORKSIZE_SIZE` | The size of the `GLOBAL_WORKSIZE` array |
| `LOCAL_WORKSIZE` | An array of local work sizes used to execute this kernel |
| `LOCAL_WORKSIZE_SIZE` | The size of the `LOCAL_WORKSIZE` array |
| `<TENSOR>_DIMS`| An array of the tensor dimension sizes. Always ordered as `BFYX` |
| `<TENSOR>_DIMS_SIZE`| The size of the `<TENSOR>_DIMS` array.|
| `<TENSOR>_TYPE`| The datatype of the tensor: `float`, `half`, or `char`|
| `<TENSOR>_FORMAT_<TENSOR_FORMAT>` | The format of the tensor, BFYX, BYXF, YXFB , FYXB, or ANY. The format is concatenated to the defined name. You can use the tensor format to define codepaths in your code with `#&zwj;ifdef/#&zwj;endif`. |
| `<TENSOR>_LOWER_PADDING` | An array of padding elements used for the tensor dimensions before they start. Always ordered as BFYX.|
| `<TENSOR>_LOWER_PADDING_SIZE` | The size of the `<TENSOR>_LOWER_PADDING` array |
| `<TENSOR>_UPPER_PADDING` | An array of padding elements used for the tensor dimensions after they end. Always ordered as BFYX. |
| `<TENSOR>_UPPER_PADDING_SIZE` | The size of the `<TENSOR>_UPPER_PADDING` array |
| `<TENSOR>_PITCHES` | The offset (in elements) between adjacent elements in each dimension. Always ordered as BFYX.|
| `<TENSOR>_PITCHES_SIZE`| The size of the `<TENSOR>_PITCHES` array |
| `<TENSOR>_OFFSET`| The number of elements from the start of the tensor to the first valid element, bypassing the lower padding. |
All `<TENSOR>` values are automatically defined for every tensor
bound to this operation, such as `INPUT0`, `INPUT1`, and `OUTPUT0`, as shown
in the following example:
```c
#define INPUT0_DIMS_SIZE 4
#define INPUT0_DIMS (int []){ 1,96,55,55, }
```
## Example Kernel<a name="example-kernel"></a>
```c
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void example_relu_kernel(
const __global INPUT0_TYPE* input0,
__global OUTPUT0_TYPE* output)
{
const uint idx = get_global_id(0);
const uint idy = get_global_id(1);
const uint idbf = get_global_id(2); // batches*features, as OpenCL supports 3D nd-ranges only
const uint feature = idbf % OUTPUT0_DIMS[1];
const uint batch = idbf / OUTPUT0_DIMS[1];
//notice that pitches are in elements, not in bytes!
const uint in_id = batch*INPUT0_PITCHES[0] + feature*INPUT0_PITCHES[1] + idy*INPUT0_PITCHES[2] + idx*INPUT0_PITCHES[3] + INPUT0_OFFSET;
const uint out_id = batch*OUTPUT0_PITCHES[0] + feature*OUTPUT0_PITCHES[1] + idy*OUTPUT0_PITCHES[2] + idx*OUTPUT0_PITCHES[3] + OUTPUT0_OFFSET;
INPUT0_TYPE value = input0[in_id];
// neg_slope (which is non-zero for leaky ReLU) is put automatically as #define, refer to the config xml
output[out_id] = value < 0 ? value * neg_slope : value;
}
```
> **NOTE**: As described in the previous section, all items like
> `INPUT0_TYPE` are actually defined as OpenCL (pre-)compiler inputs by
> OpenVINO for efficiency reasons. See [Debugging
2022-03-10 15:01:52 +03:00
> Tips](#debugging-tips) for information on debugging the results.
## Debugging Tips<a name="debugging-tips"></a>
* **Using `printf` in the OpenCL™ Kernels**.
To debug the specific values, you can use `printf` in your kernels.
However, be careful not to output excessively, which
could generate too much data. The `printf` output is typical, so
your output can be truncated to fit the buffer. Also, because of
buffering, you actually get an entire buffer of output when the
execution ends.<br>
For more information, refer to the [printf
Function](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/printfFunction.html).