[DOCS] Proofreading developer documentation moved from wiki. (#15886)
Minor stylistic and grammar corrections. Fixing links * Apply suggestions from code review Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
This commit is contained in:
committed by
GitHub
parent
c14e6ef48e
commit
cbb25e9483
@@ -2,12 +2,13 @@
|
||||
|
||||
## Key Contacts
|
||||
|
||||
Please contact a member of [openvino-ie-cpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-cpu-maintainers) group, for assistance regarding snippets.
|
||||
For assistance regarding snippets, contact a member of [openvino-ie-cpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-cpu-maintainers) group.
|
||||
|
||||
* [SnippetS design guide](./docs/snippets_design_guide.md)
|
||||
* [CPU target for SnippetS code generator](./docs/snippets_cpu_target.md)
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../README.md)
|
||||
* [OpenVINO Core Components](../../README.md)
|
||||
* [Developer documentation](../../../docs/dev/index.md)
|
||||
@@ -1,12 +1,12 @@
|
||||
# CPU target for SnippetS code generator
|
||||
# CPU Target for SnippetS Code Generator
|
||||
|
||||
Snippets in its first generation can be seen as a generalization over generic eltwise node. First generation of snippets has lack of integration with oneDNN and so patterns it supports should be kept orthogonal to what is fused with post-ops.
|
||||
Snippets in its first generation can be seen as a generalization over a generic eltwise node. First generation of snippets does not have integration with oneDNN, and the patterns it supports should be kept orthogonal to what is fused with post-ops.
|
||||
|
||||
POC CPU implementation could be found [here](https://github.com/openvinotoolkit/openvino/pull/2824)
|
||||
See the example of POC CPU implementation [here](https://github.com/openvinotoolkit/openvino/pull/2824).
|
||||
|
||||
First 8 kernel parameters are passed by structure which is unpacked inside a kernel into the registers. The rest are passed through the stack.
|
||||
|
||||
Loop trip count should be placed to some GP register, as well as work amount. Moreover, we need to load all the parameters into GP registers. If we assume that we have enough registers than it can be done before the loop body.
|
||||
The loop trip count should be placed to a GP register, as well as the work amount. Moreover, you need to load all the parameters into GP registers. If you assume that you have enough registers, then it can be done before the loop body.
|
||||
|
||||
```
|
||||
auto param0 = abi_params[0];
|
||||
@@ -18,9 +18,9 @@ auto work_amount = abi_params[3];
|
||||
|
||||
## Memory operations
|
||||
|
||||
Load could be Vector, Scalar and Broadcast. Only native vector size for an architecture is supported (e.g. 16 on AVX-512)
|
||||
A load could be Vector, Scalar, and Broadcast. Only the native vector size for an architecture is supported (for example, 16 on AVX-512).
|
||||
|
||||
Memory operation also generates post increments for the pointer it uses.
|
||||
Memory operation also generates post increments for the pointer it uses.
|
||||
|
||||
- `MemoryEmitter`
|
||||
- `StoreEmitter`
|
||||
@@ -50,8 +50,8 @@ Tensor data can be passed with strides.
|
||||
`Data` corresponds to a constant table and wraps this entity for the CPU.
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO SnippetS](../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [Developer documentation](../../../../docs/dev/index.md)
|
||||
|
||||
@@ -1,26 +1,26 @@
|
||||
# SnippetS design guide
|
||||
This document describes the design and rationale for snippets code generator. Implementation of code functionality is located [here](https://github.com/openvinotoolkit/openvino/tree/master/src/common/snippets). Proposal for CPU backend integration is [here](https://github.com/openvinotoolkit/openvino/pull/2824).
|
||||
# SnippetS Design Guide
|
||||
This document describes the design and rationale for a snippets code generator. Implementation of code functionality is located [here](https://github.com/openvinotoolkit/openvino/tree/master/src/common/snippets). A proposal for CPU backend integration is [here](https://github.com/openvinotoolkit/openvino/pull/2824).
|
||||
|
||||
## Rationale
|
||||
|
||||
We believe that core **CNN operators (convolution, gemm, fully connected) are limited by compute, the rest is memory bound**. Math approximations (like transcendental functions) are rare in emerging workloads and could be treated with the same machinery. **Snippets are designed to optimize topology for memory**, while leaving compute intensive kernels for backend developers.
|
||||
Core **CNN operators (convolution, gemm, fully connected) are limited by compute, the rest is memory bound**. Math approximations (like transcendental functions) are rare in emerging workloads and could be treated with the same machinery. **Snippets are designed to optimize topology for memory**, while leaving compute intensive kernels for backend developers.
|
||||
|
||||
We believe **potential speedup is proportional to shrink in memory-walked bytes**. So we can transform the problem to a task to optimize for memory walks, whatever pattern snippet has and operations it contains. Number of memory walks should be less or equal to handcrafted optimizations. This guarantees performance improvements over the previous approach (excluding corner cases caused by cache effects). *Shrinkage factor might be encoded to some cost function in future evolution of code generator*. Snippets generator provides diagnostics to estimate this shrinkage factor with `ngraph::snippets::op::Subgraph::print_statistics(bool verbose)` member.
|
||||
The **potential speedup is proportional to shrink in memory-walked bytes**. Therefore, you can transform the problem to a task to optimize for memory walks, whatever pattern snippet has and operations it contains. The number of memory walks should be less or equal to handcrafted optimizations. This guarantees performance improvements over the previous approach (excluding corner cases caused by cache effects). *Shrinkage factor might be encoded to some cost function in future evolution of code generator*. Snippets generator provides diagnostics to estimate this shrinkage factor with `ngraph::snippets::op::Subgraph::print_statistics(bool verbose)` member.
|
||||
|
||||
We design SnippetS generator for back-end developers. The main purpose of inventing snippets code generator is an **operator fusion**, **register allocation** and **target kernel generation** decomposition. This allows modifications (like new fusion support) and feature extensions (like new operation support) to be done in a single point of modification and avoid combinatorial explosion for fusions/types/architectures etc.
|
||||
The SnippetS generator is designed for back-end developers. The main purpose of inventing the snippets code generator is an **operator fusion**, **register allocation** and **target kernel generation** decomposition. This allows modifications (like new fusion support) and feature extensions (like new operation support) to be done in a single point of modification and avoid combinatorial explosion for fusions/types/architectures etc.
|
||||
|
||||
We believe that creating a full-fledged compiler or usage of existing compiler infrastructure (like LLVM & MLIR) is superfluous at this point of evelition. We aim to provide a **flexible and performant framework for operation fusions**, leaving micro optimizations (e.g. instruction scheduling) to the backend H/W.
|
||||
Creating a full-fledged compiler or usage of existing compiler infrastructure (like LLVM & MLIR) is superfluous at this point of evolution. The aim is to provide a **flexible and performant framework for operation fusions**, leaving micro optimizations (for example, instruction scheduling) to the backend H/W.
|
||||
|
||||
We do not aim to invent a DSL for SnippetS and would like to keep it this way. DSL gives users more flexibility to express uncommon operations. However, the shift towards an approach to encode topologies with elementary operations followed by smart enough fusions is already expressive and performant enough.
|
||||
There are no plans to invent a DSL for SnippetS. DSL gives users more flexibility to express uncommon operations. However, the shift towards an approach to encode topologies with elementary operations followed by smart enough fusions is already expressive and performant enough.
|
||||
|
||||
**Snippet** is a compiled compute **kernel** generated from a subgraph using SnippetS code generator for specific architecture with a **scheduling domain**. Using this scheduling domain and calling convention backend can execute generated compute kernels. For the first generation, snippets are **statically scheduled towards the output domain**. Multi-output snippets are supported if all outputs are broadcast-compatible in a sense that domains for all outputs can be broadcasted from one root domain which defines snippet schedule. It’s a subject of extension for future generations.
|
||||
**Snippet** is a compiled compute **kernel** generated from a subgraph using the SnippetS code generator for a specific architecture with a **scheduling domain**. Using this scheduling domain and calling convention backend can execute generated compute kernels. For the first generation, snippets are **statically scheduled towards the output domain**. Multi-output snippets are supported if all outputs are broadcast-compatible in a sense that domains for all outputs can be broadcasted from one root domain that defines snippet schedule. It is a subject of extension for future generations.
|
||||
|
||||
We use nGraph as the highest level IR for subgraph representation and lowering transformations. **Opset1** is a base operation set for code generation. We aim to **keep the minimal possible and sufficient operation set** (or ISA) and keep it **RISC-like** (memory and compute decomposed).
|
||||
nGraph is used as the highest level IR for subgraph representation and lowering transformations. **Opset1** is a base operation set for code generation. The aim is to **keep the minimal possible and sufficient operation set** (or ISA) and keep it **RISC-like** (memory and compute decomposed).
|
||||
|
||||
**One subgraph corresponds to one snippet**. Operations which cannot be scheduled by a single schedule should not be placed in the same subgraph. Snippet somewhat conceptually close to OpenCL kernel without a restriction to express only embarrassingly parallel tasks.
|
||||
**One subgraph corresponds to one snippet**. Operations which cannot be scheduled by a single schedule should not be placed in the same subgraph. A snippet is somewhat conceptually close to OpenCL kernel without a restriction to express only embarrassingly parallel tasks.
|
||||
**Subgraph** once extracted from full topology IR is **treated as an operation and data flow descriptor in scalar notation** (similar to OpenCL/CUDA). Tensor sizes are used for defining scheduling domain and detecting broadcasts/reductions.
|
||||
|
||||
We split operations into 3 groups: **layout-oblivious (LOO), layout-aware(-tolerant) and layout-dependent**. **Layout-oblivious** operation semantics and implementation are completely agnostic to a specific layout in which tensors are placed in memory. For example, elements-wise math and ReLU does in this category. Implementation **layout-aware** operation depends on the layout of input/output tensors. For example, convolutions and other block-wise kernels or layout repaks. For **layout-specific** operation semantics and implementation depends on the layout. For example, the Yolo region. Patterns to fuse constructed in terms of taxonomy above.
|
||||
Operations are split into 3 groups: **layout-oblivious (LOO), layout-aware(-tolerant) and layout-dependent(-specific)**. **Layout-oblivious** operation semantics and implementation are completely agnostic to a specific layout in which tensors are placed in memory. For example, like elements-wise math and ReLU in this category. Implementation of **layout-aware** operation depends on the layout of input/output tensors. For example, convolutions and other block-wise kernels or layout repacks. **Layout-specific** operation semantics and implementation depend on the layout. For example, the Yolo region. Patterns to fuse are constructed in terms of taxonomy above.
|
||||
|
||||
## Design
|
||||
|
||||
@@ -28,19 +28,19 @@ Code generation is split into 2 phases, **tokenization** and **lowering**.
|
||||
|
||||
### Tokenization
|
||||
|
||||
Tokenization runs on full topology nGraph function inside a specific plugin in a stage of common transformations. Input of tokenization is a topology graph. Output is a modified topology graph with `ngraph::snippets::op::Subgraph` operations installed. Each subgraph contains nGraph function (called **body**) which holds a part of original topology legal for snippet generation (can be scheduled with a single schedule)
|
||||
Tokenization runs on full topology nGraph function inside a specific plugin in a stage of common transformations. Input of tokenization is a topology graph. Output is a modified topology graph with `ngraph::snippets::op::Subgraph` operations installed. Each subgraph contains nGraph function (called **body**) which holds a part of original topology legal for snippet generation (can be scheduled with a single schedule).
|
||||
|
||||
Procedure of finding subgraphs suitable for code generation is called **tokenization**, meaning that we split the topology tree into subgraphs in the same greedy approach which is used for parsing input stream of characters into the tokens. It also could be seen as and modified into a basic block construction problem, since we also find a leader and potentially terminators. Implementation can be found [here](https://github.com/openvinotoolkit/openvino/blob/master/src/common/snippets/src/pass/collapse_subgraph.cpp).
|
||||
A procedure of finding subgraphs suitable for code generation is called **tokenization**. During tokenization the topology tree is split into subgraphs in the same greedy approach which is used for parsing input stream of characters into the tokens. It may also be seen as and modified into a basic block construction problem, since there is a leader and potentially terminators. See the example of implementation [here](https://github.com/openvinotoolkit/openvino/blob/master/src/common/snippets/src/pass/collapse_subgraph.cpp).
|
||||
|
||||
Tokenization has an advantage over the pattern matching approach (used in traditional and MLIR-based compilers) since it can handle arbitrary patterns of operations. Pattern matching deduces specific configuration of operations to translate to another one, more suitable for target machine or further lowering. This means that relations between operations are fixed. Tokenization on the other hand has the only limitation on specific operation types which are **suitable and profitable** to fuse with respect to original topology correctness (keeping it as a direct acyclic graph).
|
||||
Tokenization has an advantage over the pattern matching approach (used in traditional and MLIR-based compilers) since it can handle arbitrary patterns of operations. Pattern matching deduces specific configuration of operations to translate to another one, more suitable for target machine or further lowering. This means that relations between operations are fixed. Tokenization, on the other hand, has the only limitation on specific operation types which are **suitable and profitable** to fuse, respecting original topology correctness (keeping it as a direct acyclic graph).
|
||||
|
||||
The extracted body comes to a plug-in wrapped as a composite `Subgraph` operation which is seen as a block box from a plugin standpoint and can participate in any plugin specific subroutines (e.g. layout assignment, memory allocation, etc.).
|
||||
The extracted body comes to a plug-in wrapped as a composite `Subgraph` operation which is seen as a block box from a plugin standpoint and can participate in any plugin specific subroutines (for example, layout assignment, memory allocation, etc.).
|
||||
|
||||
### Supported subgraph patterns
|
||||
|
||||
Subgraph accepts arbitrary numbers of inputs and outputs. There is 1:1 mapping for external (subgraph node’s) and internal (body) parameters indexes.
|
||||
Subgraph accepts arbitrary numbers of inputs and outputs. There is 1:1 mapping for external (subgraph node’s) and internal (body) parameters indexes.
|
||||
|
||||
Pattern here is an exact subgraph configuration (nodes and edges between them). **The first generation of snippets supports only layout-oblivious operations which may have broadcast on inputs and broadcast-compatible outputs**. For example Shapes `<1, 42, 17, 31>`, `<1, 42, 17, 1>` and `<1, 42, 1, 31>` are considered as broadcast-compatible. Layout-oblivious operation with multiple outputs as a snippet leader and forms a new subgraph. The most beneficial patterns are subgraphs with complex control flow but minimal number of inputs/and outputs. For example, GeLU has a 5x shrinkage factor from original unfused subgraph in number of bytes walked. Subgraph below could be considered as an example of such a subgraph. Leader detection procedure aims to find such subgraphs.
|
||||
Pattern here is an exact subgraph configuration (nodes and edges between them). **The first generation of snippets supports only layout-oblivious operations which may have broadcast on inputs and broadcast-compatible outputs**. For example Shapes `<1, 42, 17, 31>`, `<1, 42, 17, 1>` and `<1, 42, 1, 31>` are considered as broadcast-compatible. Layout-oblivious operation with multiple outputs serves as a snippet leader and forms a new subgraph. The most beneficial patterns are subgraphs with complex control flow but minimal number of inputs/and outputs. For example, GeLU has a 5x shrinkage factor from original unfused subgraph in number of bytes walked. Subgraph below could be considered as an example of such a subgraph. Leader detection procedure aims to find such subgraphs.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
@@ -60,12 +60,12 @@ class nodeA3 steel1
|
||||
```
|
||||
|
||||
Operations are greedily added to the subgraph until
|
||||
1. New operation doesn’t introduce a loop in a topology function.
|
||||
1. New operation does not introduce a loop in a topology function.
|
||||
1. Number of inputs and outputs satisfies target criteria.
|
||||
1. Operation is not a predecessor of topology output.
|
||||
1. Resulting subgraph can be scheduled (all outputs are broadcast-compatible).
|
||||
1. Resulting subgraph can be scheduled (all outputs are broadcast-compatible).
|
||||
|
||||
If a potential subgraph doesn’t meet any of criteria above, the procedure continues to find a new leader.
|
||||
If a potential subgraph does not meet any of the criteria above, the procedure continues to find a new leader.
|
||||
|
||||
### Lowering
|
||||
|
||||
@@ -82,27 +82,27 @@ Lowering is a sequence of subgraph (snippet body) traversal passes to generate a
|
||||
|
||||
#### Common optimizations
|
||||
|
||||
Constants are treated as inputs for a subgraph with an exception for scalar cases (since we don’t need to schedule them). `snippets::op::Scalar` is used to represent this kind of constants.
|
||||
Constants are treated as inputs for a subgraph with an exception for scalar cases (since they do not need to be scheduled). `snippets::op::Scalar` is used to represent this kind of constants.
|
||||
|
||||
If such Scalar comes as a second input of Power operation, it’s replaced with `snippets::op::PowerStatic`.
|
||||
If such Scalar comes as a second input of Power operation, it is replaced with `snippets::op::PowerStatic`.
|
||||
|
||||
#### Canonicalization
|
||||
|
||||
The goal of this step is to apply target independent and schedule related optimizations and to make snippet **schedulable**.
|
||||
The goal of this step is to apply target-independent and schedule-related optimizations and to make a snippet **schedulable**.
|
||||
|
||||
##### Domain normalization
|
||||
|
||||
All input and output shapes are normalized to 6D for future schedule generation. If shape propagation fails or leads to inconsistent output shapes an exception is raised.
|
||||
|
||||
Layout assigned by user code and passed to a `generate` function is propagated through subgraph on this step as well. Layout is passed to a generate function as a `BlockedShapeVector` which is a `std::vector<BlockedShape>` , while `BlockedShape` is `std::tuple<ngraph::Shape, ngraph::AxisVector, ngraph::element::Type>`. For example, if backend supports `NCHW16c` layout and tensor has size of `<1, 42, 17, 31>` and hold single precision floating point this structure should be `std::make_tuple(ngraph::Shape {1, 3, 17, 31, 16}, ngraph::AxisVector {0, 1, 2, 3, 1}, ngraph::element::f32);`. This allows generic layout representation.
|
||||
The layout assigned by a user code and passed to a `generate` function is propagated through a subgraph on this step as well. The layout is passed to a `generate` function as a `BlockedShapeVector` which is a `std::vector<BlockedShape>` , while `BlockedShape` is `std::tuple<ngraph::Shape, ngraph::AxisVector, ngraph::element::Type>`. For example, if backend supports `NCHW16c` layout and a tensor has a size of `<1, 42, 17, 31>` and holds single precision floating point, this structure should be `std::make_tuple(ngraph::Shape {1, 3, 17, 31, 16}, ngraph::AxisVector {0, 1, 2, 3, 1}, ngraph::element::f32);`. This allows generic layout representation.
|
||||
|
||||
##### Dialect conversion
|
||||
|
||||
The goal for this step is to transform a subgraph (body function) into a form possible to code generation. Input for this step is subgraph in a canonical form output is a subgraph in snippets dialect.
|
||||
The goal for this step is to transform a subgraph (body function) into a form possible for code generation. Input for this step is a subgraph in a canonical form. Output is a subgraph in snippets dialect.
|
||||
|
||||
Snippet or kernel is formed around the subgraph body in a sequence of traversal steps. Let’s walk through these steps with the smallest possible subgraph which contains out of single `[Add]` operation.
|
||||
A snippet or a kernel is formed around the subgraph body in a sequence of traversal steps. Let us walk through these steps with the smallest possible subgraph which contains a single `[Add]` operation.
|
||||
|
||||
While we extract subgraphs with the tokenization part we explicitly insert Parameters and Results to its body to form a complete nGraph Function.
|
||||
When subgraphs are extracted with the tokenization part, Parameters and Results are explicitly inserted to its body to form a complete nGraph Function.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
@@ -118,11 +118,11 @@ class nodeA8 steel1
|
||||
class nodeA1,nodeA3 steel1
|
||||
```
|
||||
|
||||
This function represents operation dependencies in scalar (similar to OpenCL) notation while shapes of tensors are used to generate schedules. At this point kernel-schedule decomposition is made (similar to Halide/OpenCL/TVM)
|
||||
This function represents operation dependencies in scalar (similar to OpenCL) notation while shapes of tensors are used to generate schedules. At this point, kernel-schedule decomposition is made (similar to Halide/OpenCL/TVM).
|
||||
|
||||
###### Explicit memory operations
|
||||
|
||||
As a next step explicit memory operations are placed for each input and output. `InsertLoad` and `InsertStore` passes derived from `MatcherPass`.
|
||||
As a next step, explicit memory operations are placed for each input and output. `InsertLoad` and `InsertStore` passes derive from `MatcherPass`.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
@@ -142,16 +142,16 @@ class nodeA8 carbon1
|
||||
class nodeA1,nodeA3,nodeA6,nodeA7 steel1
|
||||
```
|
||||
|
||||
By default, memory operations assumes vector memory access, if scalar access is needed special passes `ReplaceLoadsWithScalarLoads` and `ReplaceStoresWithScalarStores` should be executed.
|
||||
By default, memory operations assume vector memory access. If scalar access is needed, special `ReplaceLoadsWithScalarLoads` and `ReplaceStoresWithScalarStores` passes should be executed.
|
||||
|
||||
###### Explicit broadcast
|
||||
|
||||
For each operation in body function inputs are checked against broadcasting. In case of parameters to be broadcasted explicit broadcast operation is generated. For example, if for the subgraph above we have `<1, 42, 17, 31>` and `<1, 42, 17, 1>` resulting subgraph is going to be
|
||||
For each operation in body function inputs are checked against broadcasting. When Parameters are to be broadcasted, an explicit broadcast operation is generated. For example, with `<1, 42, 17, 31>` and `<1, 42, 17, 1>` for the subgraph above, the resulting subgraph will be:
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
nodeA1("Parameter\n<1, 42, 17, 1>") --> node6("Load\n<1, 42, 17, 1>")
|
||||
node6("Load\n<1, 42, 17, 1>") --> nodeA9("BroadcastMove\n<1, 42, 17, 31>")
|
||||
nodeA1("Parameter\n<1, 42, 17, 1>") --> nodeA6("Load\n<1, 42, 17, 1>")
|
||||
nodeA6("Load\n<1, 42, 17, 1>") --> nodeA9("BroadcastMove\n<1, 42, 17, 31>")
|
||||
nodeA9("BroadcastMove\n<1, 42, 17, 31>") --> nodeA2(Add)
|
||||
nodeA3("Parameter\n<1, 42, 17, 31>") --> nodeA7("Load\n<1, 42, 17, 31>")
|
||||
nodeA7("Load\n<1, 42, 17, 31>") ---> nodeA2(Add)
|
||||
@@ -164,10 +164,10 @@ classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
|
||||
class nodeA2 daisy1
|
||||
class nodeA5 moss1
|
||||
class nodeA8,nodeA9 carbon1
|
||||
class nodeA1,nodeA3,node6,nodeA7 steel1
|
||||
class nodeA1,nodeA3,nodeA6,nodeA7 steel1
|
||||
```
|
||||
|
||||
If load followed by broadcast is detected then this pair is replaced by a single Broadcast load instruction. Like the following
|
||||
If Load followed by Broadcast is detected, then this pair is replaced by a single BroadcastLoad instruction:
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
@@ -187,7 +187,7 @@ class nodeA8 carbon1
|
||||
class nodeA1,nodeA3,nodeA6,nodeA7 steel1
|
||||
```
|
||||
|
||||
Broadcast and regular streaming vector load is possible from the same pointer. Broadcast load should always go before streaming load. Broadcast load for non the most varying dimension is not generated, however it affects the generated schedule.
|
||||
Broadcast and regular streaming vector load is possible from the same pointer. BroadcastLoad should always go before streaming load. BroadcastLoad for non the most varying dimension is not generated, however it affects the generated schedule.
|
||||
|
||||
#### Target-specific optimizations
|
||||
|
||||
@@ -197,13 +197,13 @@ Target developers can plug in to the code generation pipeline some specific opti
|
||||
|
||||
#### Register allocation
|
||||
|
||||
Canonicalized subgraph in a snippets dialect forms a basic block or region inside a snippet (kernel). Registers are allocated globally for the whole subgraph. Since all operations for a subgraph are assumed to be vector, only vector registers are allocated for the first generation of SnippetS. Linear scan register allocation algorithm is used. Register allocator is implemented as a function pass `ngraph::snippets::pass::AssignRegisters` and store allocated registers for each node into `rt_info`. `rt_info` for a node holds a register for Node's output. *However, this part should be refactored batter, either to become target independent or use target specific abstraction to acquire a new register*
|
||||
Canonicalized subgraph in a snippets dialect forms a basic block or region inside a snippet (kernel). Registers are allocated globally for the whole subgraph. Since all operations for a subgraph are assumed to be vector, only vector registers are allocated for the first generation of SnippetS. Linear scan register allocation algorithm is used. Register allocator is implemented as the `ngraph::snippets::pass::AssignRegisters` function pass and store allocated registers for each node into `rt_info`. `rt_info` for a node holds a register for Node's output. *However, this part should be refactored better, either to become target independent or to use target-specific abstraction to acquire a new register*
|
||||
|
||||
#### Schedule generation
|
||||
#### Schedule generation
|
||||
|
||||
The goal of this step is to transform subgraphs in a scalar notation into kernel functions callable from user code. `Kernel` and `Tile` operations are introduced for this purpose. Each of this operation has a constructor from code region described as a collection of operation and operands pairs `Kernel(const std::vector<std::pair<std::shared_ptr<ngraph::snippets::Emitter>, ngraph::snippets::RegInfo>>& region);`.
|
||||
The goal of this step is to transform subgraphs in a scalar notation into kernel functions callable from user code. The `Kernel` and `Tile` operations are introduced for this purpose. Each of these operations has a constructor from code region described as a collection of operation and operand pairs `Kernel(const std::vector<std::pair<std::shared_ptr<ngraph::snippets::Emitter>, ngraph::snippets::RegInfo>>& region);`.
|
||||
|
||||
If we return to example above this comes to a following hierarchical IR. If we limit scope to layout oblivious operations with broadcasting support, tile could be generated as a single loop over the most warning dimension. The second `Tile` is generated to handle tails and can be omitted if not needed. Special pass replaces memory operations on vector to scalar versions for tail subgraph.
|
||||
The example above can be used for the following hierarchical IR. If the scope to layout oblivious operations with broadcasting support is limited, `Tile` could be generated as a single loop over the most warning dimension. The second `Tile` is generated to handle tails and can be omitted if not needed. A special pass replaces memory operations on vector with scalar versions for tail subgraph.
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
@@ -244,13 +244,13 @@ class nodeD1 no-stroke
|
||||
```
|
||||
|
||||
Where
|
||||
* `Kernel` constants a collection of the tiles, corresponds to a Subgraph node and responsible for function signature generation, calls generators for all tiles and data sections
|
||||
* `Tile` contains single subgraph body, vector or scalar
|
||||
* `Data` corresponds to data section aggregated for all nodes in all Tile’s subgraphs
|
||||
* `Kernel` is a collection of the tiles, corresponds to a Subgraph node and is responsible for function signature generation. It calls generators for all tiles and data sections.
|
||||
* `Tile` contains a single subgraph body, a vector or a scalar.
|
||||
* `Data` corresponds to data section aggregated for all nodes in all Tile’s subgraphs.
|
||||
|
||||
#### Target code emission
|
||||
|
||||
Target code emission is table based. Target is responsible for filling `jitters` table field in `Generator` class.
|
||||
A target code emission is table based. A target is responsible for filling `jitters` table field in `Generator` class.
|
||||
|
||||
```
|
||||
std::map<const ngraph::DiscreteTypeInfo, std::function<std::shared_ptr<Emitter>(std::shared_ptr<ngraph::Node>)>> jitters;
|
||||
@@ -260,9 +260,9 @@ std::map<const ngraph::DiscreteTypeInfo, std::function<std::shared_ptr<Emitter>(
|
||||
|
||||
An OpenVINO plugin is treated as a target for snippets.
|
||||
|
||||
Each nGraph node is mapped to a convertor function which creates `Emitter` form this node. Each specific emitter should extend from `Emitter`. It is used to map this node to target code and has `emit_code` and `emit_data` methods. `emit_data` is used during data section generation. All operations from snippets dialect which are legal for code generation should be expressed as operations derived from nGraph Op as well as Emitter derived snippets::Emitter class which knows how to translate this Op to Target specific ISA. (ex. xbyak is a jit backend for CPU plugin).
|
||||
Each nGraph node is mapped to a converter function which creates `Emitter` form of the node. Each specific emitter should extend from `Emitter`. It is used to map the node to the target code and has `emit_code` and `emit_data` methods. The `emit_data` is used during data section generation. All operations from snippets dialect which are legal for code generation should be expressed as operations derived from nGraph Op as well as `Emitter` derived `snippets::Emitter` class which knows how to translate this Op to Target-specific ISA. (for example, xbyak is a jit backend for CPU plugin).
|
||||
|
||||
For minimal code generator support target should provide emitters for the following operations
|
||||
For minimal code generator support, a target should provide emitters for the following operations:
|
||||
|
||||
* `Kernel`
|
||||
* `Tile`
|
||||
@@ -273,29 +273,29 @@ For minimal code generator support target should provide emitters for the follow
|
||||
* `Store`
|
||||
* `ScalarStore`
|
||||
|
||||
Once a schedule is generated, target code is emitted from a kernel in Generator::generate method by executing Kernel::emit_code function. Since Kernel and Tile represents hierarchical
|
||||
Once a schedule is generated, a target code is emitted from a kernel in `Generator::generate` method by executing `Kernel::emit_code` function. Since `Kernel` and `Tile` represent hierarchical IR.
|
||||
|
||||
##### Dialect extensibility
|
||||
|
||||
Target can potentially extend snippets dialect with target specific operation for code emission. It should implement:
|
||||
A target can potentially extend the snippets dialect with a target-specific operation for code emission. It should implement:
|
||||
|
||||
* nGraph operation (ex. `class FMA : public ngraph::op::Op`)
|
||||
* Emitter for this operation (ex. `class FmaEmitter : public Emitter` )
|
||||
* register this pair in `jitters` map
|
||||
* nGraph operation (for example, `class FMA : public ngraph::op::Op`)
|
||||
* Emitter for the operation (for example, `class FmaEmitter : public Emitter` )
|
||||
* register the pair in `jitters` map
|
||||
|
||||
### Calling convention
|
||||
|
||||
Parameters for a generated snippet are split into schedule-invariant and schedule-dependent. Schedule-invariant parameters include pointers to input/output tensors and strides for each of them with the same rank as scheduling domain.
|
||||
Parameters for a generated snippet are split into schedule invariant and schedule dependent. Schedule-invariant parameters include pointers to input/output tensors and strides for each of them with the same rank as the scheduling domain.
|
||||
|
||||
### Diagnostics
|
||||
|
||||
#### Reference mode
|
||||
|
||||
Subgraph can be executed with nGraph references if no generator is present.
|
||||
A subgraph can be executed with nGraph references if no generator is present.
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO SnippetS](../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [Developer documentation](../../../../docs/dev/index.md)
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# OpenVINO Hetero plugin design overview
|
||||
# OpenVINO Hetero Plugin Design Overview
|
||||
|
||||
## Subgraphs selection
|
||||
|
||||
@@ -6,17 +6,17 @@ Algorithm:
|
||||
|
||||
For each plugin
|
||||
1. Select *root* node
|
||||
* Node not in subgraph previously constructed
|
||||
* Affinity is equal to plugin name
|
||||
2. Select adjacent node to any node in already subgraph which is not in rejected list
|
||||
* if there are no such nodes **end**
|
||||
3. Check selected node has same affinity
|
||||
4. Add node to subgraph if check was successful or add to rejected list otherwise
|
||||
5. Check global condition
|
||||
* Nodes in rejected list can never be added to subgraph
|
||||
* Nodes not in subgraph and not in rejected list can possibly be added later
|
||||
* Check subgraph topology (the only check now is there are no indirect subgraph self-references)
|
||||
6. If global condition was failed remove last node from subgraph, add it to rejected list and go to step 5
|
||||
* A node not in a previously constructed subgraph
|
||||
* Affinity is equal to the plugin name
|
||||
2. Select an adjacent node to any node in a present subgraph which is not on the *rejected* list
|
||||
* If there are no such nodes **end**
|
||||
3. Verify that the selected node has the same affinity
|
||||
4. Add a node to a subgraph if the check has been successful or add to the *rejected* list otherwise
|
||||
5. Check a global condition
|
||||
* Nodes in the *rejected* list can never be added to a subgraph
|
||||
* Nodes not in a subgraph and not in the *rejected* list can possibly be added later
|
||||
* Check the subgraph topology (the only check now is whether there are no indirect subgraph self-references)
|
||||
6. If a global condition has failed, remove the last node from a subgraph. Add it to the *rejected* list and go to step 5.
|
||||
* we can rollback multiple times here because rejected list is changed every time
|
||||
7. Go to step 2
|
||||
|
||||
@@ -32,7 +32,7 @@ graph TD;
|
||||
6-->7;
|
||||
```
|
||||
|
||||
Nodes [1,2,3,5,6,7] are supported in plugin, [4] is not
|
||||
Nodes [1,2,3,5,6,7] are supported in the plugin, [4] is not
|
||||
|
||||
Possible roots: [1,2,3,5,6,7]
|
||||
1. Select root [1]
|
||||
@@ -50,27 +50,27 @@ Possible roots: [1,2,3,5,6,7]
|
||||
4. Merge [5]
|
||||
* Subgraph: [1,2,3,5]
|
||||
* Rejected: []
|
||||
* Global condition: There is possible self-references through node [4] but we do not know yet, ok
|
||||
* Global condition: There are possible self-references throughout a node [4] but they are not known yet, ok
|
||||
5. Merge [6]
|
||||
* Subgraph: [1,2,3,5,6]
|
||||
* Rejected: []
|
||||
* Global condition: There is possible self-references through node [4] but we do not know yet, ok
|
||||
* Global condition: There are possible self-references throughout a node [4] but they are not known yet, ok
|
||||
6. Merge [7]
|
||||
* Subgraph: [1,2,3,5,6,7]
|
||||
* Rejected: []
|
||||
* Global condition: There is possible self-references through node [4] but we do not know yet, ok
|
||||
* Global condition: There are possible self-references throughout a node [4] but they are not known yet, ok
|
||||
7. Failed to merge [4]
|
||||
* Subgraph: [1,2,3,5,6,7]
|
||||
* Rejected: [4]
|
||||
* Global condition: There is self-references through node [4], reject
|
||||
* Global condition: There are self-references throughout a node [4], reject
|
||||
8. Rollback [7]
|
||||
* Subgraph: [1,2,3,5,6]
|
||||
* Rejected: [4,7]
|
||||
* Global condition: There is self-references through node [4], reject
|
||||
* Global condition: There are self-references throughout a node [4], reject
|
||||
9. Rollback [6]
|
||||
* Subgraph: [1,2,3,5]
|
||||
* Rejected: [4,6,7]
|
||||
* Global condition: There is self-references through node [4], reject
|
||||
* Global condition: There are self-references throughout a node [4], reject
|
||||
10. Rollback [5]
|
||||
* Subgraph: [1,2,3]
|
||||
* Rejected: [4,5,6,7]
|
||||
@@ -97,11 +97,11 @@ Possible roots: [5,6,7]
|
||||
5. Merge [2]
|
||||
* Subgraph: [2,3,5,6,7]
|
||||
* Rejected: []
|
||||
* Global condition: There is possible self-references through node [4] but we do not know yet, ok
|
||||
* Global condition: There are possible self-references throughout a node [4] but they are not known yet, ok
|
||||
6. Failed to merge [4]
|
||||
* Subgraph: [2,3,5,6,7]
|
||||
* Rejected: [4]
|
||||
* Global condition: There is self-references through node [4], reject
|
||||
* Global condition: There are self-references throughout a node [4], reject
|
||||
7. Rollback [2]
|
||||
* Subgraph: [3,5,6,7]
|
||||
* Rejected: [2,4]
|
||||
@@ -113,7 +113,7 @@ Possible roots: [] no roots, **END**
|
||||
Subgraphs: [1,2,3], [3,5,6,7]
|
||||
|
||||
Select best subgraph:
|
||||
* When we have multiple subgraphs larger ([3,5,6,7]) is always selected, always
|
||||
* When there are multiple subgraphs, a larger one ([3,5,6,7]) is **always** selected.
|
||||
|
||||
Repeat previous steps with remaining nodes [1,2]
|
||||
|
||||
@@ -124,18 +124,18 @@ The final result is:
|
||||
|
||||
## Subgraphs self reference detection
|
||||
|
||||
1. For each node in network build a list of reachable node (transitive closure)
|
||||
2. For each pair of nodes in subgraph find `path` nodes (nodes through one node in pair reachable to other)
|
||||
* assume `src` - one node in pair, `dst` - other node in pair
|
||||
* get all nodes reachable from `src`
|
||||
* in those nodes find nodes through you can reach `dst` those will be our `path` node
|
||||
3. Results for pairs is cached.
|
||||
4. Check if there intersection between `path` nodes set and rejected nodes set for each nodes pair in subgraph
|
||||
5. In case of intersection we have a self-reference and subgraph is invalid
|
||||
1. For each node in a network build a list of reachable nodes (transitive closure).
|
||||
2. For each pair of nodes in a subgraph find `path` nodes (nodes through one node in pair reachable to other).
|
||||
* assume `src` - one node in a pair, `dst` - other node in a pair
|
||||
* get all reachable nodes from `src`
|
||||
* in the nodes find nodes through which you can reach `dst`. These will be the `path` nodes.
|
||||
3. Results for pairs are cached.
|
||||
4. Check whether there is an intersection between `path` nodes set and rejected nodes set for each pair of nodes in a subgraph.
|
||||
5. If an intersection happens, a self-reference occurs, and a subgraph is invalid.
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../README.md)
|
||||
* [OpenVINO Core Components](../../README.md)
|
||||
* [OpenVINO Plugins](../README.md)
|
||||
* [Developer documentation](../../../docs/dev/index.md)
|
||||
|
||||
@@ -2,17 +2,17 @@
|
||||
|
||||
## Key Contacts
|
||||
|
||||
Please contact a member of [openvino-ie-cpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-cpu-maintainers) group, for assistance regarding CPU.
|
||||
For assistance regarding CPU, contact a member of [openvino-ie-cpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-cpu-maintainers) group.
|
||||
|
||||
## Components
|
||||
|
||||
CPU Plugin contains the following components:
|
||||
|
||||
* [docs](./docs/) - contains developer documentation pages for the component.
|
||||
* [src](./src/) - folder contains sources of the core component.
|
||||
* [tests](./tests/) - contains tests for OpenVINO Plugin components.
|
||||
* [thirdparty](./thirdparty/) - contains third-party modules.
|
||||
* [tools](./tools/) - contains tools and helpers for OpenVINO Plugin components.
|
||||
* [docs](./docs/) - developer documentation pages for the component.
|
||||
* [src](./src/) - sources of the core component.
|
||||
* [tests](./tests/) - tests for OpenVINO Plugin components.
|
||||
* [thirdparty](./thirdparty/) - third-party modules.
|
||||
* [tools](./tools/) - tools and helpers for OpenVINO Plugin components.
|
||||
|
||||
## Tutorials
|
||||
|
||||
@@ -23,6 +23,7 @@ CPU Plugin contains the following components:
|
||||
* [Internal CPU Plugin Optimizations](./docs/internal_cpu_plugin_optimization.md)
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../README.md)
|
||||
* [OpenVINO Core Components](../../README.md)
|
||||
* [OpenVINO Plugins](../README.md)
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
Intel SDE can be used for emulating CPU architecture, checking for AVX/SSE transitions, bad pointers and data misalignment, etc.
|
||||
|
||||
Also supports debugging within emulation.
|
||||
It also supports debugging within emulation.
|
||||
|
||||
In general the tool can be used for all kind of troubleshooting activities except performance analysis.
|
||||
In general, the tool can be used for all kinds of troubleshooting activities except performance analysis.
|
||||
|
||||
See [Documentation](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html) for more information
|
||||
|
||||
@@ -19,17 +19,24 @@ OV_CPU_BLOB_DUMP_FORMAT=TEXT OV_CPU_BLOB_DUMP_NODE_TYPE=Convolution \
|
||||
|
||||
- Running _cpuFuncTests_ on some old architecture, for example Sandy Bridge:
|
||||
|
||||
`/path/to/sde -snd -- ./cpuFuncTests`
|
||||
```sh
|
||||
/path/to/sde -snd -- ./cpuFuncTests
|
||||
```
|
||||
|
||||
- Count AVX/SSE transitions for the current host:
|
||||
|
||||
`/path/to/sde -ast -- ./benchmark_app -m path/to/model.xml`
|
||||
```sh
|
||||
/path/to/sde -ast -- ./benchmark_app -m path/to/model.xml
|
||||
```
|
||||
|
||||
> **NOTE**: Best way to check for AVX/SSE transitions is to run within Alder Lake emulation:
|
||||
> **NOTE**: The best way to check for AVX/SSE transitions is to run within Alder Lake emulation:
|
||||
|
||||
`/path/to/sde -adl -- ./benchmark_app -m path/to/model.xml`
|
||||
```sh
|
||||
/path/to/sde -adl -- ./benchmark_app -m path/to/model.xml
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,19 +1,20 @@
|
||||
# CPU Plugin debug capabilities
|
||||
# CPU Plugin Debug Capabilities
|
||||
|
||||
The page describes list of useful debug features, controlled by environment variables.
|
||||
The page describes a list of useful debug features, controlled by environment variables.
|
||||
|
||||
They can be activated at runtime and might be used for analyzing issues, getting more context, comparing execution results, etc.
|
||||
|
||||
To have CPU debug capabilities available at runtime the following CMake option should be used when building the plugin:
|
||||
To have CPU debug capabilities available at runtime, use the following CMake option when building the plugin:
|
||||
* `ENABLE_DEBUG_CAPS`. Default is `OFF`
|
||||
|
||||
The following debug capabilities are available with the latest openvino:
|
||||
The following debug capabilities are available with the latest OpenVINO:
|
||||
|
||||
- [Verbose mode](../src/docs/verbose.md)
|
||||
- [Blob dumping](../src/docs/blob_dumping.md)
|
||||
- [Graph serialization](../src/docs/graph_serialization.md)
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -135,7 +135,7 @@ class nodeB2,nodeB3,nodeB5,nodeB6,nodeB7,nodeB9 steel1
|
||||
```
|
||||
## Fusing Convolution and Sum Layers
|
||||
|
||||
A combination of convolution, simple, and Eltwise layers with the sum operation results in a single layer called *Convolution*:
|
||||
A combination of convolution, simple, and Eltwise layers with the sum operation results in a single layer called *Convolution*:
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
@@ -216,6 +216,7 @@ CPU plugin removes a Power layer from a topology if it has the following paramet
|
||||
- <b>offset</b> = 0
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Performance analysis using ITT counters
|
||||
# Performance Analysis Using ITT Counters
|
||||
|
||||
## Contents
|
||||
|
||||
@@ -21,8 +21,11 @@ For performance analysis, follow the steps below:
|
||||
|
||||
### Intel SEAPI
|
||||
|
||||
#### Example of tool run:
|
||||
`python ~/tools/IntelSEAPI/runtool/sea_runtool.py -o trace -f gt ! ./benchmark_app -niter 1 -nireq 1 -nstreams 1 -api sync -m ./resnet-50-pytorch/resnest-50-pytorch.xml`
|
||||
#### Example of running the tool:
|
||||
|
||||
```sh
|
||||
python ~/tools/IntelSEAPI/runtool/sea_runtool.py -o trace -f gt ! ./benchmark_app -niter 1 -nireq 1 -nstreams 1 -api sync -m ./resnet-50-pytorch/resnest-50-pytorch.xml
|
||||
```
|
||||
|
||||
#### Mandatory parameters:
|
||||
* -o trace – output file name
|
||||
@@ -34,8 +37,11 @@ Generated file can be opened with google chrome using "chrome://tracing" URL.
|
||||
|
||||
### Intel Vtune Profiler
|
||||
|
||||
#### Example of tool run:
|
||||
`vtune -collect hotspots -k sampling-mode=hw -k enable-stack-collection=true -k stack-size=0 -k sampling-interval=0.5 -- ./benchmark_app -nthreads=1 -api sync -niter 1 -nireq 1 -m ./resnet-50-pytorch/resnet-50-pytorch.xml`
|
||||
#### Example of running the tool:
|
||||
|
||||
```sh
|
||||
vtune -collect hotspots -k sampling-mode=hw -k enable-stack-collection=true -k stack-size=0 -k sampling-interval=0.5 -- ./benchmark_app -nthreads=1 -api sync -niter 1 -nireq 1 -m ./resnet-50-pytorch/resnet-50-pytorch.xml
|
||||
```
|
||||
|
||||
#### Mandatory parameters:
|
||||
* -collect hotspots
|
||||
@@ -49,9 +55,9 @@ Generated file can be opened with Vtune client.
|
||||
Use API defined in [openvino/itt](https://docs.openvinotoolkit.org/latest/itt_2include_2openvino_2itt_8hpp.html) module.
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
* [OpenVINO GPU Plugin](../README.md)
|
||||
* [Developer documentation](../../../../docs/dev/index.md)
|
||||
|
||||
@@ -1,22 +1,30 @@
|
||||
# CPU plugin runtime parameters cache
|
||||
# CPU Plugin Runtime Parameters Cache
|
||||
|
||||
## Checklist for the runtime cache implementation
|
||||
1. Determine what data will be cached. We usually use the Executor concept that represents a junction of the executable code, usually JIT generated kernel, with some precomputed algorithm parameters.
|
||||
2. Provide a key that uniquelly identifies the cached value as a funtion of dynamically changing parameters, i.e. shapes, dynamic input that determines the algorithm parameters, etc. To be used in a hash table, the key must have the following static interface:
|
||||
|
||||
1. Determine what data will be cached.
|
||||
|
||||
1. Determine what data will be cached. It is commonly recommended to use the Executor concept that represents a junction of the executable code, usually JIT generated kernel, with some precomputed algorithm parameters.
|
||||
|
||||
2. Provide a key that uniquely identifies the cached value as a function of dynamically changing parameters, that is, shapes, dynamic input that determines the algorithm parameters, etc. To be used in a hash table, the key must have the following static interface:
|
||||
```cpp
|
||||
struct KeyType {
|
||||
size_t hash() const;
|
||||
bool operator== () const;
|
||||
};
|
||||
```
|
||||
3. Provide a builder, that is, a callable object of the following signature:
|
||||
|
||||
3. Provide a builder, that is, a callable object of the following signature:
|
||||
```cpp
|
||||
ValueType build(const KeyType& key);
|
||||
```
|
||||
The `ValueType` is a type to be cached (e.g. shared pointer to Executor object). Remember that in the current cache implementation, a default constructed `ValueType()` object is considered empty, so it is better to use `std::shared_ptr` as the `ValueType`. The builder instantiates a specific type of cached entity from the `key`, thus the `key` completely defines the cached data. The builder is used to creat the `ValueType` object in case of cache miss.
|
||||
4. Refactor the specific implementation of the `prepareParams()` method to extract the cached object construction logic (e.g. the algorithm parameters recalculation and JIT kernel generation) into the builder.
|
||||
The `ValueType` is a type to be cached (for example, a shared pointer to Executor object). Remember that in the current cache implementation, a default constructed `ValueType()` object is considered empty. Therefore, it is better to use `std::shared_ptr` as the `ValueType`. The builder instantiates a specific type of cached entity from the `key`, so the `key` completely defines the cached data. The builder is used to create the `ValueType` object in case of a cache miss.
|
||||
|
||||
4. Refactor the specific implementation of the `prepareParams()` method to extract the cached object construction logic (for example, the algorithm parameters recalculation and JIT kernel generation) into the builder.
|
||||
|
||||
5. Add the key generation code into the `prepareParams()` method to query the cache.
|
||||
6. Implement cache usage as the following:
|
||||
|
||||
6. Implement cache usage as follows:
|
||||
```cpp
|
||||
void preapareParams() override {
|
||||
... //code that prepares parameters for the key
|
||||
@@ -31,6 +39,7 @@
|
||||
execPtr = result.first;
|
||||
}
|
||||
```
|
||||
|
||||
7. To provide smoke testing of these changes, add repeated shapes to the "target shapes" part of the corresponding single layer test definition:
|
||||
```cpp
|
||||
{ //dynamic case description each pair per each input has {{dynamic shape}, {{static shape case1}, {static shape case2}, ...}
|
||||
@@ -38,7 +47,7 @@
|
||||
{{-1, -1, 5}, {{10, 10, 5}, {5, 5, 5}, {10, 10, 5}}} // input 1
|
||||
},
|
||||
```
|
||||
It worth to mention that placing two identical target shapes one after another does not trigger the cache, since another optimization based on the fact that the shapes have not been changed takes place. For example, the following test definition does not properly test the cache:
|
||||
**Note that placing two identical target shapes one after another does not trigger the cache,** since another optimization based on the fact that the shapes have not been changed takes place. For example, the following test definition does not properly test the cache:
|
||||
```cpp
|
||||
{ // the shape infer and params preparation stages will be skipped for the second target shapes combination since the shapes are not changed
|
||||
{{-1, -1, -1}, {{5, 5, 5}, {5, 5, 5}}}, // input 0
|
||||
@@ -47,6 +56,7 @@
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -4,7 +4,7 @@ GPU plugin in [OpenVINO toolkit](https://github.com/openvinotoolkit/openvino) su
|
||||
|
||||
## Key Contacts
|
||||
|
||||
Please contact a member of [openvino-ie-gpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-gpu-maintainers) group, for assistance regarding GPU.
|
||||
For assistance regarding GPU, contact a member of [openvino-ie-gpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-gpu-maintainers) group.
|
||||
|
||||
## Components
|
||||
|
||||
@@ -32,29 +32,32 @@ GPU Plugin contains the following components:
|
||||
* [GPU plugin unit test](./docs/gpu_plugin_unit_test.md)
|
||||
|
||||
## Attached licenses
|
||||
|
||||
GPU plugin uses 3<sup>rd</sup>-party components licensed under following licenses:
|
||||
- *googletest* under [Google License](https://github.com/google/googletest/blob/master/googletest/LICENSE)
|
||||
- *OpenCL™ ICD and C++ Wrapper under [Khronos™ License](https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/LICENSE.txt)
|
||||
- *RapidJSON* under [Tencent License](https://github.com/Tencent/rapidjson/blob/master/license.txt)
|
||||
|
||||
## Support
|
||||
Please report issues and suggestions
|
||||
[GitHub issues](https://github.com/openvinotoolkit/openvino/issues).
|
||||
|
||||
To report issues and make suggestions, see [GitHub issues](https://github.com/openvinotoolkit/openvino/issues).
|
||||
|
||||
## How to Contribute
|
||||
We welcome community contributions to GPU plugin. If you have an idea how to improve the library:
|
||||
|
||||
Community contributions to GPU plugin are highly welcome. If you have a suggestion on how to improve the library:
|
||||
|
||||
- Share your proposal via
|
||||
[GitHub issues](https://github.com/openvinotoolkit/openvino/issues)
|
||||
- Ensure you can build the product and run all the tests with your patch
|
||||
- In the case of a larger feature, create a test
|
||||
- In case of a larger feature, create a test
|
||||
- Submit a [pull request](https://github.com/openvinotoolkit/openvino/pulls)
|
||||
|
||||
We will review your contribution and, if any additional fixes or modifications
|
||||
are necessary, may provide feedback to guide you. When accepted, your pull
|
||||
request will be merged into our GitHub repository.
|
||||
are necessary, we may provide feedback to guide you. Once your pull request
|
||||
has been approved, it will be merged into our GitHub repository.
|
||||
|
||||
## System Requirements
|
||||
|
||||
GPU plugin supports Intel® HD Graphics, Intel® Iris® Graphics and Intel® Arc™ Graphics and is optimized for Gen9-Gen12LP, Gen12HP architectures
|
||||
|
||||
GPU plugin currently uses OpenCL™ with multiple Intel OpenCL™ extensions and requires Intel® Graphics Driver to run.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Basic data structures of GPU graph and overall flow
|
||||
# Basic Data Structures of GPU Graph and Overall Flow
|
||||
|
||||
## Overall graph data structure
|
||||
<a name="fig1"></a>
|
||||
@@ -60,23 +60,23 @@ d1 ..> d2 : Dependency
|
||||
```
|
||||
|
||||
There are three levels of abstraction in the graph structures being used in the gpu plugin : *topology*, *program*, *network*. <br>
|
||||
The above <a href="#fig1">figure</a> presents the overall data structures.
|
||||
The above <a href="#fig1">figure</a> presents the overall data structures.
|
||||
|
||||
First, the original model should be presented as a corresponding *topology*, which is consisting of primitives and their connections. It can be regarded as a simple graph structure representing the original model.
|
||||
First, the original model should be presented as a corresponding *topology*, which consists of primitives and their connections. It can be regarded as a simple graph structure representing the original model.
|
||||
|
||||
Then the topology is to be converted to a *program*, which is consisting of *program_nodes* corresponding to the original primitives and their connections.
|
||||
Then the topology is to be converted to a *program*, which consists of *program_nodes* corresponding to the original primitives and their connections.
|
||||
Here, the majority of the transformation and optimizations are performed on the *program*.
|
||||
Also, the *primitive_impl* is created for each *program_node* at this stage, which holds the selected kernels for each *program_node* and the required information to run the kernels such as work group sizes and kernel arguments, etc. The final source code of the kernels are decided and compiled at this stage, too.
|
||||
Note that a *program* is common for the streams, i.e., there is only one *program* created for all the streams.
|
||||
Also, the *primitive_impl* is created for each *program_node* at this stage, which holds the selected kernels for each *program_node* and the required information to run the kernels, such as work group sizes and kernel arguments, etc. The final source code of the kernels is decided and compiled at this stage, too.
|
||||
Note that a *program* is common for the streams, that is, there is only one *program* created for all the streams.
|
||||
|
||||
Once the *program* is finalized, then the *network* is built from the *program* for each stream.
|
||||
A *network* is consisting of primitive instances (a.k.a *primitive_inst*) that contains the required memory allocations for the kernels.
|
||||
Then finally we can run the *network* by running the network::execute().
|
||||
Once the *program* is finalized, the *network* is built from the *program* for each stream.
|
||||
A *network* consists of primitive instances (*primitive_inst*) that contain the required memory allocations for the kernels.
|
||||
Finally, you can run the *network* using the `network::execute()` method.
|
||||
|
||||
The more detailed description of each component is to be described in the below sections.
|
||||
A more detailed description of each component is described in the sections below.
|
||||
|
||||
|
||||
## primitive
|
||||
## primitive
|
||||
```cpp
|
||||
struct primitive {
|
||||
...
|
||||
@@ -87,16 +87,16 @@ struct primitive {
|
||||
...
|
||||
};
|
||||
```
|
||||
A *primitive* is the primary representation of an operation in gpu plugin, which comprises a graph structure, i.e., the *topology*. A *primitive* is to be created for a layer operation in the original model and holds the basic information about the operation, such as required input, output, attributes, as well as its own id, a.k.a *primitive_id*. Here, the *primitive_id* is a unique string id assigned to each *primitive* throughout the processing. <br>
|
||||
A *primitive* is the primary representation of an operation in GPU plugin, which comprises a graph structure, that is, the *topology*. A *primitive* is to be created for a layer operation in the original model and holds the basic information about the operation, such as required input, output, attributes, as well as its own id (*primitive_id*). Here, the *primitive_id* is a unique string id assigned to each *primitive* throughout the processing. <br>
|
||||
|
||||
The APIs of the available primitives can be found [here](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/include/intel_gpu/primitives).<br>
|
||||
See the APIs of the available [primitives](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/include/intel_gpu/primitives).<br>
|
||||
|
||||
An example creation of a arg_max_min primitive:
|
||||
An example creation of a `arg_max_min` primitive:
|
||||
```cpp
|
||||
cldnn::arg_max_min top_k_prim = cldnn::arg_max_min("top_k", { "input" }, arg_max_min::max, top_k, arg_max_min::y, arg_max_min::sort_by_values, false, "", padding(), data_types::f32);
|
||||
```
|
||||
|
||||
In GPU plugin, the *primitives* are converted from ngraph operations, which can be found [here](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/src/plugin/ops).
|
||||
In GPU plugin, the *primitives* are converted from ngraph [operations](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/src/plugin/ops).
|
||||
|
||||
## topology
|
||||
```cpp
|
||||
@@ -107,9 +107,9 @@ struct topology{
|
||||
};
|
||||
```
|
||||
|
||||
A *topology* is a graph structure consisting of *primitives* and their connections. Here a connection is defined by input primitives assigned to a primitive.
|
||||
A *topology* is a graph structure consisting of *primitives* and their connections. Here, a connection is defined by input primitives assigned to a primitive.
|
||||
|
||||
A simple example of creation of a topology, which is consisting of two poolings, one concatenation of the poolings, and a reorder primitive, is shown as follows:
|
||||
A simple example of creating a topology, which consists of two poolings, one concatenation of the poolings, and a reorder primitive, is as follows:
|
||||
```cpp
|
||||
auto input0 = engine.allocate_memory({data_types::i8, format::bfyx, {1, 1, 8, 3}});
|
||||
auto input1 = engine.allocate_memory({data_types::i8, format::bfyx, {1, 1, 8, 3}});
|
||||
@@ -127,9 +127,9 @@ topology topology(input_layout("input0", input0->get_layout()),
|
||||
reorder("reorder", "concat", reorder_layout));
|
||||
```
|
||||
|
||||
In the above example, "pool0" is the *primitive_id* of the first pooling, and "input0" is the *primitive_id* of the input primitive of it. The latter parameters such as pooling_mode::max, {1, 1, 2, 2}, {1, 1, 1, 1} are other properties for pooling primitive, pooling_mode, tensor size, stride, respectively.
|
||||
In the example above, "pool0" is the *primitive_id* of the first pooling, and "input0" is the *primitive_id* of the input primitive of it. The `pooling_mode::max, {1, 1, 2, 2}, {1, 1, 1, 1}` parameters stand for other properties for pooling: primitive, pooling_mode, tensor size, stride, respectively.
|
||||
|
||||
Note that topology is created from ngraph representation in the gpu plugin. Manual definition of a topology shown in the above snippet is usually for unittest purpose.
|
||||
Note that topology is created from ngraph representation in the GPU plugin. Manual definition of a topology shown in the snippet above is usually for the purpose of a unit test.
|
||||
|
||||
## program_node (impl)
|
||||
|
||||
@@ -147,14 +147,15 @@ struct program_node {
|
||||
...
|
||||
};
|
||||
```
|
||||
A program is consisting of program_nodes which are created from primitives. ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L353)) A program_node is created by a factory for each primitive type, i.e., primitive_type, which is associated to each primitive as type ([link](https://github.com/openvinotoolkit/openvino/blob/173f328c53d39dd42ecdb9de9e04f9d2c266683f/src/plugins/intel_gpu/include/intel_gpu/primitives/primitive.hpp#L79)). (Note that this primitive_type is used to create primitive_inst or call choose_impl too.)
|
||||
A program consists of *program_nodes* which are created from primitives. ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L353)) A *program_node* is created by a factory for each *primitive type*, that is, *primitive_type*, which is associated to each primitive as a type ([link](https://github.com/openvinotoolkit/openvino/blob/173f328c53d39dd42ecdb9de9e04f9d2c266683f/src/plugins/intel_gpu/include/intel_gpu/primitives/primitive.hpp#L79)). Note that this *primitive_type* is used to create *primitive_inst* or call *choose_impl* too.
|
||||
|
||||
Basically a program_node holds the following information which is to be decided throughout the transformation / optimization processes in a program:
|
||||
* layout : output layout of a program_node. ([impl](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp))
|
||||
* dependencies : a list of program_nodes whose outputs are used by the current program_node as the inputs
|
||||
* memory dependencies : a list of program_nodes, the live ranges of the outputs of them overlaps with that of the current program_node
|
||||
* fused operations : fused operations to the current program_node
|
||||
* selected impl : The primitive_impl object which holds the information for the selected kernel required to run it, such as the selected kernels, work group size, etc. Also this object has the methods to set kernel arguments for a primitive_inst and execute the kernel by enqueueing it to the command queue.
|
||||
A *program_node* holds the following information which is to be decided throughout the transformation / optimization processes in a program:
|
||||
|
||||
* layout: output layout of a *program_node*. ([impl](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp))
|
||||
* dependencies: a list of *program_nodes*, the outputs of which are used by the current *program_node* as the inputs
|
||||
* memory dependencies : a list of *program_nodes*, the live ranges of their outputs of them overlap with that of the current *program_node*
|
||||
* fused operations: fused operations to the current *program_node*
|
||||
* selected impl: The *primitive_impl* object which holds the information for the selected kernel required to run it, such as the selected kernels, work group size, etc. Also, this object has the methods to set kernel arguments for a *primitive_inst* and execute the kernel by enqueueing it to the command queue.
|
||||
|
||||
## program (impl)
|
||||
|
||||
@@ -174,16 +175,16 @@ struct program {
|
||||
```
|
||||
The major tasks that are done while building a program are as follows:
|
||||
([ref](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L433))
|
||||
* Init graph : Create an initial program consisting of program_nodes built from a given topology
|
||||
* Optimization (Major optimizations will be dealt with from another section TBD)
|
||||
* pre-optimization ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L474)): Optimizations done before graph_compilation. Notable passes are as follows:
|
||||
* prepare_primitive_fusing : decision of fusing
|
||||
* reorder_inputs : decision of preferred layout / impl (ocl vs onednn) and adding reorders w.r.t the decision
|
||||
* post-optimization ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L437)) Optimizations done after graph_compilation <br>
|
||||
* post_optimize_weights : Add reorder for the weights toward preferred formats (as generic nodes) <br>
|
||||
* propagate_constants : Transfer and reorder original weight data to the generic_nodes created at post_optimize_weights. Here, note that the constant propagation is doing weight reorder by running actual network (w/ is_internal = true). To this end, a temporal program is created/built/run within this pass. <br>
|
||||
* Init graph: Create an initial program consisting of *program_nodes* built from a given topology.
|
||||
* Optimization (Major optimizations will be dealt with from another section TBD)
|
||||
* pre-optimization ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L474)): Optimizations done before *graph_compilation*. Notable passes are as follows:
|
||||
* *prepare_primitive_fusing*: decision of fusing
|
||||
* *reorder_inputs*: decision of preferred layout / impl (ocl vs onednn) and adding reorders w.r.t the decision
|
||||
* post-optimization ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L437)) Optimizations done after *graph_compilation* <br>
|
||||
* *post_optimize_weights*: Add reorder for the weights toward preferred formats (as generic nodes) <br>
|
||||
* *propagate_constants*: Transfer and reorder original weight data to the *generic_nodes* created at *post_optimize_weights*. Note that the constant propagation is doing a weight reorder by running the actual network (w/ is_internal = true). To this end, a temporal program is created/built/run within this pass. <br>
|
||||
|
||||
* Kernel selection and graph compilations ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L436)) : Select best kernel for the program_node and create the impl (i.e., primitive_impl), and collect the kernel source code strings to the kernels_cache.
|
||||
* Kernel selection and graph compilations ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L436)): Select best kernel for the *program_node* and create the impl (that is, *primitive_impl*), and collect the kernel source code strings to the kernels_cache.
|
||||
* Kernel compilation ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L451)): JIT compilation of the collected kernels. Currently 9 kernels are combined as a batch and compiled at a time. Also the batches are compiled in parallel. See [here](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/runtime/kernels_cache.cpp#L400).
|
||||
|
||||
## primitive_inst (impl)
|
||||
@@ -203,12 +204,12 @@ class primitive_inst {
|
||||
...
|
||||
};
|
||||
```
|
||||
Once all processing at a program level is finished, a network is to be built from the program.
|
||||
primitive_inst is the basic component comprising a network.
|
||||
While each primitive_inst object is still associated to the corresponding program_node, it holds the required memory objects such as output memory objects and intermediate memory objects that are to be used by that node. A brief description for the two kinds of memory allocated for a primitive_inst is as follows:
|
||||
Once all processing at a program level has been finished, a network is to be built from the program.
|
||||
The *primitive_inst* is the basic component comprising a network.
|
||||
While each *primitive_inst* object is still associated with the corresponding *program_node*, it holds the required memory objects, such as output memory objects and intermediate memory objects that are to be used by that node. A brief description of the two kinds of memory allocated for a *primitive_inst* is as follows:
|
||||
|
||||
* output memory : An output memory of a primitive_inst is allocated at the creation of each primitive_inst ([impl](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L210)), unless its output is reusing the input memory or the node is a mutable data to be used as a 2nd output. The general output tensors are allocated by the memory pool, so that the memory could be reused by other nodes when it is not needed. (Note that constants data are not reusable and should retain the own memory, so that they could be shared by multiple streams. More descriptions about memory pool will be given by dedicated section (TBD)).
|
||||
* intermediate memory ([impl](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L215)): Some kernels requires intermediate memories in addition to the input/output memories such as [detection_output](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/kernel_selector/core/actual_kernels/detection_output/detection_output_kernel_ref.cpp#L155). The allocation happens after all primitive_insts are finished ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L592)), since it needs to be processed in a processing_order to use the predecessors' allocation information while the creation of primitive_inst is done in a order sorted by memory_size.
|
||||
* output memory: An output memory of a *primitive_inst* is allocated at the creation of each *primitive_inst* ([impl](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L210)), unless its output is reusing the input memory or the node is a mutable data to be used as a second output. The general output tensors are allocated by the memory pool, so that the memory could be reused by other nodes when it is not needed. Note that constants data is not reusable and should retain its own memory so that it could be shared by multiple streams. A more detailed description of the memory pool will be given in the dedicated section (TBD).
|
||||
* intermediate memory ([impl](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L215)): Some kernels require intermediate memories in addition to the input/output memories such as [detection_output](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/kernel_selector/core/actual_kernels/detection_output/detection_output_kernel_ref.cpp#L155). The allocation happens after all *primitive_insts* are finished ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L592)), since it needs to be processed in a *processing_order* to use the predecessors' allocation information while the creation of *primitive_inst* is done in an order sorted by *memory_size*.
|
||||
|
||||
## network (impl)
|
||||
```cpp
|
||||
@@ -230,14 +231,15 @@ struct network {
|
||||
void allocate_primitives();
|
||||
};
|
||||
```
|
||||
When a network is built, the comprising primitives are allocated and dependencies among them are set ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L259)).
|
||||
When a network is built, the comprising primitives are allocated and dependencies among them are set ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L259)).
|
||||
|
||||
The major processes done while a network is executed are as follows ([impl]( https://github.com/openvinotoolkit/openvino/blob/3de428c7139fef69e37b406c3490c26b67b48026/src/plugins/intel_gpu/src/graph/network.cpp#L663)) :
|
||||
* set arguments of the primitives (i.e., set the [kernel_args](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/runtime/kernel_args.hpp) required for running the kernels such as input/output memory address)
|
||||
The major processes, done while a network is executed, are as follows ([impl]( https://github.com/openvinotoolkit/openvino/blob/3de428c7139fef69e37b406c3490c26b67b48026/src/plugins/intel_gpu/src/graph/network.cpp#L663)):
|
||||
* set arguments of the primitives (that is, set the [kernel_args](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/runtime/kernel_args.hpp) required for running the kernels such as input/output memory address)
|
||||
|
||||
* [execute primitives](https://github.com/openvinotoolkit/openvino/blob/3de428c7139fef69e37b406c3490c26b67b48026/src/plugins/intel_gpu/src/graph/network.cpp#L849) : Execute each primitives, i.e., enqueue the kernels to the context queue.
|
||||
* [execute primitives](https://github.com/openvinotoolkit/openvino/blob/3de428c7139fef69e37b406c3490c26b67b48026/src/plugins/intel_gpu/src/graph/network.cpp#L849): Execute each primitive, that is, enqueue the kernels to the context queue.
|
||||
|
||||
## See Also
|
||||
|
||||
## See also
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,31 +1,33 @@
|
||||
# Execution of Inference
|
||||
|
||||
Network execution happens when user calls `inferRequest->infer()` or `inferRequest->start_async()`. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/samples/cpp/benchmark_app/main.cpp#L929)
|
||||
Network execution is triggered when the `inferRequest->infer()` or `inferRequest->start_async()` methods are called. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/samples/cpp/benchmark_app/main.cpp#L929)
|
||||
|
||||
In high level, all we need to do is enqueuing OCL kernels with buffers. For that purpose, we need to find the `cldnn::network` instance as it contains the required buffers for execution. [(link)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/basic_data_structures.md#network-impl) `CPUStreamExecutor` is holding streams and the stream corresponds to the `cldnn::network` structure. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/inference/src/threading/ie_cpu_streams_executor.cpp#L263)
|
||||
At high level, all that is required to do is enqueuing OCL kernels with buffers. For that purpose, you need to find the `cldnn::network` instance, as it contains the required buffers for execution. [(link)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/basic_data_structures.md#network-impl) `CPUStreamExecutor` is holding streams, and the stream corresponds to the `cldnn::network` structure. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/inference/src/threading/ie_cpu_streams_executor.cpp#L263)
|
||||
|
||||
The main body of network execution is `cldnn::network::execute_impl`. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L663) In this function, `set_arguments()` is called to set OpenCL arguments and `execute_primitive` is called to enqueue kernels to OCL queue.
|
||||
In case of synchronous API call(i.e. `inferRequest->infer()`), waiting for completion of kernels is also required. It is called from `cldnn::network_output::get_memory()` function. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/include/intel_gpu/graph/network.hpp#L31)
|
||||
In case of a synchronous API call (that is, `inferRequest->infer()`), waiting for the completion of kernels is also required. It is called from the `cldnn::network_output::get_memory()` function. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/include/intel_gpu/graph/network.hpp#L31)
|
||||
|
||||
## Optimized-out node
|
||||
|
||||
During graph compilation [(link)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/graph_optimization_passes.md), some nodes may be optimized out.
|
||||
|
||||
For example, concat operation may be executed _implicitly_, or in other words, concat may be _optimized out_. Implicit concat is possible when the input of concat can put the output tensor directly into the result tensor of concat.
|
||||
For example, concat operation may be executed _implicitly_, or in other words, concat may be _optimized out_. Implicit concat is possible when the input of concat can put the output tensor directly into the resulting tensor of concat.
|
||||
|
||||
In such case, we don't remove the node in the graph for integrity of node connection. Concat layer is just marked as **optimized-out** and not executed during runtime. [(src)](https://github.com/openvinotoolkit/openvino/blob/dc6e5c51ee4bfb8a26a02ebd7a899aa6a8eeb239/src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp#L155)
|
||||
In such case, you do not remove the node in the graph for the integrity of the node connection. Concat layer is just marked as **optimized-out** and not executed during runtime. [(src)](https://github.com/openvinotoolkit/openvino/blob/dc6e5c51ee4bfb8a26a02ebd7a899aa6a8eeb239/src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp#L155)
|
||||
|
||||
## Dumping layer in/out buffer during execution
|
||||
`cldnn::network::execute_impl` also contains some logic to dump layer in/out buffers for debugging purpose. As it is related to memory usage, it deserves some description, too.
|
||||
The `cldnn::network::execute_impl` function also contains some logic to dump layer in/out buffers for debugging purposes. As it is related to memory usage, it deserves some description, too.
|
||||
|
||||
In order to dump buffers, we need to wait for the moment that the kernel is about to be called(for source buffer) or just called(for destination buffer). In other moments, we don't have the layer's buffer as the buffers are reused from memory pool. [(link)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/memory_allocation_gpu_plugin.md#memory-dependency-and-memory-pool)
|
||||
To dump buffers, you need to wait for the moment that the kernel is about to be called (for source buffer) or just called (for destination buffer). In other moments, you do not have the layer's buffer as the buffers are reused from the memory pool. [(link)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/memory_allocation_gpu_plugin.md#memory-dependency-and-memory-pool)
|
||||
|
||||
`get_stream().finish()` is called firstly as we need to be synchronous with kernel execution. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L712) Then we can access the buffer. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L114) This access varies depending on the kind of buffer. If it is `usm_host` or `usm_shared`, it is just accessed directly. If it is `usm_device`, it is accessed after copying the data into host memory because host cannot access `usm_device` directly. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L312) If it is ocl memory, we map this into host memory. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L46)
|
||||
The `get_stream().finish()` function is called first as you need to be synchronous with kernel execution. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L712). Then, you can access the buffer. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L114). This access varies depending on the kind of the buffer. If it is `usm_host` or `usm_shared`, it is just accessed directly. If it is `usm_device`, it is accessed after copying the data into host memory because the host cannot access `usm_device` directly. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L312) If it is OCL memory, you map this into host memory. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L46)
|
||||
|
||||
Typical network execution happens with `usm_host` for network input and output and `usm_device` for the buffers inside the network.
|
||||
|
||||
For usage of this dumping feature, please see [link](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/gpu_debug_utils.md#layer-inout-buffer-dumps).
|
||||
For usage of this dumping feature, see this [link](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/gpu_debug_utils.md#layer-inout-buffer-dumps).
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,21 +1,23 @@
|
||||
# GPU plugin debug utils
|
||||
# GPU Plugin Debug Utils
|
||||
|
||||
This document is a list of useful debug features / tricks that might be used to find root cause of performance / functional issues. Some of them
|
||||
are available by default, but some others might require plugin recompilation.
|
||||
|
||||
## Debug Config
|
||||
`Debug_config` is an infra structure that contains number of easy-to-use debugging features. It has various control parameters. You can check list of parameters from the source code `cldnn::debug_configuration`.
|
||||
|
||||
`Debug_config` is an infrastructure that contains several easy-to-use debugging features. It has various control parameters, which you can check from the source code `cldnn::debug_configuration`.
|
||||
|
||||
### How to use it
|
||||
First, this feature should be enabled from cmake configuration `ENABLE_DEBUG_CAPS`. When openvino is released, it is turned off by default.
|
||||
The parameters should be set from environment variable when calling inference engine API.
|
||||
|
||||
First, this feature should be enabled from cmake configuration `ENABLE_DEBUG_CAPS`. When OpenVINO is released, it is turned off by default.
|
||||
The parameters should be set from an environment variable when calling inference engine API.
|
||||
|
||||
```
|
||||
$ OV_GPU_Verbose=1 ./benchmark_app ... # Run benchmark_app with OV_GPU_Verbose option
|
||||
$ OV_GPU_DumpLayersPath="cldnn/" ./benchmark_app ... # Run benchmark_app and store intermediate buffers into cldnn/ directory.
|
||||
```
|
||||
|
||||
For Windows OS, please use below syntax.
|
||||
For Windows OS, use the following syntax:
|
||||
|
||||
```
|
||||
Windows Power Shell:
|
||||
@@ -28,38 +30,42 @@ Windows cmd.exe:
|
||||
```
|
||||
|
||||
### Options syntax
|
||||
|
||||
Plugin is able to parse different naming styles for debug options:
|
||||
1. `OV_GPU_SOME_OPTION`
|
||||
2. `OV_GPU_SomeOption`
|
||||
|
||||
Behavior when both versions are specified is not defined.
|
||||
|
||||
Some options also allow multiple prefixes: `OV` and `OV_GPU`. `OV` prefix is intended to be used for options common for all OpenVINO components. In case if an option is set twice with different prefixes, then `OV_GPU` has higher priority.
|
||||
Some options also allow multiple prefixes: `OV` and `OV_GPU`. `OV` prefix is intended to be used for options common for all OpenVINO components. When an option is set twice with different prefixes, then `OV_GPU` has higher priority.
|
||||
|
||||
### List of parameters (There are actually more than this, please see OV_GPU_Help result)
|
||||
### List of parameters
|
||||
|
||||
* `OV_GPU_Help`: Show help message of debug config.
|
||||
* `OV_GPU_Verbose`: Verbose execution. Currently, Verbose=1 and 2 are supported.
|
||||
* `OV_GPU_PrintMultiKernelPerf`: Print kernel latency for multi-kernel primitives. This is turned on by setting 1. Execution time is printed.
|
||||
* `OV_GPU_DisableUsm`: Disable the usage of usm (unified shared memory). This is turned on by setting 1.
|
||||
* `OV_GPU_DisableOnednn`: Disable onednn for discrete GPU (no effect for integrated GPU)
|
||||
* `OV_GPU_DumpGraphs`: Dump optimized graph into the path that this variable points. This is turned on by setting the destination path into this variable.
|
||||
* `OV_GPU_DumpSources`: Dump opencl sources
|
||||
* `OV_GPU_DumpLayersPath`: Enable intermediate buffer dump and store the tensors. This is turned on by setting the destination path into this variable. You can check the exact layer name from `OV_GPU_Verbose=1`.
|
||||
* `OV_GPU_DumpLayers`: Dump intermediate buffers only for the layers that this variable specifies. Multiple layers can be specified with space delimiter. Dump feature should be enabled through `OV_GPU_DumpLayersPath`
|
||||
* `OV_GPU_DumpLayersResult`: Dump output buffers of result layers only
|
||||
* `OV_GPU_DumpLayersDstOnly`: When dumping intermediate buffer, dump destination buffer only. This is turned on by setting 1.
|
||||
* `OV_GPU_DumpLayersLimitBatch`: Limit the size of batch to dump
|
||||
* `OV_GPU_DryRunPath`: Dry run and serialize execution graph into the specified path
|
||||
* `OV_GPU_BaseBatchForMemEstimation`: Base batch size to be used in memory estimation
|
||||
* `OV_GPU_AfterProc`: Run inference after the specified process PIDs are finished, separated by space. Supported on only on linux.
|
||||
* `OV_GPU_SerialCompile`: Serialize creating primitives and compiling kernels
|
||||
* `OV_GPU_ForceImplType`: Force implementation type of a target primitive or layer. [primitive or layout_name]:[impl_type] For primitives, fc:onednn, fc:ocl, do:cpu, do:ocl, reduce:ocl and reduce:onednn are supported
|
||||
* `OV_GPU_MaxKernelsPerBatch`: Maximum number of kernels in a batch during compiling kernels
|
||||
This is a part of the full list. To get all parameters, see OV_GPU_Help result.
|
||||
|
||||
* `OV_GPU_Help`: Shows help message of debug config.
|
||||
* `OV_GPU_Verbose`: Verbose execution. Currently, `Verbose=1` and `2` are supported.
|
||||
* `OV_GPU_PrintMultiKernelPerf`: Prints kernel latency for multi-kernel primitives. This is turned on by setting `1`. Execution time is printed.
|
||||
* `OV_GPU_DisableUsm`: Disables the usage of usm (unified shared memory). This is turned on by setting `1`.
|
||||
* `OV_GPU_DisableOnednn`: Disables oneDNN for discrete GPU (no effect for integrated GPU).
|
||||
* `OV_GPU_DumpGraphs`: Dumps an optimized graph into the path that this variable points. This is turned on by setting the destination path into this variable.
|
||||
* `OV_GPU_DumpSources`: Dumps openCL sources
|
||||
* `OV_GPU_DumpLayersPath`: Enables intermediate buffer dump and store the tensors. This is turned on by setting the destination path into this variable. You can check the exact layer name from `OV_GPU_Verbose=1`.
|
||||
* `OV_GPU_DumpLayers`: Dumps intermediate buffers only for the layers that this variable specifies. Multiple layers can be specified with a space delimiter. Dump feature should be enabled through `OV_GPU_DumpLayersPath`.
|
||||
* `OV_GPU_DumpLayersResult`: Dumps output buffers of result layers only.
|
||||
* `OV_GPU_DumpLayersDstOnly`: When dumping intermediate buffer, dumps destination buffer only. This is turned on by setting `1`.
|
||||
* `OV_GPU_DumpLayersLimitBatch`: Limits the size of a batch to dump.
|
||||
* `OV_GPU_DryRunPath`: Dry runs and serializes the execution graph into the specified path.
|
||||
* `OV_GPU_BaseBatchForMemEstimation`: Base batch size to be used in memory estimation.
|
||||
* `OV_GPU_AfterProc`: Runs inference after the specified process PIDs are finished, separated by space. Supported only on Linux.
|
||||
* `OV_GPU_SerialCompile`: Serializes creating primitives and compiling kernels.
|
||||
* `OV_GPU_ForceImplType`: Forces implementation type of a target primitive or a layer. [primitive or layout_name]:[impl_type] For primitives, `fc:onednn`, `fc:ocl`, `do:cpu`, `do:ocl`, `reduce:ocl` and `reduce:oneDNN` are supported
|
||||
* `OV_GPU_MaxKernelsPerBatch`: Maximum number of kernels in a batch during compiling kernels.
|
||||
|
||||
## Dump execution graph
|
||||
The execution graph (also known as runtime graph) is a device specific graph after all transformations applied by the plugin. It's a very useful
|
||||
feature for performance analysis and it allows to find a source of performance regressions quickly. Execution graph can be retrieved from the plugin
|
||||
|
||||
The execution graph (also known as a runtime graph) is a device-specific graph after all transformations applied by the plugin. It is a very useful
|
||||
feature for performance analysis and it allows finding a source of performance regressions quickly. The execution graph can be retrieved from the plugin
|
||||
using `GetExecGraphInfo()` method of `InferenceEngine::ExecutableNetwork` and then serialized as usual IR:
|
||||
```cpp
|
||||
ExecutableNetwork exeNetwork;
|
||||
@@ -68,8 +74,8 @@ using `GetExecGraphInfo()` method of `InferenceEngine::ExecutableNetwork` and th
|
||||
execGraphInfo.serialize("/path/to/serialized/exec/graph.xml");
|
||||
```
|
||||
|
||||
The capability to retrieve execution graph and store it on the disk is integrated into `benchmark_app`. The execution graph can be simply dumped
|
||||
by setting additional parameter `-exec_graph_path exec_graph.xml` for `benchmark_app`. Output `xml` file has a format similar to usual IR, but contains
|
||||
The capability to retrieve the execution graph and store it on the disk is integrated into `benchmark_app`. The execution graph can be simply dumped
|
||||
by setting an additional parameter `-exec_graph_path exec_graph.xml` for `benchmark_app`. Output `xml` file has a format similar to usual IR, but contains
|
||||
execution nodes with some runtime info such as:
|
||||
- Execution time of each node
|
||||
- Mapping between nodes in final device specific graph and original input graph operations
|
||||
@@ -78,7 +84,7 @@ execution nodes with some runtime info such as:
|
||||
- Primitive type
|
||||
- Inference precision
|
||||
|
||||
Typical node in GPU execution graph looks as follows:
|
||||
A typical node in GPU execution graph looks as follows:
|
||||
```
|
||||
<layer id="0" name="convolution" type="Convolution">
|
||||
<data execOrder="1" execTimeMcs="500" originalLayersNames="convolution,relu" outputLayouts="b_fs_yx_fsv16" outputPrecisions="FP16" primitiveType="convolution_gpu_bfyx_to_bfyx_f16" />
|
||||
@@ -101,24 +107,24 @@ Typical node in GPU execution graph looks as follows:
|
||||
</layer>
|
||||
```
|
||||
|
||||
Most of the data here is very handy for the performance analysis. For example, for each node you can check that:
|
||||
- Nodes fusion works as expected on given models (i.e. some node is missing in execution graph and it's name is a part of `originalLayersNames` list for some other node)
|
||||
Most of the data here is very handy for performance analysis. For example, for each node you can check whether:
|
||||
- Nodes fusion works as expected on given models (that is, some node is missing in the execution graph and its name is a part of `originalLayersNames` list for some other node)
|
||||
- Input and output layouts of a node are optimal in each case
|
||||
- Input and output precisions are valid in each case
|
||||
- The node used expected kernel for execution
|
||||
- And the most important: actual execution time of each operation
|
||||
- The node used the expected kernel for execution
|
||||
- And most important: the actual execution time of each operation
|
||||
|
||||
This graph can be visualized using Netron tool and all these properties can be analyzed there.
|
||||
|
||||
Note: execution time collection for each primitive requires `CONFIG_KEY(PERF_COUNT)` to be enabled (`benchmark_app` does it automatically), thus the overall model execution time is usually much worse in such use cases.
|
||||
> **NOTE**: execution time collection for each primitive requires `CONFIG_KEY(PERF_COUNT)` to be enabled (`benchmark_app` does it automatically). Therefore, the overall model execution time is usually much worse in such use cases.
|
||||
|
||||
## Performance counters
|
||||
|
||||
This feature is a simplified version of execution graph as it provides much less information, but it might be more suitable for quick analysis and some kind of
|
||||
This feature is a simplified version of the execution graph as it provides much less information, but it might be more suitable for quick analysis and some kind of
|
||||
processing with scripts.
|
||||
|
||||
Performance counters can be retrieved from each `InferenceEngine::InferRequest` object using `getPerformanceCounts()` method. This feature is also integrated
|
||||
into `benchmark_app` and the counters can be printed to cout using `-pc` parameter.
|
||||
into `benchmark_app` and the counters can be printed to count using `-pc` parameter.
|
||||
|
||||
The format looks as follows:
|
||||
|
||||
@@ -135,17 +141,16 @@ relu OPTIMIZED_OUT layerType: ReLU realTime: 0
|
||||
Total time: 53877 microseconds
|
||||
```
|
||||
|
||||
So it allows to quickly check execution time of some operation on the device and make sure that correct primitive is used. Also, the output can be easily
|
||||
converted into .csv format and then used to collect any kind of statistics (e.g. execution time distribution by layer types).
|
||||
So it allows you to quickly check the execution time of some operation on the device and make sure that the correct primitive is used. Also, the output can be easily converted into the *.csv* format and then used to collect any kind of statistics (for example, execution time distribution by layer types).
|
||||
|
||||
## Graph dumps
|
||||
|
||||
intel_gpu plugin allows to dump some info about intermediate stages in graph optimizer.
|
||||
*Intel_GPU* plugin allows you to dump some info about intermediate stages in the graph optimizer.
|
||||
|
||||
* You can dump graphs with `OV_GPU_DumpGraphs` of debug config. For the usage of debug config, please see [link](#debug-config).
|
||||
* You can dump graphs with `OV_GPU_DumpGraphs` of debug config. For the usage of debug config, see the [link](#debug-config).
|
||||
|
||||
* Alternative, you can also enable the dumps from the application source code:
|
||||
clDNN plugin has the special internal config option `graph_dumps_dir` which can be set from the user app via plugin config:
|
||||
* Alternatively, you can also enable the dumps from the application source code:
|
||||
clDNN plugin has the special internal config option - `graph_dumps_dir`, which can be set from the user app via plugin config:
|
||||
```cpp
|
||||
Core ie;
|
||||
std::map<std::string, std::string> device_config;
|
||||
@@ -153,7 +158,7 @@ device_config[CLDNN_CONFIG_KEY(GRAPH_DUMPS_DIR)] = "/some/existing/path/";
|
||||
ie.SetConfig(device_config, "GPU");
|
||||
```
|
||||
|
||||
For each stage it dumps:
|
||||
For each stage, it dumps:
|
||||
```
|
||||
- cldnn_program_${program_id}_${stage_id}_${stage_name}.graph - graph saved in dot format which can be visualized via graphviz tool
|
||||
- cldnn_program_${program_id}_${stage_id}_${stage_name}.info - graph in text format
|
||||
@@ -162,16 +167,16 @@ For each stage it dumps:
|
||||
- ${program_id}_${stage_id}_${stage_name}.xml - graph in a format of execution graph
|
||||
```
|
||||
|
||||
Main graph usually has `program_id = 0`, graphs with other `program_id` values are usually created internally for constant propagation or some other purposes.
|
||||
The main graph usually has `program_id = 0`. Graphs with other `program_id` values are usually created internally for constant propagation or some other purposes.
|
||||
|
||||
## Sources dumps
|
||||
|
||||
Since intel_gpu source tree contains only *templates* of the OpenCL™ kernels, it's quite important to get full kernels source code.
|
||||
Since *Intel_GPU* source tree contains only *templates* of the OpenCL™ kernels, it is quite important to get full kernels source code.
|
||||
|
||||
* You can use `OV_GPU_DumpSources` of debug config. For the usage of debug config, please see [link](#debug-config).
|
||||
* You can use `OV_GPU_DumpSources` of debug config. For the usage of debug config, see [link](#debug-config).
|
||||
|
||||
* You can also dump OpenCL source code by changing OpenVINO source code:
|
||||
clDNN plugin has the special internal config option `sources_dumps_dir` which can be set from the user app via plugin config:
|
||||
clDNN plugin has the special internal config option - `sources_dumps_dir`, which can be set from the user app via plugin config:
|
||||
```cpp
|
||||
Core ie;
|
||||
std::map<std::string, std::string> device_config;
|
||||
@@ -184,12 +189,12 @@ When this key is enabled, the plugin dumps multiple files with the following nam
|
||||
clDNN_program_${program_id}_part_${bucket_id}.cl
|
||||
```
|
||||
|
||||
Note: `program_id` here might differ from `program_id` for the graph dumps as it's just a static counter for enumerating incoming programs.
|
||||
> **Note**: `program_id` here might differ from `program_id` for the graph dumps, as it is just a static counter for enumerating incoming programs.
|
||||
|
||||
Each file contains a bucket of kernels that are compiled together. In case of any compilation errors, intel_gpu plugin will append compiler output
|
||||
in the end of corresponding source file.
|
||||
Each file contains a bucket of kernels that are compiled together. In case of any compilation errors, *Intel_GPU* plugin will append compiler output
|
||||
to the end of the corresponding source file.
|
||||
|
||||
If you want to find some specific layer, then you'll need to use Debug/RelWithDebInfo build or modify base jitter method to append `LayerID` in release build:
|
||||
To find a specific layer, use "Debug/RelWithDebInfo" build or modify the base jitter method to append `LayerID` in the release build:
|
||||
```cpp
|
||||
// inference-engine/thirdparty/clDNN/kernel_selector/core/kernel_base.cpp
|
||||
JitConstants KernelBase::MakeBaseParamsJitConstants(const base_params& params) const {
|
||||
@@ -200,19 +205,19 @@ JitConstants KernelBase::MakeBaseParamsJitConstants(const base_params& params) c
|
||||
}
|
||||
```
|
||||
|
||||
When source is dumped, it actually contains huge amount of macros(`#define`). For readability, you can run c preprocessor to apply the macros.
|
||||
When the source is dumped, it contains a huge amount of macros(`#define`). For readability, you can run *c preprocessor* to apply the macros.
|
||||
|
||||
`$ cpp dumped_source.cl > clean_source.cl`
|
||||
|
||||
|
||||
## Layer in/out buffer dumps
|
||||
|
||||
In some cases you might want to get actual values in each layer execution to compare it with some reference blob. In order to do that we have
|
||||
`OV_GPU_DumpLayersPath` option in debug config. For the usage of debug config, please see [link](#debug-config).
|
||||
In some cases, you might want to get actual values in each layer execution to compare it with some reference blob. To do that, choose the
|
||||
`OV_GPU_DumpLayersPath` option in debug config. For the usage of debug config, see [link](#debug-config).
|
||||
|
||||
As a prerequisite, enable ENABLE_DEBUG_CAPS from cmake configuration.
|
||||
As a prerequisite, enable `ENABLE_DEBUG_CAPS` from the cmake configuration.
|
||||
|
||||
Then, check runtime layer name by executing benchmark_app with OV_GPU_Verbose=1. It is better to be checked with this than through IR because this may be slightly different. OV_GPU_Verbose=1 will show log of execution of each layer.
|
||||
Then, check the runtime layer name by executing *benchmark_app* with `OV_GPU_Verbose=1`. It is better to check it with `OV_GPU_Verbose=1` than through IR because this may be slightly different. `OV_GPU_Verbose=1` will show the log of execution of each layer.
|
||||
|
||||
```
|
||||
# As a prerequisite, enable ENABLE_DEBUG_CAPS from cmake configuration.
|
||||
@@ -221,30 +226,31 @@ export OV_GPU_DumpLayers="layer_name_to_dump1 layer_name_to_dump2"
|
||||
export OV_GPU_DumpLayersDstOnly=1 # Set as 1 when you want to dump dest buff only
|
||||
```
|
||||
|
||||
Dump files have the following naming:
|
||||
Dump files are named in the following convention:
|
||||
```
|
||||
${layer_name_with_underscores}_${src/dst}_${port_id}.txt
|
||||
```
|
||||
|
||||
Each file contains single buffer in common planar format (`bfyx`, `bfzyx` or `bfwzyx`) where each value is stored on a separate line. The first line in the file constains buffer description, e.g:
|
||||
Each file contains a single buffer in a common planar format (`bfyx`, `bfzyx`, or `bfwzyx`), where each value is stored on a separate line. The first line in the file contains a buffer description, for example:
|
||||
```
|
||||
shape: [b:1, f:1280, x:1, y:1, z:1, w:1, g:1] (count: 1280, original format: b_fs_yx_fsv16)
|
||||
```
|
||||
|
||||
For accuracy troubleshoot, you may want to compare the GPU plugin result against CPU plugin result. For CPU dump, see [Blob dumping](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/docs/blob_dumping.md)
|
||||
For troubleshooting the accuracy, you may want to compare the results of GPU plugin and CPU plugin. For CPU dump, see [Blob dumping](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/docs/blob_dumping.md)
|
||||
|
||||
|
||||
## Run int8 model on gen9 HW
|
||||
## Run int8 model on Gen9 HW
|
||||
|
||||
As gen9 hw doesn't have hardware acceleration, low precision transformations are disabled by default, thus quantized networks are executed in full precision (fp16 or fp32) with explicit execution of quantize operations.
|
||||
If you don't have gen12 HW, but want to debug network's accuracy or performance of simple operations (which doesn't require dp4a support), then you can enable low precision pipeline on gen9 using one of the following ways:
|
||||
1. Add `{PluginConfigInternalParams::KEY_LP_TRANSFORMS_MODE, PluginConfigParams::YES}` option to the plugin config
|
||||
As Gen9 HW does not have hardware acceleration, low-precision transformations are disabled by default. Therefore, quantized networks are executed in full precision (FP16 or FP32), with explicit execution of quantize operations.
|
||||
If you do not have Gen12 HW, but want to debug the network's accuracy or performance of simple operations (which does not require dp4a support), then you can enable low precision pipeline on Gen9, with one of the following approaches:
|
||||
1. Add `{PluginConfigInternalParams::KEY_LP_TRANSFORMS_MODE, PluginConfigParams::YES}` option to the plugin config.
|
||||
2. Enforce `supports_imad = true` [here](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/src/gpu/device_info.cpp#L226)
|
||||
3. Enforce `conf.enableInt8 = true` [here](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/cldnn_engine/cldnn_engine.cpp#L366)
|
||||
|
||||
After that the plugin will run exactly the same scope of transformations as on gen12HW and generate similar kernels (small difference is possible due to different EUs count)
|
||||
After that, the plugin will run exactly the same scope of transformations as on Gen12 HW and generate similar kernels (a small difference is possible due to different EUs count).
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,18 +1,18 @@
|
||||
# GPU kernels implementation overview
|
||||
# GPU Kernels Implementation Overview
|
||||
|
||||
As mentioned in [GPU plugin structure](./source_code_structure.md), kernels for GPU plugin are located in `src/plugins/intel_gpu/src/kernel_selector` folder.
|
||||
|
||||
For each operation we usually have multiple kernels that can support different parameters and/or optimized for different scenarios.
|
||||
For each operation, there are usually multiple kernels that can support different parameters and/or are optimized for different scenarios.
|
||||
|
||||
Each operation has 3 major entities in kernel selector:
|
||||
- Operation specific `kernel_selector` instance
|
||||
- Operation parameters descriptor
|
||||
- Kernels itself with a set of heuristics inside for optimal selection
|
||||
|
||||
## Kernel selector instance
|
||||
For each operation we create kernel_selector class derived from `kernel_selector_base`. Basically, this class is needed to specify available kernels
|
||||
for given operation. Each kernel selector is used as singleton. For example:
|
||||
## Kernel selector instance
|
||||
|
||||
For each operation, you create `kernel_selector` class derived from `kernel_selector_base`. Basically, this class is needed to specify available kernels
|
||||
for a given operation. Each kernel selector is used as a singleton. For example:
|
||||
|
||||
```cpp
|
||||
class mvn_kernel_selector : public kernel_selector_base {
|
||||
@@ -57,7 +57,7 @@ auto best_kernels = kernel_selector.GetBestKernels(mvn_params, mvn_optional_para
|
||||
|
||||
## Operation parameters
|
||||
|
||||
The parameters of operation for kernel_selector are defined in corresponding `${op_name}_params` class which is derived from `base_params`. For example:
|
||||
The parameters of operation for `kernel_selector` are defined in corresponding `${op_name}_params` class which is derived from `base_params`. For example:
|
||||
```cpp
|
||||
struct mvn_params : public base_params {
|
||||
mvn_params() : base_params(KernelType::MVN) {}
|
||||
@@ -79,9 +79,9 @@ struct mvn_params : public base_params {
|
||||
};
|
||||
```
|
||||
|
||||
The derived class should parameterize base class with specific `KernelType` and add operation-specific parameters. The only method that must be implemented
|
||||
is `GetParamsKey()` which is used as a quick check for kernels applicability for current parameters, i.e. we take `ParamsKey` object calculated for input
|
||||
operation parameters and `ParamsKey` object for each kernel, so we can compare them and discard the kernels that don't support current parameters.
|
||||
The derived class should parameterize base class with a specific `KernelType` and add operation-specific parameters. The only method that must be implemented
|
||||
is `GetParamsKey()` which is used as a quick check for kernels applicability for current parameters. In other words, you take a `ParamsKey` object calculated for input
|
||||
operation parameters and a `ParamsKey` object for each kernel. Then, you can compare them and discard the kernels that do not support current parameters.
|
||||
`ParamsKey` is implemented as a set of bit masks, so the applicability check is quite simple:
|
||||
```cpp
|
||||
const ParamsKey implKey = some_implementation->GetSupportedKey();
|
||||
@@ -97,15 +97,15 @@ if (!((implKey.mask & paramsKey.mask) == paramsKey.mask))
|
||||
|
||||
Each kernel must specify the following things:
|
||||
- Input parameters checks
|
||||
- `GetSupportedKey()` method implementation which returns `ParamsKey` object for current implementation
|
||||
- `Validate()` method that do more complex checks (optional)
|
||||
- Dispatch data (global/local workgroup sizes, scheduling algorithm, etc)
|
||||
- `GetSupportedKey()` method implementation, which returns `ParamsKey` object for current implementation.
|
||||
- `Validate()` method, that does more complex checks (optional).
|
||||
- Dispatch data (global/local workgroup sizes, scheduling algorithm, etc.)
|
||||
- Kernel name - must be passes to base class c-tor
|
||||
- Kernel arguments specification - description of each argument in corresponding OpenCL™ kernel
|
||||
- Additional JIT constants required for kernel - set of macro definitions that must be added to thi kernel template to make full specialization for given params
|
||||
- Supported fused operations (if any) - a list of supported operations that can be fused into current kernel
|
||||
- Additional JIT constants required for kernel - set of macro definitions that must be added to the kernel template to make full specialization for given params
|
||||
- Supported fused operations (if any) - a list of supported operations that can be fused into the current kernel.
|
||||
|
||||
Let's have a look at the key methods of each kernel implementation:
|
||||
Key methods of each kernel implementation are as follows:
|
||||
|
||||
```cpp
|
||||
class MVNKernelRef : public MVNKernelBase {
|
||||
@@ -132,6 +132,7 @@ protected:
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# GPU memory formats
|
||||
# GPU Memory Formats
|
||||
|
||||
The memory format descriptor in GPU plugin usually uses the following letters:
|
||||
- `b` - batch
|
||||
@@ -8,9 +8,9 @@ The memory format descriptor in GPU plugin usually uses the following letters:
|
||||
- `o` - output channels (for weights layout only)
|
||||
- `g` - groups (for weights layout only)
|
||||
|
||||
The combination of the characters above defines tensor format, i.e. the actual layout of tensor values in memory buffer. For example:
|
||||
The combination of the characters above defines the tensor format, that is, the actual layout of tensor values in a memory buffer. For example:
|
||||
`bfyx` format means that the tensor has 4 dimensions in planar layout and `x` coordinate changes faster than `y`, `y` - faster than `f`, and so on.
|
||||
It means that for tensor with size `[b: 2; f: 2; y: 2; x: 2]` we have a linear memory buffer with `size=16` where:
|
||||
It means that for a tensor with size `[b: 2; f: 2; y: 2; x: 2]`, there is a linear memory buffer with `size=16`, where:
|
||||
```
|
||||
i = 0 => [b=0; f=0; y=0; x=0];
|
||||
i = 1 => [b=0; f=0; y=0; x=1];
|
||||
@@ -37,19 +37,19 @@ i = 14 => [b=1; f=1; y=1; x=0];
|
||||
i = 15 => [b=1; f=1; y=1; x=1];
|
||||
```
|
||||
|
||||
Usually, planar memory formats are not very efficient for DNN operations, so GPU plugin has plenty *blocked* format. Blocking means that we take some tensor dimension
|
||||
and put blocks of adjacent elements closer in memory (in the format with single blocking they are stored linearly in the memory). Consider the most widely used
|
||||
blocked format in GPU plugin: `b_fs_yx_fsv16`. First of all, let's understand what these additional letters mean. We have `b`, `f`, `y`, `x` dimensions here, so
|
||||
this is 4D tensor.
|
||||
Usually, planar memory formats are not very efficient for DNN operations, so GPU plugin has plenty of *blocked* formats. Blocking means that you take some tensor dimension
|
||||
and put blocks of adjacent elements closer in memory (in the format with a single blocking, they are stored linearly in the memory). Consider the most widely used
|
||||
blocked format in GPU plugin: `b_fs_yx_fsv16`. First of all, let's understand what these additional letters mean. There are `b`, `f`, `y`, `x` dimensions here, so
|
||||
this is a 4D tensor.
|
||||
`fs=CeilDiv(f, block_size)`; `fs` means `feature slice` - the blocked dimension.
|
||||
The block size is specified in the format name: `fsv16` - `block_size = 16`, blocked dimension is `f`; `fsv` means `feature slice vector`
|
||||
The block size is specified in the format name: `fsv16` - `block_size = 16`, a blocked dimension is `f`; `fsv` means `feature slice vector`
|
||||
Just like with any other layout, the coordinate of the rightmost dimension (`fsv`) is changed first, then coordinate to the left (`x`), and so on.
|
||||
|
||||
Note: if the original `f` dimension is not divisible by block size (16 in this case), then it's aligned up to the first divisible value. These pad values
|
||||
> **Note**: If the original `f` dimension is not divisible by block size (`16` in this case), then it is aligned up to the first divisible value. These pad values
|
||||
are filled with zeroes.
|
||||
|
||||
Let's look at the changes with the tensor above if we reorder it into `b_fs_yx_fsv16` format:
|
||||
1. Actual buffer size becomes `[b: 2; f: 16; y: 2; x: 2]`, and total size = 128
|
||||
When you reorder the tensor above into `b_fs_yx_fsv16` format, changes are as follows:
|
||||
1. Actual buffer size becomes `[b: 2; f: 16; y: 2; x: 2]`, and total size equals 128.
|
||||
2. The order of elements in memory changes:
|
||||
```
|
||||
// first batch
|
||||
@@ -106,6 +106,7 @@ i = 127 => [b=1; f=15; y=1; x=1] == [b=1; fs=0; y=1; x=1; fsv=15];
|
||||
All formats used by GPU plugin are specified in `src/plugins/intel_gpu/include/intel_gpu/runtime/format.hpp` file. Most of the formats there follow the notation above.
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
# Driver issues troubleshooting
|
||||
# Driver Issues Troubleshooting
|
||||
|
||||
If you see errors like "[CLDNN ERROR]. clGetPlatformIDs error -1001" when running OpenVINO samples / demos, then most likely you have some issues with OpenCL runtime on your machine. This document contains several hints on what to check and how to troubleshoot such kind of issues.
|
||||
If you see errors like `[CLDNN ERROR]. clGetPlatformIDs error -1001` when running OpenVINO samples / demos, then most likely you have some issues with OpenCL runtime on your machine. This document contains several hints on what to check and how to troubleshoot such issues.
|
||||
|
||||
In order to make sure that OpenCL runtime is functional on your machine, you can use [clinfo](https://github.com/Oblomov/clinfo) tool. On many linux distributives it can be installed via package manager. If it's not available for your system, it can be easily built from sources.
|
||||
To make sure that OpenCL runtime is functional on your machine, you can use [clinfo](https://github.com/Oblomov/clinfo) tool. On many linux distributions it can be installed via package manager. If it is not available for your system, it can be easily built from sources.
|
||||
|
||||
Example of clinfo output:
|
||||
```
|
||||
@@ -23,26 +23,30 @@ Number of devices 1
|
||||
Device Type GPU
|
||||
```
|
||||
## 1. Make sure that you have GPU on your system
|
||||
|
||||
Some Intel® CPUs might not have integrated GPU, so if you want to run OpenVINO on iGPU, go to [ark.intel website](https://ark.intel.com/) and make sure that your CPU has it.
|
||||
|
||||
## 2. Make sure that OpenCL® Runtime is installed
|
||||
On Windows OpenCL runtime is a part of the GPU driver, but on linux it should be installed separately. For the installation tips please refer to [OpenVINO docs](https://docs.openvino.ai/latest/openvino_docs_install_guides_installing_openvino_linux_header.html) and [OpenCL Compute Runtime docs](https://github.com/intel/compute-runtime/tree/master/opencl/doc).
|
||||
To get support of Intel® Iris® Xe MAX Graphics with Linux please follow [driver installation guide](https://dgpu-docs.intel.com/devices/iris-xe-max-graphics/index.html)
|
||||
|
||||
OpenCL runtime is a part of the GPU driver on Windows, but on Linux it should be installed separately. For the installation tips, refer to [OpenVINO docs](https://docs.openvino.ai/latest/openvino_docs_install_guides_installing_openvino_linux_header.html) and [OpenCL Compute Runtime docs](https://github.com/intel/compute-runtime/tree/master/opencl/doc).
|
||||
To get the support of Intel® Iris® Xe MAX Graphics with Linux, follow the [driver installation guide](https://dgpu-docs.intel.com/devices/iris-xe-max-graphics/index.html)
|
||||
|
||||
## 3. Make sure that user has all required permissions to work with GPU device
|
||||
|
||||
Add the current Linux user to the `video` group:
|
||||
```
|
||||
sudo usermod -a -G video "$(whoami)"
|
||||
```
|
||||
|
||||
## 4. Make sure that iGPU is enabled
|
||||
|
||||
```
|
||||
$ cat /sys/devices/pci0000\:00/0000\:00\:02.0/enable
|
||||
1
|
||||
```
|
||||
|
||||
## 5. Make sure that "/etc/OpenCL/vendors/intel.icd" contain proper paths to the OpenCL driver
|
||||
## 5. Make sure that "/etc/OpenCL/vendors/intel.icd" contains proper paths to the OpenCL driver
|
||||
|
||||
```
|
||||
$ cat /etc/OpenCL/vendors/intel.icd
|
||||
/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
|
||||
@@ -50,12 +54,15 @@ $ cat /etc/OpenCL/vendors/intel.icd
|
||||
Note: path to the runtime lib may vary in different driver versions
|
||||
|
||||
## 6. Use LD_DEBUG=libs to trace loaded libraries
|
||||
|
||||
For more details, see the [OpenCL on Linux](https://github.com/bashbaug/OpenCLPapers/blob/markdown/OpenCLOnLinux.md)
|
||||
|
||||
## 7. If you are using dGPU with XMX, ensure that HW_MATMUL feature is recognized
|
||||
Openvino contains hello_query_device sample application: [link](https://docs.openvino.ai/latest/openvino_inference_engine_ie_bridges_python_sample_hello_query_device_README.html)
|
||||
|
||||
OpenVINO contains *hello_query_device* sample application: [link](https://docs.openvino.ai/latest/openvino_inference_engine_ie_bridges_python_sample_hello_query_device_README.html)
|
||||
|
||||
With this option, you can check whether Intel XMX(Xe Matrix Extension) feature is properly recognized or not. This is a hardware feature to accelerate matrix operations and available on some discrete GPUs.
|
||||
|
||||
```
|
||||
$ ./hello_query_device.py
|
||||
...
|
||||
@@ -68,9 +75,9 @@ install them from [OpenCL Git](https://github.com/KhronosGroup/OpenCL-Headers)
|
||||
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
* [OpenVINO GPU Plugin](../README.md)
|
||||
* [Developer documentation](../../../../docs/dev/index.md)
|
||||
|
||||
@@ -1,29 +1,29 @@
|
||||
# GPU plugin operations enabling flow
|
||||
# GPU Plugin Operations Enabling Flow
|
||||
|
||||
## Terminology
|
||||
|
||||
* **NGraph operation**: Building block of neural networks, such as convolution or pooling.
|
||||
* **(clDNN) Primitive**: Basic NN operation that was defined in clDNN. One primitive is usually mapped to one ngraph operation, but graph compilation may cause the mapping not to be 1-to-1.
|
||||
* **Kernel**: Actual body of execution in GPU. It also refers to specific implementations of **Primitive** for GPU, such as `convolution_gpu_winograd_2x3_s1.cl`. Usually, single kernel fulfills the operation of single primitive, but several kernels may be used to support one primitive.
|
||||
* **Unittest**: Single-layer test within cldnn.
|
||||
* **Kernel**: Actual body of execution in GPU. It also refers to specific implementations of **Primitive** for GPU, such as `convolution_gpu_winograd_2x3_s1.cl`. Usually, single kernel fulfills the operation of a single primitive, but several kernels may be used to support one primitive.
|
||||
* **Unittest**: Single-layer test within clDNN.
|
||||
* **Functional test**: Single-layer test in IE.
|
||||
|
||||
<br>
|
||||
|
||||
## Adding new primitive
|
||||
|
||||
1. Understand the new operation.
|
||||
* Review the [ngraph operation spec](https://github.com/openvinotoolkit/openvino/tree/master/docs/ops)
|
||||
* IE operations(a.k.a primitive or NN-layer) are defined by ngraph.
|
||||
* You can check ngraph reference implementation of the primitive as well
|
||||
* e.g. [Scatter Elements Update in nGraph](https://github.com/openvinotoolkit/openvino/blob/master/src/core/reference/include/ngraph/runtime/reference/scatter_elements_update.hpp)
|
||||
* For example, [Scatter Elements Update in nGraph](https://github.com/openvinotoolkit/openvino/blob/master/src/core/reference/include/ngraph/runtime/reference/scatter_elements_update.hpp)
|
||||
|
||||
1. Try to find existing primitive that fully or partially covers this operation.
|
||||
* It is also possible to transform the network so that the missing primitive is covered from existing primitive.
|
||||
* e.g. [Replace reduce with pooling](https://github.com/openvinotoolkit/openvino/blob/23808f46f7b5d464fd649ad278f253eec12721b3/inference-engine/src/cldnn_engine/cldnn_engine.cpp#L205)
|
||||
* For example, [replace reduce with pooling](https://github.com/openvinotoolkit/openvino/blob/23808f46f7b5d464fd649ad278f253eec12721b3/inference-engine/src/cldnn_engine/cldnn_engine.cpp#L205).
|
||||
|
||||
1. Add new / extend existing clDNN primitive according to the operation spec.
|
||||
1. This phase is to enable primitive within clDNN library, without exposing it to IE.
|
||||
1. Implement **reference parallel kernel** that supports all parameters of the operation and all input/output data types and layouts.
|
||||
|
||||
1. Add new / extend existing cldnn primitive according to the operation spec.
|
||||
1. This phase is to enable primitive within cldnn library, without exposing it to IE.
|
||||
1. Implement **reference parallel kernel** that supports all parameters of the operation and all input/output data types and layouts
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| [scatter_elements_update_ref.cl](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/kernel_selector/cl_kernels/scatter_elements_update_ref.cl) | OpenCL Kernel body. For more detail, please see [How to write OCL kernel](#writing-ocl-kernel) section |
|
||||
@@ -31,18 +31,18 @@
|
||||
| [scatter_elements_update_kernel_selector.(cpp,h)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/kernel_selector/kernels/scatter_update/scatter_elements_update_kernel_selector.cpp) | Kernel selector for a primitive |
|
||||
| [register_gpu.(cpp,hpp)](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/src/gpu/register_gpu.cpp) | Primitive registration |
|
||||
| [scatter_elements_update_gpu.cpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/src/gpu/scatter_elements_update_gpu.cpp) | Primitive registration, input spec |
|
||||
| [scatter_elements_update_inst.h](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/graph/include/scatter_elements_update_inst.h) | Node type declaration for cldnn program |
|
||||
| [scatter_elements_update_inst.h](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/graph/include/scatter_elements_update_inst.h) | Node type declaration for clDNN program |
|
||||
| [clDNN/src/scatter_elements_update.cpp](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/graph/scatter_elements_update.cpp) | Code for scatter_elements_update_inst.h |
|
||||
| [clDNN/api/cldnn/primitives/scatter_elements_update.hpp](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/primitives/scatter_elements_update.hpp) | clDNN primitive definition |
|
||||
| [common_types.h](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/kernel_selector/common_types.h) | Enum declaration for KernelType and arguments |
|
||||
|
||||
1. Add unit tests for the new operation
|
||||
1. Add unit tests for the new operation.
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| [scatter_elements_update_gpu_test.cpp](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/scatter_elements_update_gpu_test.cpp) | Unittest for layer |
|
||||
|
||||
* Need to add reference code or expected result for checking the result.
|
||||
* You need to add reference code or expected result for checking the result.
|
||||
|
||||
* You can also specify the kernel with `force_implementations` in case the primitive contains multiple kernels.
|
||||
```
|
||||
@@ -54,31 +54,31 @@
|
||||
...
|
||||
```
|
||||
|
||||
* This unit test is built into `clDNN_unit_tests`. It is a gtest application.
|
||||
* This unit test is built into `clDNN_unit_tests`. It is a `gtest` application.
|
||||
```
|
||||
# Show list of test cases
|
||||
openvino/bin/intel64/Debug$ ./clDNN_unit_tests64 --gtest_list_tests
|
||||
# Run test
|
||||
openvino/bin/intel64/Debug$ ./clDNN_unit_tests64 --gtest_filter=scatter_elements_update_gpu_fp16.*
|
||||
```
|
||||
|
||||
* Test scope needs to be comprehensive, but not wasteful. These tests run for every PRs in CI. Let's save the planet.
|
||||
|
||||
|
||||
* Test scope needs to be comprehensive, but not wasteful. These tests run for every PR in CI. Let's save the planet.
|
||||
|
||||
1. Support layer fusion, if applicable
|
||||
* It is usually easy to fuse some layers, such as scale, activation, quantize and eltwise, into previous layer. This fusing rule can be added to `prepare_primitive_fusing::fuse_simple_primitives`.
|
||||
* It is usually easy to fuse some layers, such as *scale*, *activation*, *quantize*, and *eltwise*, into the previous layer. This fusing rule can be added to `prepare_primitive_fusing::fuse_simple_primitives`.
|
||||
* `fuse_simple_primitives` is called during [graph compilation phase](https://github.com/openvinotoolkit/openvino/blob/71c50c224964bf8c24378d16f015d74e2c1e1ce8/inference-engine/thirdparty/clDNN/src/program.cpp#L430)
|
||||
* You can see general description of layer fusion [here](https://docs.openvinotoolkit.org/latest/openvino_docs_IE_DG_supported_plugins_CL_DNN.html#optimizations)
|
||||
* See general description of layer fusion [here](https://docs.openvinotoolkit.org/latest/openvino_docs_IE_DG_supported_plugins_CL_DNN.html#optimizations)
|
||||
* Unit tests for layer fusion are placed in a single file: [fusings_gpu_test.cpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/tests/test_cases/fusings_gpu_test.cpp). It is also compiled into `clDNN_unit_tests`.
|
||||
* Code for fused layers are generated with `jitter`. It is created as `FUSED_OPS..` macro in OCL code. This generation logic is in `KernelBase::MakeFusedOpsJitConstants`.
|
||||
|
||||
1. Add / update factory for this operation in the GPU plugin to use new primitive in inference-engine
|
||||
1. Add / update factory for this operation in the GPU plugin to use new primitive in inference-engine.
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| [cldnn_engine/ops/scatter_elements_update.cpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/cldnn_engine/ops/scatter_elements_update.cpp) | Instantiation from cldnn plugin for IE |
|
||||
| [cldnn_engine/ops/scatter_elements_update.cpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/cldnn_engine/ops/scatter_elements_update.cpp) | Instantiation from clDNN plugin for IE |
|
||||
| [cldnn_primitives_list.hpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/cldnn_engine/cldnn_primitives_list.hpp) | Registration for primitives |
|
||||
|
||||
1. Add functional single layer tests for the operation and try to cover most of the difference use cases of this operation
|
||||
1. Add functional single-layer tests for the operation and try to cover most of the different use cases of this operation.
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
@@ -86,32 +86,31 @@
|
||||
|
||||
* It is possible to use ngraph reference code for result validation.
|
||||
* This is compiled into `gpuFuncTests`. It is also `gtest` application.
|
||||
* Please also review the [general guideline of test infrastructure](https://github.com/openvinotoolkit/openvino/wiki/InferenceEngineTestsInfrastructure)
|
||||
* Also, review the [general guideline of test infrastructure](https://github.com/openvinotoolkit/openvino/blob/master/docs/IE_PLUGIN_DG/PluginTesting.md).
|
||||
|
||||
1. [Optional] If there are existing IRs with this operation, try to run the full model(s) to be sure that it's correctly processed within the context
|
||||
1. [Optional] If there are existing IRs with this operation, try to run the full model(s) to be sure that it is correctly processed within the context.
|
||||
|
||||
1. [Optional] If there are existing IRs with this operation, try to run the full model(s) and estimate performance impact from this operation on total model execution time
|
||||
1. [Optional] If there are existing IRs with this operation, try to run the full model(s) and estimate performance impact from this operation on total model execution time.
|
||||
|
||||
1. Create PR with your changes
|
||||
1. Create a PR with your changes.
|
||||
* If you are `OpenVINO` group member in github, CI will be triggered.
|
||||
* Please review the [OpenVINO contribution guide](https://github.com/openvinotoolkit/openvino/blob/master/CONTRIBUTING.md).
|
||||
|
||||
<br>
|
||||
* Review the [OpenVINO contribution guide](https://github.com/openvinotoolkit/openvino/blob/master/CONTRIBUTING.md).
|
||||
|
||||
## Adding new kernel for an existing primitive
|
||||
* The process is quite similar to previous one. You can skip already existing steps.
|
||||
* Main work is adding new kernel and registering it from kernel selector.
|
||||
* You may need to add unit test for that new kernel. Specific kernel can be chosen with `build_option::force_implementations`.
|
||||
* It is not possible to specify kernel from functional test(IE).
|
||||
|
||||
<br>
|
||||
* The process is quite similar to the previous one. You can skip already existing steps.
|
||||
* Main work is adding a new kernel and registering it from the kernel selector.
|
||||
* You may need to add a unit test for that new kernel. A specific kernel can be chosen with `build_option::force_implementations`.
|
||||
* It is not possible to specify a kernel from a functional test(IE).
|
||||
|
||||
## Writing OCL kernel
|
||||
|
||||
### Jitter
|
||||
In GPU OCL kernels, many conditional statements are processed with `#ifdef` so that it can be handled during compile-time. The definitions are created with `jitter.cpp`. It is set during graph compilation. You can see generated macros following the steps in [source dumps](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/gpu_debug_utils.md#sources-dumps).
|
||||
|
||||
In GPU OCL kernels, many conditional statements are processed with `#ifdef` so that they can be handled during compile-time. The definitions are created with `jitter.cpp`. It is set during graph compilation. You can see generated macros, following the steps in [source dumps](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/gpu_debug_utils.md#sources-dumps).
|
||||
|
||||
Jitter also contains run-time parameters such as input and output size.
|
||||
Additional macros can be defined from host-code of kernel itself. For example, see below code snippet. It passes `SUB_GROUP_SIZE` through macro definition through jitter.
|
||||
Additional macros can be defined from the host-code of a kernel itself. For example, see the code snippet below. It passes `SUB_GROUP_SIZE` through macro definition through jitter.
|
||||
```
|
||||
// GetJitConstants method of the kernel
|
||||
const size_t sub_group_size = 16;
|
||||
@@ -120,17 +119,22 @@ Additional macros can be defined from host-code of kernel itself. For example, s
|
||||
```
|
||||
|
||||
### Accessing input and output tensor
|
||||
Jitter generates macros for index calculations. With these macros, you can program ocl kernel in a layout-agnostic way. If you use the macro `${TENSOR_NAME}_GET_INDEX`, you can get 1d-index from tensor coordinate whether the format is planar(such as `bfyx` or `byxf`) or blocked.(such as `b_fs_yx_fsv16`). You can check [source code for GET_INDEX macro](https://github.com/openvinotoolkit/openvino/blob/7f8d3aa63899a3e3362c95eb7d1b04a5899660bd/inference-engine/thirdparty/clDNN/kernel_selector/core/common/jitter.cpp#L313).
|
||||
|
||||
Jitter generates macros for index calculations. With these macros, you can program OCL kernel in a layout-agnostic way. If you use the macro `${TENSOR_NAME}_GET_INDEX`, you can get 1d-index from a tensor coordinate whether the format is planar (such as `bfyx` or `byxf`) or blocked (such as `b_fs_yx_fsv16`). You can check [source code for GET_INDEX macro](https://github.com/openvinotoolkit/openvino/blob/7f8d3aa63899a3e3362c95eb7d1b04a5899660bd/inference-engine/thirdparty/clDNN/kernel_selector/core/common/jitter.cpp#L313).
|
||||
|
||||
### Layout support
|
||||
|
||||
If a kernel is not performance-critical, you can support `bfyx`, `bfzyx` and `bfwzyx` only for layout. Those are default layouts. As an optimized format, `b_fs_yx_fsv16`, `b_fs_yx_fsv4` or `byxf` can be used as well.
|
||||
[General description of layout can be found here](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/gpu_memory_formats.md) and [header file is here](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/api/tensor.hpp)
|
||||
|
||||
[General description of layout can be found here](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/docs/gpu_memory_formats.md) and [header file is here](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/api/tensor.hpp).
|
||||
|
||||
### Layer fusion
|
||||
|
||||
When layers are fused, `jitter` will create macros to generate code for fused layers. It is realized into `FUSED_OPS..` in OCL kernel. You can understand the usage from other kernels.
|
||||
There is a [comment that describes layer fusion](https://github.com/openvinotoolkit/openvino/blob/7f8d3aa63899a3e3362c95eb7d1b04a5899660bd/inference-engine/thirdparty/clDNN/kernel_selector/core/kernel_selector_params.h#L521).
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,14 +1,14 @@
|
||||
# GPU plugin unit test
|
||||
# GPU Plugin Unit Test
|
||||
|
||||
GPU plugin has two type tests: first one is functional tests and second one is unit tests.
|
||||
GPU plugin has two types of tests: functional and unit tests. This article is about the latter.
|
||||
|
||||
- The functional test is testing single layer, behavior, sub graph and low precision transformation on inference engine level for various layout and data types such as fp16 and fp32.
|
||||
- The unit test is testing cldnn primitive and core type modules on GPU plugin level. Unlike functional test, it is possible to test by explicitly specifying the format of the input such as `bfyx` or `b_fs_yx_fsv16`. This documentation is about this type of test.
|
||||
- The functional test is testing a single layer, behavior, subgraph and low-precision transformation on inference engine level for various layout and data types, such as FP16 and FP32.
|
||||
- The unit test is testing clDNN primitive and core-type modules on GPU plugin level. Unlike the functional test, it is possible to test by explicitly specifying the format of the input, such as `bfyx` or `b_fs_yx_fsv16`.
|
||||
|
||||
# Structure of unit test
|
||||
# Structure of a unit test
|
||||
|
||||
Intel GPU unit test (aka clDNN unit test) is a set of unit tests each of which is for testing all primitives, fusions and fundamental core types of GPU plugin.
|
||||
There are 4 sub categories of unit tests as below.
|
||||
Intel GPU unit test (aka clDNN unit test) is a set of unit tests, each of which is for testing all primitives, fusions, and fundamental core types of GPU plugin.
|
||||
There are four subcategories of unit tests as below.
|
||||
|
||||
```bash
|
||||
openvino/src/plugins/intel_gpu/tests - root of Intel GPU unit test
|
||||
@@ -19,42 +19,45 @@ openvino/src/plugins/intel_gpu/tests - root of Intel GPU unit test
|
||||
```
|
||||
|
||||
- ### fusions
|
||||
- Fusion is an algorithm that fuse several operations into one optimized operation. For example, two nodes of `conv -> relu` may be fused into single node of `conv`.
|
||||
|
||||
- Fusion is an algorithm that fuses several operations into one optimized operation. For example, two nodes of `conv -> relu` may be fused into a single node of `conv`.
|
||||
- Fusion unit tests checks whether the fusion is done as expected.
|
||||
- fusion_test_common.cpp
|
||||
- The base class for fusing test, i.e., [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19), is implemented here. It tests whether the fusing is successful or not by comparing the execution results of the two networks, one is the fused network, the other is non fused network for same topology.
|
||||
- [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19) has an important method called *`compare()`*.
|
||||
- *`compare()`* method has the following three tasks
|
||||
- The base class for a fusing test, that is, [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19), is implemented here. It tests whether the fusing is successful or not by comparing the execution results of the two networks, one is the fused network, the other is non-fused network for the same topology.
|
||||
- [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19) has an important method called `compare()`.
|
||||
- `compare()` method has the following three tasks:
|
||||
- Execute two networks (fused network and not fused network)
|
||||
- Compare the actual number of executed primitives with the expected number of executed primitives in test params
|
||||
- Compare the actual number of executed primitives with the expected number of executed primitives in test params
|
||||
- Compare the results between fused network and non fused network
|
||||
- eltwise_fusing_test.cpp
|
||||
- Check whether or not eltwise is fused to other primitives as expected
|
||||
- Checks whether or not *eltwise* is fused to other primitives as expected
|
||||
- [primitive_name]_fusion_test.cpp
|
||||
- Check that nodes such as eltwise or activation are fusing to the [primitive_name] as expected
|
||||
- Checks that nodes such as *eltwise* or *activation* are fusing to the [primitive_name] as expected
|
||||
- The detail of how to add each instance is described [below](#fusions-1).
|
||||
|
||||
- ### test_cases
|
||||
- It is mainly checking that cldnn primitives and topology creation are working as designed
|
||||
- It also checks configurations for OpenCL functionalities such as cl_cache, cl_mem allocation and cl_command_queue modes
|
||||
|
||||
- ### module_tests
|
||||
- Unit tests for fundamental core modules such as ocl_user_events, format, layout, and usm memory
|
||||
- Check ocl_user_event is working as expected
|
||||
- Check all format is converted to the string and trait
|
||||
- Check various layouts are created as expected
|
||||
- Check usm_host and usm device memory buffer creation and read/write functionality
|
||||
- It is mainly checking whether clDNN primitives and topology creation are working as designed.
|
||||
- It also checks configurations for OpenCL functionalities such as *cl_cache*, *cl_mem allocation* and *cl_command_queue* modes
|
||||
|
||||
- ### module_tests
|
||||
|
||||
- Unit tests for fundamental core modules such as `ocl_user_events`, format, layout, and USM memory:
|
||||
- check whether `ocl_user_event` is working as expected,
|
||||
- check whether all format is converted to the string and trait,
|
||||
- check whether various layouts are created as expected,
|
||||
- check `usm_host` and USM device memory buffer creation and read/write functionality.
|
||||
|
||||
- ### test_utils
|
||||
- Defined base functions of unit test such as *`get_test_engine()`* which returns `cldnn::engine`
|
||||
- Utility functions such as Float16, random_gen and uniform_quantized_real_distribution
|
||||
|
||||
- Define base functions of a unit test, such as `get_test_engine()`, which returns `cldnn::engine`
|
||||
- Utility functions such as `Float16`, `random_gen` and `uniform_quantized_real_distribution`
|
||||
|
||||
# How to run unit tests
|
||||
|
||||
## Build unit test
|
||||
|
||||
1. Turn on `ENABLE_TESTS` and `ENABLE_CLDNN_TESTS` in cmake option
|
||||
1. Turn on `ENABLE_TESTS` and `ENABLE_CLDNN_TESTS` in cmake option:
|
||||
|
||||
```bash
|
||||
cmake -DCMAKE_BUILD_TYPE=Release \
|
||||
@@ -69,21 +72,19 @@ openvino/src/plugins/intel_gpu/tests - root of Intel GPU unit test
|
||||
make clDNN_unit_tests
|
||||
```
|
||||
|
||||
3. You can find _`clDNN_unit_tests64`_ in bin directory after build
|
||||
|
||||
|
||||
3. You can find `clDNN_unit_tests64` in *bin* directory after build
|
||||
|
||||
## Run unit test
|
||||
|
||||
You can run _`clDNN_unit_tests64`_ in bin directory which is the output of openvino build
|
||||
You can run _`clDNN_unit_tests64`_ in *bin* directory which is the output of OpenVINO build
|
||||
|
||||
If you want to run specific unit test, you can use gtest_filter option as follows:
|
||||
If you want to run a specific unit test, you can use `gtest_filter` option as follows:
|
||||
|
||||
```
|
||||
./clDNN_unit_tests64 --gtest_filter='*filter_name*'
|
||||
```
|
||||
|
||||
Then, you can get the result like this
|
||||
Then, you can get the result similar to:
|
||||
|
||||
```bash
|
||||
openvino/bin/intel64/Release$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD
|
||||
@@ -101,34 +102,33 @@ Note: Google Test filter = *fusings_gpu/conv_fp32_reorder_fsv16_to_bfyx.basic/0*
|
||||
[ PASSED ] 1 test.
|
||||
```
|
||||
|
||||
|
||||
# How to create new test case
|
||||
|
||||
## TEST and TEST_P (GoogleTest macros)
|
||||
|
||||
GPU unit tests are using 2 types of test macros(**TEST** and **TEST_P**) in [GoogleTest (aka gtest)](https://google.github.io/googletest/)
|
||||
GPU unit tests are using two types of test macros (**TEST** and **TEST_P**) in [GoogleTest (aka gtest)](https://google.github.io/googletest/)
|
||||
|
||||
- ### **TEST**
|
||||
- **TEST** is the simple test case macro.
|
||||
- To make test-case using **TEST**, define an individual test named *`TestName`* in the test suite *`TestSuiteName`*
|
||||
- **TEST** is a simple test case macro.
|
||||
- To make a test-case using **TEST**, define an individual test named `TestName` in the test suite `TestSuiteName`
|
||||
|
||||
```
|
||||
TEST(TestSuiteName, TestName) {
|
||||
... test body ...
|
||||
}
|
||||
```
|
||||
- The test body can be any code under test. To determine the outcomes within the test body, use assertion such as *`EXPECT_EQ`* and *`ASSERT_NE`*.
|
||||
|
||||
- The test body can be any code under the test. To determine the outcome within the test body, use assertion types, such as `EXPECT_EQ` and `ASSERT_NE`.
|
||||
|
||||
- ### **TEST_P**
|
||||
- **TEST_P** is used to set test case using test parameter sets
|
||||
- To make test-case using **TEST_P**, define an individual value-parameterized test named *`TestName`* that uses the test fixture class *`TestFixtureName`* which is the test suite name
|
||||
- **TEST_P** is used to set a test case using test parameter sets
|
||||
- To make a test case using **TEST_P**, define an individual value-parameterized test named `TestName` that uses the test fixture class `TestFixtureName`, which is the test suite name:
|
||||
|
||||
```
|
||||
TEST_P(TestFixtureName, TestName) {
|
||||
... statements ...
|
||||
}
|
||||
```
|
||||
- Then, instantiates the value-parameterized test suite *`TestSuiteName`* which is defined defined with **TEST_P**
|
||||
- Then, instantiates the value-parameterized test suite `TestSuiteName`, which is defined with **TEST_P**
|
||||
```c++
|
||||
INSTANTIATE_TEST_SUITE_P(InstantiationName,TestSuiteName,param_generator)
|
||||
```
|
||||
@@ -136,29 +136,28 @@ GPU unit tests are using 2 types of test macros(**TEST** and **TEST_P**) in [G
|
||||
|
||||
## module_test and test_cases
|
||||
|
||||
- module_test and test_cases are testing GPU plugin using both **TEST_P** and **TEST**.
|
||||
- Please refer to [the fusion test](#fusions-1) for the test case based on **TEST_P**
|
||||
- *module_test* and *test_cases* are testing GPU plugin using both **TEST_P** and **TEST**.
|
||||
- Refer to [the fusion test](#fusions-1) for the test case based on **TEST_P**
|
||||
- **TEST** checks the test result by comparing the execution results with expected values after running network created from the target topology to check.
|
||||
- It is important to generate test input and expected output result in **TEST**
|
||||
- You can create input data and expected output data using the 3 following ways:
|
||||
- Generate simple input data and calculate the expected output data from input data manually like [basic_deformable_convolution_def_group1_2](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/convolution_gpu_test.cpp#L254)
|
||||
- Generate random input and get the expected output using reference function which is made in the test codes like [mvn_test_across_channels_outside_sqrt_bfyx](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/mvn_gpu_test.cpp#L108)
|
||||
- Generate random input and get the expected output from another reference kernel which is existed in cldnn kernels like [mvn_random_test_bsv32](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/mvn_gpu_test.cpp#L793)
|
||||
- You can create input data and expected output data using these three approaches:
|
||||
- Generate simple input data and calculate the expected output data from input data manually, like [basic_deformable_convolution_def_group1_2](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/convolution_gpu_test.cpp#L254)
|
||||
- Generate random input and get the expected output, using reference function, which is made in the test codes like [mvn_test_across_channels_outside_sqrt_bfyx](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/mvn_gpu_test.cpp#L108)
|
||||
- Generate random input and get the expected output from another reference kernel which exists in clDNN kernels like [mvn_random_test_bsv32](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/mvn_gpu_test.cpp#L793)
|
||||
|
||||
- When you allocate input data, please keep in mind that the layout order in *`engine.allocation_memory`* is not *`bfyx`* but *`bfxy`*. i.e., example, if input is {1,1,4,5}, the layout should be below
|
||||
- When you allocate input data, keep in mind that the layout order in `engine.allocation_memory` is not `bfyx` but `bfxy`. For example, if input is `{1,1,4,5}`, the layout should be as below:
|
||||
|
||||
```c++
|
||||
auto input = engine.allocate_memory({ data_types::f32, format::bfyx, { 1, 1, 5, 4 } });
|
||||
```
|
||||
|
||||
|
||||
## fusions
|
||||
|
||||
- It is implemented based on **TEST_P** because there are many cases where multiple layouts are tested in the same topology
|
||||
- If the fusing test class is already existed, you can use it. otherwise, you should make new fusing test class which is inherited [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19)
|
||||
- The new fusing test class should create `execute()` method which creates fused / non fused networks and calls *`compare`* method after setting input
|
||||
- Create test case using **TEST_P**
|
||||
- You can make the desired networks using create_topologies.
|
||||
- It is implemented based on **TEST_P** because there are many cases where multiple layouts are tested in the same topology.
|
||||
- If the fusing test class already exists, you can use it. Otherwise, you should make a new fusing test class, which is inherited [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19).
|
||||
- The new fusing test class should create the `execute()` method, which creates fused / non-fused networks and calls `compare` method after setting input.
|
||||
- Create a test case, using **TEST_P**:
|
||||
- You can make the desired networks using create_topologies.
|
||||
```mermaid
|
||||
flowchart LR
|
||||
nodeA1(bias) --> nodeA2(conv_prim)
|
||||
@@ -186,7 +185,7 @@ class nodeA3 moss1
|
||||
class nodeA8 steel1
|
||||
class nodeA4,nodeA1,nodeA6,nodeA9,nodeA11 carbon1
|
||||
```
|
||||
- For example, if you design the networks like the one above, you can make the test code as follow
|
||||
- For example, if you design the networks like the one above, you can make the test code as follows:
|
||||
|
||||
```c++
|
||||
class conv_fp32_multi_eltwise_4_clamp : public ConvFusingTest {};
|
||||
@@ -218,12 +217,12 @@ class nodeA4,nodeA1,nodeA6,nodeA9,nodeA11 carbon1
|
||||
|
||||
```
|
||||
|
||||
- If you want to change some node's layout format to specific format, you can change it using *`build_option::force_implementations`*.
|
||||
- In the sample codes, *`conv_prim`* is set to *`format::b_fs_yx_fsv16`* by *`build_option::force_implementations`*
|
||||
- *`tolerance`* is used as to threshold to check whether or not output result are same between fused network and non fused network in *`compare`* function.
|
||||
- After the test case is implemented, use `INSTANTIATE_TEST_SUITE_P` to set the test suite for each parameter case as follows.
|
||||
- Check all variables in *`convolution_test_params`* to make `CASE_CONV_FP32_2`.
|
||||
- In *`convolution_test_params`*, all tensor, format, and data_types are used in common in all convolution fusing tests. So you can define `CASE_CONV_FP32_2` with all variables except *`expected_fused_primitives`* and *`expected_not_fused_primitives`*
|
||||
- If you want to change some node's layout format to a specific format, you can change it using `build_option::force_implementations`.
|
||||
- In the sample codes, `conv_prim` is set to `format::b_fs_yx_fsv16` by `build_option::force_implementations`.
|
||||
- `tolerance` is used as a threshold to check whether or not the output results are the same between a fused network and a non-fused network in the `compare` function.
|
||||
- After the test case is implemented, use `INSTANTIATE_TEST_SUITE_P` to set the test suite for each parameter case as follows.
|
||||
- Check all variables in `convolution_test_params` to make `CASE_CONV_FP32_2`.
|
||||
- In `convolution_test_params`, all tensor, format, and `data_types` are used in common in all convolution fusing tests. Therefore, you can define `CASE_CONV_FP32_2` with all variables except `expected_fused_primitives` and `expected_not_fused_primitives`.
|
||||
|
||||
```c++
|
||||
struct convolution_test_params {
|
||||
@@ -256,6 +255,7 @@ INSTANTIATE_TEST_SUITE_P(fusings_gpu, conv_fp32_scale, ::testing::ValuesIn(std::
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,25 +1,26 @@
|
||||
# Graph Optimization Passes
|
||||
|
||||
Graph optimization is a collection of optimization passes that happens to convert a general network description into a network-description-for-GPU-execution. It happens in the constructor of `cldnn::program`. In other words, the input of graph optimization is `topology`[(link)](./basic_data_structures.md#topology) and output is `program`[(link)](./basic_data_structures.md#program-impl--).
|
||||
Graph optimization is a collection of optimization passes that convert a general network description into a network-description-for-GPU-execution. It happens in the constructor of `cldnn::program`. In other words, the input of graph optimization is `topology` [(link)](./basic_data_structures.md#topology) and the output is `program` [(link)](./basic_data_structures.md#program-impl--).
|
||||
|
||||
The transformation from original graph into the final graph is quite complicated. The steps are divided into smaller pieces(`pass`). The purpose of this documentation is not to explain every step in detail, but to explain key steps.
|
||||
The transformation from the original graph into the final graph is quite complicated. The steps are divided into smaller pieces (`pass`). The purpose of this documentation is not to explain every step in detail, but to explain key steps.
|
||||
|
||||
For debugging purpose, you can dump the optimized graph after each step. Please see this [link](./gpu_debug_utils.md#graph-dumps) for detail.
|
||||
For debugging purposes, you can dump the optimized graph after each step. See this [article](./gpu_debug_utils.md#graph-dumps) for details.
|
||||
|
||||
Note: The optimization passes runs in sequence and the prefixed number indicates the sequence. However, this sequence number might change in the future.
|
||||
> **Note**: The optimization passes run in sequence and the prefixed number indicates the sequence. However, the sequence number might change in the future.
|
||||
|
||||
* **00_init**: First step of the optimization. If you want to see first cldnn graph, you can check this. It collects network output node information and set node processing order.
|
||||
* **08_prepare_primitive_fusing**: Fuse post-operations into other primitives. For example, relu is fused into convolution. Element-wise add operation can usually be fused into predecessor, too. The layout for the primitive is not chosen at this point yet, and we don't know which kernel will be chosen for the primitive. However, support for post-operation is dependent on the chosen kernel. That is why this pass contains some logic to guess the layout.
|
||||
* **09_reorder_inputs**: Select layout format for each primitives. This is done by calling `layout_optimizer::get_preferred_format` function which returns preferred format for a node(or “any” which means that format must be propagated from adjacent nodes if possible). Then it propagate formats for nodes with “any” preferred format to minimize local reorders. After propagating formats, it inserts actual reorders nodes into the graph. As a result of this pass, we get quite complicated graph with many _redundant_ reorders. It will be removed from `remove_redundant_reorders`.
|
||||
* **17_remove_redundant_reorders**: This pass is about removing reorder, but it has two conceptual purpose. First one is removing _redundant_ reorders. For example, when the network contains a pattern like `reorder - reorder - reorder`, it can be shrunk into single `reorder`. Second one is about supporting cross-layout operation of primitive. For example, when a `convolution` needs to receive `bfyx` input and to generate `b_fs_yx_fsv16` output, the initial graph from `reorder_inputs` looks like this: `data(bfyx) --> reorder(b_fs_yx_fsv16) --> convolution(b_fs_yx_fsv16)`. This pass looks for such pattern and removes the reorder to generate cross-layout graph for the target convolution: `data(bfyx) --> convolution(b_fs_yx_fsv16)`
|
||||
* **19_prepare_buffer_fusing**: This pass is for implicit concat or implicit crop. Implicit concat is about removing `concatenation` primitive when two predecessors can put result into the target buffer of concat directly. For example, if two convolution results are concatenated along f-axis and the layout is bfyx format and b=1, we can just remove concat primitive and manipulate the output address of the convolutions to point proper locations.
|
||||
* **20_add_required_reorders**: This pass tries to keep graph consistency and add reorder if current format is not supported by a node. It checks if current input format is present in `implementation_map<op_t>` defined in `<op_type>_gpu.cpp` file. If it is not defined, this pass tries to change layout to one of the most common format [bfyx, yxfb, byxf] and picks the first supported format.
|
||||
* **21_add_onednn_optimization_attributes**: This pass generates onednn attributes for post operation[(link)](https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html#post-ops-and-attributes). OpenVINO gpu plugin(a.k.a. cldnn) has a set of defined post operations and it requires some transformation to map those into onednn post-operations.
|
||||
* **22_compile_graph**: This pass creates `primitive_impl` through kernel selector. In this pass, the kernel for each node is chosen. For onednn primitives, OpenCL code is compiled in this stage. For cldnn primitives, OpenCL code will be compiled after all passes.
|
||||
* **26_propagate_constants**: This pass reorders weights for convolution, deconvolution and FC to a required format. As kernel is chosen in `compile_graph` stage, it is now known that some reordering is required for weights. It is because the weights are stored in a simple planar format in IR, but other format is usually required for optimized convolution(or deconv, FC). In order to reorder weights, this pass creates a simple graph that receives weights and generates reordered weights. We get the reordered weights by executing the network and the reordered weights are inserted back into the original graph.
|
||||
* **31_oooq_memory_dependencies**: In GPU, device memory is a limited resource and it is not necessary to keep all the intermediate results when inferencing a network. Therefore, the buffer is reused when the content is not needed anymore. However, it is necessary to take it into consideration that intel_gpu plugin is using out-of-order queue. As we are not sure the exact sequence of execution, there is additional limitation of reusing buffer. For example, in case of multi-branch structure like inception, there is no direct dependencies between the branches except for the common ancestor. However, in OOOQ execution mode, as we are not sure the sequence of execution in inception module, it is necessary not to reuse the buffer from one branch by another branch. Such _implicit dependency_ information is processed in this pass.
|
||||
* **00_init**: First step of the optimization. If you want to see the first clDNN graph, you can check this. It collects network output node information and sets node processing order.
|
||||
* **08_prepare_primitive_fusing**: Fuse post-operations into other primitives. For example, *ReLU* is fused into convolution. Element-wise *add* operation can usually be fused into predecessor, too. The layout for the primitive is not chosen at this point yet, and you do not know which kernel will be chosen for the primitive. However, support for post-operation is dependent on the chosen kernel. That is why this pass contains some logic to guess the layout.
|
||||
* **09_reorder_inputs**: Select the layout format for each primitives. This is done by calling `layout_optimizer::get_preferred_format` function, which returns preferred format for a node (or “any” which means that the format must be propagated from adjacent nodes if possible). Then it propagates formats for nodes with “any” preferred format to minimize local reorders. After propagating formats, it inserts actual reorder nodes into the graph. The result of this pass is a quite complicated graph with many _redundant_ reorders. It will be removed from `remove_redundant_reorders`.
|
||||
* **17_remove_redundant_reorders**: This pass is about removing reorder, but it has two conceptual purposes. First one is removing _redundant_ reorders. For example, when the network contains a pattern like `reorder - reorder - reorder`, it can be shrunk into a single `reorder`. Second one is about supporting cross-layout operation of a primitive. For example, when a `convolution` needs to receive `bfyx` input and to generate `b_fs_yx_fsv16` output, the initial graph from `reorder_inputs` looks as follows: `data(bfyx) --> reorder(b_fs_yx_fsv16) --> convolution(b_fs_yx_fsv16)`. This pass looks for such a pattern and removes the reorder to generate a cross-layout graph for the target convolution: `data(bfyx) --> convolution(b_fs_yx_fsv16)`
|
||||
* **19_prepare_buffer_fusing**: This pass is for implicit concat or implicit crop. Implicit concat is about removing `concatenation` primitive when two predecessors can put result into the target buffer of concat directly. For example, if two convolution results are concatenated along f-axis and the layout is `bfyx` format and `b=1`, you can just remove concat primitive and manipulate the output address of the convolutions to point to proper locations.
|
||||
* **20_add_required_reorders**: This pass tries to keep graph consistency and add reorder if current format is not supported by a node. It checks if the current input format is present in `implementation_map<op_t>`, defined in `<op_type>_gpu.cpp` file. If it is not defined, this pass tries to change layout to one of the most common format `[bfyx, yxfb, byxf]` and picks the first supported format.
|
||||
* **21_add_onednn_optimization_attributes**: This pass generates oneDNN attributes for post operation [(link)](https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html#post-ops-and-attributes). OpenVINO GPLU plugin (clDNN) has a set of defined post operations and it requires some transformation to map those into oneDNN post-operations.
|
||||
* **22_compile_graph**: This pass creates `primitive_impl` through the kernel selector. In this pass, the kernel for each node is chosen. For oneDNN primitives, OpenCL code is compiled in this stage. For clDNN primitives, OpenCL code will be compiled after all passes.
|
||||
* **26_propagate_constants**: This pass reorders weights for convolution, deconvolution and FC to a required format. As the kernel is chosen in `compile_graph` stage, it is now known that some reordering is required for the weights. It is because the weights are stored in a simple planar format in IR, but other format is usually required for optimized convolution(or deconv, FC). To reorder weights, this pass creates a simple graph that receives weights and generates reordered weights. You get the reordered weights by executing the network and the reordered weights are inserted back into the original graph.
|
||||
* **31_oooq_memory_dependencies**: In GPU, device memory is a limited resource and it is not necessary to keep all the intermediate results when inferencing a network. Therefore, the buffer is reused when the content is not needed anymore. However, it is necessary to take it into consideration that `Intel_GPU` plugin is using out-of-order queue. As you are not sure about the exact sequence of execution, there is an additional limitation of reusing the buffer. For example, in case of a multi-branch structure like inception, there is no direct dependencies between the branches except for the common ancestor. However, in OOOQ execution mode, as you are not sure about the sequence of execution in inception module, it is necessary not to reuse the buffer from one branch by another branch. Such _implicit dependency_ information is processed in this pass.
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,23 +1,26 @@
|
||||
# Memory allocation in GPU plugin
|
||||
# Memory Allocation in GPU Plugin
|
||||
|
||||
## Allocation types
|
||||
GPU plugin supports 4 types of memory allocation as below. Note that the prefix `usm_` indicates the allocation type using Intel Unified Shared Memory (USM) extension for OpenCL. For more detailed information about the USM extension, refer to [this](https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html) page.
|
||||
* `cl_mem` : Standard OpenCL cl_mem allocation
|
||||
* `usm_host` : Allocated in host memory and accessible by both of host and device. Not migratable.
|
||||
|
||||
GPU plugin supports four types of memory allocation as below. Note that the prefix `usm_` indicates the allocation type using Intel Unified Shared Memory (USM) extension for OpenCL. For more detailed information about the USM extension, refer to [this](https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html) page.
|
||||
* `cl_mem` : Standard OpenCL cl_mem allocation.
|
||||
* `usm_host` : Allocated in host memory and accessible by both of host and device. Non-migratable.
|
||||
* `usm_shared` : Allocated in host and devices and accessible by both host and device. The memories are automatically migrated on demand.
|
||||
* `usm_device` : Allocated in device memory and accessible only by the device which owns the memory. Not migratable.
|
||||
* `usm_device` : Allocated in device memory and accessible only by the device which owns the memory. Non-migratable.
|
||||
|
||||
Note that there are a few restrictions on a memory allocation:
|
||||
|
||||
* Allocation of single memory object should not exceed the available device memory size, i.e., the value obtained by `CL_DEVICE_GLOBAL_MEM_SIZE`.
|
||||
* The sum of all memory objects required to execute a kernel (i.e., the sum of inputs and outputs of a kernel) should not exceed the target available memory. For example, if you want to allocate a memory object to the device memory, the above restrictions should be satisfied against the device memory. Otherwise, the memory object should be allocated on the host memory.
|
||||
* Allocation of a single memory object should not exceed the available device memory size, that is, the value obtained by `CL_DEVICE_GLOBAL_MEM_SIZE`.
|
||||
* The sum of all memory objects required to execute a kernel (that is, the sum of inputs and outputs of a kernel) should not exceed the target available memory. For example, if you want to allocate a memory object to the device memory, the above restrictions should be satisfied against the device memory. Otherwise, the memory object should be allocated on the host memory.
|
||||
|
||||
## Memory allocation API
|
||||
In GPU plugin, the allocation for each allocation type can be done with [engine::allocate_memory](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/include/intel_gpu/runtime/engine.hpp#L51), which
|
||||
calls the corresponding memory object wrapper for each allocation type: [gpu_buffer](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L35), [gpu_usm](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L291).
|
||||
|
||||
## Dump memory allocation history
|
||||
The memory allocation history is being managed by the `engine`, which can be dumped by setting the environment variable `OV_GPU_Verbose=1` if the OpenVino is built with the cmake configuration `ENABLE_DEBUG_CAPS=ON`.
|
||||
In GPU plugin, the allocation for each allocation type can be done with [engine::allocate_memory](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/include/intel_gpu/runtime/engine.hpp#L51), which
|
||||
calls the corresponding memory object wrapper for each allocation type: [gpu_buffer](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L35), [gpu_usm](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L291).
|
||||
|
||||
## Dump memory allocation history
|
||||
|
||||
The memory allocation history is being managed by the `engine`, which can be dumped by setting the environment variable `OV_GPU_Verbose=1` if OpenVino is built with the cmake configuration `ENABLE_DEBUG_CAPS=ON`.
|
||||
```cpp
|
||||
...
|
||||
GPU_Debug: Allocate 58982400 bytes of usm_host allocation type (current=117969612; max=117969612)
|
||||
@@ -26,26 +29,28 @@ GPU_Debug: Allocate 44236800 bytes of usm_host allocation type (current=16220641
|
||||
GPU_Debug: Allocate 14873856 bytes of usm_device allocation type (current=59500236; max=59500236)
|
||||
...
|
||||
```
|
||||
Here, `current` denotes the total allocated memory amount at that moment while `max` denotes the peak record of the total memory allocation until that moment.
|
||||
Here, `current` denotes the amount of total allocated memory at that moment, while `max` denotes the peak record of the total memory allocation until that moment.
|
||||
|
||||
## Allocated memory objects
|
||||
The typical memory allocation performed in the GPU plugin can be categorized as follows:
|
||||
* `Constant memory allocation`: In GPU plugin, constant data are held by the `data` primitives and the required memory objects are [allocated](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/plugin/ops/constant.cpp#L181) and assigned at the creation of the data primitive. First, it is allocated on the host memory and the constant data are copied from the corresponding blob in ngraph. Once all the transformation and optimization processes in `cldnn::program` is finished and the user nodes of those data are known as the GPU operations using the device memory, then the memory is reallocated on the device memory and the constants data are copied to there (i.e., [transferred](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/graph/program.cpp#L457)). Note that constant data are shared within batches and streams.
|
||||
* `Output memory allocation`: A memory object to store the output result of each primitive is created at the creation of each primitive_inst ([link](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L263)), except the cases when the output is reusing the input memory. Note that the creation of a primitive_inst is done in an descending order of the output memory size for achieving better memory reusing efficiency.
|
||||
|
||||
* `Intermediate memory allocation`: Some primitives such as _detection_output_ and _non_max_suppression_ consisting of multiple kernels require intermediate memories to exchange data b/w those kernels. The allocation of such intermediate memories happens after all allocation for primitive_insts are finished ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L592)), since it needs to be processed in a processing order to use the predecessors' allocation information to decide whether to allocate it on device memory or not by checking the memory allocation restriction described above.
|
||||
The typical memory allocation performed in the GPU plugin can be categorized as follows:
|
||||
* `Constant memory allocation`: In GPU plugin, constant data are held by the `data` primitives and the required memory objects are [allocated](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/plugin/ops/constant.cpp#L181) and assigned at the creation of the data primitive. First, it is allocated on the host memory and the constant data are copied from the corresponding blob in ngraph. Once all the transformation and optimization processes in `cldnn::program` are finished and the user nodes of the data are known as the GPU operations using the device memory, then the memory is reallocated on the device memory and the constant data is copied to there (that is, [transferred](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/graph/program.cpp#L457)). Note that constant data is shared within batches and streams.
|
||||
* `Output memory allocation`: A memory object to store the output result of each primitive is created at the creation of each `primitive_inst` ([link](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L263)), except when the output is reusing the input memory. Note that the creation of a `primitive_inst` is done in descending order of the output memory size for achieving better memory reusing efficiency.
|
||||
|
||||
* `Intermediate memory allocation`: Some primitives such as _detection_output_ and _non_max_suppression_ consisting of multiple kernels require intermediate memories to exchange data b/w those kernels. The allocation of such intermediate memories happens after all allocation for `primitive_insts` is finished ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L592)). After all, it needs to be processed in a processing order to use the predecessors' allocation information to decide whether to allocate it on device memory or not by checking the memory allocation restriction described above.
|
||||
|
||||
## Memory dependency and memory pool
|
||||
In GPU plugin, multiple memory objects can be allocated at a same address, when there is no dependency between the users of them. For example, a memory region of a program_node _A_'s output memory can be allocated for another program_node _B_'s output, if the output of _A_ is no longer used by any other program_node, when the result of the _B_ is to be stored. This mechanism is realized by the following two parts;
|
||||
1. `Memory dependency` : memory_dependencies of a program_node is set by the memory dependency passes. There are two kinds of memory dependency passes as follows:
|
||||
* `basic_memory_dependencies` : Assuming an in-order-queue execution, this pass adds dependencies to a program_node, which are deduced by checking its direct input and output nodes only.
|
||||
* `oooq_memory_dependencies` : Assuming an out-of-order-queue execution, this pass adds dependencies to all pair of program_nodes that can potentially be executed at the same time.
|
||||
2. `Memory pool` : The GPU plugin has a [memory pool](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/include/intel_gpu/runtime/memory_pool.hpp) which is responsible for the decision of allocation or reuse for an allocation request. This memory_pool utilizes the memory dependencies set by the above two passes in the decision of reuse of not. Note that each `cldnn::network` has its own `memory_pool`.
|
||||
|
||||
In GPU plugin, multiple memory objects can be allocated at the same address, when there is no dependency between their users. For example, a memory region of a `program_node` _A_'s output memory can be allocated for another `program_node` _B_'s output, if the output of _A_ is no longer used by any other `program_node`, when the result of the _B_ is to be stored. This mechanism is realized by the following two parts;
|
||||
1. `Memory dependency` : memory_dependencies of a `program_node` is set by the memory dependency passes. There are two kinds of memory dependency passes:
|
||||
* `basic_memory_dependencies` : Assuming an in-order-queue execution, this pass adds dependencies to a `program_node`, which are deduced by checking its direct input and output nodes only.
|
||||
* `oooq_memory_dependencies` : Assuming an out-of-order-queue execution, this pass adds dependencies to all pairs of `program_nodes` that can potentially be executed at the same time.
|
||||
2. `Memory pool` : The GPU plugin has a [memory pool](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/include/intel_gpu/runtime/memory_pool.hpp), which is responsible for the decision of allocation or reuse for an allocation request. This `memory_pool` utilizes the memory dependencies set by the above two passes in the decision of reuse of not. Note that each `cldnn::network` has its own `memory_pool`.
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
* [OpenVINO GPU Plugin](../README.md)
|
||||
* [Developer documentation](../../../../docs/dev/index.md)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# GPU plugin workflow
|
||||
# GPU Plugin Workflow
|
||||
|
||||
The simplified workflow in the GPU plugin is shown on the picture below (click on image for higher resolution):
|
||||
The simplified workflow in the GPU plugin is shown in the diagram below (click it for higher resolution):
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
@@ -147,6 +147,7 @@ class `intel_gpu::device_query` {Detects available devices for given backend}
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,20 +1,20 @@
|
||||
# GPU plugin structure
|
||||
# GPU Plugin Structure
|
||||
|
||||
Historically GPU plugin was built on top of standalone [clDNN library](https://github.com/intel/clDNN) for DNNs inference on Intel® GPUs,
|
||||
Historically, GPU plugin was built on top of standalone [clDNN library](https://github.com/intel/clDNN) for DNNs inference on Intel® GPUs,
|
||||
but at some point clDNN became a part of OpenVINO, so now it's a part of overall GPU plugin code. Intel® Arc™ Graphics Xe-HPG is supported
|
||||
via embedding of [oneDNN library](https://github.com/oneapi-src/oneDNN)
|
||||
|
||||
OpenVINO GPU plugin is responsible for:
|
||||
1. [IE Plugin API](https://docs.openvino.ai/latest/openvino_docs_ie_plugin_dg_overview.html) implementation.
|
||||
2. Translation of model from common IE semantic (ov::Function) into plugin specific one (cldnn::topology) which is then compiled into
|
||||
gpu graph representation (cldnn::network).
|
||||
2. Translation of a model from common IE semantic (`ov::Function`) into plugin-specific one (`cldnn::topology`), which is then compiled into
|
||||
GPU graph representation (`cldnn::network`).
|
||||
3. Implementation of OpenVINO operation set for Intel® GPU.
|
||||
4. Device specific graph transformations.
|
||||
4. Device-specific graph transformations.
|
||||
5. Memory allocation and management logic.
|
||||
6. Processing of incoming InferRequests using clDNN objects.
|
||||
6. Processing of incoming InferRequests, using clDNN objects.
|
||||
7. Actual execution on GPU device.
|
||||
|
||||
As Intel GPU Plugin source code structure is shown below:
|
||||
Intel GPU Plugin source code structure is shown below:
|
||||
<pre>
|
||||
src/plugins/intel_gpu - root GPU plugin folder
|
||||
├── include
|
||||
@@ -49,19 +49,20 @@ src/plugins/intel_gpu - root GPU plugin folder
|
||||
└── rapidjson - thirdparty <a href="https://github.com/Tencent/rapidjson">RapidJSON</a> lib for reading json files (cache.json)
|
||||
</pre>
|
||||
|
||||
One last thing that is worth mentioning is functional tests which is located in the following location:
|
||||
It is worth it to mention the functional tests, which are located in:
|
||||
```
|
||||
src/tests/functional/plugin/gpu
|
||||
```
|
||||
Most of the tests are reused across plugins, and each plugin only need to add test instances with some specific parameters.
|
||||
Most of the tests are reused across plugins, and each plugin only needs to add the test instances with some specific parameters.
|
||||
|
||||
Shared tests are located here:
|
||||
Shared tests are located in:
|
||||
```
|
||||
src/tests/functional/plugin/shared <--- test definitions
|
||||
src/tests/functional/plugin/gpu/shared_tests_instances <--- instances for GPU plugin
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../README.md)
|
||||
* [OpenVINO Plugins](../../README.md)
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# Inference Engine Test Infrastructure
|
||||
|
||||
This is OpenVINO Inference Engine testing framework. OpenVINO Inference Engine test system contains:
|
||||
* **Unit tests**
|
||||
* **Unit tests**
|
||||
This test type is used for detailed testing of each software instance (including internal classes with their methods)
|
||||
within the tested modules (Inference Engine and Plugins). There are following rules which are **required** for Unit
|
||||
Tests development:
|
||||
@@ -9,50 +9,51 @@ This is OpenVINO Inference Engine testing framework. OpenVINO Inference Engine t
|
||||
* Unit test folder for a particular module should replicate `SRC` folder layout of the corresponding tested module to
|
||||
allow further developers get better understanding which part of software is already covered by unit tests and where
|
||||
to add new tests if needed.
|
||||
> **Example**: We have `network_serializer.h` and `network_serializer.cpp` files within the `src` folder of the
|
||||
tested Inference Engine module. Then, new `network_serializer_test.cpp` file should be created within the root of
|
||||
> **Example**: There are `network_serializer.h` and `network_serializer.cpp` files within the `src` folder of the
|
||||
tested Inference Engine module. Then, a new `network_serializer_test.cpp` file should be created within the root of
|
||||
the Unit Test folder for this module. This test file should cover all the classes and methods from the original
|
||||
files.
|
||||
|
||||
> **Example**: We have `ie_reshaper.cpp` within the `src/shape_infer` subfolder of the tested module. In this case
|
||||
new `shape_infer` subfolder should be created within the the root of the Unit Test folder for this module. And new
|
||||
|
||||
> **Example**: There is the `ie_reshaper.cpp` file within the `src/shape_infer` subfolder of the tested module. In this case,
|
||||
a new `shape_infer` subfolder should be created within the root of the Unit Test folder for this module. And a new
|
||||
`ie_reshaper_test.cpp` file should be created within this newly created subfolder. This test file should cover all
|
||||
the classes and methods from the original file.
|
||||
|
||||
* Each Unit Test should cover the only target classes and methods. If needed, all external interface components should
|
||||
|
||||
* Each Unit Test should cover only the target classes and methods. If needed, all external interface components should
|
||||
be mocked. There are common mock objects provided within the common Unit Test Utilities to stub the general
|
||||
Inference Engine API classes.
|
||||
> **Example**: We have `cnn_network_impl.hpp` and `cnn_network_impl.cpp` files within the `src` folder of the tested
|
||||
module. In this case, new `cnn_network_impl_test.cpp` file should be created and it should contain tests on
|
||||
> **Example**: There are `cnn_network_impl.hpp` and `cnn_network_impl.cpp` files within the `src` folder of the tested
|
||||
module. In this case, a new `cnn_network_impl_test.cpp` file should be created and it should contain tests on
|
||||
`CNNNetworkImpl` class only.
|
||||
|
||||
* It's not prohibited to have several test files for the same file from the tested module.
|
||||
* It's not prohibited to create a separate test file for a specific classes or functions (not for the whole file).
|
||||
* It is not prohibited to have several test files for the same file from the tested module.
|
||||
* It is not prohibited to create a separate test file for specific classes or functions (not for the whole file).
|
||||
|
||||
* **Functional tests**
|
||||
* **Functional tests**
|
||||
This test type is used to verify public Inference Engine API. There are following types of functional tests:
|
||||
* `inference_engine_tests` are plugin-independent tests. Used to verify Inference Engine API methods which don't
|
||||
involve any plugin runtime. E.g. `network_reader`, `network_serializer`, `precision` tests.
|
||||
* `plugin_tests` are plugin-dependent tests. These tests require plugin runtime to be executed during testing. E.g.
|
||||
any tests using `ExecutableNetwork`, `InferRequest` API can only be implemented within this test group.
|
||||
* `inference_engine_tests` are plugin-independent tests. They are used to verify Inference Engine API methods that do not
|
||||
involve any plugin runtime. The examples are: `network_reader`, `network_serializer`, and `precision` tests.
|
||||
* `plugin_tests` are plugin-dependent tests. These tests require plugin runtime to be executed during testing. For example,
|
||||
any tests using `ExecutableNetwork`, `InferRequest` API can only be implemented within this test group.
|
||||
|
||||
> **Example**: Any new test on creating of a CNNNetwork object and checking of its output info should be included to
|
||||
to the Inference Engine Functional tests suite. But any new test containing reading of a network and loading it to a
|
||||
> **Example**: Any new test on creating a CNNNetwork object and checking its output info should be included to
|
||||
the Inference Engine Functional tests suite. However, any new test containing reading of a network and loading it to a
|
||||
specified plugin is always the plugin test.
|
||||
|
||||
There are following rules which are **required** for Functional Tests development:
|
||||
* All Functional tests are separated into different executables for the Inference Engine and each plugin.
|
||||
* Pre-converted IR files must not be used within the new Functional Tests. Tested models should be generated during
|
||||
the tests execution. The main method to generate a required model is building of the required NGraph function and
|
||||
creating of a CNNNetwork using it. If a required layer is not covered by Ngraph it's allowed to build IR file using
|
||||
`xml_net_builder` utility (please refer to the `ir_net.hpp` file). IR XML files hardcoded as strings within the test
|
||||
creating a CNNNetwork using it. If a required layer is not covered by Ngraph, it is allowed to build IR file using
|
||||
`xml_net_builder` utility (refer to the `ir_net.hpp` file). IR XML files hardcoded as strings within the test
|
||||
code should not be used.
|
||||
* All the plugin test cases are parameterized with (at least) the device name and included to the common
|
||||
`funcSharedTests` static library. This library is linked to the Plugin Test binaries. And all the plugin
|
||||
developers just add required test instantiations based on the linked test definitions to own test binary. It should
|
||||
be done to make all the **shared** test cases always visible and available to instantiate by other plugins.
|
||||
be done to make all the **shared** test cases always visible and available to instantiate by other plugins.
|
||||
|
||||
> **NOTE**: Any new plugin test case should be added to the common test definitions library
|
||||
(`funcSharedTests`) within the OpenVINO repository first. And then this test case can be instantiated with the
|
||||
(`funcSharedTests`) within the OpenVINO repository first. Then, this test case can be instantiated with the
|
||||
required parameters inside own plugin's test binary which links this shared tests library.
|
||||
|
||||
> **NOTE**: `funcSharedTests` library is added to the developer package and available for closed source
|
||||
@@ -60,15 +61,17 @@ This is OpenVINO Inference Engine testing framework. OpenVINO Inference Engine t
|
||||
* All the inference engine functional test cases are defined and instantiated within the single test binary. These
|
||||
test cases are not implemented as a separate library and not available for instantiations outside this binary.
|
||||
|
||||
* **Inference Engine tests utilities**
|
||||
* **Inference Engine tests utilities**
|
||||
The set of utilities which are used by the Inference Engine Functional and Unit tests. Different helper functions,
|
||||
blob comparators, OS specific constants, etc are implemented within the utilities.
|
||||
blob comparators, OS-specific constants, etc. are implemented within the utilities.
|
||||
Internal namespaces (for example, `CommonTestUtils::`, `FuncTestUtils::` or `UnitTestUtils::`) must be used to
|
||||
separate utilities by domains.
|
||||
|
||||
> **NOTE**: All the utilities libraries are added to the developer package and available for closed source
|
||||
development.
|
||||
|
||||
## See also
|
||||
## See also
|
||||
|
||||
* [OpenVINO™ README](../../README.md)
|
||||
* [OpenVINO Core Components](../README.md)
|
||||
* [Developer documentation](../../docs/dev/index.md)
|
||||
|
||||
@@ -1,9 +1,11 @@
|
||||
# Conformance test runner
|
||||
# Conformance Test Runner
|
||||
|
||||
## Description
|
||||
|
||||
Conformance suites certify plugin functionality using a set of tests with plugin specificity independent parameters. There are two types of conformance validation.
|
||||
|
||||
### API Conformance
|
||||
|
||||
The suite checks the following OpenVINO API entities in a plugin implementation:
|
||||
* plugin
|
||||
* compiled model (executable network)
|
||||
@@ -11,24 +13,24 @@ The suite checks the following OpenVINO API entities in a plugin implementation:
|
||||
Also, there are test instantiations to validate hardware plugin functionality via software plugins (for example, MULTI, HETERO, etc.) for the entities.
|
||||
|
||||
The other part of the API conformance suite is QueryModel validation:
|
||||
* `ReadIR_queryModel` tests validate the `query_model` API using a simple single operation graph (Conformance IR) based on model parameters.
|
||||
* `ReadIR_queryModel` tests validate the `query_model` API, using a simple single operation graph (Conformance IR) based on model parameters.
|
||||
* `OpImplCheck` tests are simple synthetic checks to `query_model` and set implementation status for each operation.
|
||||
|
||||
A result of the `apiConformanceTests` run is two xml files: `report_api.xml` and `report_opset.xml`. The first one shows OpenVINO API entities' test statistics for each OpenVINO API entity, such as passed/failed/crashed/skipped/hanging, tests number, pass rates, and implementation status. The second one demonstrates the `query_model` results for each operation.
|
||||
|
||||
|
||||
A result of the `apiConformanceTests` run is two *xml* files: `report_api.xml` and `report_opset.xml`. The first one shows OpenVINO API entities' test statistics for each OpenVINO API entity, such as `passed/failed/crashed/skipped/hanging`, tests number, pass rates, and implementation status. The second one demonstrates the `query_model` results for each operation.
|
||||
|
||||
### Opset Conformance
|
||||
|
||||
The suite validates an OpenVINO operation plugin implementation, using simple single operation graphs (Conformance IR) taken from models. The plugin inference output is compared with the reference.
|
||||
|
||||
The suite contains:
|
||||
The suite contains:
|
||||
* `ReadIR_compareWithRefs` set allows reading IRs from folders recursively, inferring them, and comparing plugin results with the reference.
|
||||
* `OpImplCheckTest` set checks an operation plugin implementation status, using a simple synthetic single operation graph (`Implemented`/`Not implemented`). The suite checks only `compile_model` without comparison with the reference.
|
||||
* `OpImplCheckTest` set checks an operation plugin implementation status, using a simple synthetic single operation graph (`Implemented`/`Not implemented`). The suite checks only `compile_model` without comparison with the reference.
|
||||
|
||||
A result of the `conformanceTests` run is the `report_opset.xml` file. It shows tests statistic, like pass rate, passed, crashed, skipped, failed tests, and plugin implementation per operation for devices.
|
||||
|
||||
## How to build
|
||||
Run the following command in build directory:
|
||||
|
||||
Run the following commands in the build directory:
|
||||
1. Generate CMake project:
|
||||
```
|
||||
cmake -DENABLE_TESTS=ON -DENABLE_FUNCTIONAL_TESTS=ON ..
|
||||
@@ -43,129 +45,128 @@ Run the following command in build directory:
|
||||
```
|
||||
make --jobs=$(nproc --all) lib_plugin_name
|
||||
```
|
||||
|
||||
|
||||
## How to run using [simple conformance runner](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/run_conformance.py)
|
||||
|
||||
There is a simple python runner to complete the whole conformance pipeline locally. Some steps could be excluded from the pipeline by command-line parameter configuration.
|
||||
|
||||
### The conformance pipeline steps:
|
||||
1. (Optional) Download models/conformance IR via URL / copy archieve to working directory / verify dirs / check list-files.
|
||||
|
||||
1. (Optional) Download models/conformance IR via URL / copy archive to working directory / verify dirs / check list-files.
|
||||
2. (Optional) Run `SubgraphDumper` to generate a simple single op graph based on models or download the `conformance_ir` folder. (if `-s=1`)
|
||||
3. Run conformance test executable files.
|
||||
4. Generate conformance reports.
|
||||
|
||||
### Command-line arguments
|
||||
|
||||
The script has the following arguments:
|
||||
* `-h, --help` show this help message and exit
|
||||
* `-m MODELS_PATH, --models_path MODELS_PATH`
|
||||
Path to the directory/ies containing models to dump subgraph (the default way is to download conformance IR). It may be directory, archieve file, .lst file or http link to download something . If `--s=0`, specify the Conformance IRs directoryy
|
||||
Path to the directory/ies containing models to dump subgraph (the default method is to download conformance IR). It may be a directory, an archive file, an `.lst` file, or a URL to download some data. If `--s=0`, specify the Conformance IRs directory.
|
||||
* `-d DEVICE, --device DEVICE`
|
||||
Specify the target device. The default value is CPU
|
||||
Specify the target device. The default value is `CPU`.
|
||||
* `-ov OV_PATH, --ov_path OV_PATH`
|
||||
OV repo path. The default way is try to find the absolute path of OV repo (by using script path)
|
||||
OV repo path. The default method is to try to find the absolute path of OV repo (by using the script path).
|
||||
* `-w WORKING_DIR, --working_dir WORKING_DIR`
|
||||
Specify a working directory to save all artifacts, such as reports, models, conformance_irs, etc.
|
||||
Specify a working directory to save all artifacts, such as reports, models, `conformance_irs`, etc.
|
||||
* `-t TYPE, --type TYPE`
|
||||
Specify conformance type: `OP` or `API`. The default value is `OP`
|
||||
Specify conformance type: `OP` or `API`. The default value is `OP`.
|
||||
* `-s DUMP_CONFORMANCE, --dump_conformance DUMP_CONFORMANCE`
|
||||
Set '1' if you want to create Conformance IRs from custom/downloaded models. In other cases, set `0`. The default value is '1'
|
||||
Set `1` if you want to create Conformance IRs from custom/downloaded models. In other cases, set `0`. The default value is `1`.
|
||||
* `-j WORKERS, --workers WORKERS`
|
||||
Specify number of workers to run in parallel. The default value is CPU count - 1
|
||||
Specify number of workers to run in parallel. The default value is CPU count - `1`
|
||||
* `--gtest_filter GTEST_FILTER`
|
||||
Specify gtest filter to apply when running test. E.g. *Add*:*BinaryConv*. The default value is None
|
||||
Specify gtest filter to apply when running a test. For example, *Add*:*BinaryConv*. The default value is `None`.
|
||||
* `-c OV_CONFIG_PATH, --ov_config_path OV_CONFIG_PATH`
|
||||
Specify path to file contains plugin config
|
||||
Specify path to a file, which contains plugin config.
|
||||
* `-sh SHAPE_MODE, --shape_mode SHAPE_MODE`
|
||||
Specify shape mode for conformance. Default value is ``. Possible values: `static`, `dynamic`, ``
|
||||
Specify shape mode for conformance. The default value is ``. Possible values: `static`, `dynamic`, ``
|
||||
|
||||
> **NOTE**:
|
||||
> All arguments are optional and have default values to reproduce OMZ conformance results in a default way.
|
||||
> **NOTE**: All arguments are optional and have default values to reproduce OMZ conformance results in a default method.
|
||||
|
||||
> **NOTE**:
|
||||
> The approach can be used as custom model scope validator!
|
||||
> **NOTE**: The approach can be used as custom model scope validator!
|
||||
|
||||
## Examples of usage:
|
||||
1. Use the default way to reproduce opset conformance results for OMZ on GPU:
|
||||
|
||||
1. Use the default method to reproduce opset conformance results for OMZ on GPU:
|
||||
```
|
||||
python3 run_conformance.py -d GPU
|
||||
```
|
||||
```
|
||||
2. Use the conformance pipeline to check new models support (as IRs) on the CPU plugin and save results to a custom directory:
|
||||
```
|
||||
python3 run_conformance.py -m /path/to/new/model_irs -s=1 -w /path/to/working/dir -d CPU
|
||||
```
|
||||
3. Use custom OV build to check GNA conformance using pre-generated conformance_irs:
|
||||
```
|
||||
3. Use custom OV build to check GNA conformance, using pre-generated `conformance_irs`:
|
||||
```
|
||||
python3 run_conformance.py -m /path/to/conformance_irs -s=0 -ov /path/to/ov_repo_on_custom_branch -d GNA
|
||||
```
|
||||
|
||||
> **IMPORTANT NOTE:**
|
||||
> If you need to debug some conformance tests, use the binary run as the default method. If you want to get conformance results or reproduce CI behavior, use the simple python runner.
|
||||
```
|
||||
|
||||
> **IMPORTANT NOTE:** If you need to debug some conformance tests, use the binary run as the default method. If you want to get conformance results or reproduce CI behavior, use the simple python runner.
|
||||
|
||||
## How to generate Conformance IRs set
|
||||
|
||||
Run the following commands:
|
||||
1. Clone [`Open Model Zoo repo`](https://github.com/openvinotoolkit/open_model_zoo) or prepare custom model scope
|
||||
2. Download all models using [Downloader tool](https://github.com/openvinotoolkit/open_model_zoo/blob/master/tools/model_tools/downloader.py) from the repo.
|
||||
3. Convert downloaded models to IR files using [Converter tool](https://github.com/openvinotoolkit/open_model_zoo/blob/master/tools/model_tools/converter.py) from the repo.
|
||||
3. Convert downloaded models to IR files, using [Converter tool](https://github.com/openvinotoolkit/open_model_zoo/blob/master/tools/model_tools/converter.py) from the repo.
|
||||
4. Run [Subgraph dumper](./../subgraphs_dumper/README.md) to collect unique operation set from the models.
|
||||
|
||||
|
||||
|
||||
## How to run operation conformance suite
|
||||
|
||||
The target is able to take the following command-line arguments:
|
||||
* `-h` prints target command-line options with description.
|
||||
* `--device` specifies target device.
|
||||
* `--input_folders` specifies the input folders with IRs or '.lst' file contains IRs path. Delimiter is `,` symbol.
|
||||
* `--plugin_lib_name` is name of plugin library. The example is `openvino_intel_cpu_plugin`. Use only with unregistered in IE Core devices.
|
||||
* `--disable_test_config` allows to ignore all skipped tests with the exception of `DISABLED_` prefix using.
|
||||
* `--skip_config_path` allows to specify paths to files contain regular expressions list to skip tests. [Examples](./op_conformance_runner/skip_configs)
|
||||
* `--config_path` allows to specify path to file contains plugin config. [Example](./op_conformance_runner/config/config_example.txt)
|
||||
* `--extend_report` allows not to re-write device results to the report (add results of this run to the existing). Mutually exclusive with --report_unique_name.
|
||||
* `--report_unique_name` allows to save report with unique name (report_pid_timestamp.xml). Mutually exclusive with --extend_report.
|
||||
* `--save_report_timeout` allows to try to save report in cycle using timeout (in seconds).
|
||||
* `--output_folder` Paths to the output folder to save report.
|
||||
* `--extract_body` allows to count extracted operation bodies to report.
|
||||
* `--shape_mode` Optional. Allows to run `static`, `dynamic` or both scenarios. Default value is empty string allows to run both scenarios. Possible values
|
||||
* `--input_folders` specifies the input folders with IRs or `.lst` file. It contains paths, separated by a comma `,`.
|
||||
* `--plugin_lib_name` is a name of a plugin library. The example is `openvino_intel_cpu_plugin`. Use only with unregistered in IE Core devices.
|
||||
* `--disable_test_config` allows ignoring all skipped tests with the exception of `DISABLED_` prefix using.
|
||||
* `--skip_config_path` allows specifying paths to files. It contains a list of regular expressions to skip tests. [Examples](./op_conformance_runner/skip_configs/skip_config_example.lst)
|
||||
* `--config_path` allows specifying the path to a file that contains plugin config. [Example](./op_conformance_runner/config/config_example.txt)
|
||||
* `--extend_report` allows you not to re-write device results to the report (add results of this run to the existing one). Mutually exclusive with `--report_unique_name`.
|
||||
* `--report_unique_name` allows you to save a report with a unique name (`report_pid_timestamp.xml`). Mutually exclusive with `--extend_report`.
|
||||
* `--save_report_timeout` allows saving a report in the cycle, using timeout (in seconds).
|
||||
* `--output_folder` specifies the path to the output folder to save a report.
|
||||
* `--extract_body` allows you to count extracted operation bodies to a report.
|
||||
* `--shape_mode` is optional. It allows you to run `static`, `dynamic` , or both scenarios. The default value is an empty string, which allows running both scenarios. Possible values
|
||||
are `static`, `dynamic`, ``
|
||||
* `--test_timeout` Setup timeout for each test in seconds, default timeout 900seconds (15 minutes).
|
||||
* `--test_timeout` specifies setup timeout for each test in seconds. The default timeout is 900 seconds (15 minutes).
|
||||
* All `gtest` command-line parameters
|
||||
|
||||
> **NOTE**:
|
||||
>
|
||||
> Using of [`parallel_runner`](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/run_parallel.py) tool to run a conformance suite helps to report crashed tests and collect correct statistic after unexpected crashes.
|
||||
> The tool is able to work in 2 modes:
|
||||
> * one test is run in separate thread (first run, as the output the cache will be saved as a custom file)
|
||||
> * similar load time per one worker based on test execution time. May contain different test count per worker
|
||||
>
|
||||
>
|
||||
> Using [`parallel_runner`](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/run_parallel.py) tool to run a conformance suite helps to report crashed tests and collect correct statistics after unexpected crashes.
|
||||
> The tool is able to work in two modes:
|
||||
> * one test is run in a separate thread (first run, as the output the cache will be saved as a custom file).
|
||||
> * similar load time per one worker based on test execution time. May contain different test count per worker.
|
||||
>
|
||||
> The example of usage is:
|
||||
> ```
|
||||
> python3 run_parallel.py -e=/path/to/openvino/bin/intel64/Debug/conformanceTests -d .
|
||||
> --gtest_filter=*Add*:*BinaryConv* -- --input_folders=/path/to/ir_1,/path/to/ir_2 --device=CPU
|
||||
> python3 run_parallel.py -e=/path/to/openvino/bin/intel64/Debug/conformanceTests -d .
|
||||
> --gtest_filter=*Add*:*BinaryConv* -- --input_folders=/path/to/ir_1,/path/to/ir_2 --device=CPU
|
||||
> --report_unique_name --output_folder=/path/to/temp_output_report_folder
|
||||
> ```
|
||||
> All arguments after `--` symbol is forwarding to `conformanceTests` target.
|
||||
>
|
||||
>
|
||||
> If you use the `--report_unique_name` argument, run
|
||||
> [the merge xml script](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/merge_xmls.py)
|
||||
> to aggregate the results to one xml file. Check command-line arguments with `--help` before running the command.
|
||||
> [the merge xml script](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/merge_xmls.py)
|
||||
> to aggregate the results to one *xml* file. Check command-line arguments with `--help` before running the command.
|
||||
> The example of usage is:
|
||||
> ```
|
||||
> python3 merge_xmls.py --input_folders=/path/to/temp_output_report_folder --output_folder=/path/to/output_report_folder --output_filename=report_aggregated
|
||||
> ```
|
||||
|
||||
## How to create operation conformance report
|
||||
|
||||
Run [the summarize script](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/summarize.py) to generate `html` and `csv` report. Check command-line arguments with `--help` before running the command.
|
||||
The example of using the script is:
|
||||
```
|
||||
python3 summarize.py --xml /opt/repo/infrastructure-master/thirdparty/gtest-parallel/report.xml --out /opt/repo/infrastructure-master/thirdparty/gtest-parallel/
|
||||
```
|
||||
> **NOTE**:
|
||||
>
|
||||
> Please, do not forget to copy [styles folder](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/template) to the output directory. It
|
||||
> helps to provide report with the filters and other usable features.
|
||||
> **NOTE**: Remember to copy [styles folder](./../../../../ie_test_utils/functional_test_utils/layer_tests_summary/template) to the output directory. It helps to provide a report with filters and other useful features.
|
||||
|
||||
The report contains statistics based on conformance results and filter fields at the top of the page.
|
||||
|
||||
## See also
|
||||
## See Also
|
||||
|
||||
* [OpenVINO™ README](../../../../../../README.md)
|
||||
* [OpenVINO Core Components](../../../../../README.md)
|
||||
* [Developer documentation](../../../../../../docs/dev/index.md)
|
||||
Reference in New Issue
Block a user