From 70cb829992638ae5a42422f2f40f8f92d5b3b52b Mon Sep 17 00:00:00 2001
From: Maciej Smyk <maciejx.smyk@intel.com>
Date: Thu, 16 Feb 2023 08:03:11 +0100
Subject: [PATCH] [DOCS] Move of developer documentation from wiki to md
 documents - master (#15372)

* CPU Plugin README creation

* debug capabilities

* Update debug_capabilities.md

* performance_analysis_ITT_counters

* cpu-emulation

* runtime_parameters_cache

* Update README.md

* internal_cpu_plugin_optimization

* See Also update for CPU Plugin

* See Also update for CPU Plugin 2

* intel_gpu

* Update README.md

* source code structure & See Also update for CPU plugin

* Update README.md

* See also update

* basic_data_structure

* memory_allocation_gpu_plugin

* Update memory_allocation_gpu_plugin.md

* simplified workflow

* graph optimization passes

* execution_of_inference

* GPU Plugin

* GPU Plugin fix

* Snippets

* Update README.md

* Update README.md

* fixes

* Snippets fix

* Update README.md

* component description

* Key Contacts

* Apply suggestions from code review

Co-authored-by: Ilya Churaev <ilyachur@gmail.com>

* Update src/plugins/intel_gpu/README.md

* Update src/plugins/intel_cpu/docs/internal_cpu_plugin_optimization.md

* Update src/plugins/intel_cpu/docs/internal_cpu_plugin_optimization.md

* Update src/plugins/intel_cpu/docs/internal_cpu_plugin_optimization.md

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

Text graphs to mermaid

* Update src/plugins/intel_gpu/docs/simplified_workflow.md

* Delete ov_intel_gpu_plugin_diagram.png

Removal of ov_intel_gpu_plugin_diagram.png file as the mermaid version is replacing it.

* Apply suggestions from code review

* Update src/common/snippets/README.md

---------

Co-authored-by: Sebastian Golebiewski <sebastianx.golebiewski@intel.com>
Co-authored-by: Ilya Churaev <ilyachur@gmail.com>
---
 src/common/snippets/README.md                 |  13 +
 .../snippets/docs/snippets_cpu_target.md      |  57 ++++
 .../snippets/docs/snippets_design_guide.md    | 301 ++++++++++++++++++
 src/plugins/README.md                         |  10 +-
 src/plugins/hetero/README.md                  | 141 ++++++++
 src/plugins/intel_cpu/README.md               |  29 ++
 src/plugins/intel_cpu/docs/cpu_emulation.md   |  37 +++
 .../intel_cpu/docs/debug_capabilities.md      |  21 ++
 .../docs/internal_cpu_plugin_optimization.md  | 223 +++++++++++++
 .../docs/performance_analysis_ITT_counters.md |  57 ++++
 .../docs/runtime_parameters_cache.md          |  54 ++++
 src/plugins/intel_gpu/README.md               |  55 +++-
 .../intel_gpu/docs/basic_data_structures.md   | 245 ++++++++++++++
 .../intel_gpu/docs/execution_of_inference.md  |  33 ++
 src/plugins/intel_gpu/docs/gpu_debug_utils.md | 252 +++++++++++++++
 src/plugins/intel_gpu/docs/gpu_kernels.md     | 139 ++++++++
 .../intel_gpu/docs/gpu_memory_formats.md      | 113 +++++++
 .../docs/gpu_plugin_driver_troubleshooting.md |  71 +++++
 .../intel_gpu/docs/gpu_plugin_ops_enabling.md | 138 ++++++++
 .../intel_gpu/docs/gpu_plugin_unit_test.md    | 263 +++++++++++++++
 .../docs/graph_optimization_passes.md         |  27 ++
 .../docs/memory_allocation_gpu_plugin.md      |  51 +++
 .../intel_gpu/docs/simplified_workflow.md     | 154 +++++++++
 .../intel_gpu/docs/source_code_structure.md   |  68 ++++
 src/tests/README.md                           |   9 +-
 .../plugin/conformance/test_runner/README.md  |  10 +-
 26 files changed, 2552 insertions(+), 19 deletions(-)
 create mode 100644 src/common/snippets/README.md
 create mode 100644 src/common/snippets/docs/snippets_cpu_target.md
 create mode 100644 src/common/snippets/docs/snippets_design_guide.md
 create mode 100644 src/plugins/hetero/README.md
 create mode 100644 src/plugins/intel_cpu/README.md
 create mode 100644 src/plugins/intel_cpu/docs/cpu_emulation.md
 create mode 100644 src/plugins/intel_cpu/docs/debug_capabilities.md
 create mode 100644 src/plugins/intel_cpu/docs/internal_cpu_plugin_optimization.md
 create mode 100644 src/plugins/intel_cpu/docs/performance_analysis_ITT_counters.md
 create mode 100644 src/plugins/intel_cpu/docs/runtime_parameters_cache.md
 create mode 100644 src/plugins/intel_gpu/docs/basic_data_structures.md
 create mode 100644 src/plugins/intel_gpu/docs/execution_of_inference.md
 create mode 100644 src/plugins/intel_gpu/docs/gpu_debug_utils.md
 create mode 100644 src/plugins/intel_gpu/docs/gpu_kernels.md
 create mode 100644 src/plugins/intel_gpu/docs/gpu_memory_formats.md
 create mode 100644 src/plugins/intel_gpu/docs/gpu_plugin_driver_troubleshooting.md
 create mode 100644 src/plugins/intel_gpu/docs/gpu_plugin_ops_enabling.md
 create mode 100644 src/plugins/intel_gpu/docs/gpu_plugin_unit_test.md
 create mode 100644 src/plugins/intel_gpu/docs/graph_optimization_passes.md
 create mode 100644 src/plugins/intel_gpu/docs/memory_allocation_gpu_plugin.md
 create mode 100644 src/plugins/intel_gpu/docs/simplified_workflow.md
 create mode 100644 src/plugins/intel_gpu/docs/source_code_structure.md

diff --git a/src/common/snippets/README.md b/src/common/snippets/README.md
new file mode 100644
index 00000000000..eca770a584c
--- /dev/null
+++ b/src/common/snippets/README.md
@@ -0,0 +1,13 @@
+# SnippetS
+
+## Key Contacts
+
+Please contact a member of [openvino-ie-cpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-cpu-maintainers) group, for assistance regarding snippets.
+
+* [SnippetS design guide](./docs/snippets_design_guide.md)
+* [CPU target for SnippetS code generator](./docs/snippets_cpu_target.md)
+
+## See also
+ * [OpenVINO™ README](../../../README.md)
+ * [OpenVINO Core Components](../../README.md)
+ * [Developer documentation](../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/common/snippets/docs/snippets_cpu_target.md b/src/common/snippets/docs/snippets_cpu_target.md
new file mode 100644
index 00000000000..04b70f7df87
--- /dev/null
+++ b/src/common/snippets/docs/snippets_cpu_target.md
@@ -0,0 +1,57 @@
+# CPU target for SnippetS code generator
+
+Snippets in its first generation can be seen as a generalization over generic eltwise node. First generation of snippets has lack of integration with oneDNN and so patterns it supports should be kept orthogonal to what is fused with post-ops. 
+
+POC CPU implementation could be found [here](https://github.com/openvinotoolkit/openvino/pull/2824)
+
+First 8 kernel parameters are passed by structure which is unpacked inside a kernel into the registers. The rest are passed through the stack.
+
+Loop trip count should be placed to some GP register, as well as work amount. Moreover, we need to load all the parameters into GP registers. If we assume that we have enough registers than it can be done before the loop body.
+
+```
+auto param0 = abi_params[0];
+auto param1 = abi_params[1];
+auto result = abi_params[2];
+
+auto work_amount = abi_params[3];
+```
+
+## Memory operations
+
+Load could be Vector, Scalar and Broadcast. Only native vector size for an architecture is supported (e.g. 16 on AVX-512)
+
+Memory operation also generates post increments for the pointer it uses. 
+
+- `MemoryEmitter`
+    - `StoreEmitter`
+    - `ScalarStoreEmitter`
+    - `LoadEmitter` (post increment)
+    - `BroadcastLoadEmitter`
+    - `ScalarLoadEmitter` (post increment)
+
+## Tensor blocking
+
+All inputs and outputs should be the same layout. Re-layout operations are not included in the snippets dialect. Since current scope is limited to layout-oblivious operations no specific handling for blocking is required. Extending dialect with re-layout operations is a subject of further benchmarking. The following memory representation is assumed.
+
+```
+ offset              domain                margin
++-------+-------------------------------+----------+
+|       |                               |          |
+|       |                               |          |
+|       |                               |          |
+|       |                               |          |
++-------+-------------------------------+----------+
+```
+
+Tensor data can be passed with strides.
+
+## Data section
+
+`Data` corresponds to a constant table and wraps this entity for the CPU.
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO SnippetS](../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
+ 
\ No newline at end of file
diff --git a/src/common/snippets/docs/snippets_design_guide.md b/src/common/snippets/docs/snippets_design_guide.md
new file mode 100644
index 00000000000..01b005b20e4
--- /dev/null
+++ b/src/common/snippets/docs/snippets_design_guide.md
@@ -0,0 +1,301 @@
+# SnippetS design guide
+This document describes the design and rationale for snippets code generator. Implementation of code functionality is located [here](https://github.com/openvinotoolkit/openvino/tree/master/src/common/snippets). Proposal for CPU backend integration is [here](https://github.com/openvinotoolkit/openvino/pull/2824).
+
+## Rationale
+
+We believe that core **CNN operators (convolution, gemm, fully connected) are limited by compute, the rest is memory bound**. Math approximations (like transcendental functions) are rare in emerging workloads and could be treated with the same machinery. **Snippets are designed to optimize topology for memory**, while leaving compute intensive kernels for backend developers.
+
+We believe **potential speedup is proportional to shrink in memory-walked bytes**. So we can transform the problem to a task to optimize for memory walks, whatever pattern snippet has and operations it contains. Number of memory walks should be less or equal to handcrafted optimizations. This guarantees performance improvements over the previous approach (excluding corner cases caused by cache effects). *Shrinkage factor might be encoded to some cost function in future evolution of code generator*. Snippets generator provides diagnostics to estimate this shrinkage factor with `ngraph::snippets::op::Subgraph::print_statistics(bool verbose)` member.
+
+We design SnippetS generator for back-end developers. The main purpose of inventing snippets code generator is an **operator fusion**, **register allocation** and **target kernel generation** decomposition. This allows modifications (like new fusion support) and feature extensions (like new operation support) to be done in a single point of modification and avoid combinatorial explosion for fusions/types/architectures etc.
+
+We believe that creating a full-fledged compiler or usage of existing compiler infrastructure (like LLVM & MLIR) is superfluous at this point of evelition. We aim to provide a **flexible and performant framework for operation fusions**, leaving micro optimizations (e.g. instruction scheduling) to the backend H/W.
+
+We do not aim to invent a DSL for SnippetS and would like to keep it this way. DSL gives users more flexibility to express uncommon operations. However, the shift towards an approach to encode topologies with elementary operations followed by smart enough fusions is already expressive and performant enough.
+
+**Snippet** is a compiled compute **kernel** generated from a subgraph using SnippetS code generator for specific architecture with a **scheduling domain**. Using this scheduling domain and calling convention backend can execute generated compute kernels. For the first generation, snippets are **statically scheduled towards the output domain**. Multi-output snippets are supported if all outputs are broadcast-compatible in a sense that domains for all outputs can be broadcasted from one root domain which defines snippet schedule. It’s a subject of extension for future generations.
+
+We use nGraph as the highest level IR for subgraph representation and lowering transformations. **Opset1** is a base operation set for code generation. We aim to **keep the minimal possible and sufficient operation set** (or ISA) and keep it **RISC-like** (memory and compute decomposed).
+
+**One subgraph corresponds to one snippet**. Operations which cannot be scheduled by a single schedule should not be placed in the same subgraph. Snippet somewhat conceptually close to OpenCL kernel without a restriction to express only embarrassingly parallel tasks.
+**Subgraph** once extracted from full topology IR is **treated as an operation and data flow descriptor in scalar notation** (similar to OpenCL/CUDA). Tensor sizes are used for defining scheduling domain and detecting broadcasts/reductions.
+
+We split operations into 3 groups: **layout-oblivious (LOO), layout-aware(-tolerant) and layout-dependent**. **Layout-oblivious** operation semantics and implementation are completely agnostic to a specific layout in which tensors are placed in memory. For example, elements-wise math and ReLU does in this category. Implementation **layout-aware** operation depends on the layout of input/output tensors. For example, convolutions and other block-wise kernels or layout repaks. For **layout-specific** operation semantics and implementation depends on the layout. For example, the Yolo region. Patterns to fuse constructed in terms of taxonomy above.
+
+## Design
+
+Code generation is split into 2 phases, **tokenization** and **lowering**.
+
+### Tokenization
+
+Tokenization runs on full topology nGraph function inside a specific plugin in a stage of common transformations. Input of tokenization is a topology graph. Output is a modified topology graph with `ngraph::snippets::op::Subgraph` operations installed. Each subgraph contains nGraph function (called **body**) which holds a part of original topology legal for snippet generation (can be scheduled with a single schedule) 
+
+Procedure of finding subgraphs suitable for code generation is called **tokenization**, meaning that we split the topology tree into subgraphs in the same greedy approach which is used for parsing input stream of characters into the tokens. It also could be seen as and modified into a basic block construction problem, since we also find a leader and potentially terminators. Implementation can be found [here](https://github.com/openvinotoolkit/openvino/blob/master/src/common/snippets/src/pass/collapse_subgraph.cpp).
+
+Tokenization has an advantage over the pattern matching approach (used in traditional and MLIR-based compilers) since it can handle arbitrary patterns of operations. Pattern matching deduces specific configuration of operations to translate to another one, more suitable for target machine or further lowering. This means that relations between operations are fixed. Tokenization on the other hand has the only limitation on specific operation types which are **suitable and profitable** to fuse with respect to original topology correctness (keeping it as a direct acyclic graph).
+
+The extracted body comes to a plug-in wrapped as a composite `Subgraph` operation which is seen as a block box from a plugin standpoint and can participate in any plugin specific subroutines (e.g. layout assignment, memory allocation, etc.).
+
+### Supported subgraph patterns
+
+Subgraph accepts arbitrary numbers of inputs and outputs. There is 1:1 mapping for external (subgraph node’s) and internal (body) parameters indexes. 
+
+Pattern here is an exact subgraph configuration (nodes and edges between them). **The first generation of snippets supports only layout-oblivious operations which may have broadcast on inputs and broadcast-compatible outputs**. For example Shapes `<1, 42, 17, 31>`, `<1, 42, 17, 1>` and `<1, 42, 1, 31>` are considered as broadcast-compatible. Layout-oblivious operation with multiple outputs as a snippet leader and forms a new subgraph. The most beneficial patterns are subgraphs with complex control flow but minimal number of inputs/and outputs. For example, GeLU has a 5x shrinkage factor from original unfused subgraph in number of bytes walked. Subgraph below could be considered as an example of such a subgraph. Leader detection procedure aims to find such subgraphs.
+
+```mermaid
+ flowchart LR
+    nodeA1(...) --> nodeA2(Add)
+    nodeA2(Add) --> nodeA3(Add)
+    nodeA2(Add) --> nodeA5(Multiply)
+    nodeA3(Add) --> nodeA4(Clamp)
+    nodeA4(Clamp) --> nodeA5(Multiply)
+    nodeA5(Multiply) --> nodeA6(...)
+classDef no-bg-color fill:none,stroke-width:0px
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class nodeA1,nodeA6 no-bg-color
+class nodeA2,nodeA3 daisy1
+class nodeA4,nodeA5 steel1
+class nodeA3 steel1
+```
+
+Operations are greedily added to the subgraph until
+1. New operation doesn’t introduce a loop in a topology function.
+1. Number of inputs and outputs satisfies target criteria.
+1. Operation is not a predecessor of topology output.
+1. Resulting subgraph can be scheduled (all outputs are broadcast-compatible). 
+
+If a potential subgraph doesn’t meet any of criteria above, the procedure continues to find a new leader.
+
+### Lowering
+
+Lowering is a sequence of subgraph (snippet body) traversal passes to generate a compute kernel out of subgraphs of operations extracted by tokenization.
+
+1. Common optimizations
+1. Canonicalization
+    1. Domain normalization
+    1. Conversion to snippets dialect
+1. Target-specific optimizations
+1. Register allocation
+1. Schedule generation
+1. Target code emission
+
+#### Common optimizations
+
+Constants are treated as inputs for a subgraph with an exception for scalar cases (since we don’t need to schedule them). `snippets::op::Scalar` is used to represent this kind of constants.
+
+If such Scalar comes as a second input of Power operation, it’s replaced with `snippets::op::PowerStatic`.
+
+#### Canonicalization
+
+The goal of this step is to apply target independent and schedule related optimizations and to make snippet **schedulable**.
+
+##### Domain normalization
+
+All input and output shapes are normalized to 6D for future schedule generation. If shape propagation fails or leads to inconsistent output shapes an exception is raised.
+
+Layout assigned by user code and passed to a `generate` function is propagated through subgraph on this step as well. Layout is passed to a generate function as a `BlockedShapeVector` which is a `std::vector<BlockedShape>` , while `BlockedShape` is `std::tuple<ngraph::Shape, ngraph::AxisVector, ngraph::element::Type>`. For example, if backend supports `NCHW16c` layout and tensor has size of `<1, 42, 17, 31>` and hold single precision floating point this structure should be `std::make_tuple(ngraph::Shape {1, 3, 17, 31, 16}, ngraph::AxisVector {0, 1, 2, 3, 1}, ngraph::element::f32);`. This allows generic layout representation.
+
+##### Dialect conversion
+
+The goal for this step is to transform a subgraph (body function) into a form possible to code generation. Input for this step is subgraph in a canonical form output is a subgraph in snippets dialect.
+
+Snippet or kernel is formed around the subgraph body in a sequence of traversal steps. Let’s walk through these steps with the smallest possible subgraph which contains out of single `[Add]` operation. 
+
+While we extract subgraphs with the tokenization part we explicitly insert Parameters and Results to its body to form a complete nGraph Function.
+
+```mermaid
+flowchart LR
+    nodeA1(Parameter) --> nodeA2(Add)
+    nodeA3(Parameter) --> nodeA2(Add)
+    nodeA2(Add) --> nodeA5(Result)
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class nodeA2 daisy1
+class nodeA5 moss1
+class nodeA8 steel1
+class nodeA1,nodeA3 steel1
+```
+
+This function represents operation dependencies in scalar (similar to OpenCL) notation while shapes of tensors are used to generate schedules. At this point kernel-schedule decomposition is made (similar to Halide/OpenCL/TVM)
+
+###### Explicit memory operations
+
+As a next step explicit memory operations are placed for each input and output. `InsertLoad` and `InsertStore` passes derived from `MatcherPass`.
+
+```mermaid
+flowchart LR
+    nodeA1(Parameter) --> nodeA6(Load)
+    nodeA6(Load) --> nodeA2(Add)
+    nodeA3(Parameter) --> nodeA7(Load)
+    nodeA7(Load) --> nodeA2(Add)
+    nodeA2(Add) --> nodeA8(Store)
+    nodeA8(Store) --> nodeA5(Result)
+classDef carbon1 fill:#E9E9E9, stroke: #AEAEAE, color: #262626
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class nodeA2 daisy1
+class nodeA5 moss1
+class nodeA8 carbon1
+class nodeA1,nodeA3,nodeA6,nodeA7 steel1
+```
+
+By default, memory operations assumes vector memory access, if scalar access is needed special passes `ReplaceLoadsWithScalarLoads` and `ReplaceStoresWithScalarStores`  should be executed.
+
+###### Explicit broadcast
+
+For each operation in body function inputs are checked against broadcasting. In case of parameters to be broadcasted explicit broadcast operation is generated. For example, if for the subgraph above we have `<1, 42, 17, 31>` and `<1, 42, 17, 1>` resulting subgraph is going to be
+
+```mermaid
+flowchart LR
+    nodeA1("Parameter\n<1, 42, 17, 1>") --> node6("Load\n<1, 42, 17, 1>")
+    node6("Load\n<1, 42, 17, 1>") --> nodeA9("BroadcastMove\n<1, 42, 17, 31>")
+    nodeA9("BroadcastMove\n<1, 42, 17, 31>") --> nodeA2(Add)
+    nodeA3("Parameter\n<1, 42, 17, 31>") --> nodeA7("Load\n<1, 42, 17, 31>")
+    nodeA7("Load\n<1, 42, 17, 31>") ---> nodeA2(Add)
+    nodeA2(Add) --> nodeA8("Store\n<1, 42, 17, 31>")
+    nodeA8("Store\n<1, 42, 17, 31>") --> nodeA5("Result\n<1, 42, 17, 31>")
+classDef carbon1 fill:#E9E9E9, stroke: #AEAEAE, color: #262626
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class nodeA2 daisy1
+class nodeA5 moss1
+class nodeA8,nodeA9 carbon1
+class nodeA1,nodeA3,node6,nodeA7 steel1
+```
+
+If load followed by broadcast is detected then this pair is replaced by a single Broadcast load instruction. Like the following
+
+```mermaid
+flowchart LR
+    nodeA1(Parameter) --> nodeA6(BroadcastLoad)
+    nodeA6(BroadcastLoad) --> nodeA2(Add)
+    nodeA3(Parameter) --> nodeA7(Load)
+    nodeA7(Load) --> nodeA2(Add)
+    nodeA2(Add) --> nodeA8(Store)
+    nodeA8(Store) --> nodeA5(Result)
+classDef carbon1 fill:#E9E9E9, stroke: #AEAEAE, color: #262626
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class nodeA2 daisy1
+class nodeA5 moss1
+class nodeA8 carbon1
+class nodeA1,nodeA3,nodeA6,nodeA7 steel1
+```
+
+Broadcast and regular streaming vector load is possible from the same pointer. Broadcast load should always go before streaming load. Broadcast load for non the most varying dimension is not generated, however it affects the generated schedule.
+
+#### Target-specific optimizations
+
+Target developers can plug in to the code generation pipeline some specific optimizations with passing `ngraph::pass::Manager` into `generate` function of `subgraph`. **Passes are executed on subgraph in canonical form converted to a snippet dialect**.
+
+*It might be also extended to provide an interface for target independent optimizations in future*
+
+#### Register allocation
+
+Canonicalized subgraph in a snippets dialect forms a basic block or region inside a snippet (kernel). Registers are allocated globally for the whole subgraph. Since all operations for a subgraph are assumed to be vector, only vector registers are allocated for the first generation of SnippetS. Linear scan register allocation algorithm is used. Register allocator is implemented as a function pass `ngraph::snippets::pass::AssignRegisters` and store allocated registers for each node into `rt_info`. `rt_info` for a node holds a register for Node's output. *However, this part should be refactored batter, either to become target independent or use target specific abstraction to acquire a new register*
+
+#### Schedule generation 
+
+The goal of this step is to transform subgraphs in a scalar notation into kernel functions callable from user code. `Kernel` and `Tile` operations are introduced for this purpose. Each of this operation has a constructor from code region described as a collection of operation and operands pairs `Kernel(const std::vector<std::pair<std::shared_ptr<ngraph::snippets::Emitter>, ngraph::snippets::RegInfo>>& region);`. 
+
+If we return to example above this comes to a following hierarchical IR. If we limit scope to layout oblivious operations with broadcasting support, tile could be generated as a single loop over the most warning dimension. The second `Tile` is generated to handle tails and can be omitted if not needed. Special pass replaces memory operations on vector to scalar versions for tail subgraph. 
+
+```mermaid
+graph LR
+subgraph subgraphD1[ ]
+nodeD1(Data)
+end
+subgraph subgraphC1[Kernel]
+direction LR
+subgraph subgraphA1[Tile]
+nodeA1(Parameter) --> nodeA6(Load)
+nodeA6(Load) --> nodeA2(Add)
+nodeA3(Parameter) --> nodeA7(Load)
+nodeA7(Load) --> nodeA2(Add)
+nodeA2(Add) --> nodeA8(Store)
+nodeA8(Store) --> nodeA5(Result)
+end
+subgraph subgraphB1[Tile]
+nodeB1(Parameter) --> nodeB6(ScalarLoad)
+nodeB6(ScalarLoad) --> nodeB2(Add)
+nodeB3(Parameter) --> nodeB7(ScalarLoad)
+nodeB7(ScalarLoad) --> nodeB2(Add)
+nodeB2(Add) --> nodeB8(ScalarStore)
+nodeB8(ScalarStore) --> nodeB5(Result)
+end
+end
+classDef no-stroke fill:none,stroke-width:0px
+classDef no-bg-color fill:none,stroke-width:1px,stroke:#86B3CA
+classDef carbon1 fill:#E9E9E9, stroke: #AEAEAE, color: #262626
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class subgraphC1,subgraphA1,subgraphB1,subgraphD1 no-bg-color
+class nodeA2,nodeB2 daisy1
+class nodeA5,nodeB5 moss1
+class nodeA8,nodeB8 carbon1
+class nodeA1,nodeA3,nodeA6,nodeA7,nodeB1,nodeB3,nodeB6,nodeB7 steel1
+class nodeD1 no-stroke
+```
+
+Where
+*  `Kernel` constants a collection of the tiles, corresponds to a Subgraph node and responsible for function signature generation, calls generators for all tiles and data sections
+* `Tile` contains single subgraph body, vector or scalar
+* `Data` corresponds to data section aggregated for all nodes in all Tile’s subgraphs
+
+#### Target code emission
+
+Target code emission is table based. Target is responsible for filling `jitters` table field in `Generator` class. 
+
+```
+std::map<const ngraph::DiscreteTypeInfo, std::function<std::shared_ptr<Emitter>(std::shared_ptr<ngraph::Node>)>> jitters;
+```
+
+##### Interface with a target
+
+An OpenVINO plugin is treated as a target for snippets.
+
+Each nGraph node is mapped to a convertor function which creates `Emitter` form this node. Each specific emitter should extend from `Emitter`. It is used to map this node to target code and has `emit_code` and `emit_data` methods. `emit_data` is used during data section generation. All operations from snippets dialect which are legal for code generation should be expressed as operations derived from nGraph Op as well as Emitter derived snippets::Emitter class which knows how to translate this Op to Target specific ISA. (ex. xbyak is a jit backend for CPU plugin).
+
+For minimal code generator support target should provide emitters for the following operations
+
+* `Kernel`
+* `Tile`
+* `Data`
+* `Load`
+* `ScalarLoad`
+* `BroadcastLoad`
+* `Store`
+* `ScalarStore`
+
+Once a schedule is generated, target code is emitted from a kernel in Generator::generate method by executing Kernel::emit_code function. Since Kernel and Tile represents hierarchical
+
+##### Dialect extensibility
+
+Target can potentially extend snippets dialect with target specific operation for code emission. It should implement:
+
+* nGraph operation (ex. `class FMA : public ngraph::op::Op`)
+* Emitter for this operation (ex. `class FmaEmitter : public Emitter` )
+* register this pair in `jitters` map
+
+### Calling convention
+
+Parameters for a generated snippet are split into schedule-invariant and schedule-dependent. Schedule-invariant parameters include pointers to input/output tensors and strides for each of them with the same rank as scheduling domain.
+
+### Diagnostics
+
+#### Reference mode
+
+Subgraph can be executed with nGraph references if no generator is present.
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO SnippetS](../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
+
diff --git a/src/plugins/README.md b/src/plugins/README.md
index d5c6dfa63d1..dacb94bce36 100644
--- a/src/plugins/README.md
+++ b/src/plugins/README.md
@@ -4,13 +4,13 @@ OpenVINO Plugins provide support for hardware devices.
 
 The list of supported plugins:
 
- * [auto](./auto)
+ * [auto](./auto/README.md)
  * [auto_batch](./auto_batch)
- * [hetero](./hetero)
- * [intel_cpu](./intel_cpu)
+ * [hetero](./hetero/README.md)
+ * [intel_cpu](./intel_cpu/README.md)
  * [intel_gna](./intel_gna)
- * [intel_gpu](./intel_gpu)
- * [template](./template)
+ * [intel_gpu](./intel_gpu/README.md)
+ * [template](./template/README.md)
 
 ## See also
  * [OpenVINO™ README](../../README.md)
diff --git a/src/plugins/hetero/README.md b/src/plugins/hetero/README.md
new file mode 100644
index 00000000000..952013815bc
--- /dev/null
+++ b/src/plugins/hetero/README.md
@@ -0,0 +1,141 @@
+# OpenVINO Hetero plugin design overview
+
+## Subgraphs selection
+
+Algorithm:
+
+For each plugin
+1. Select *root* node
+    * Node not in subgraph previously constructed
+    * Affinity is equal to plugin name
+2. Select adjacent node to any node in already subgraph which is not in rejected list
+    * if there are no such nodes **end**
+3. Check selected node has same affinity
+4. Add node to subgraph if check was successful or add to rejected list otherwise
+5. Check global condition
+    * Nodes in rejected list can never be added to subgraph
+    * Nodes not in subgraph and not in rejected list can possibly be added later
+    * Check subgraph topology (the only check now is there are no indirect subgraph self-references)
+6. If global condition was failed remove last node from subgraph, add it to rejected list and go to step 5
+    * we can rollback multiple times here because rejected list is changed every time
+7. Go to step 2
+
+Example:
+```mermaid
+graph TD;
+    1-->2;
+    2-->3;
+    2-->4;
+    3-->5;
+    4-->5;
+    5-->6;
+    6-->7;
+```
+
+Nodes [1,2,3,5,6,7] are supported in plugin, [4] is not
+
+Possible roots: [1,2,3,5,6,7]
+1. Select root [1]
+    * Subgraph: [1]
+    * Rejected: []
+    * Global condition: ok
+2. Merge [2]
+    * Subgraph: [1,2]
+    * Rejected: []
+    * Global condition: ok
+3. Merge [3]
+    * Subgraph: [1,2,3]
+    * Rejected: []
+    * Global condition: ok
+4. Merge [5]
+    * Subgraph: [1,2,3,5]
+    * Rejected: []
+    * Global condition: There is possible self-references through node [4] but we do not know yet, ok
+5. Merge [6]
+    * Subgraph: [1,2,3,5,6]
+    * Rejected: []
+    * Global condition: There is possible self-references through node [4] but we do not know yet, ok
+6. Merge [7]
+    * Subgraph: [1,2,3,5,6,7]
+    * Rejected: []
+    * Global condition: There is possible self-references through node [4] but we do not know yet, ok
+7. Failed to merge [4]
+    * Subgraph: [1,2,3,5,6,7]
+    * Rejected: [4]
+    * Global condition: There is self-references through node [4], reject
+8. Rollback [7]
+    * Subgraph: [1,2,3,5,6]
+    * Rejected: [4,7]
+    * Global condition: There is self-references through node [4], reject
+9. Rollback [6]
+    * Subgraph: [1,2,3,5]
+    * Rejected: [4,6,7]
+    * Global condition: There is self-references through node [4], reject
+10. Rollback [5]
+    * Subgraph: [1,2,3]
+    * Rejected: [4,5,6,7]
+    * Global condition: ok
+11. There are nodes to merge **end**
+
+Possible roots: [5,6,7]
+1. Select root [5]
+    * Subgraph: [5]
+    * Rejected: []
+    * Global condition: ok
+2. Merge [6]
+    * Subgraph: [5,6]
+    * Rejected: []
+    * Global condition: ok
+3. Merge [7]
+    * Subgraph: [5,6,7]
+    * Rejected: []
+    * Global condition: ok
+4. Merge [3]
+    * Subgraph: [3,5,6,7]
+    * Rejected: []
+    * Global condition: ok
+5. Merge [2]
+    * Subgraph: [2,3,5,6,7]
+    * Rejected: []
+    * Global condition: There is possible self-references through node [4] but we do not know yet, ok
+6. Failed to merge [4]
+    * Subgraph: [2,3,5,6,7]
+    * Rejected: [4]
+    * Global condition: There is self-references through node [4], reject
+7. Rollback [2]
+    * Subgraph: [3,5,6,7]
+    * Rejected: [2,4]
+    * Global condition: ok
+8. There are nodes to merge **end**
+
+Possible roots: [] no roots, **END**
+
+Subgraphs: [1,2,3], [3,5,6,7]
+
+Select best subgraph:
+* When we have multiple subgraphs larger ([3,5,6,7]) is always selected, always
+
+Repeat previous steps with remaining nodes [1,2]
+
+The final result is:
+* First plugin: [3,5,6,7], [1,2]
+* Second plugin: [4]
+
+
+## Subgraphs self reference detection
+
+1. For each node in network build a list of reachable node (transitive closure)
+2. For each pair of nodes in subgraph find `path` nodes (nodes through one node in pair reachable to other)
+    * assume `src` - one node in pair, `dst` - other node in pair
+    * get all nodes reachable from `src`
+    * in those nodes find nodes through you can reach `dst` those will be our `path` node
+3. Results for pairs is cached.
+4. Check if there intersection between `path` nodes set and rejected nodes set for each nodes pair in subgraph
+5. In case of intersection we have a self-reference and subgraph is invalid
+
+## See also
+ * [OpenVINO™ README](../../../README.md)
+ * [OpenVINO Core Components](../../README.md)
+ * [OpenVINO Plugins](../README.md)
+ * [Developer documentation](../../../docs/dev/index.md)
+ 
\ No newline at end of file
diff --git a/src/plugins/intel_cpu/README.md b/src/plugins/intel_cpu/README.md
new file mode 100644
index 00000000000..f7afe70ab15
--- /dev/null
+++ b/src/plugins/intel_cpu/README.md
@@ -0,0 +1,29 @@
+# OpenVINO Intel CPU Plugin
+
+## Key Contacts
+
+Please contact a member of [openvino-ie-cpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-cpu-maintainers) group, for assistance regarding CPU.
+
+## Components
+
+CPU Plugin contains the following components:
+
+* [docs](./docs/) - contains developer documentation pages for the component.
+* [src](./src/) - folder contains sources of the core component.
+* [tests](./tests/) - contains tests for OpenVINO Plugin components.
+* [thirdparty](./thirdparty/) - contains third-party modules.
+* [tools](./tools/) - contains tools and helpers for OpenVINO Plugin components.
+
+## Tutorials
+
+* [Debug capabilities](./docs/debug_capabilities.md)
+* [Performance analysis using ITT counters](./docs/performance_analysis_ITT_counters.md)
+* [Intel Software Development Emulator (CPU emulation)](./docs/cpu_emulation.md)
+* [Runtime parameters cache](./docs/runtime_parameters_cache.md)
+* [Internal CPU Plugin Optimizations](./docs/internal_cpu_plugin_optimization.md)
+
+## See also
+ * [OpenVINO™ README](../../../README.md)
+ * [OpenVINO Core Components](../../README.md)
+ * [OpenVINO Plugins](../README.md)
+ * [Developer documentation](../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_cpu/docs/cpu_emulation.md b/src/plugins/intel_cpu/docs/cpu_emulation.md
new file mode 100644
index 00000000000..d431eda5d0d
--- /dev/null
+++ b/src/plugins/intel_cpu/docs/cpu_emulation.md
@@ -0,0 +1,37 @@
+# Intel Software Development Emulator
+
+Intel SDE can be used for emulating CPU architecture, checking for AVX/SSE transitions, bad pointers and data misalignment, etc.
+
+Also supports debugging within emulation.
+
+In general the tool can be used for all kind of troubleshooting activities except performance analysis.
+
+See [Documentation](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html) for more information
+
+## Usage examples:
+
+- Emulating Sapphire Rapids CPU for _benchmark_app_ together with blob dumping, for example to debug some accuracy issue:
+
+```sh
+OV_CPU_BLOB_DUMP_FORMAT=TEXT OV_CPU_BLOB_DUMP_NODE_TYPE=Convolution \
+/path/to/sde -spr -- ./benchmark_app --niter 1 --nstreams 1 -m path/to/model.xml
+```
+
+- Running _cpuFuncTests_ on some old architecture, for example Sandy Bridge:
+
+`/path/to/sde -snd -- ./cpuFuncTests`
+
+- Count AVX/SSE transitions for the current host:
+
+`/path/to/sde -ast -- ./benchmark_app -m path/to/model.xml`
+
+> **NOTE**: Best way to check for AVX/SSE transitions is to run within Alder Lake emulation:
+
+`/path/to/sde -adl -- ./benchmark_app -m path/to/model.xml`
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO CPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
diff --git a/src/plugins/intel_cpu/docs/debug_capabilities.md b/src/plugins/intel_cpu/docs/debug_capabilities.md
new file mode 100644
index 00000000000..6ae506fb4f6
--- /dev/null
+++ b/src/plugins/intel_cpu/docs/debug_capabilities.md
@@ -0,0 +1,21 @@
+# CPU Plugin debug capabilities
+
+The page describes list of useful debug features, controlled by environment variables.
+
+They can be activated at runtime and might be used for analyzing issues, getting more context, comparing execution results, etc.
+
+To have CPU debug capabilities available at runtime the following CMake option should be used when building the plugin:
+* `ENABLE_DEBUG_CAPS`. Default is `OFF`
+
+The following debug capabilities are available with the latest openvino:
+
+- [Verbose mode](../src/docs/verbose.md)
+- [Blob dumping](../src/docs/blob_dumping.md)
+- [Graph serialization](../src/docs/graph_serialization.md)
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_cpu/docs/internal_cpu_plugin_optimization.md b/src/plugins/intel_cpu/docs/internal_cpu_plugin_optimization.md
new file mode 100644
index 00000000000..377792a6dc9
--- /dev/null
+++ b/src/plugins/intel_cpu/docs/internal_cpu_plugin_optimization.md
@@ -0,0 +1,223 @@
+# Internal CPU Plugin Optimizations
+
+The CPU plugin supports several graph optimization algorithms, such as fusing or removing layers.
+Refer to the sections below for details.
+
+> **NOTE**: For layer descriptions, see the [IR Notation Reference](https://docs.openvino.ai/latest/openvino_docs_ops_opset.html).
+
+
+## Fusing Convolution and Simple Layers
+
+Merge of a convolution layer and any of the simple layers listed below:
+- Activation: ReLU, ELU, Sigmoid, Clamp
+- Depthwise: ScaleShift, PReLU
+- FakeQuantize
+
+> **NOTE**: You can have any number and order of simple layers.
+
+A combination of a convolution layer and simple layers results in a single fused layer called 
+*Convolution*:
+
+```mermaid
+flowchart TD
+    subgraph subgraphA1[Runtime Graph]
+    direction TB
+    nodeA1(Input) --> nodeA2(Convolution)
+    nodeA2(Convolution) --> nodeA3(Output)
+    end
+    subgraph subgraphB1[Original Graph]
+    direction TB
+    nodeB1(Input) --> nodeB2(Convolution)
+    nodeB2(Convolution) --> nodeB3(Simple Layer)
+    nodeB3(Simple Layer) --> nodeB4(...)
+    nodeB4(...) --> nodeB5(Simple Layer)
+    nodeB5(Simple Layer) --> nodeB6(Output)
+    end
+classDef no-bg-color fill:none,stroke-width:0px
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class subgraphA1,subgraphB1,nodeB4 no-bg-color
+class nodeA2 daisy1
+class nodeB1,nodeB6,nodeA1,nodeA3 moss1
+class nodeB2,nodeB3,nodeB5, steel1
+```
+
+## Fusing Pooling and FakeQuantize Layers
+
+A combination of Pooling and FakeQuantize layers results in a single fused layer called *Pooling*:  
+
+```mermaid
+flowchart TD
+    subgraph subgraphA1[Runtime Graph]
+    direction TB
+    nodeA1(Input) --> nodeA2(Pooling)
+    nodeA2(Pooling) --> nodeA3(Output)
+    end
+    subgraph subgraphB1[Original Graph]
+    direction TB
+    nodeB1(Input) --> nodeB2("Pooling [Average]")
+    nodeB2("Pooling [Average]") --> nodeB3(Fake Quantize)
+    nodeB3(Fake Quantize) --> nodeB4(Output)
+    end
+classDef no-bg-color fill:none,stroke-width:0px
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class subgraphA1,subgraphB1 no-bg-color
+class nodeA2 daisy1
+class nodeB1,nodeB4,nodeA1,nodeA3 moss1
+class nodeB2,nodeB3 steel1
+```
+## Fusing FullyConnected and Activation Layers
+
+A combination of FullyConnected and Activation layers results in a single fused layer called 
+*FullyConnected*:
+
+```mermaid
+flowchart TD
+    subgraph subgraphA1[Runtime Graph]
+    direction TB
+    nodeA1(Input) --> nodeA2(FullyConnected)
+    nodeA2(FullyConnected) --> nodeA3(Output)
+    end
+    subgraph subgraphB1[Original Graph]
+    direction TB
+    nodeB1(Input) --> nodeB2(FullyConnected)
+    nodeB2(FullyConnected) --> nodeB3("Activation [ReLU]")
+    nodeB3("Activation [ReLU]") --> nodeB4(Output)
+    end
+classDef no-bg-color fill:none,stroke-width:0px
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class subgraphA1,subgraphB1 no-bg-color
+class nodeA2 daisy1
+class nodeB1,nodeB4,nodeA1,nodeA3 moss1
+class nodeB2,nodeB3 steel1
+```
+## Fusing Convolution and Depthwise Convolution Layers Grouped with Simple Layers
+
+> **NOTE**: This pattern is possible only on CPUs with support of Streaming SIMD Extensions 4.2 
+> (SSE 4.2) and Intel AVX2 Instruction Set Architecture (ISA).
+
+A combination of a group of a Convolution (or Binary Convolution) layer and simple layers and a group of a Depthwise Convolution
+layer and simple layers results in a single layer called *Convolution* (or *Binary Convolution*):
+> **NOTE**: Depthwise convolution layers should have the same values for the `group`, input channels, and output channels parameters.
+
+```mermaid
+flowchart TD
+    subgraph subgraphA1[Runtime Graph]
+    direction TB
+    nodeA1(Input) --> nodeA2(Convolution)
+    nodeA2(Convolution) --> nodeA3(Output)
+    end
+    subgraph subgraphB1[Original Graph]
+    direction TB
+    nodeB1(Input) --> nodeB2(Convolution)
+    nodeB2(Convolution) --> nodeB3(Simple Layer)
+    nodeB3(Simple Layer) --> nodeB4(...)
+    nodeB4(...) --> nodeB5(Simple Layer)
+    nodeB5(Simple Layer) --> nodeB6(Depthwise \n Convolution)
+    nodeB6(Depthwise \n Convolution) --> nodeB7(Simple Layer)
+    nodeB7(Simple Layer) --> nodeB8(...)
+    nodeB8(...) --> nodeB9(Simple Layer)
+    nodeB9(Simple Layer) --> nodeB10(Output)
+    end
+classDef no-bg-color fill:none,stroke-width:0px
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+class subgraphA1,subgraphB1,nodeB4,nodeB8 no-bg-color
+class nodeA2 daisy1
+class nodeB1,nodeA1,nodeA3,nodeB10 moss1
+class nodeB2,nodeB3,nodeB5,nodeB6,nodeB7,nodeB9 steel1
+```
+## Fusing Convolution and Sum Layers
+
+A combination of convolution, simple, and Eltwise layers with the sum operation results in a single layer called *Convolution*:  
+
+```mermaid
+flowchart TD
+    subgraph subgraphA1[Runtime Graph]
+    direction TB
+    nodeA1(Input) --> nodeA4(Any Layer)
+    nodeA4(Any Layer) --> nodeA2(Convolution)
+    nodeA5(Input2) ---> nodeA2(Convolution)
+    nodeA2(Convolution) --> nodeA3(Output)
+    end
+    subgraph subgraphB1[Original Graph]
+    direction TB
+    nodeB1(Input1) --> nodeB7(Any Layer)
+    nodeB7(Any Layer) -----> nodeB2("Eltwise[op=sum]")
+    nodeB8(Input) --> nodeB9(Convolution)
+    nodeB9(Convolution) --> nodeB10(Simple Layer)
+    nodeB10(Simple Layer) --> nodeB11(...)
+    nodeB11(...) --> nodeB12(Simple Layer)
+    nodeB12(Simple Layer) --> nodeB2("Eltwise[op=sum]")
+    nodeB2("Eltwise[op=sum]") --> nodeB3(Simple Layer)
+    nodeB3(Simple Layer) --> nodeB4(...)
+    nodeB4(...) --> nodeB5(Simple Layer)
+    nodeB5(Simple Layer) --> nodeB6(Output)
+    end
+classDef no-bg-color fill:none,stroke-width:0px
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+classDef coral1 fill:#FFB6B9, stroke: #FF848A, color: #262626
+class subgraphA1,subgraphB1,nodeB4,nodeB11 no-bg-color
+class nodeA2 daisy1
+class nodeB1,nodeA5,nodeA1,nodeA3,nodeB6,nodeB8 moss1
+class nodeB3,nodeB5,nodeA4,nodeB7,nodeB9,nodeB10,nodeB12 steel1
+class nodeB2 coral1
+```
+## Fusing a Group of Convolutions
+
+If a topology contains the following pipeline, a CPU plugin merges split, convolution, and concatenation layers into a single convolution layer with the group parameter:   
+
+```mermaid
+flowchart TD
+    subgraph subgraphA1[Runtime Graph]
+    direction TB
+    nodeA1(Input) --> nodeA2(Convolution)
+    nodeA2(Convolution) --> nodeA3(Output)
+    end
+    subgraph subgraphB1[Original Graph]
+    direction TB
+    nodeB1(Input) --> nodeB2(Split)
+    nodeB2(Split) --> nodeB6(Convolution1)
+    nodeB6(Convolution1) --> nodeB4(Concatenation)
+    nodeB2(Split) --> nodeB3(Convolution3)
+    nodeB2(Split) --> nodeB7(Convolution2)
+    nodeB7(Convolution2) --> nodeB4(Concatenation)
+    nodeB3(Convolution3) --> nodeB4(Concatenation)
+    nodeB4(Concatenation) --> nodeB5(Output)
+
+    end
+classDef no-bg-color fill:none,stroke-width:0px
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+classDef coral-tint-2 fill:#FFB6B9, stroke: #FF848A, color: #262626
+class subgraphA1,subgraphB1 no-bg-color
+class nodeB4,nodeB2 coral-tint-2
+class nodeA2 daisy1
+class nodeB1,nodeA1,nodeA3,nodeB5 moss1
+class nodeB3,nodeB6,nodeB7 steel1
+```
+> **NOTE**: Parameters of the convolution layers must coincide.
+
+
+## Removing a Power Layer
+
+CPU plugin removes a Power layer from a topology if it has the following parameters:
+  - <b>power</b> = 1
+  - <b>scale</b> = 1
+  - <b>offset</b> = 0
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_cpu/docs/performance_analysis_ITT_counters.md b/src/plugins/intel_cpu/docs/performance_analysis_ITT_counters.md
new file mode 100644
index 00000000000..263b043dd12
--- /dev/null
+++ b/src/plugins/intel_cpu/docs/performance_analysis_ITT_counters.md
@@ -0,0 +1,57 @@
+# Performance analysis using ITT counters
+
+## Contents
+
+- [Introduction](#introduction)
+- [Performance analysis](#performance-analysis)
+- [Adding new ITT counters](#adding-new-itt-counters)
+
+## Introduction
+
+OpenVINO has a powerful capabilities for performance analysis of the key stages, such as read time and load time. Most of the modules and features have been tagged with [Intel ITT](https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/api-support/instrumentation-and-tracing-technology-apis.html) counters, which allows us to measure the performance of these components.
+
+## Performance analysis
+
+For performance analysis, follow the steps below:
+1. Run the CMake tool with the following option: `-DENABLE_PROFILING_ITT=ON` and build OpenVINO.
+2. Choose the tool for statistics collection using ITT counters.
+    1. [Intel SEAPI](https://github.com/vladislav-volkov/IntelSEAPI) should be built from sources. See the [Readme](https://github.com/vladislav-volkov/IntelSEAPI/blob/master/README.txt) file for details.
+    2. [Intel Vtune Profiler](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html)
+3. Run OpenVINO project with performance analysis tool.
+
+### Intel SEAPI
+
+#### Example of tool run:
+`python ~/tools/IntelSEAPI/runtool/sea_runtool.py -o trace -f gt ! ./benchmark_app -niter 1 -nireq 1 -nstreams 1 -api sync -m ./resnet-50-pytorch/resnest-50-pytorch.xml`
+
+#### Mandatory parameters:
+* -o trace – output file name
+* -f gt - statistics type to be generated (Google traces)
+
+#### Generated artifacts:
+`trace.pid-21725-0.json`
+Generated file can be opened with google chrome using "chrome://tracing" URL.
+
+### Intel Vtune Profiler
+
+#### Example of tool run:
+`vtune -collect hotspots -k sampling-mode=hw -k enable-stack-collection=true -k stack-size=0 -k sampling-interval=0.5 -- ./benchmark_app -nthreads=1 -api sync -niter 1 -nireq 1 -m ./resnet-50-pytorch/resnet-50-pytorch.xml`
+
+#### Mandatory parameters:
+* -collect hotspots
+
+#### Generated artifacts:
+`r000hs`
+Generated file can be opened with Vtune client.
+
+## Adding new ITT counters
+
+Use API defined in [openvino/itt](https://docs.openvinotoolkit.org/latest/itt_2include_2openvino_2itt_8hpp.html) module.
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
+ 
\ No newline at end of file
diff --git a/src/plugins/intel_cpu/docs/runtime_parameters_cache.md b/src/plugins/intel_cpu/docs/runtime_parameters_cache.md
new file mode 100644
index 00000000000..85ccea276cb
--- /dev/null
+++ b/src/plugins/intel_cpu/docs/runtime_parameters_cache.md
@@ -0,0 +1,54 @@
+# CPU plugin runtime parameters cache
+
+## Checklist for the runtime cache implementation
+1. Determine what data will be cached. We usually use the Executor concept that represents a junction of the executable code, usually JIT generated kernel, with some precomputed algorithm parameters.
+2. Provide a key that uniquelly identifies the cached value as a funtion of dynamically changing parameters, i.e. shapes, dynamic input that determines the algorithm parameters, etc. To be used in a hash table, the key must have the following static interface:
+   ```cpp
+   struct KeyType {
+       size_t hash() const;
+       bool operator== () const;
+   };
+   ```
+3. Provide a builder, that is, a callable object of the following signature: 
+   ```cpp
+   ValueType build(const KeyType& key);
+   ```
+   The `ValueType` is a type to be cached (e.g. shared pointer to Executor object). Remember that in the current cache implementation, a default constructed `ValueType()` object is considered empty, so it is better to use `std::shared_ptr` as the `ValueType`. The builder instantiates a specific type of cached entity from the `key`, thus the `key` completely defines the cached data. The builder is used to creat the `ValueType` object in case of cache miss.
+4. Refactor the specific implementation of the `prepareParams()` method to extract the cached object construction logic (e.g. the algorithm parameters recalculation and JIT kernel generation) into the builder.
+5. Add the key generation code into the `prepareParams()` method to query the cache.
+6. Implement cache usage as the following:
+   ```cpp
+   void preapareParams() override {
+        ... //code that prepares parameters for the key
+
+        //key instantiation
+        KeyType key = {param1, param2, ...};
+        // get a reference to the cache
+        auto cache = getRuntimeCache();
+        //query cahce, buildExecutor is the builder descibed in 3
+        auto result = cache->getOrCreate(key, buildExecutor); 
+        // get the the cached value, in this example it is a pointer to an executor
+        execPtr = result.first; 
+   }
+   ```
+7. To provide smoke testing of these changes, add repeated shapes to the "target shapes" part of the corresponding single layer test definition:
+    ```cpp
+    { //dynamic case description each pair per each input has {{dynamic shape}, {{static shape case1}, {static shape case2}, ...}
+        {{-1, -1, -1}, {{10, 10, 10}, {5, 5, 5}, {10, 10, 10}}}, // input 0
+        {{-1, -1, 5}, {{10, 10, 5}, {5, 5, 5}, {10, 10, 5}}}  // input 1
+    },
+    ```
+   It worth to mention that placing two identical target shapes one after another does not trigger the cache, since another optimization based on the fact that the shapes have not been changed takes place. For example, the following test definition does not properly test the cache:
+    ```cpp
+    { // the shape infer and params preparation stages will be skipped for the second target shapes combination since the shapes are not changed
+        {{-1, -1, -1}, {{5, 5, 5}, {5, 5, 5}}}, // input 0
+        {{-1, -1, 5},  {{5, 5, 5}, {5, 5, 5}}}  // input 1
+    },
+    ```
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/README.md b/src/plugins/intel_gpu/README.md
index ead8e7811d9..1b33bf8a3ec 100644
--- a/src/plugins/intel_gpu/README.md
+++ b/src/plugins/intel_gpu/README.md
@@ -1,9 +1,41 @@
-﻿
-### Attached licenses
+# OpenVINO Intel GPU Plugin
+
+GPU plugin in [OpenVINO toolkit](https://github.com/openvinotoolkit/openvino) supports inference on Intel® GPUs starting from Gen8 architecture.
+
+## Key Contacts
+
+Please contact a member of [openvino-ie-gpu-maintainers](https://github.com/orgs/openvinotoolkit/teams/openvino-ie-gpu-maintainers) group, for assistance regarding GPU.
+
+## Components
+
+GPU Plugin contains the following components:
+
+* [docs](./docs/) - developer documentation pages for the component.
+* [include](./include/) - public API.
+* [src](./src/) - sources of the component.
+* [tests](./tests/) - tests for OpenVINO Plugin component.
+* [thirdparty](./thirdparty/) - third-party modules.
+
+## Tutorials
+
+* [Source code structure](./docs/source_code_structure.md)
+  * [Basic data structures of gpu graph and overall flow](./docs/basic_data_structures.md)
+  * [Memory allocation in GPU plugin](./docs/memory_allocation_gpu_plugin.md)
+* [Simplified workflow](./docs/simplified_workflow.md)
+  * [Graph Optimization Passes](./docs/graph_optimization_passes.md)
+  * [Execution of Inference](./docs/execution_of_inference.md)
+* [Memory formats](./docs/gpu_memory_formats.md)
+* [Kernels and kernel selectors](./docs/gpu_kernels.md)
+* [GPU plugin operations enabling flow](./docs/gpu_plugin_ops_enabling.md)
+* [Debug utils](./docs/gpu_debug_utils.md)
+* [OpenCL Runtime issues troubleshooting](./docs/gpu_plugin_driver_troubleshooting.md)
+* [GPU plugin unit test](./docs/gpu_plugin_unit_test.md)
+
+## Attached licenses
 GPU plugin uses 3<sup>rd</sup>-party components licensed under following licenses:
-- *googletest* under [Google\* License](https://github.com/google/googletest/blob/master/googletest/LICENSE)
-- *OpenCL™ ICD and C++ Wrapper* under [Khronos™ License](https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/LICENSE.txt)
-- *RapidJSON* under [Tencent\* License](https://github.com/Tencent/rapidjson/blob/master/license.txt)
+- *googletest* under [Google License](https://github.com/google/googletest/blob/master/googletest/LICENSE)
+- *OpenCL™ ICD and C++ Wrapper under [Khronos™ License](https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/LICENSE.txt)
+- *RapidJSON* under [Tencent License](https://github.com/Tencent/rapidjson/blob/master/license.txt)
 
 ## Support
 Please report issues and suggestions
@@ -32,9 +64,9 @@ GPU plugin requires CPU with Intel® SSE/Intel® AVX support.
 ---
 
 The software dependencies are:
-- [CMake\*](https://cmake.org/download/) 3.5 or later
+- [CMake](https://cmake.org/download/) 3.5 or later
 - C++ compiler with C++11 standard support compatible with:
-    * GNU\* Compiler Collection 4.8 or later
+    * GNU Compiler Collection 4.8 or later
     * clang 3.5 or later
     * [Intel® C++ Compiler](https://software.intel.com/en-us/intel-parallel-studio-xe) 17.0 or later
     * Visual C++ 2015 (MSVC++ 19.0) or later
@@ -43,7 +75,7 @@ The software dependencies are:
 
 - [python™](https://www.python.org/downloads/) 3.7 or later.
 
-# Trademark Information
+## Trademark Information
 
 Intel, the Intel logo, Intel Atom, Intel Core, Intel Xeon Phi, Iris, OpenVINO,
 the OpenVINO logo, Pentium, VTune, and Xeon are trademarks
@@ -59,3 +91,10 @@ OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission
 by Khronos.
 
 Copyright © 2021, Intel Corporation
+
+## See also
+
+ * [OpenVINO™ README](../../../README.md)
+ * [OpenVINO Core Components](../../README.md)
+ * [OpenVINO Plugins](../README.md)
+ * [Developer documentation](../../../docs/dev/index.md)
diff --git a/src/plugins/intel_gpu/docs/basic_data_structures.md b/src/plugins/intel_gpu/docs/basic_data_structures.md
new file mode 100644
index 00000000000..087ea86b4b0
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/basic_data_structures.md
@@ -0,0 +1,245 @@
+# Basic data structures of GPU graph and overall flow
+
+## Overall graph data structure
+<a name="fig1"></a>
+
+```mermaid
+classDiagram 
+direction LR
+pooling  --<| primitive_base
+convolution --<| primitive_base 
+class primitive_base{<<PType>>}
+primitive_base --<| primitive
+primitive --o program_node
+primitive --o topology
+class typed_program_node {<<convolution>>}
+typed_program_node --<| typed_program_node_base
+class typed_program_node_base{<<PType>>}
+typed_program_node_base --<| program_node
+program_node --o program
+class primitive_type {
++create_node
++create_instance
++choose_mpl}
+program --> topology
+program ..<| primitive_type : create_node()\nchoose_impl()
+convolution_impl --<| typed_primitive_impl_ocl
+fully_connected_impl --<| typed_primitive_impl_ocl
+convolution_onednn --<| typed_primitive__onednn_impl
+pooling_onednn --<| typed_primitive__onednn_impl
+class typed_primitive_impl_ocl {<<PType>>}
+typed_primitive_impl_ocl --<| typed_primitive_impl
+class typed_primitive__onednn_impl {<<PType>>}
+typed_primitive__onednn_impl --<| typed_primitive_impl
+class typed_primitive_impl {<<PType>>}
+typed_primitive_impl --<| primitive_impl
+primitive_impl --o primitive_inst
+primitive_impl --o program_node
+class typed_primitive_inst {<<convolution>>}
+class `typed_primitive-inst` {<<pooling>>}
+typed_primitive_inst --<| typed_primitive_inst_base
+`typed_primitive-inst` --<| typed_primitive_inst_base
+class typed_primitive_inst_base {<<PType>>}
+typed_primitive_inst_base --<| primitive_inst
+primitive_inst --> program_node
+primitive_inst --o network
+network --> program
+network ..<| primitive_type : create_instance
+class primitive_type_base {<<PType>>}
+primitive_type_base --<| primitive_type
+primitive_type_base ..<| typed_program_node
+primitive_type_base --o primitive_base: 0.1
+class implementation_map {<<PType>>
+get(typed_program_node<Ptype>): factory_type}
+primitive_type_base ..<| implementation_map : get()
+primitive_type_base ..<| typed_primitive_inst
+a1 o-- a2 : Aggregation
+b1 --> b2 : Association
+c1 --<| c2 : Inheritance
+d1 ..> d2 : Dependency
+```
+
+There are three levels of abstraction in the graph structures being used in the gpu plugin : *topology*, *program*, *network*. <br>
+The above <a href="#fig1">figure</a> presents the overall data structures. 
+
+First, the original model should be presented as a corresponding *topology*, which is consisting of primitives and their connections. It can be regarded as a simple graph structure representing the original model.  
+
+Then the topology is to be converted to a *program*, which is consisting of *program_nodes* corresponding to the original primitives and their connections. 
+Here, the majority of the transformation and optimizations are performed on the *program*.
+Also, the *primitive_impl* is created for each *program_node* at this stage, which holds the selected kernels for each *program_node* and the required information to run the kernels such as work group sizes and kernel arguments, etc. The final source code of the kernels are decided and compiled at this stage, too.
+Note that a *program* is common for the streams, i.e., there is only one *program* created for all the streams. 
+
+Once the *program* is finalized, then the *network* is built from the *program* for each stream. 
+A *network* is consisting of primitive instances (a.k.a *primitive_inst*) that contains the required memory allocations for the kernels. 
+Then finally we can run the *network* by running the network::execute().
+
+The more detailed description of each component is to be described in the below sections. 
+
+
+## primitive 
+```cpp
+struct primitive { 
+...
+    const primitive_id id;
+    const primitive_type* type;
+    padding output_padding;
+    std::vector<primitive_id> input;
+...
+};
+```
+A *primitive* is the primary representation of an operation in gpu plugin, which comprises a graph structure, i.e., the *topology*. A *primitive* is to be created for a layer operation in the original model and holds the basic information about the operation, such as required input, output, attributes, as well as its own id, a.k.a *primitive_id*. Here, the *primitive_id* is a unique string id assigned to each *primitive* throughout the processing. <br>
+
+The APIs of the available primitives can be found [here](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/include/intel_gpu/primitives).<br>
+
+An example creation of a arg_max_min primitive:
+```cpp
+cldnn::arg_max_min top_k_prim = cldnn::arg_max_min("top_k", { "input" }, arg_max_min::max, top_k, arg_max_min::y, arg_max_min::sort_by_values, false, "", padding(), data_types::f32);
+```
+
+In GPU plugin, the *primitives* are converted from ngraph operations, which can be found [here](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/src/plugin/ops).
+
+## topology
+```cpp
+struct topology{
+...
+    std::map<primitive_id, std::shared_ptr<primitive>> _primitives;
+...
+};
+```
+
+A *topology* is a graph structure consisting of *primitives* and their connections. Here a connection is defined by input primitives assigned to a primitive.  
+
+A simple example of creation of a topology, which is consisting of two poolings, one concatenation of the poolings, and a reorder primitive, is shown as follows:
+```cpp
+auto input0 = engine.allocate_memory({data_types::i8, format::bfyx, {1, 1, 8, 3}});
+auto input1 = engine.allocate_memory({data_types::i8, format::bfyx, {1, 1, 8, 3}});
+layout reorder_layout(data_types::i8, format::yxfb, {7, 2, 2, 1});
+topology topology(input_layout("input0", input0->get_layout()),
+                  input_layout("input1", input1->get_layout()),
+                  pooling("pool0 /*primitive_id of this pooling*/", "input0 /*primitive_id of input primitive for pool0*/", pooling_mode::max, {1, 1, 2, 2}, {1, 1, 1, 1}),
+                  pooling("pool1", "input1", pooling_mode::max, {1, 1, 2, 2}, {1, 1, 1, 1}),
+                  concatenation("concat",
+                                {"pool0", "pool1"},
+                                concatenation::concatenation_axis::along_f,
+                                data_types::i8,
+                                "",
+                                padding{{0, 0, 0, 0}, 0}),
+                  reorder("reorder", "concat", reorder_layout));
+```
+
+In the above example, "pool0" is the *primitive_id* of the first pooling, and "input0" is the *primitive_id* of the input primitive of it. The latter parameters such as pooling_mode::max, {1, 1, 2, 2}, {1, 1, 1, 1} are other properties for pooling primitive, pooling_mode, tensor size, stride, respectively.
+
+Note that topology is created from ngraph representation in the gpu plugin. Manual definition of a topology shown in the above snippet is usually for unittest purpose.
+
+## program_node (impl)
+
+```cpp
+struct program_node {
+...
+    program& myprog;
+    std::unique_ptr<primitive_impl> selected_impl;
+    layout output_layout;
+    std::vector<program_node*> dependencies;
+    std::list<program_node*> users;
+    std::set<primitive_id> memory_dependencies;
+    std::vector<fused_activation_params> fused_activations;
+    std::vector<fused_primitive_desc> fused_prims;
+...
+};
+```
+A program is consisting of program_nodes which are created from primitives. ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L353)) A program_node is created by a factory for each primitive type, i.e., primitive_type, which is associated to each primitive as type ([link](https://github.com/openvinotoolkit/openvino/blob/173f328c53d39dd42ecdb9de9e04f9d2c266683f/src/plugins/intel_gpu/include/intel_gpu/primitives/primitive.hpp#L79)). (Note that this primitive_type is used to create primitive_inst or call choose_impl too.) 
+
+Basically a program_node holds the following information which is to be decided throughout the transformation / optimization processes in a program:
+* layout : output layout of a program_node. ([impl](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/runtime/layout.hpp))
+* dependencies : a list of program_nodes whose outputs are used by the current program_node as the inputs 
+* memory dependencies : a list of program_nodes, the live ranges of the outputs of them overlaps with that of the current program_node
+* fused operations : fused operations to the current program_node
+* selected impl : The primitive_impl object which holds the information for the selected kernel required to run it, such as the selected kernels, work group size, etc. Also this object has the methods to set kernel arguments for a primitive_inst and execute the kernel by enqueueing it to the command queue. 
+
+## program (impl)
+
+```cpp
+struct program {
+...
+    uint32_t prog_id = 0;
+    engine& _engine;
+    std::unique_ptr<kernels_cache> _kernels_cache;
+    std::list<program_node*> inputs;
+    std::vector<program_node*> outputs;
+    nodes_ordering processing_order;
+    std::unique_ptr<pass_manager> pm;
+    std::map<primitive_id, std::shared_ptr<program_node>> nodes_map;
+...
+};
+```
+The major tasks that are done while building a program are as follows:
+([ref](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L433))
+* Init graph : Create an initial program consisting of program_nodes built from a given topology 
+* Optimization (Major optimizations will be dealt with from another section TBD)  
+   * pre-optimization ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L474)): Optimizations done before graph_compilation. Notable passes are as follows: 
+        * prepare_primitive_fusing : decision of fusing 
+        * reorder_inputs : decision of preferred layout / impl (ocl vs onednn) and adding reorders w.r.t the decision
+   * post-optimization ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L437)) Optimizations done after graph_compilation <br>
+        * post_optimize_weights : Add reorder for the weights toward preferred formats (as generic nodes) <br>
+        * propagate_constants : Transfer and reorder original weight data to the generic_nodes created at post_optimize_weights. Here, note that the constant propagation is doing weight reorder by running actual network (w/ is_internal = true). To this end, a temporal program is created/built/run within this pass. <br>
+
+* Kernel selection and graph compilations ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L436)) : Select best kernel for the program_node and create the impl (i.e., primitive_impl), and collect the kernel source code strings to the kernels_cache. 
+* Kernel compilation ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/program.cpp#L451)): JIT compilation of the collected kernels. Currently 9 kernels are combined as a batch and compiled at a time. Also the batches are compiled in parallel. See [here](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/runtime/kernels_cache.cpp#L400).
+
+## primitive_inst (impl)
+
+```cpp
+class primitive_inst {
+...
+    program_node const& _node;
+    std::unique_ptr<primitive_impl> _impl;
+    std::vector<std::shared_ptr<primitive_inst>> _deps;
+    std::vector<std::shared_ptr<primitive_inst>> _exec_deps;
+    memory::ptr _output;
+    std::vector<memory::cptr> _intermediates_memory;
+
+    event::ptr execute(const std::vector<event::ptr>& events);
+    memory::ptr allocate_output(); 
+...
+};
+```
+Once all processing at a program level is finished, a network is to be built from the program. 
+primitive_inst is the basic component comprising a network. 
+While each primitive_inst object is still associated to the corresponding  program_node, it holds the required memory objects such as output memory objects and intermediate memory objects that are to be used by that node. A brief description for the two kinds of memory allocated for a primitive_inst is as follows:
+
+* output memory : An output memory of a primitive_inst is allocated at the creation of each primitive_inst ([impl](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L210)), unless its output is reusing the input memory or the node is a mutable data to be used as a 2nd output. The general output tensors are allocated by the memory pool, so that the memory could be reused by other nodes when it is not needed. (Note that constants data are not reusable and should retain the own memory, so that they could be shared by multiple streams. More descriptions about memory pool will be given by dedicated section (TBD)).
+* intermediate memory ([impl](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L215)): Some kernels requires intermediate memories in addition to the input/output memories such as [detection_output](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/kernel_selector/core/actual_kernels/detection_output/detection_output_kernel_ref.cpp#L155). The allocation happens after all primitive_insts are finished ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L592)), since it needs to be processed in a processing_order to use the predecessors' allocation information while the creation of primitive_inst is done in a order sorted by memory_size. 
+
+## network (impl)
+```cpp
+struct network {
+...
+    program::ptr _program;
+    stream::ptr _stream;
+    std::unique_ptr<memory_pool> _memory_pool;
+    std::map<primitive_id, std::shared_ptr<primitive_inst>> _primitives;
+    std::vector<std::shared_ptr<primitive_inst>> _inputs;
+    std::vector<std::shared_ptr<primitive_inst>> _outputs;
+    std::list<std::shared_ptr<primitive_inst>> _exec_order;
+    std::list<std::shared_ptr<primitive_inst>> _data_outputs;
+    std::unordered_map<primitive_id, event::ptr> _events;
+    output_chains_map _output_chains;
+...
+    std::map<primitive_id, network_output> execute(const std::vector<event::ptr>& dependencies = {});
+    void set_arguments();
+    void allocate_primitives();
+};
+```
+When a network is built, the comprising primitives are allocated and dependencies among them are set ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L259)). 
+
+The major processes done while a network is executed are as follows ([impl]( https://github.com/openvinotoolkit/openvino/blob/3de428c7139fef69e37b406c3490c26b67b48026/src/plugins/intel_gpu/src/graph/network.cpp#L663)) :
+* set arguments of the primitives (i.e., set the [kernel_args](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/runtime/kernel_args.hpp) required for running the kernels such as input/output memory address)
+
+* [execute primitives](https://github.com/openvinotoolkit/openvino/blob/3de428c7139fef69e37b406c3490c26b67b48026/src/plugins/intel_gpu/src/graph/network.cpp#L849) : Execute each primitives, i.e., enqueue the kernels to the context queue. 
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
diff --git a/src/plugins/intel_gpu/docs/execution_of_inference.md b/src/plugins/intel_gpu/docs/execution_of_inference.md
new file mode 100644
index 00000000000..7608fedd994
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/execution_of_inference.md
@@ -0,0 +1,33 @@
+# Execution of Inference
+
+Network execution happens when user calls `inferRequest->infer()` or `inferRequest->start_async()`. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/samples/cpp/benchmark_app/main.cpp#L929)
+
+In high level, all we need to do is enqueuing OCL kernels with buffers. For that purpose, we need to find the `cldnn::network` instance as it contains the required buffers for execution. [(link)](https://github.com/openvinotoolkit/openvino/wiki/Basic-data-structures-of-gpu-graph-and-overall-flow#network-impl) `CPUStreamExecutor` is holding streams and the stream corresponds to the `cldnn::network` structure. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/inference/src/threading/ie_cpu_streams_executor.cpp#L263)
+
+The main body of network execution is `cldnn::network::execute_impl`. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L663) In this function, `set_arguments()` is called to set OpenCL arguments and `execute_primitive` is called to enqueue kernels to OCL queue.
+In case of synchronous API call(i.e. `inferRequest->infer()`), waiting for completion of kernels is also required. It is called from `cldnn::network_output::get_memory()` function. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/include/intel_gpu/graph/network.hpp#L31)
+
+## Optimized-out node
+During graph compilation [(link)](https://github.com/openvinotoolkit/openvino/wiki/Graph-Optimization-Passes), some nodes may be optimized out.
+
+For example, concat operation may be executed _implicitly_, or in other words, concat may be _optimized out_. Implicit concat is possible when the input of concat can put the output tensor directly into the result tensor of concat.
+
+In such case, we don't remove the node in the graph for integrity of node connection. Concat layer is just marked as **optimized-out** and not executed during runtime. [(src)](https://github.com/openvinotoolkit/openvino/blob/dc6e5c51ee4bfb8a26a02ebd7a899aa6a8eeb239/src/plugins/intel_gpu/src/graph/impls/ocl/primitive_base.hpp#L155)
+
+## Dumping layer in/out buffer during execution
+`cldnn::network::execute_impl` also contains some logic to dump layer in/out buffers for debugging purpose. As it is related to memory usage, it deserves some description, too.
+
+In order to dump buffers, we need to wait for the moment that the kernel is about to be called(for source buffer) or just called(for destination buffer). In other moments, we don't have the layer's buffer as the buffers are reused from memory pool. [(link)](https://github.com/openvinotoolkit/openvino/wiki/Memory-allocation-in-GPU-plugin#memory-dependency-and-memory-pool)
+
+`get_stream().finish()` is called firstly as we need to be synchronous with kernel execution. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L712) Then we can access the buffer. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/graph/network.cpp#L114) This access varies depending on the kind of buffer. If it is `usm_host` or `usm_shared`, it is just accessed directly. If it is `usm_device`, it is accessed after copying the data into host memory because host cannot access `usm_device` directly. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L312) If it is ocl memory, we map this into host memory. [(src)](https://github.com/openvinotoolkit/openvino/blob/f48b23362965fba7e86b0077319ea0d7193ec429/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L46) 
+
+Typical network execution happens with `usm_host` for network input and output and `usm_device` for the buffers inside the network.
+
+For usage of this dumping feature, please see [link](https://github.com/openvinotoolkit/openvino/wiki/GPUPluginDebugUtils#layer-inout-buffer-dumps).
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/gpu_debug_utils.md b/src/plugins/intel_gpu/docs/gpu_debug_utils.md
new file mode 100644
index 00000000000..1acc5786778
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/gpu_debug_utils.md
@@ -0,0 +1,252 @@
+# GPU plugin debug utils
+
+This document is a list of useful debug features / tricks that might be used to find root cause of performance / functional issues. Some of them
+are available by default, but some others might require plugin recompilation.
+
+## Debug Config
+`Debug_config` is an infra structure that contains number of easy-to-use debugging features. It has various control parameters. You can check list of parameters from the source code `cldnn::debug_configuration`.
+
+### How to use it
+First, this feature should be enabled from cmake configuration `ENABLE_DEBUG_CAPS`. When openvino is released, it is turned off by default.
+The parameters should be set from environment variable when calling inference engine API.
+
+```
+$ OV_GPU_Verbose=1 ./benchmark_app ...      # Run benchmark_app with OV_GPU_Verbose option
+$ OV_GPU_DumpLayersPath="cldnn/" ./benchmark_app ...   # Run benchmark_app and store intermediate buffers into cldnn/ directory.
+```
+
+For Windows OS, please use below syntax.
+
+```
+Windows Power Shell:
+> $env:OV_GPU_Verbose=1
+> .\benchmark_app.exe ...      # Run benchmark_app with OV_GPU_Verbose option
+
+Windows cmd.exe:
+> set "OV_GPU_Verbose=1"
+> benchmark_app.exe ...      # Run benchmark_app with OV_GPU_Verbose option
+```
+
+### Options syntax
+Plugin is able to parse different naming styles for debug options:
+1. `OV_GPU_SOME_OPTION`
+2. `OV_GPU_SomeOption`
+
+Behavior when both versions are specified is not defined.
+
+Some options also allow multiple prefixes: `OV` and `OV_GPU`. `OV` prefix is intended to be used for options common for all OpenVINO components. In case if an option is set twice with different prefixes, then `OV_GPU` has higher priority.
+
+### List of parameters (There are actually more than this, please see OV_GPU_Help result)
+
+* `OV_GPU_Help`: Show help message of debug config.
+* `OV_GPU_Verbose`: Verbose execution. Currently, Verbose=1 and 2 are supported.
+* `OV_GPU_PrintMultiKernelPerf`: Print kernel latency for multi-kernel primitives. This is turned on by setting 1. Execution time is printed.
+* `OV_GPU_DisableUsm`: Disable the usage of usm (unified shared memory). This is turned on by setting 1.
+* `OV_GPU_DisableOnednn`: Disable onednn for discrete GPU (no effect for integrated GPU)
+* `OV_GPU_DumpGraphs`: Dump optimized graph into the path that this variable points. This is turned on by setting the destination path into this variable.
+* `OV_GPU_DumpSources`: Dump opencl sources
+* `OV_GPU_DumpLayersPath`: Enable intermediate buffer dump and store the tensors. This is turned on by setting the destination path into this variable. You can check the exact layer name from `OV_GPU_Verbose=1`.
+* `OV_GPU_DumpLayers`: Dump intermediate buffers only for the layers that this variable specifies. Multiple layers can be specified with space delimiter. Dump feature should be enabled through `OV_GPU_DumpLayersPath`
+* `OV_GPU_DumpLayersResult`: Dump output buffers of result layers only
+* `OV_GPU_DumpLayersDstOnly`: When dumping intermediate buffer, dump destination buffer only. This is turned on by setting 1.
+* `OV_GPU_DumpLayersLimitBatch`:        Limit the size of batch to dump
+* `OV_GPU_DryRunPath`:                  Dry run and serialize execution graph into the specified path
+* `OV_GPU_BaseBatchForMemEstimation`:   Base batch size to be used in memory estimation
+* `OV_GPU_AfterProc`:                   Run inference after the specified process PIDs are finished, separated by space. Supported on only on linux.
+* `OV_GPU_SerialCompile`:               Serialize creating primitives and compiling kernels
+* `OV_GPU_ForceImplType`:               Force implementation type of a target primitive or layer. [primitive or layout_name]:[impl_type] For primitives, fc:onednn, fc:ocl, do:cpu, do:ocl, reduce:ocl and reduce:onednn are supported
+* `OV_GPU_MaxKernelsPerBatch`:          Maximum number of kernels in a batch during compiling kernels
+
+## Dump execution graph
+The execution graph (also known as runtime graph) is a device specific graph after all transformations applied by the plugin. It's a very useful
+feature for performance analysis and it allows to find a source of performance regressions quickly. Execution graph can be retrieved from the plugin
+using `GetExecGraphInfo()` method of `InferenceEngine::ExecutableNetwork` and then serialized as usual IR:
+```cpp
+    ExecutableNetwork exeNetwork;
+    // Load some model into the plugin
+    CNNNetwork execGraphInfo = exeNetwork.GetExecGraphInfo();
+    execGraphInfo.serialize("/path/to/serialized/exec/graph.xml");
+```
+
+The capability to retrieve execution graph and store it on the disk is integrated into `benchmark_app`. The execution graph can be simply dumped
+by setting additional parameter `-exec_graph_path exec_graph.xml` for `benchmark_app`. Output `xml` file has a format similar to usual IR, but contains
+execution nodes with some runtime info such as:
+- Execution time of each node
+- Mapping between nodes in final device specific graph and original input graph operations
+- Output layout
+- Output precision
+- Primitive type
+- Inference precision
+
+Typical node in GPU execution graph looks as follows:
+```
+<layer id="0" name="convolution" type="Convolution">
+    <data execOrder="1" execTimeMcs="500" originalLayersNames="convolution,relu" outputLayouts="b_fs_yx_fsv16" outputPrecisions="FP16" primitiveType="convolution_gpu_bfyx_to_bfyx_f16" />
+    <input>
+        <port id="0">
+            <dim>1</dim>
+            <dim>3</dim>
+            <dim>224</dim>
+            <dim>224</dim>
+        </port>
+    </input>
+    <output>
+        <port id="1" precision="FP16">
+            <dim>1</dim>
+            <dim>64</dim>
+            <dim>112</dim>
+            <dim>112</dim>
+        </port>
+    </output>
+</layer>
+```
+
+Most of the data here is very handy for the performance analysis. For example, for each node you can check that:
+- Nodes fusion works as expected on given models (i.e. some node is missing in execution graph and it's name is a part of `originalLayersNames` list for some other node)
+- Input and output layouts of a node are optimal in each case
+- Input and output precisions are valid in each case
+- The node used expected kernel for execution
+- And the most important: actual execution time of each operation
+
+This graph can be visualized using Netron tool and all these properties can be analyzed there.
+
+Note: execution time collection for each primitive requires `CONFIG_KEY(PERF_COUNT)` to be enabled (`benchmark_app` does it automatically), thus the overall model execution time is usually much worse in such use cases.
+
+## Performance counters
+
+This feature is a simplified version of execution graph as it provides much less information, but it might be more suitable for quick analysis and some kind of
+processing with scripts.
+
+Performance counters can be retrieved from each `InferenceEngine::InferRequest` object using `getPerformanceCounts()` method. This feature is also integrated
+into `benchmark_app` and the counters can be printed to cout using `-pc` parameter.
+
+The format looks as follows:
+
+```
+${layer_name}      ${exec_status}  layerType: ${type}            realTime: ${device_time}  cpu: ${host_time}    execType: ${kernel_name}
+Total time: ${sum_of_device_times} microseconds
+```
+
+For example:
+
+```
+convolution         EXECUTED       layerType: Convolution        realTime: 500             cpu: 3               execType: convolution_gpu_bfyx_os_iyx_osv16
+relu                OPTIMIZED_OUT  layerType: ReLU               realTime: 0               cpu: 0               execType: undef
+Total time: 53877 microseconds
+```
+
+So it allows to quickly check execution time of some operation on the device and make sure that correct primitive is used. Also, the output can be easily
+converted into .csv format and then used to collect any kind of statistics (e.g. execution time distribution by layer types).
+
+## Graph dumps
+
+intel_gpu plugin allows to dump some info about intermediate stages in graph optimizer.
+
+* You can dump graphs with `OV_GPU_DumpGraphs` of debug config. For the usage of debug config, please see [link](#debug-config).
+
+* Alternative, you can also enable the dumps from the application source code:
+clDNN plugin has the special internal config option `graph_dumps_dir` which can be set from the user app via plugin config:
+```cpp
+Core ie;
+std::map<std::string, std::string> device_config;
+device_config[CLDNN_CONFIG_KEY(GRAPH_DUMPS_DIR)] = "/some/existing/path/";
+ie.SetConfig(device_config, "GPU");
+```
+
+For each stage it dumps:
+```
+- cldnn_program_${program_id}_${stage_id}_${stage_name}.graph - graph saved in dot format which can be visualized via graphviz tool
+- cldnn_program_${program_id}_${stage_id}_${stage_name}.info - graph in text format
+- cldnn_program_${program_id}_${stage_id}_${stage_name}.optimized - the list of nodes optimized out up to this stage
+- cldnn_program_${program_id}_${stage_id}_${stage_name}.order - processing order in text format
+- ${program_id}_${stage_id}_${stage_name}.xml - graph in a format of execution graph
+```
+
+Main graph usually has `program_id = 0`, graphs with other `program_id` values are usually created internally for constant propagation or some other purposes.
+
+## Sources dumps
+
+Since intel_gpu source tree contains only *templates* of the OpenCL™ kernels, it's quite important to get full kernels source code.
+
+* You can use `OV_GPU_DumpSources` of debug config. For the usage of debug config, please see [link](#debug-config).
+
+* You can also dump OpenCL source code by changing OpenVINO source code:
+clDNN plugin has the special internal config option `sources_dumps_dir` which can be set from the user app via plugin config:
+```cpp
+Core ie;
+std::map<std::string, std::string> device_config;
+device_config[CLDNN_CONFIG_KEY(SOURCES_DUMPS_DIR)] = "/some/existing/path/";
+ie.SetConfig(device_config, "GPU");
+```
+
+When this key is enabled, the plugin dumps multiple files with the following names:
+```
+clDNN_program_${program_id}_part_${bucket_id}.cl
+```
+
+Note: `program_id` here might differ from `program_id` for the graph dumps as it's just a static counter for enumerating incoming programs.
+
+Each file contains a bucket of kernels that are compiled together. In case of any compilation errors, intel_gpu plugin will append compiler output
+in the end of corresponding source file.
+
+If you want to find some specific layer, then you'll need to use Debug/RelWithDebInfo build or modify base jitter method to append `LayerID` in release build:
+```cpp
+// inference-engine/thirdparty/clDNN/kernel_selector/core/kernel_base.cpp
+JitConstants KernelBase::MakeBaseParamsJitConstants(const base_params& params) const {
+    // ...
+#ifndef NDEBUG                             <--- should be removed
+    jit.AddConstant(MakeJitConstant("LayerID", params.layerID));
+#endif
+}
+```
+
+When source is dumped, it actually contains huge amount of macros(`#define`). For readability, you can run c preprocessor to apply the macros.
+
+`$ cpp dumped_source.cl > clean_source.cl`
+
+
+## Layer in/out buffer dumps
+
+In some cases you might want to get actual values in each layer execution to compare it with some reference blob. In order to do that we have
+`OV_GPU_DumpLayersPath` option in debug config. For the usage of debug config, please see [link](#debug-config).
+
+As a prerequisite, enable ENABLE_DEBUG_CAPS from cmake configuration.
+
+Then, check runtime layer name by executing benchmark_app with OV_GPU_Verbose=1. It is better to be checked with this than through IR because this may be slightly different. OV_GPU_Verbose=1 will show log of execution of each layer.
+
+```
+# As a prerequisite, enable ENABLE_DEBUG_CAPS from cmake configuration.
+export OV_GPU_DumpLayersPath=path/to/dir
+export OV_GPU_DumpLayers="layer_name_to_dump1 layer_name_to_dump2"
+export OV_GPU_DumpLayersDstOnly=1              # Set as 1 when you want to dump dest buff only
+```
+
+Dump files have the following naming:
+```
+${layer_name_with_underscores}_${src/dst}_${port_id}.txt
+```
+
+Each file contains single buffer in common planar format (`bfyx`, `bfzyx` or `bfwzyx`) where each value is stored on a separate line. The first line in the file constains buffer description, e.g:
+```
+shape: [b:1, f:1280, x:1, y:1, z:1, w:1, g:1] (count: 1280, original format: b_fs_yx_fsv16)
+```
+
+For accuracy troubleshoot, you may want to compare the GPU plugin result against CPU plugin result. For CPU dump, see [Blob dumping](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/docs/blob_dumping.md)
+
+
+## Run int8 model on gen9 HW
+
+As gen9 hw doesn't have hardware acceleration, low precision transformations are disabled by default, thus quantized networks are executed in full precision (fp16 or fp32) with explicit execution of quantize operations.
+If you don't have gen12 HW, but want to debug network's accuracy or performance of simple operations (which doesn't require dp4a support), then you can enable low precision pipeline on gen9 using one of the following ways:
+1. Add `{PluginConfigInternalParams::KEY_LP_TRANSFORMS_MODE, PluginConfigParams::YES}` option to the plugin config
+2. Enforce `supports_imad = true` [here](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/src/gpu/device_info.cpp#L226)
+3. Enforce `conf.enableInt8 = true` [here](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/cldnn_engine/cldnn_engine.cpp#L366)
+
+After that the plugin will run exactly the same scope of transformations as on gen12HW and generate similar kernels (small difference is possible due to different EUs count)
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
diff --git a/src/plugins/intel_gpu/docs/gpu_kernels.md b/src/plugins/intel_gpu/docs/gpu_kernels.md
new file mode 100644
index 00000000000..176300fa04c
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/gpu_kernels.md
@@ -0,0 +1,139 @@
+# GPU kernels implementation overview
+
+As mentioned in [GPU plugin structure](./source_code_structure.md), kernels for GPU plugin are located in `src/plugins/intel_gpu/src/kernel_selector` folder.
+
+For each operation we usually have multiple kernels that can support different parameters and/or optimized for different scenarios.
+
+Each operation has 3 major entities in kernel selector:
+ - Operation specific `kernel_selector` instance
+ - Operation parameters descriptor
+ - Kernels itself with a set of heuristics inside for optimal selection
+
+ ## Kernel selector instance
+For each operation we create kernel_selector class derived from `kernel_selector_base`. Basically, this class is needed to specify available kernels
+for given operation. Each kernel selector is used as singleton. For example:
+
+
+```cpp
+class mvn_kernel_selector : public kernel_selector_base {
+public:
+    static mvn_kernel_selector& Instance() {
+        static mvn_kernel_selector instance_;
+        return instance_;
+    }
+
+    mvn_kernel_selector();
+
+    KernelsData GetBestKernels(const Params& params, const optional_params& options) const override;
+}
+
+// The list of available kernels is usually specified in kernel_selector c-tor using `Attach` method whith creates instance of each type
+// and append it to implementations list.
+// In this case we have 3 available kernels for MVN operation. Kernels might have different priorities and support only subset of operation parameters
+// E.g. MVNKernel_b_fs_yx_fsv16_imad supports only `fsv16` blocked layouts and INT8/UINT8 input data types
+mvn_kernel_selector::mvn_kernel_selector() {
+    Attach<MVNKernelRef>();
+    Attach<MVNKernelBfyxOpt>();
+    Attach<MVNKernel_b_fs_yx_fsv16_imad>();
+}
+
+// This method is used to get the optimal kernel for given parameters
+// There are 2 base methods to pick optimal kernels: `GetNaiveBestKernel` and `GetAutoTuneBestKernel`
+// If kernel supports auto tuning, then it uses `GetAutoTuneBestKernel`, otherwise, it uses `GetNaiveBestKernel`
+// parameterized with `KernelType` which specifies the operation type which is implemented by the specific kernel selector
+KernelsData mvn_kernel_selector::GetBestKernels(const Params& params, const optional_params& options) const {
+    return GetNaiveBestKernel(params, options, KernelType::MVN);
+}
+```
+
+The caller code looks as follows:
+
+```cpp
+// Get static instance of the kernel_selector
+auto& kernel_selector = kernel_selector::mvn_kernel_selector::Instance();
+// Run some heuristics to pick the best mvn kernel for given `mvn_params`
+auto best_kernels = kernel_selector.GetBestKernels(mvn_params, mvn_optional_params);
+```
+
+## Operation parameters
+
+The parameters of operation for kernel_selector are defined in corresponding `${op_name}_params` class which is derived from `base_params`. For example:
+```cpp
+struct mvn_params : public base_params {
+    mvn_params() : base_params(KernelType::MVN) {}
+
+    MVNMode mvnMode = MVNMode::WITHIN_CHANNELS;
+    bool mvnNormalizeVariance = true;
+    float epsilon = 1e-10f;
+
+    virtual ParamsKey GetParamsKey() const {
+        ParamsKey k = base_params::GetParamsKey();
+
+        k.EnableMVNMode(mvnMode);
+
+        if (mvnNormalizeVariance)
+            k.EnableMVNNormalizeVariance();
+
+        return k;
+    }
+};
+```
+
+The derived class should parameterize base class with specific `KernelType` and add operation-specific parameters. The only method that must be implemented
+is `GetParamsKey()` which is used as a quick check for kernels applicability for current parameters, i.e. we take `ParamsKey` object calculated for input
+operation parameters and `ParamsKey` object for each kernel, so we can compare them and discard the kernels that don't support current parameters.
+`ParamsKey` is implemented as a set of bit masks, so the applicability check is quite simple:
+```cpp
+const ParamsKey implKey = some_implementation->GetSupportedKey();
+if (!implKey.Support(paramsKey))
+    // Do something
+
+// Support() method do something like follows for each internal bit mask:
+if (!((implKey.mask & paramsKey.mask) == paramsKey.mask))
+    return false;
+```
+
+## Kernel implementation
+
+Each kernel must specify the following things:
+- Input parameters checks
+  - `GetSupportedKey()` method implementation which returns `ParamsKey` object for current implementation
+  - `Validate()` method that do more complex checks (optional)
+- Dispatch data (global/local workgroup sizes, scheduling algorithm, etc)
+- Kernel name - must be passes to base class c-tor
+- Kernel arguments specification - description of each argument in corresponding OpenCL™ kernel
+- Additional JIT constants required for kernel - set of macro definitions that must be added to thi kernel template to make full specialization for given params
+- Supported fused operations (if any) - a list of supported operations that can be fused into current kernel
+
+Let's have a look at the key methods of each kernel implementation:
+
+```cpp
+class MVNKernelRef : public MVNKernelBase {
+public:
+    MVNKernelRef() : MVNKernelBase("mvn_gpu_ref") {} // mvn_gpu_ref is the name of the file with kernel template in cl_kernels/ folder without .cl extension
+    // Returns the kernel specified for input parameters if the implementation can process it
+    KernelsData GetKernelsData(const Params& params, const optional_params& options) const override;
+    // Returns `ParamsKey` for current implementation for quick applicability check
+    ParamsKey GetSupportedKey() const override;
+
+protected:
+    // Specifies additional jit constants for kernel template specification
+    JitConstants GetJitConstants(const mvn_params& params, DispatchData dispatchData) const override;
+    // The list of supported fused operations
+    std::vector<FusedOpType> GetSupportedFusedOps() const override {
+        return {
+            FusedOpType::ACTIVATION,
+            FusedOpType::QUANTIZE,
+            FusedOpType::ELTWISE,
+            FusedOpType::SCALE
+        };
+    }
+};
+```
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/gpu_memory_formats.md b/src/plugins/intel_gpu/docs/gpu_memory_formats.md
new file mode 100644
index 00000000000..891814ad59f
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/gpu_memory_formats.md
@@ -0,0 +1,113 @@
+# GPU memory formats
+
+The memory format descriptor in GPU plugin usually uses the following letters:
+ - `b` - batch
+ - `f` - features/channels
+ - `w`, `z`, `y`, `x` - spatial dimensions
+ - `i` - input channels (for weights layout only)
+ - `o` - output channels (for weights layout only)
+ - `g` - groups (for weights layout only)
+
+The combination of the characters above defines tensor format, i.e. the actual layout of tensor values in memory buffer. For example:
+`bfyx` format means that the tensor has 4 dimensions in planar layout and `x` coordinate changes faster than `y`, `y` - faster than `f`, and so on.
+It means that for tensor with size `[b: 2; f: 2; y: 2; x: 2]` we have a linear memory buffer with `size=16` where:
+```
+i = 0  => [b=0; f=0; y=0; x=0];
+i = 1  => [b=0; f=0; y=0; x=1];
+
+i = 2  => [b=0; f=0; y=1; x=0];
+i = 3  => [b=0; f=0; y=1; x=1];
+
+i = 4  => [b=0; f=1; y=0; x=0];
+i = 5  => [b=0; f=1; y=0; x=1];
+
+i = 6  => [b=0; f=1; y=1; x=0];
+i = 7  => [b=0; f=1; y=1; x=1];
+
+i = 8  => [b=1; f=0; y=0; x=0];
+i = 9  => [b=1; f=0; y=0; x=1];
+
+i = 10 => [b=1; f=0; y=1; x=0];
+i = 11 => [b=1; f=0; y=1; x=1];
+
+i = 12 => [b=1; f=1; y=0; x=0];
+i = 13 => [b=1; f=1; y=0; x=1];
+
+i = 14 => [b=1; f=1; y=1; x=0];
+i = 15 => [b=1; f=1; y=1; x=1];
+```
+
+Usually, planar memory formats are not very efficient for DNN operations, so GPU plugin has plenty *blocked* format. Blocking means that we take some tensor dimension
+and put blocks of adjacent elements closer in memory (in the format with single blocking they are stored linearly in the memory). Consider the most widely used
+blocked format in GPU plugin: `b_fs_yx_fsv16`. First of all, let's understand what these additional letters mean. We have `b`, `f`, `y`, `x` dimensions here, so
+this is 4D tensor.
+`fs=CeilDiv(f, block_size)`; `fs` means `feature slice` - the blocked dimension.
+The block size is specified in the format name: `fsv16` - `block_size = 16`, blocked dimension is `f`; `fsv` means `feature slice vector`
+Just like with any other layout, the coordinate of the rightmost dimension (`fsv`) is changed first, then coordinate to the left (`x`), and so on.
+
+Note: if the original `f` dimension is not divisible by block size (16 in this case), then it's aligned up to the first divisible value. These pad values
+are filled with zeroes.
+
+Let's look at the changes with the tensor above if we reorder it into `b_fs_yx_fsv16` format:
+1. Actual buffer size becomes `[b: 2; f: 16; y: 2; x: 2]`, and total size = 128
+2. The order of elements in memory changes:
+```
+// first batch
+i = 0   => [b=0; f=0;  y=0; x=0] == [b=0; fs=0; y=0; x=0; fsv=0];
+i = 1   => [b=0; f=1;  y=0; x=0] == [b=0; fs=0; y=0; x=0; fsv=1];
+i = 2   => [b=0; f=2;  y=0; x=0] == [b=0; fs=0; y=0; x=0; fsv=2];
+...
+i = 15  => [b=0; f=15; y=0; x=0] == [b=0; fs=0; y=0; x=0; fsv=15];
+
+i = 16  => [b=0; f=0;  y=0; x=1] == [b=0; fs=0; y=0; x=1; fsv=0];
+i = 17  => [b=0; f=1;  y=0; x=1] == [b=0; fs=0; y=0; x=1; fsv=1];
+i = 18  => [b=0; f=2;  y=0; x=1] == [b=0; fs=0; y=0; x=1; fsv=2];
+...
+i = 31  => [b=0; f=15; y=0; x=1] == [b=0; fs=0; y=0; x=1; fsv=15];
+
+i = 32  => [b=0; f=0;  y=1; x=0] == [b=0; fs=0; y=1; x=0; fsv=0];
+i = 33  => [b=0; f=1;  y=1; x=0] == [b=0; fs=0; y=1; x=0; fsv=1];
+i = 34  => [b=0; f=2;  y=1; x=0] == [b=0; fs=0; y=1; x=0; fsv=2];
+...
+i = 47  => [b=0; f=15; y=1; x=0] == [b=0; fs=0; y=1; x=0; fsv=15];
+
+i = 48  => [b=0; f=0;  y=1; x=1] == [b=0; fs=0; y=1; x=1; fsv=0];
+i = 49  => [b=0; f=1;  y=1; x=1] == [b=0; fs=0; y=1; x=1; fsv=1];
+i = 50  => [b=0; f=2;  y=1; x=1] == [b=0; fs=0; y=1; x=1; fsv=2];
+...
+i = 63  => [b=0; f=15; y=1; x=1] == [b=0; fs=0; y=1; x=1; fsv=15];
+
+// second batch
+i = 64  => [b=1; f=0;  y=0; x=0] == [b=1; fs=0; y=0; x=0; fsv=0];
+i = 65  => [b=1; f=1;  y=0; x=0] == [b=1; fs=0; y=0; x=0; fsv=1];
+i = 66  => [b=1; f=2;  y=0; x=0] == [b=1; fs=0; y=0; x=0; fsv=2];
+...
+i = 79  => [b=1; f=15; y=0; x=0] == [b=1; fs=0; y=0; x=0; fsv=15];
+
+i = 80  => [b=1; f=0;  y=0; x=1] == [b=1; fs=0; y=0; x=1; fsv=0];
+i = 81  => [b=1; f=1;  y=0; x=1] == [b=1; fs=0; y=0; x=1; fsv=1];
+i = 82  => [b=1; f=2;  y=0; x=1] == [b=1; fs=0; y=0; x=1; fsv=2];
+...
+i = 95  => [b=1; f=15; y=0; x=1] == [b=1; fs=0; y=0; x=1; fsv=15];
+
+i = 96  => [b=1; f=0;  y=1; x=0] == [b=1; fs=0; y=1; x=0; fsv=0];
+i = 97  => [b=1; f=1;  y=1; x=0] == [b=1; fs=0; y=1; x=0; fsv=1];
+i = 98  => [b=1; f=2;  y=1; x=0] == [b=1; fs=0; y=1; x=0; fsv=2];
+...
+i = 111 => [b=1; f=15; y=1; x=0] == [b=1; fs=0; y=1; x=0; fsv=15];
+
+i = 112 => [b=1; f=0;  y=1; x=1] == [b=1; fs=0; y=1; x=1; fsv=0];
+i = 113 => [b=1; f=1;  y=1; x=1] == [b=1; fs=0; y=1; x=1; fsv=1];
+i = 114 => [b=1; f=2;  y=1; x=1] == [b=1; fs=0; y=1; x=1; fsv=2];
+...
+i = 127 => [b=1; f=15; y=1; x=1] == [b=1; fs=0; y=1; x=1; fsv=15];
+```
+
+All formats used by GPU plugin are specified in `src/plugins/intel_gpu/include/intel_gpu/runtime/format.hpp` file. Most of the formats there follow the notation above.
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/gpu_plugin_driver_troubleshooting.md b/src/plugins/intel_gpu/docs/gpu_plugin_driver_troubleshooting.md
new file mode 100644
index 00000000000..5710c74a983
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/gpu_plugin_driver_troubleshooting.md
@@ -0,0 +1,71 @@
+# Driver issues troubleshooting
+
+If you see errors like "[CLDNN ERROR]. clGetPlatformIDs error -1001" when running OpenVINO samples / demos, then most likely you have some issues with OpenCL runtime on your machine. This document contains several hints on what to check and how to troubleshoot such kind of issues.
+
+In order to make sure that OpenCL runtime is functional on your machine, you can use [clinfo](https://github.com/Oblomov/clinfo) tool. On many linux distributives it can be installed via package manager. If it's not available for your system, it can be easily built from sources.
+
+Example of clinfo output:
+```
+Number of platforms                               1
+  Platform Name                                   Intel(R) OpenCL HD Graphics
+  Platform Vendor                                 Intel(R) Corporation
+
+  ...
+
+  Platform Name                                   Intel(R) OpenCL HD Graphics
+Number of devices                                 1
+  Device Name                                     Intel(R) Graphics [0x3e92]
+  Device Vendor                                   Intel(R) Corporation
+  Device Vendor ID                                0x8086
+  Device Version                                  OpenCL 3.0 NEO 
+  Driver Version                                  20.49.0
+  Device OpenCL C Version                         OpenCL C 3.0 
+  Device Type                                     GPU
+```
+## 1. Make sure that you have GPU on your system
+Some Intel® CPUs might not have integrated GPU, so if you want to run OpenVINO on iGPU, go to [ark.intel website](https://ark.intel.com/) and make sure that your CPU has it.
+
+## 2. Make sure that OpenCL® Runtime is installed
+On Windows OpenCL runtime is a part of the GPU driver, but on linux it should be installed separately. For the installation tips please refer to [OpenVINO docs](https://docs.openvino.ai/latest/openvino_docs_install_guides_installing_openvino_linux_header.html) and [OpenCL Compute Runtime docs](https://github.com/intel/compute-runtime/tree/master/opencl/doc).
+To get support of Intel® Iris® Xe MAX Graphics with Linux please follow [driver installation guide](https://dgpu-docs.intel.com/devices/iris-xe-max-graphics/index.html)
+
+
+## 3. Make sure that user has all required permissions to work with GPU device
+Add the current Linux user to the `video` group:
+```
+sudo usermod -a -G video "$(whoami)"
+```
+
+## 4. Make sure that iGPU is enabled
+```
+$ cat /sys/devices/pci0000\:00/0000\:00\:02.0/enable
+1
+```
+
+## 5. Make sure that "/etc/OpenCL/vendors/intel.icd" contain proper paths to the OpenCL driver
+```
+$ cat /etc/OpenCL/vendors/intel.icd 
+/usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
+```
+Note: path to the runtime lib may vary in different driver versions
+
+## 6. Use LD_DEBUG=libs to trace loaded libraries
+For more details, see the [OpenCL on Linux](https://github.com/bashbaug/OpenCLPapers/blob/markdown/OpenCLOnLinux.md)
+
+## 7. If you are using dGPU with XMX, ensure that HW_MATMUL feature is recognized
+Openvino contains hello_query_device sample application: [link](https://docs.openvino.ai/latest/openvino_inference_engine_ie_bridges_python_sample_hello_query_device_README.html)
+
+With this option, you can check whether Intel XMX(Xe Matrix Extension) feature is properly recognized or not. This is a hardware feature to accelerate matrix operations and available on some discrete GPUs.
+```
+$ ./hello_query_device.py
+...
+[ INFO ]                OPTIMIZATION_CAPABILITIES: FP32, BIN, FP16, INT8, GPU_HW_MATMUL
+```
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
+ 
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/gpu_plugin_ops_enabling.md b/src/plugins/intel_gpu/docs/gpu_plugin_ops_enabling.md
new file mode 100644
index 00000000000..82a150a17b6
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/gpu_plugin_ops_enabling.md
@@ -0,0 +1,138 @@
+# GPU plugin operations enabling flow
+
+## Terminology
+* **NGraph operation**: Building block of neural networks, such as convolution or pooling.
+* **(clDNN) Primitive**: Basic NN operation that was defined in clDNN. One primitive is usually mapped to one ngraph operation, but graph compilation may cause the mapping not to be 1-to-1.
+* **Kernel**: Actual body of execution in GPU. It also refers to specific implementations of **Primitive** for GPU, such as `convolution_gpu_winograd_2x3_s1.cl`. Usually, single kernel fulfills the operation of single primitive, but several kernels may be used to support one primitive.
+* **Unittest**: Single-layer test within cldnn.
+* **Functional test**: Single-layer test in IE.
+
+<br>
+
+## Adding new primitive
+1. Understand the new operation.
+    * Review the [ngraph operation spec](https://github.com/openvinotoolkit/openvino/tree/master/docs/ops)
+    * IE operations(a.k.a primitive or NN-layer) are defined by ngraph.
+    * You can check ngraph reference implementation of the primitive as well
+        * e.g. [Scatter Elements Update in nGraph](https://github.com/openvinotoolkit/openvino/blob/master/src/core/reference/include/ngraph/runtime/reference/scatter_elements_update.hpp)
+
+1. Try to find existing primitive that fully or partially covers this operation.
+    * It is also possible to transform the network so that the missing primitive is covered from existing primitive.
+    * e.g. [Replace reduce with pooling](https://github.com/openvinotoolkit/openvino/blob/23808f46f7b5d464fd649ad278f253eec12721b3/inference-engine/src/cldnn_engine/cldnn_engine.cpp#L205)
+
+1. Add new / extend existing cldnn primitive according to the operation spec.
+    1. This phase is to enable primitive within cldnn library, without exposing it to IE.
+    1. Implement **reference parallel kernel** that supports all parameters of the operation and all input/output data types and layouts
+        
+        | File | Description |
+        |------|-------------|
+        | [scatter_elements_update_ref.cl](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/kernel_selector/cl_kernels/scatter_elements_update_ref.cl) | OpenCL Kernel body. For more detail, please see [How to write OCL kernel](#writing-ocl-kernel) section |
+        | [scatter_elements_update_kernel_ref.(cpp,h)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/kernel_selector/kernels/scatter_update/scatter_elements_update_kernel_ref.cpp) | Counterpart of kernel body for host |
+        | [scatter_elements_update_kernel_selector.(cpp,h)](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/kernel_selector/kernels/scatter_update/scatter_elements_update_kernel_selector.cpp) | Kernel selector for a primitive |
+        | [register_gpu.(cpp,hpp)](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/src/gpu/register_gpu.cpp) | Primitive registration |
+        | [scatter_elements_update_gpu.cpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/src/gpu/scatter_elements_update_gpu.cpp) | Primitive registration, input spec |
+        | [scatter_elements_update_inst.h](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/graph/include/scatter_elements_update_inst.h) | Node type declaration for cldnn program |
+        | [clDNN/src/scatter_elements_update.cpp](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/graph/scatter_elements_update.cpp) | Code for scatter_elements_update_inst.h |
+        | [clDNN/api/cldnn/primitives/scatter_elements_update.hpp](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/include/intel_gpu/primitives/scatter_elements_update.hpp) | clDNN primitive definition |
+        | [common_types.h](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/kernel_selector/common_types.h) | Enum declaration for KernelType and arguments |
+
+    1. Add unit tests for the new operation
+
+        | File | Description |
+        |------|-------------|
+        | [scatter_elements_update_gpu_test.cpp](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/scatter_elements_update_gpu_test.cpp) | Unittest for layer |
+
+        * Need to add reference code or expected result for checking the result.
+
+        * You can also specify the kernel with `force_implementations` in case the primitive contains multiple kernels.
+            ```
+            ...
+            build_options options;
+            implementation_desc conv_impl = { format::fs_b_yx_fsv32, "" };
+            options.set_option(build_option::force_implementations({ {"conv_fsv", conv_impl} }));
+            network network(engine, topology, options);
+            ...
+            ```
+
+        * This unit test is built into `clDNN_unit_tests`. It is a gtest application.
+            ```
+            # Show list of test cases
+            openvino/bin/intel64/Debug$ ./clDNN_unit_tests64 --gtest_list_tests
+            # Run test
+            openvino/bin/intel64/Debug$ ./clDNN_unit_tests64 --gtest_filter=scatter_elements_update_gpu_fp16.*
+            ```
+        
+        * Test scope needs to be comprehensive, but not wasteful. These tests run for every PRs in CI. Let's save the planet.
+    
+    1. Support layer fusion, if applicable
+        * It is usually easy to fuse some layers, such as scale, activation, quantize and eltwise, into previous layer. This fusing rule can be added to `prepare_primitive_fusing::fuse_simple_primitives`.
+        * `fuse_simple_primitives` is called during [graph compilation phase](https://github.com/openvinotoolkit/openvino/blob/71c50c224964bf8c24378d16f015d74e2c1e1ce8/inference-engine/thirdparty/clDNN/src/program.cpp#L430)
+        * You can see general description of layer fusion [here](https://docs.openvinotoolkit.org/latest/openvino_docs_IE_DG_supported_plugins_CL_DNN.html#optimizations)
+        * Unit tests for layer fusion are placed in a single file: [fusings_gpu_test.cpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/tests/test_cases/fusings_gpu_test.cpp). It is also compiled into `clDNN_unit_tests`.
+        * Code for fused layers are generated with `jitter`. It is created as `FUSED_OPS..` macro in OCL code. This generation logic is in `KernelBase::MakeFusedOpsJitConstants`.
+
+1. Add / update factory for this operation in the GPU plugin to use new primitive in inference-engine
+
+    | File | Description |
+    |------|-------------|
+    | [cldnn_engine/ops/scatter_elements_update.cpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/cldnn_engine/ops/scatter_elements_update.cpp) | Instantiation from cldnn plugin for IE |
+    | [cldnn_primitives_list.hpp](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/cldnn_engine/cldnn_primitives_list.hpp) | Registration for primitives |
+
+1. Add functional single layer tests for the operation and try to cover most of the difference use cases of this operation
+
+    | File | Description |
+    |------|-------------|
+    | [single_layer_tests/scatter_elements_update.cpp](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/tests/functional/shared_tests_instances/single_layer_tests/scatter_elements_update.cpp) | Single layer test |
+
+    * It is possible to use ngraph reference code for result validation.
+    * This is compiled into `gpuFuncTests`. It is also `gtest` application.
+    * Please also review the [general guideline of test infrastructure](https://github.com/openvinotoolkit/openvino/wiki/InferenceEngineTestsInfrastructure)
+
+1. [Optional] If there are existing IRs with this operation, try to run the full model(s) to be sure that it's correctly processed within the context
+
+1. [Optional] If there are existing IRs with this operation, try to run the full model(s) and estimate performance impact from this operation on total model execution time
+
+1. Create PR with your changes
+    * If you are `OpenVINO` group member in github, CI will be triggered.
+    * Please review the [OpenVINO contribution guide](https://github.com/openvinotoolkit/openvino/wiki/Contribute).
+
+<br>
+
+## Adding new kernel for an existing primitive
+* The process is quite similar to previous one. You can skip already existing steps.
+* Main work is adding new kernel and registering it from kernel selector.
+* You may need to add unit test for that new kernel. Specific kernel can be chosen with `build_option::force_implementations`.
+* It is not possible to specify kernel from functional test(IE).
+
+<br>
+
+## Writing OCL kernel
+
+### Jitter
+In GPU OCL kernels, many conditional statements are processed with `#ifdef` so that it can be handled during compile-time. The definitions are created with `jitter.cpp`. It is set during graph compilation. You can see generated macros following the steps in [source dumps](https://github.com/openvinotoolkit/openvino/wiki/GPUPluginDebugUtils#sources-dumps).
+Jitter also contains run-time parameters such as input and output size.
+Additional macros can be defined from host-code of kernel itself. For example, see below code snippet. It passes `SUB_GROUP_SIZE` through macro definition through jitter.
+```
+  // GetJitConstants method of the kernel
+  const size_t sub_group_size = 16;
+  JitConstants jit = MakeBaseParamsJitConstants(params);
+  jit.AddConstant(MakeJitConstant("SUB_GROUP_SIZE", sub_group_size ));
+```
+
+### Accessing input and output tensor
+Jitter generates macros for index calculations. With these macros, you can program ocl kernel in a layout-agnostic way. If you use the macro `${TENSOR_NAME}_GET_INDEX`, you can get 1d-index from tensor coordinate whether the format is planar(such as `bfyx` or `byxf`) or blocked.(such as `b_fs_yx_fsv16`). You can check [source code for GET_INDEX macro](https://github.com/openvinotoolkit/openvino/blob/7f8d3aa63899a3e3362c95eb7d1b04a5899660bd/inference-engine/thirdparty/clDNN/kernel_selector/core/common/jitter.cpp#L313).
+
+### Layout support
+If a kernel is not performance-critical, you can support `bfyx`, `bfzyx` and `bfwzyx` only for layout. Those are default layouts. As an optimized format, `b_fs_yx_fsv16`, `b_fs_yx_fsv4` or `byxf` can be used as well.
+[General description of layout can be found here](https://github.com/openvinotoolkit/openvino/wiki/GPUPluginMemoryFormats) and [header file is here](https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/thirdparty/clDNN/api/tensor.hpp)
+
+### Layer fusion
+When layers are fused, `jitter` will create macros to generate code for fused layers. It is realized into `FUSED_OPS..` in OCL kernel. You can understand the usage from other kernels.
+There is a [comment that describes layer fusion](https://github.com/openvinotoolkit/openvino/blob/7f8d3aa63899a3e3362c95eb7d1b04a5899660bd/inference-engine/thirdparty/clDNN/kernel_selector/core/kernel_selector_params.h#L521).
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/gpu_plugin_unit_test.md b/src/plugins/intel_gpu/docs/gpu_plugin_unit_test.md
new file mode 100644
index 00000000000..87632f28d8e
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/gpu_plugin_unit_test.md
@@ -0,0 +1,263 @@
+# GPU plugin unit test
+
+GPU plugin has two type tests: first one is functional tests and second one is unit tests.
+
+- The functional test is testing single layer, behavior, sub graph and low precision transformation on inference engine level for various layout and data types such as fp16 and fp32.
+- The unit test is testing cldnn primitive and core type modules on GPU plugin level. Unlike functional test, it is possible to test by explicitly specifying the format of the input such as `bfyx` or `b_fs_yx_fsv16`. This documentation is about this type of test.
+
+# Structure of unit test
+
+Intel GPU unit test (aka clDNN unit test) is a set of unit tests each of which is for testing all primitives, fusions and fundamental core types of GPU plugin. 
+There are 4 sub categories of unit tests as below.
+
+```bash
+openvino/src/plugins/intel_gpu/tests	- root of Intel GPU unit test
+|── fusions
+|── module_tests 				
+|── test_cases
+└── test_utils
+```
+
+- ### fusions
+  - Fusion is an algorithm that fuse several operations into one optimized operation. For example, two nodes of `conv -> relu` may be fused into single node of `conv`.
+  - Fusion unit tests checks whether the fusion is done as expected.
+  - fusion_test_common.cpp
+     - The base class for fusing test, i.e., [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19), is implemented here. It tests whether the fusing is successful or not by comparing the execution results of the two networks, one is the fused network, the other is non fused network for same topology.
+       - [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19) has an important method called *`compare()`*. 
+       - *`compare()`* method has the following three tasks
+            - Execute two networks (fused network and not fused network)
+            - Compare the actual  number of executed primitives with the expected number of executed primitives in test params
+            - Compare the results between fused network and non fused network
+  - eltwise_fusing_test.cpp
+       - Check whether or not eltwise is fused to other primitives as expected
+  - [primitive_name]_fusion_test.cpp
+       - Check that nodes such as eltwise or activation are fusing to the [primitive_name] as expected
+  - The detail of how to add each instance is described [below](#fusions-1).
+
+- ### test_cases
+  - It is mainly checking that cldnn primitives and topology creation are working as designed
+  - It also checks configurations for OpenCL functionalities such as cl_cache, cl_mem allocation and cl_command_queue modes
+
+- ### module_tests 
+  - Unit tests for fundamental core modules such as ocl_user_events, format, layout, and usm memory
+    - Check ocl_user_event is working as expected
+    - Check all format is converted to the string and trait
+    - Check various layouts are created as expected
+    - Check usm_host and  usm device memory buffer creation and read/write functionality
+
+- ### test_utils
+  - Defined base functions of unit test such as *`get_test_engine()`* which returns `cldnn::engine`
+  - Utility functions such as Float16, random_gen and uniform_quantized_real_distribution
+
+
+# How to run unit tests
+
+## Build unit test
+
+1. Turn on `ENABLE_TESTS` and `ENABLE_CLDNN_TESTS` in cmake option
+
+   ```bash
+   cmake -DCMAKE_BUILD_TYPE=Release \
+       -DENABLE_TESTS=ON \
+       -DENABLE_CLDNN_TESTS=ON \
+       -DENABLE_CLDNN=ON ..
+   ```
+
+2. Build
+
+   ```bash
+   make clDNN_unit_tests
+   ```
+
+3. You can find _`clDNN_unit_tests64`_ in bin directory after build
+
+
+
+## Run unit test
+
+You can run _`clDNN_unit_tests64`_ in bin directory which is the output of openvino build
+
+If you want to run specific unit test, you can use gtest_filter option as follows:
+
+```
+./clDNN_unit_tests64 --gtest_filter='*filter_name*'
+```
+
+Then, you can get the result like this
+
+```bash
+openvino/bin/intel64/Release$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD
+openvino/bin/intel64/Release$ ./clDNN_unit_tests64 --gtest_filter='*fusings_gpu/conv_fp32_reorder_fsv16_to_bfyx.basic/0*'
+Running main() from /home/openvino/thirdparty/gtest/gtest/googletest/src/gtest_main.cc
+Note: Google Test filter = *fusings_gpu/conv_fp32_reorder_fsv16_to_bfyx.basic/0*
+[==========] Running 1 test from 1 test suite.
+[----------] Global test environment set-up.
+[----------] 1 test from fusings_gpu/conv_fp32_reorder_fsv16_to_bfyx
+[ RUN      ] fusings_gpu/conv_fp32_reorder_fsv16_to_bfyx.basic/0
+[       OK ] fusings_gpu/conv_fp32_reorder_fsv16_to_bfyx.basic/0 (84 ms)
+[----------] 1 test from fusings_gpu/conv_fp32_reorder_fsv16_to_bfyx (84 ms total)
+[----------] Global test environment tear-down
+[==========] 1 test from 1 test suite ran. (85 ms total)
+[  PASSED  ] 1 test.
+```
+
+
+# How to create new test case
+
+## TEST and TEST_P (GoogleTest macros)
+
+GPU unit tests are using 2 types of test macros(**TEST** and **TEST_P**)  in  [GoogleTest (aka gtest)](https://google.github.io/googletest/)
+
+- ### **TEST**
+  - **TEST** is the simple test case macro.
+  - To make test-case using **TEST**,  define an individual test named *`TestName`* in the test suite *`TestSuiteName`*
+
+    ```
+    TEST(TestSuiteName, TestName) {
+      ... test body ...
+    }
+    ```
+  - The test body can be any code under test. To determine the outcomes within the test body, use assertion such as *`EXPECT_EQ`* and *`ASSERT_NE`*.
+ 
+- ### **TEST_P**
+  - **TEST_P** is used to set test case using test parameter sets
+  - To make test-case using **TEST_P**, define an individual value-parameterized test named *`TestName`* that uses the test fixture class *`TestFixtureName`* which is the test suite name
+
+    ```
+    TEST_P(TestFixtureName, TestName) {
+      ... statements ...
+    }
+    ```
+  - Then, instantiates the value-parameterized test suite *`TestSuiteName`* which is defined defined with **TEST_P**
+    ```c++
+    INSTANTIATE_TEST_SUITE_P(InstantiationName,TestSuiteName,param_generator)
+    ```
+
+
+## module_test and test_cases
+
+- module_test and test_cases are testing GPU plugin using both **TEST_P** and **TEST**.
+- Please refer to [the fusion test](#fusions-1) for the test case based on **TEST_P**
+- **TEST** checks the test result by comparing the execution results with expected values after running network created from the target topology to check.
+  - It is important to generate test input and expected output result in **TEST**
+  - You can create input data and expected output data using the 3 following ways:
+    - Generate simple input data and calculate the expected output data from input data manually like [basic_deformable_convolution_def_group1_2](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/convolution_gpu_test.cpp#L254)
+    - Generate random input and get the expected output using reference function which is made in the test codes like [mvn_test_across_channels_outside_sqrt_bfyx](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/mvn_gpu_test.cpp#L108)
+    - Generate random input and get the expected output from another reference kernel which is existed in cldnn kernels like [mvn_random_test_bsv32](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/test_cases/mvn_gpu_test.cpp#L793)
+
+- When you allocate input data, please keep in mind that the layout order in *`engine.allocation_memory`* is not *`bfyx`* but *`bfxy`*. i.e., example, if input is {1,1,4,5}, the layout should be below
+
+  ```c++
+  auto input = engine.allocate_memory({ data_types::f32, format::bfyx, { 1, 1, 5, 4 } });
+  ```
+
+
+## fusions
+
+- It is implemented based on **TEST_P** because there are many cases where multiple layouts are tested in the same topology
+- If the fusing test class is already existed, you can use it. otherwise, you should make new fusing test class which is inherited [BaseFusingTest](https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/tests/fusions/fusion_test_common.hpp#L19)
+  - The new fusing test class should create `execute()` method which creates fused / non fused networks and calls *`compare`* method after setting input
+- Create test case using **TEST_P**
+  - You can make the desired networks using create_topologies. 
+```mermaid
+flowchart LR
+    nodeA1(bias) --> nodeA2(conv_prim)
+    nodeA3(input) --> nodeA2(conv_prim)
+    nodeA4(weights) --> nodeA2(conv_prim)
+    nodeA2(conv_prim) --> nodeA5(eltwise2_mul)
+    nodeA6(eltwise1_data) --> nodeA7(eltwise1_add)
+    nodeA2(conv_prim) --> nodeA7(eltwise1_add)
+    nodeA7(eltwise1_add) --> nodeA8(activation)
+    nodeA8(activation) --> nodeA5(eltwise2_mul)
+    nodeA9(eltwise2_data) --> nodeA10(eltwise3_div)
+    nodeA11(eltwise4_data) --> nodeA12(eltwise4_add)
+    nodeA5(eltwise2_mul) --> nodeA10(eltwise3_div)
+    nodeA10(eltwise3_div) --> nodeA12(eltwise4_add)
+    nodeA12(eltwise4_add) --> nodeA13(reorder_bfyx)
+classDef no-bg-color fill:none,stroke-width:0px
+classDef moss1 fill:#D7F3A2, stroke: #B1D272, color: #262626
+classDef steel1 fill:#B9D6E5, stroke: #86B3CA, color: #262626
+classDef daisy1 fill:#FFE17A, stroke: #FEC91B, color: #262626
+classDef coral1 fill:#FFB6B9, stroke: #FF848A, color: #262626
+classDef carbon1 fill:#E9E9E9, stroke: #AEAEAE, color: #262626
+class nodeA7,nodeA5,nodeA10,nodeA12 coral1
+class nodeA2,nodeA13 daisy1
+class nodeA3 moss1
+class nodeA8 steel1
+class nodeA4,nodeA1,nodeA6,nodeA9,nodeA11 carbon1
+```
+  - For example, if you design the networks like the one above, you can make the test code as follow
+
+    ```c++
+    class conv_fp32_multi_eltwise_4_clamp : public ConvFusingTest {};
+    TEST_P(conv_fp32_multi_eltwise_4_clamp, basic) {
+        if (engine.get_device_info().supports_immad) {
+            return;
+        }
+        auto p = GetParam();
+        create_topologies(
+            input_layout("input", get_input_layout(p)),
+            data("eltwise1_data", get_mem(get_output_layout(p))),
+            data("eltwise2_data", get_mem(get_output_layout(p))),
+            data("eltwise4_data", get_mem(get_output_layout(p))),
+            data("bias", get_mem(get_bias_layout(p))),
+            data("weights", get_mem(get_weights_layout(p))),
+            convolution("conv_prim", "input", { "weights" }, { "bias" }, p.groups, p.stride, p.pad, p.dilation),
+            eltwise("eltwise1_add", "conv_prim", "eltwise1_data", eltwise_mode::sum),
+            activation("activation", "eltwise1_add", activation_func::clamp, { 0.5f, 2.5f }),
+            eltwise("eltwise2_mul", "activation", "conv_prim", eltwise_mode::prod),
+            eltwise("eltwise3_div", "eltwise2_mul", "eltwise2_data", eltwise_mode::prod),
+            eltwise("eltwise4_add", "eltwise3_div", "eltwise4_data", eltwise_mode::sum),
+            reorder("reorder_bfyx", "eltwise4_add", p.default_format, data_types::f32)
+        );
+        implementation_desc conv_impl = { format::b_fs_yx_fsv16, "" };
+        bo_fused.set_option(build_option::force_implementations({ { "conv_prim", conv_impl } }));
+        tolerance = 1e-5f;
+        execute(p);
+    }
+    
+    ```
+
+  - If you want to change some node's layout format to specific format, you can change it using *`build_option::force_implementations`*.
+    - In the sample codes, *`conv_prim`* is set to *`format::b_fs_yx_fsv16`* by *`build_option::force_implementations`*
+- *`tolerance`* is used as to threshold to check whether or not output result are same between fused network and non fused network in *`compare`* function.
+- After the test case is implemented, use `INSTANTIATE_TEST_SUITE_P` to set the test suite for each parameter case as follows. 
+  - Check all variables in *`convolution_test_params`* to make `CASE_CONV_FP32_2`. 
+    - In *`convolution_test_params`*, all tensor, format, and data_types are used in common in all convolution fusing tests. So you can define `CASE_CONV_FP32_2` with all variables except *`expected_fused_primitives`* and *`expected_not_fused_primitives`*
+
+```c++
+struct convolution_test_params {
+    tensor in_shape;
+    tensor out_shape;
+    tensor kernel;
+    tensor stride;
+    tensor pad;
+    tensor dilation;
+    uint32_t groups;
+    data_types data_type;
+    format input_format;
+    data_types weights_type;
+    format weights_format;
+    data_types default_type;
+    format default_format;
+    size_t expected_fused_primitives;
+    size_t expected_not_fused_primitives;
+};
+
+
+// in_shape; out_shape; kernel; stride; pad; dilation; groups; data_type; input_format; weights_type; weights_format; default_type; default_format;
+#define CASE_CONV_FP32_2 { 1, 16, 4, 5 }, { 1, 32, 2, 3 }, { 1, 1, 3, 3 }, tensor{ 1 }, tensor{ 0 }, tensor{ 1 }, 1, data_types::f32, format::b_fs_yx_fsv16, data_types::f32, format::os_is_yx_isv16_osv16, data_types::f32, format::bfyx
+
+
+INSTANTIATE_TEST_SUITE_P(fusings_gpu, conv_fp32_scale, ::testing::ValuesIn(std::vector<convolution_test_params>{
+    convolution_test_params{ CASE_CONV_FP32_2, 2, 3 }, // CASE_CONV_FP32_2, # of fused executed primitives, # of non fused networks
+    convolution_test_params{ CASE_CONV_FP32_3, 2, 3 },
+}));
+```
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/graph_optimization_passes.md b/src/plugins/intel_gpu/docs/graph_optimization_passes.md
new file mode 100644
index 00000000000..5a96e74a80c
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/graph_optimization_passes.md
@@ -0,0 +1,27 @@
+# Graph Optimization Passes
+
+Graph optimization is a collection of optimization passes that happens to convert a general network description into a network-description-for-GPU-execution. It happens in the constructor of `cldnn::program`. In other words, the input of graph optimization is `topology`[(link)](./basic_data_structures.md#topology) and output is `program`[(link)](./basic_data_structures.md#program-impl--).
+
+The transformation from original graph into the final graph is quite complicated. The steps are divided into smaller pieces(`pass`). The purpose of this documentation is not to explain every step in detail, but to explain key steps.
+
+For debugging purpose, you can dump the optimized graph after each step. Please see this [link](./gpu_debug_utils.md#graph-dumps) for detail.
+
+Note: The optimization passes runs in sequence and the prefixed number indicates the sequence. However, this sequence number might change in the future.
+
+* **00_init**: First step of the optimization. If you want to see first cldnn graph, you can check this. It collects network output node information and set node processing order.
+* **08_prepare_primitive_fusing**: Fuse post-operations into other primitives. For example, relu is fused into convolution. Element-wise add operation can usually be fused into predecessor, too. The layout for the primitive is not chosen at this point yet, and we don't know which kernel will be chosen for the primitive. However, support for post-operation is dependent on the chosen kernel. That is why this pass contains some logic to guess the layout.
+* **09_reorder_inputs**: Select layout format for each primitives. This is done by calling `layout_optimizer::get_preferred_format` function which returns preferred format for a node(or “any” which means that format must be propagated from adjacent nodes if possible). Then it propagate formats for nodes with “any” preferred format to minimize local reorders. After propagating formats, it inserts actual reorders nodes into the graph. As a result of this pass, we get quite complicated graph with many _redundant_ reorders. It will be removed from `remove_redundant_reorders`.
+* **17_remove_redundant_reorders**: This pass is about removing reorder, but it has two conceptual purpose. First one is removing _redundant_ reorders. For example, when the network contains a pattern like `reorder - reorder - reorder`, it can be shrunk into single `reorder`. Second one is about supporting cross-layout operation of primitive. For example, when a `convolution` needs to receive `bfyx` input and to generate `b_fs_yx_fsv16` output, the initial graph from `reorder_inputs` looks like this: `data(bfyx) --> reorder(b_fs_yx_fsv16) --> convolution(b_fs_yx_fsv16)`. This pass looks for such pattern and removes the reorder to generate cross-layout graph for the target convolution: `data(bfyx) --> convolution(b_fs_yx_fsv16)`
+* **19_prepare_buffer_fusing**: This pass is for implicit concat or implicit crop. Implicit concat is about removing `concatenation` primitive when two predecessors can put result into the target buffer of concat directly. For example, if two convolution results are concatenated along f-axis and the layout is bfyx format and b=1, we can just remove concat primitive and manipulate the output address of the convolutions to point proper locations.
+* **20_add_required_reorders**: This pass tries to keep graph consistency and add reorder if current format is not supported by a node. It checks if current input format is present in `implementation_map<op_t>`  defined in `<op_type>_gpu.cpp` file. If it is not defined, this pass tries to change layout to one of the most common format [bfyx, yxfb, byxf] and picks the first supported format.
+* **21_add_onednn_optimization_attributes**: This pass generates onednn attributes for post operation[(link)](https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html#post-ops-and-attributes). OpenVINO gpu plugin(a.k.a. cldnn) has a set of defined post operations and it requires some transformation to map those into onednn post-operations.
+* **22_compile_graph**: This pass creates `primitive_impl` through kernel selector. In this pass, the kernel for each node is chosen. For onednn primitives, OpenCL code is compiled in this stage. For cldnn primitives, OpenCL code will be compiled after all passes.
+* **26_propagate_constants**: This pass reorders weights for convolution, deconvolution and FC to a required format. As kernel is chosen in `compile_graph` stage, it is now known that some reordering is required for weights. It is because the weights are stored in a simple planar format in IR, but other format is usually required for optimized convolution(or deconv, FC). In order to reorder weights, this pass creates a simple graph that receives weights and generates reordered weights. We get the reordered weights by executing the network and the reordered weights are inserted back into the original graph.
+* **31_oooq_memory_dependencies**: In GPU, device memory is a limited resource and it is not necessary to keep all the intermediate results when inferencing a network. Therefore, the buffer is reused when the content is not needed anymore. However, it is necessary to take it into consideration that intel_gpu plugin is using out-of-order queue. As we are not sure the exact sequence of execution, there is additional limitation of reusing buffer. For example, in case of multi-branch structure like inception, there is no direct dependencies between the branches except for the common ancestor. However, in OOOQ execution mode, as we are not sure the sequence of execution in inception module, it is necessary not to reuse the buffer from one branch by another branch. Such _implicit dependency_ information is processed in this pass.
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/memory_allocation_gpu_plugin.md b/src/plugins/intel_gpu/docs/memory_allocation_gpu_plugin.md
new file mode 100644
index 00000000000..aa1d54d3733
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/memory_allocation_gpu_plugin.md
@@ -0,0 +1,51 @@
+# Memory allocation in GPU plugin
+
+## Allocation types
+GPU plugin supports 4 types of memory allocation as below. Note that the prefix `usm_` indicates the allocation type using Intel Unified Shared Memory (USM) extension for OpenCL. For more detailed information about the USM extension, refer to [this](https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html) page. 
+* `cl_mem` : Standard OpenCL cl_mem allocation
+* `usm_host` : Allocated in host memory and accessible by both of host and device. Not migratable.
+* `usm_shared` : Allocated in host and devices and accessible by both host and device. The memories are automatically migrated on demand.
+* `usm_device` : Allocated in device memory and accessible only by the device which owns the memory. Not migratable.
+
+Note that there are a few restrictions on a memory allocation:
+
+* Allocation of single memory object should not exceed the available device memory size, i.e., the value obtained by `CL_DEVICE_GLOBAL_MEM_SIZE`.
+* The sum of all memory objects required to execute a kernel (i.e., the sum of inputs and outputs of a kernel) should not exceed the target available memory. For example, if you want to allocate a memory object to the device memory, the above restrictions should be satisfied against the device memory. Otherwise, the memory object should be allocated on the host memory. 
+
+## Memory allocation API
+In GPU plugin, the allocation for each allocation type can be done with [engine::allocate_memory](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/include/intel_gpu/runtime/engine.hpp#L51), which
+calls the corresponding memory object wrapper for each allocation type: [gpu_buffer](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L35), [gpu_usm](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp#L291). 
+
+## Dump memory allocation history 
+The memory allocation history is being managed by the `engine`, which can be dumped by setting the environment variable `OV_GPU_Verbose=1` if the OpenVino is built with the cmake configuration `ENABLE_DEBUG_CAPS=ON`. 
+```cpp
+...
+GPU_Debug: Allocate 58982400 bytes of usm_host allocation type (current=117969612; max=117969612)
+GPU_Debug: Allocate 44621568 bytes of usm_device allocation type (current=44626380; max=44626380)
+GPU_Debug: Allocate 44236800 bytes of usm_host allocation type (current=162206412; max=162206412)
+GPU_Debug: Allocate 14873856 bytes of usm_device allocation type (current=59500236; max=59500236)
+...
+```
+Here, `current` denotes the total allocated memory amount at that moment while `max` denotes the peak record of the total memory allocation until that moment. 
+
+## Allocated memory objects
+The typical memory allocation performed in the GPU plugin can be categorized as follows: 
+* `Constant memory allocation`: In GPU plugin, constant data are held by the `data` primitives and the required memory objects are [allocated](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/plugin/ops/constant.cpp#L181) and assigned at the creation of the data primitive. First, it is allocated on the host memory and the constant data are copied from the corresponding blob in ngraph. Once all the transformation and optimization processes in `cldnn::program` is finished and the user nodes of those data are known as the GPU operations using the device memory, then the memory is reallocated on the device memory and the constants data are copied to there (i.e., [transferred](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/graph/program.cpp#L457)). Note that constant data are shared within batches and streams.
+* `Output memory allocation`: A memory object to store the output result of each primitive is created at the creation of each primitive_inst ([link](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/src/graph/primitive_inst.cpp#L263)), except the cases when the output is reusing the input memory. Note that the creation of a primitive_inst is done in an descending order of the output memory size for achieving better memory reusing efficiency.
+
+* `Intermediate memory allocation`: Some primitives such as _detection_output_ and _non_max_suppression_ consisting of multiple kernels require intermediate memories to exchange data b/w those kernels. The allocation of such intermediate memories happens after all allocation for primitive_insts are finished ([link](https://github.com/openvinotoolkit/openvino/blob/4c01d6c50c6d314373dffd2a8ddbc294011b2508/src/plugins/intel_gpu/src/graph/network.cpp#L592)), since it needs to be processed in a processing order to use the predecessors' allocation information to decide whether to allocate it on device memory or not by checking the memory allocation restriction described above.
+
+## Memory dependency and memory pool
+In GPU plugin, multiple memory objects can be allocated at a same address, when there is no dependency between the users of them. For example, a memory region of a program_node _A_'s output memory can be allocated for another program_node _B_'s output, if the output of _A_ is no longer used by any other program_node, when the result of the _B_ is to be stored. This mechanism is realized by the following two parts;
+1. `Memory dependency` : memory_dependencies of a program_node is set by the memory dependency passes. There are two kinds of memory dependency passes as follows: 
+    * `basic_memory_dependencies` : Assuming an in-order-queue execution, this pass adds dependencies to a program_node, which are deduced by checking its direct input and output nodes only.
+    * `oooq_memory_dependencies` : Assuming an out-of-order-queue execution, this pass adds dependencies to all pair of program_nodes that can potentially be executed at the same time.
+2. `Memory pool` : The GPU plugin has a [memory pool](https://github.com/openvinotoolkit/openvino/blob/de47a3b4a4ba1f8464b85a665c4d58403e0d16b8/src/plugins/intel_gpu/include/intel_gpu/runtime/memory_pool.hpp) which is responsible for the decision of allocation or reuse for an allocation request. This memory_pool utilizes the memory dependencies set by the above two passes in the decision of reuse of not. Note that each `cldnn::network` has its own `memory_pool`.
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
+ 
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/simplified_workflow.md b/src/plugins/intel_gpu/docs/simplified_workflow.md
new file mode 100644
index 00000000000..7d72cc3b9bb
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/simplified_workflow.md
@@ -0,0 +1,154 @@
+# GPU plugin workflow
+
+The simplified workflow in the GPU plugin is shown on the picture below (click on image for higher resolution):
+
+```mermaid
+classDiagram 
+class `intel_gpu::Plugin` {Inference Engine plugin 
+implementation for GPU}
+class `intel_gpu::CompiledModel` {Device specific network 
+representation that can be executed}
+class `intel_gpu::InferRequestAsync` {
+Inference request for specific executable network. 
+Wrapper for input and output memory}
+class `intel_gpu::TransformationPipeline` {Set of ngraph-based transformations 
+configured by GPU plugin}
+`Core::compile_model()` --> `intel_gpu::CompiledModel`
+`CompiledModel::create_infer_request()` -->`intel_gpu::InferRequestAsync`
+`InferRequest::start_async()` --> `intel_gpu::network`
+`intel_gpu::Plugin` --|> `InferenceEngine::InferencePluginInternal`
+`intel_gpu::Plugin` --> `intel_gpu::CompiledModel` : Create
+`intel_gpu::CompiledModel` --|> `InferenceEngine::ExecutableNetworkThreadSafeDefault`
+`intel_gpu::CompiledModel` --> `intel_gpu::InferRequestAsync` : Create
+`intel_gpu::TransformationPipeline` --> `ov::Model`
+`intel_gpu::TransformationPipeline` --> `intel_gpu::CompiledModel`
+`InferenceEngine::InferRequestInternal`
+class `intel_gpu::Graph` {Per stream copy of 
+compiled graph with 
+independent memory}
+`intel_gpu::Graph` "1..N" --* `intel_gpu::CompiledModel`
+class `intel_gpu::ProgramBuilder` {Object for operations 
+semantic translation and 
+graph compilation}
+`intel_gpu::CompiledModel` --> `intel_gpu::ProgramBuilder` : Create
+`intel_gpu::ProgramBuilder` "1" --o "N" `intel_gpu::Graph`
+class `intel_gpu::convolution` {convolution operation descriptor}
+class `intel_gpu::data` {Primitive representing 
+constant data in a topology}
+class `intel_gpu::input_layout` {Represents dynamic input data}
+class `intel_gpu::primitive_base` {<<Interface>>}
+`intel_gpu::convolution` ..<| `intel_gpu::primitive_base`
+`intel_gpu::data` ..<| `intel_gpu::primitive_base`
+`intel_gpu::input_layout` ..<| `intel_gpu::primitive_base`
+`Any other primitive` ..<| `intel_gpu::primitive_base`
+class `intel_gpu::topology` {
+Set of primitives. Each primitive 
+knows operation parameters, 
+it's inputs and outputs}
+class `intel_gpu::program` {
+Class that contains compiled topology. 
+All kernels are selected, 
+memory dependencies are resolved, 
+the only missing thing - memory for intermediate buffers}
+`intel_gpu::primitive_base` "0..N" --o `intel_gpu::topology`
+`intel_gpu::program` --> `intel_gpu::topology`
+`intel_gpu::ProgramBuilder` --> `intel_gpu::topology` : Create
+`intel_gpu::ProgramBuilder` --> `intel_gpu::program` : Create
+class `intel_gpu::program_node` {Base class for representation of a single graph node}
+class `intel_gpu::primitive_impl` {
+<<interface>>
+Base class for representation of a single graph node}
+class `intel_gpu::typed_primitive_onednn_impl` {Implementations that use oneDNN library}
+class `oneDNN library` {statically linked into GPU plugin}
+class `intel_gpu::typed_primitive_ocl_impl` {OCL implementations that use 
+kernels from kernel_selector}
+class `intel_gpu::kernel_selector` {
+module that stores OCL kernels 
+for primitives and has embed some 
+rules for optimal kernel selection}
+`intel_gpu::program_node` --o `intel_gpu::program`
+`intel_gpu::primitive_impl` --o `intel_gpu::program_node`
+`intel_gpu::typed_primitive_onednn_impl` ..<| `intel_gpu::primitive_impl`
+`intel_gpu::typed_primitive_ocl_impl` ..<| `intel_gpu::primitive_impl`
+`intel_gpu::typed_primitive_ocl_impl` ..> `intel_gpu::kernel_selector`
+`intel_gpu::typed_primitive_onednn_impl` --> `oneDNN bridge` : Use
+`intel_gpu::typed_primitive_onednn_impl` ..> `oneDNN library`
+class `intel_gpu::build_options` {Set of options for graph compilations}
+class `intel_gpu::pass_manager` {Helper to run graph transformations}
+class `intel_gpu::base_pass` {
+<<Interface>>
+Base class for graph transformations}
+`intel_gpu::program` --> `intel_gpu::build_options`
+`intel_gpu::program` --> `intel_gpu::pass_manager` : Use
+`intel_gpu::program` --> `intel_gpu::base_pass` : Use
+`intel_gpu::pass_manager` --> `intel_gpu::base_pass` : Run
+class `intel_gpu::prepare_primitive_fusing` {
+Pass that fuses multiple operations into single node}
+class `intel_gpu::prepare_quantization` {
+Pass that prepares models for low precision execution}
+class `intel_gpu::reorder_inputs` {
+Pass that is responsible for layout/impl selection}
+class `intel_gpu::compile_graph` {
+Pass that selects and creates 
+best implementation for each primitive}
+class `intel_gpu::remove_redundant_reorders` {
+Pass that optimizes reorders in the graph}
+`intel_gpu::prepare_primitive_fusing`--|> `intel_gpu::base_pass`
+`intel_gpu::prepare_quantization`--|> `intel_gpu::base_pass`
+`intel_gpu::reorder_inputs`--|> `intel_gpu::base_pass`
+`intel_gpu::compile_graph`--|> `intel_gpu::base_pass`
+`intel_gpu::layout_optimizer`--|> `intel_gpu::base_pass`
+`intel_gpu::remove_redundant_reorders`--|> `intel_gpu::base_pass`
+`intel_gpu::reorder_inputs`--> `intel_gpu::layout_optimizer` : Use
+class `intel_gpu::network` {
+A program with allocated memory. 
+Can be executed on the device}
+`intel_gpu::InferRequestAsync` --> `intel_gpu::network` : Set input/output memory and run execution
+`intel_gpu::network` --> `intel_gpu::InferRequestAsync` : Return inference result
+class `intel_gpu::tensor` {Size of memory buffer}
+class `intel_gpu::format` {Order of elements in memory}
+class `intel_gpu::data_type` {elements precision}
+class `intel_gpu::memory_pool` {
+Object that tracks memory allocations 
+and tries to reuse memory buffers}
+class `intel_gpu::layout` {Memory descriptor}
+class `intel_gpu::memory` {GPU memory object}
+class `intel_gpu::stream` {
+Abstraction for queue. 
+Knows how to submit kernels and
+ provide some synchronization capabilities}
+class `intel_gpu::event` {Synchronization primitive}
+class `intel_gpu::kernel` {Holds kernel handle}
+class `intel_gpu::engine` {Engine for specific device, 
+responsible for memory allocations}
+class `intel_gpu::device` {Holds context/device handles for selected backend}
+class `intel_gpu::device_info` {Storage for device capabilities and info}
+class `intel_gpu::engine_configuration` {Options for engine}
+class `intel_gpu::device_query` {Detects available devices for given backend}
+`intel_gpu::tensor` --o `intel_gpu::layout`
+`intel_gpu::format` --o `intel_gpu::layout`
+`intel_gpu::data_type` --o `intel_gpu::layout`
+`intel_gpu::layout` --o  `intel_gpu::memory`
+`intel_gpu::memory` --o "0..N" `intel_gpu::memory_pool`
+`intel_gpu::memory` --o `intel_gpu::data`
+`intel_gpu::memory_pool` --* `intel_gpu::network`
+`intel_gpu::stream` --* `intel_gpu::network`
+`intel_gpu::stream` --> `intel_gpu::event`
+`intel_gpu::stream` --> `intel_gpu::kernel`
+`intel_gpu::engine` --> `intel_gpu::stream` : Create
+`intel_gpu::engine` --> `intel_gpu::memory` : Create
+`intel_gpu::engine` --> `intel_gpu::engine_configuration`
+`intel_gpu::engine` -- `oneDNN library` : Share context/device/queue handles
+`intel_gpu::device` --o `intel_gpu::engine`
+`intel_gpu::device_info` --o `intel_gpu::device`
+`intel_gpu::device_query` --> `intel_gpu::device`
+`OCL Implementation of Runtime`..<| `Runtime module API & common`
+`SYCL/L0 Implementation of Runtime (POC)`..<| `Runtime module API & common`
+```
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/plugins/intel_gpu/docs/source_code_structure.md b/src/plugins/intel_gpu/docs/source_code_structure.md
new file mode 100644
index 00000000000..da7d6141ed0
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/source_code_structure.md
@@ -0,0 +1,68 @@
+# GPU plugin structure
+
+Historically GPU plugin was built on top of standalone [clDNN library](https://github.com/intel/clDNN) for DNNs inference on Intel® GPUs,
+but at some point clDNN became a part of OpenVINO, so now it's a part of overall GPU plugin code.
+
+OpenVINO GPU plugin is responsible for:
+ 1. [IE Plugin API](https://docs.openvino.ai/latest/openvino_docs_ie_plugin_dg_overview.html) implementation.
+ 2. Translation of model from common IE semantic (ov::Function) into plugin specific one (cldnn::topology) which is then compiled into
+ gpu graph representation (cldnn::network).
+ 3. Implementation of OpenVINO operation set for Intel® GPU.
+ 4. Device specific graph transformations.
+ 5. Memory allocation and management logic.
+ 6. Processing of incoming InferRequests using clDNN objects.
+ 7. Actual execution on GPU device.
+
+As Intel GPU Plugin source code structure is shown below:
+<pre>
+src/plugins/intel_gpu                  - root GPU plugin folder
+             ├── include               
+             │   ├── intel_gpu         - library internal headers
+             │   │   ├── graph         - headers for internal graph representations
+             │   │   ├── plugin        - definition of classes required for OpenVINO plugin API implementation
+             │   │   ├── primitives    - primitive definitions for all supported operations
+             │   │   └── runtime       - abstraction for execution runtime entities (memory, device, engine, etc)
+             │   └── va
+             ├── src
+             │   ├── graph - all sources related to internal graph representation
+             │   │    ├── graph_optimizer - passes for graph transformations
+             │   │    ├── impls - definition of primitive implementations
+             │   │    └── include - headers with graph nodes
+             │   │ 
+             │   ├── kernel_selector - OpenCL™ kernels (host+device parts) + utils for optimal kernels selection
+             │   │   ├── common      - definition of some generic classes/structures used in kernel_selector
+             │   │   └── core        - kernels, kernel selectors, and kernel parameters definitions
+             │   │       ├── actual_kernels  - host side part of OpenCL™ kernels including applicability checks, performance heuristics and Local/Global work-groups description
+             │   │       ├── cache  - cache.json - tuning cache of the kernels which is redistributed with the plugin to improve kernels and kernel parameters selection for better performance
+             │   │       ├── cl_kernels - templates of GPU kernels (device part) written on OpenCL™
+             │   │       └── common - utils for code generation and kernels selection 
+             │   ├── plugin - implementation of OpenVINO plugin API
+             │   │    └── ops - factories for conversion of OpenVINO operations to internal primitives
+             │   └── runtime
+             │        └── ocl/ - implementation for OpenCL™ based runtime
+             ├── tests
+             │   ├── test_cases
+             │   └── test_utils
+             └── thirdparty
+                 ├── onednn_gpu - <a href="https://github.com/oneapi-src/oneDNN">oneDNN</a> submodule which may be used to accelerate some primitives
+                 └── rapidjson  - thirdparty <a href="https://github.com/Tencent/rapidjson">RapidJSON</a> lib for reading json files (cache.json)
+</pre>
+
+One last thing that is worth mentioning is functional tests which is located in the following location:
+```
+src/tests/functional/plugin/gpu
+```
+Most of the tests are reused across plugins, and each plugin only need to add test instances with some specific parameters.
+
+Shared tests are located here:
+```
+src/tests/functional/plugin/shared                        <--- test definitions
+src/tests/functional/plugin/gpu/shared_tests_instances    <--- instances for GPU plugin
+```
+
+## See also
+ * [OpenVINO™ README](../../../../README.md)
+ * [OpenVINO Core Components](../../../README.md)
+ * [OpenVINO Plugins](../../README.md)
+ * [OpenVINO GPU Plugin](../README.md)
+ * [Developer documentation](../../../../docs/dev/index.md)
\ No newline at end of file
diff --git a/src/tests/README.md b/src/tests/README.md
index ba43d1a2dfd..f751809b8e3 100644
--- a/src/tests/README.md
+++ b/src/tests/README.md
@@ -15,7 +15,7 @@ This is OpenVINO Inference Engine testing framework. OpenVINO Inference Engine t
     files.
     
     > **Example**: We have `ie_reshaper.cpp` within the `src/shape_infer` subfolder of the tested module. In this case
-    new `shape_infer` subfolder should be created within the root of the Unit Test folder for this module. And new
+    new `shape_infer` subfolder should be created within the the root of the Unit Test folder for this module. And new
     `ie_reshaper_test.cpp` file should be created within this newly created subfolder. This test file should cover all
     the classes and methods from the original file.
   
@@ -47,7 +47,7 @@ This is OpenVINO Inference Engine testing framework. OpenVINO Inference Engine t
     creating of a CNNNetwork using it. If a required layer is not covered by Ngraph it's allowed to build IR file using
     `xml_net_builder` utility (please refer to the `ir_net.hpp` file). IR XML files hardcoded as strings within the test
     code should not be used.
-  * All the plugin test cases are parametrized with (at least) the device name and included to the common
+  * All the plugin test cases are parameterized with (at least) the device name and included to the common
     `funcSharedTests` static library. This library is linked to the Plugin Test binaries. And all the plugin
     developers just add required test instantiations based on the linked test definitions to own test binary. It should
     be done to make all the **shared** test cases always visible and available to instantiate by other plugins. 
@@ -67,3 +67,8 @@ This is OpenVINO Inference Engine testing framework. OpenVINO Inference Engine t
   separate utilities by domains.
   > **NOTE**: All the utilities libraries are added to the developer package and available for closed source
   development.
+
+  ## See also
+ * [OpenVINO™ README](../../README.md)
+ * [OpenVINO Core Components](../README.md)
+ * [Developer documentation](../../docs/dev/index.md)
diff --git a/src/tests/functional/plugin/conformance/test_runner/README.md b/src/tests/functional/plugin/conformance/test_runner/README.md
index f59da1a19ec..4844f4e9b14 100644
--- a/src/tests/functional/plugin/conformance/test_runner/README.md
+++ b/src/tests/functional/plugin/conformance/test_runner/README.md
@@ -3,7 +3,7 @@
 ## Description
 Conformance suites certify plugin functionality using a set of tests with plugin specificity independent parameters. There are two types of conformance validation.
 
-### `API Conformance`
+### API Conformance
 The suite checks the following OpenVINO API entities in a plugin implementation:
 * plugin
 * compiled model (executable network)
@@ -18,7 +18,7 @@ A result of the `apiConformanceTests` run is two xml files: `report_api.xml` and
 
 
 
-### `Opset Conformance`
+### Opset Conformance
 The suite validates an OpenVINO operation plugin implementation, using simple single operation graphs (Conformance IR) taken from models. The plugin inference output is compared with the reference.
 
  The suite contains:
@@ -165,5 +165,7 @@ python3 summarize.py --xml /opt/repo/infrastructure-master/thirdparty/gtest-para
 
 The report contains statistics based on conformance results and filter fields at the top of the page.
 
-
-
+## See also
+ * [OpenVINO™ README](../../../../../../README.md)
+ * [OpenVINO Core Components](../../../../../README.md)
+ * [Developer documentation](../../../../../../docs/dev/index.md)
\ No newline at end of file