[GPU] Set affinity TBB using CPUStreamsExecutor (#7738) (#7738)

2021-11-03 17:50:55 +09:00 · 2021-11-03 17:50:55 +09:00 · ece45630f0
commit ece45630f0
parent 65c3b4c357
20 changed files with 264 additions and 208 deletions
--- a/docs/IE_DG/supported_plugins/GPU.md
+++ b/docs/IE_DG/supported_plugins/GPU.md
@ -105,7 +105,8 @@ When specifying key values as raw strings (that is, when using Python API), omit
 | `KEY_CACHE_DIR`      | `"<cache_dir>"`                    | `""`              | Specifies a directory where compiled OCL binaries can be cached. First model loading generates the cache, and all subsequent LoadNetwork calls use precompiled kernels which significantly improves load time. If empty - caching is disabled             |
 | `KEY_PERF_COUNT`      | `YES` / `NO`                    | `NO`              | Collect performance counters during inference             |
 | `KEY_CONFIG_FILE`     | `"<file1> [<file2> ...]"`         | `""`              | Load custom layer configuration files                     |
-| `KEY_GPU_PLUGIN_`<br>`PRIORITY` | `<0-3>`                       | `0`               | OpenCL queue priority (before usage, make sure your OpenCL driver supports appropriate extension)<br> Higher value means higher priority for OpenCL queue. 0 disables the setting. |
+| `KEY_GPU_MODEL_`<br>`PRIORITY` | `GPU_MODEL_PRIORITY_<HIGH\|LOW>`  <br/> `GPU_QUEUE_PRIORITY_<LOW\|HIGH\|MED\|DISABLED>`   <br/> `GPU_HOST_TASK_PRIORITY_<HIGH\|LOW\|ANY>`  | `GPU_QUEUE_PRIORITY_DEFAULT` <br/>  `\|GPU_HOST_TASK_PRIORITY_ANY`             | Specifies two types of priority: host task priority and OpenCL queue priority.<br/><br/>Host task priority is specified by `GPU_HOST_TASK_PRIORITY_[level]` and there are three types of task levels: `HIGH`, `LOW`, and `ANY`. Note that `HIGH` and `LOW` are effective only when tbb is used for multithreading the LoadNetwork workload and the host processor is hybrid type. For hybrid processors, if the task priority type is set as `HIGH` the task will have higher priority for core type selection, and vice versa. If the host processor is not hybrid core or the multi threading is not using tbb, it is set as `ANY`, which is the default type.<br/><br/>OpenCL queue priority is specified by `GPU_QUEUE_PRIORITY_[level]` and there are four types of levels: `HIGH`, `MED`, `LOW`, and `DEFAULT`, where the default value is `DEFAULT`. Before usage, make sure your OpenCL driver supports appropriate extension.<br/><br/>Basically `GPU_MODEL_PRIORITY` can be set as combination of the two priority types, such as<br/>-`GPU_QUEUE_PRIORITY_HIGH\|GPU_HOST_TASK_PRIORITY_HIGH` or<br/>-`GPU_QUEUE_PRIORITY_LOW\|GPU_HOST_TASK_PRIORITY_HIGH`.<br/><br/>Also it can be set as a more abstract level of priority PLUGIN_PRIORIY_[level], which represents combination of the two priorities as follows:<br/>-`GPU_MODEL_PRIORITY_HIGH` : `GPU_QUEUE_PRIORITY_HIGH\|GPU_HOST_TASK_PRIORITY_HIGH`<br/>-`GPU_MODEL_PRIORITY_LOW` : `GPU_QUEUE_PRIORITY_LOW\|GPU_HOST_TASK_PRIORITY_LOW`<br/><br/>The default of `KEY_GPU_MODEL_PRIORITY` is `GPU_QUEUE_PRIORITY_DEFAULT\|GPU_HOST_TASK_PRIORITY_ANY`.<br>  |
+| `KEY_GPU_PLUGIN_`<br>`PRIORITY` | `<0-3>`                       | `0`               | OpenCL queue priority (before usage, make sure your OpenCL driver supports appropriate extension)<br> Higher value means higher priority for OpenCL queue. 0 disables the setting. **Deprecated**. Please use KEY_GPU_MODEL_PRIORITY |
 | `KEY_GPU_PLUGIN_`<br>`THROTTLE` | `<0-3>`                       | `0`               | OpenCL queue throttling (before usage, make sure your OpenCL driver supports appropriate extension)<br> Lower value means lower driver thread priority and longer sleep time for it. 0 disables the setting. |
 | `KEY_CLDNN_ENABLE_`<br>`FP16_FOR_QUANTIZED_`<br>`MODELS` | `YES` / `NO`                       | `YES`               | Allows using FP16+INT8 mixed precision mode, so non-quantized parts of a model will be executed in FP16 precision for FP16 IR. Does not affect quantized FP32 IRs |
 | `KEY_GPU_NV12_`<br>`TWO_INPUTS` | `YES` / `NO`                       | `NO`               | Controls preprocessing logic for nv12 input. If it's set to YES, then device graph will expect that user will set biplanar nv12 blob as input wich will be directly passed to device execution graph. Otherwise, preprocessing via GAPI is used to convert NV12->BGR, thus GPU graph have to expect single input |
@ -113,7 +114,7 @@ When specifying key values as raw strings (that is, when using Python API), omit
 | `KEY_EXCLUSIVE_ASYNC_`<br>`REQUESTS` | `YES` / `NO`                | `NO`              | Forces async requests (also from different executable networks) to execute serially.|
 | `KEY_GPU_MAX_NUM_`<br>`THREADS` | `integer value` | `maximum # of HW threads available in host environment` |  Specifies the number of CPU threads that can be used for GPU engine, e.g, JIT compilation of GPU kernels or cpu kernel processing within GPU plugin. The default value is set as the number of maximum available threads in host environment to minimize the time for LoadNetwork, where the GPU kernel build time occupies a large portion. Note that if the specified value is larger than the maximum available # of threads or less than zero, it is set as maximum available # of threads. It can be specified with a smaller number than the available HW threads according to the usage scenario, e.g., when the user wants to assign more CPU threads while GPU plugin is running. Note that setting this value with lower number will affect not only the network loading time but also the cpu layers of GPU networks that are optimized with multi-threading. |
 | `KEY_GPU_ENABLE_`<br>`LOOP_UNROLLING` | `YES` / `NO`             | `YES`             | Enables recurrent layers such as TensorIterator or Loop with fixed iteration count to be unrolled. It is turned on by default. Turning this key on will achieve better inference performance for loops with not too many iteration counts (less than 16, as a rule of thumb). Turning this key off will achieve better performance for both graph loading time and inference time with many iteration counts (greater than 16). Note that turning this key on will increase the graph loading time in proportion to the iteration counts. Thus, this key should be turned off if graph loading time is considered to be most important target to optimize. |
-| `KEY_CLDNN_PLUGIN_`<br>`PRIORITY` | `<0-3>`                       | `0`               | OpenCL queue priority (before usage, make sure your OpenCL driver supports appropriate extension)<br> Higher value means higher priority for OpenCL queue. 0 disables the setting. **Deprecated**. Please use KEY_GPU_PLUGIN_PRIORITY |
+| `KEY_CLDNN_PLUGIN_`<br>`PRIORITY` | `<0-3>`                       | `0`               | OpenCL queue priority (before usage, make sure your OpenCL driver supports appropriate extension)<br> Higher value means higher priority for OpenCL queue. 0 disables the setting. **Deprecated**. Please use KEY_GPU_MODEL_PRIORITY |
 | `KEY_CLDNN_PLUGIN_`<br>`THROTTLE` | `<0-3>`                       | `0`               | OpenCL queue throttling (before usage, make sure your OpenCL driver supports appropriate extension)<br> Lower value means lower driver thread priority and longer sleep time for it. 0 disables the setting. **Deprecated**. Please use KEY_GPU_PLUGIN_THROTTLE |
 | `KEY_CLDNN_GRAPH_`<br>`DUMPS_DIR` | `"<dump_dir>"`                       | `""`               | clDNN graph optimizer stages dump output directory (in GraphViz format) **Deprecated**. Will be removed in the next release                                     |
 | `KEY_CLDNN_SOURCES_`<br>`DUMPS_DIR` | `"<dump_dir>"`                       | `""`               | Final optimized clDNN OpenCL sources dump output directory. **Deprecated**. Will be removed in the next release                                   |
--- a/inference-engine/src/cldnn_engine/cldnn_config.cpp
+++ b/inference-engine/src/cldnn_engine/cldnn_config.cpp
@ -11,6 +11,7 @@
 #include "ie_api.h"
 #include "file_utils.h"
 #include "cldnn_itt.h"
+#include <ie_system_conf.h>
 #include <thread>

 #ifdef _WIN32
@ -40,6 +41,20 @@ static void createDirectory(std::string _path) {
    }
 }

+static int getNumberOfCores(const IStreamsExecutor::Config::PreferredCoreType core_type) {
+    const auto total_num_cores = getNumberOfLogicalCPUCores();
+    const auto total_num_big_cores = getNumberOfLogicalCPUCores(true);
+    const auto total_num_little_cores = total_num_cores - total_num_big_cores;
+
+    int num_cores = total_num_cores;
+    if (core_type == IStreamsExecutor::Config::BIG) {
+        num_cores = total_num_big_cores;
+    } else if (core_type == IStreamsExecutor::Config::LITTLE) {
+        num_cores = total_num_little_cores;
+    }
+    return num_cores;
+}
+
 IE_SUPPRESS_DEPRECATED_START
 void Config::UpdateFromMap(const std::map<std::string, std::string>& configMap) {
    OV_ITT_SCOPED_TASK(itt::domains::CLDNNPlugin, "Config::UpdateFromMap");
@ -97,7 +112,63 @@ void Config::UpdateFromMap(const std::map<std::string, std::string>& configMap)
                default:
                    IE_THROW(ParameterMismatch) << "Unsupported queue priority value: " << uVal;
            }
+        } else if (key.compare(GPUConfigParams::KEY_GPU_MODEL_PRIORITY) == 0) {
+            bool found_matched_value = false;
+            if (val.find(GPUConfigParams::GPU_MODEL_PRIORITY_HIGH) != std::string::npos) {
+                queuePriority = cldnn::priority_mode_types::high;
+                task_exec_config._threadPreferredCoreType = IStreamsExecutor::Config::BIG;
+                found_matched_value = true;
+            } else if (val.find(GPUConfigParams::GPU_MODEL_PRIORITY_LOW) != std::string::npos) {
+                queuePriority = cldnn::priority_mode_types::low;
+                task_exec_config._threadPreferredCoreType = IStreamsExecutor::Config::LITTLE;
+                found_matched_value = true;
+            } else {
+                if (val.find(GPUConfigParams::GPU_QUEUE_PRIORITY_HIGH) != std::string::npos) {
+                    queuePriority = cldnn::priority_mode_types::high;
+                    found_matched_value = true;
+                } else if (val.find(GPUConfigParams::GPU_QUEUE_PRIORITY_MED) != std::string::npos) {
+                    queuePriority = cldnn::priority_mode_types::med;
+                    found_matched_value = true;
+                } else if (val.find(GPUConfigParams::GPU_QUEUE_PRIORITY_LOW) != std::string::npos) {
+                    queuePriority = cldnn::priority_mode_types::low;
+                    found_matched_value = true;
+                } else if (val.find(GPUConfigParams::GPU_QUEUE_PRIORITY_DEFAULT) != std::string::npos) {
+                    queuePriority = cldnn::priority_mode_types::disabled;
+                    found_matched_value = true;
+                } else { // default is disabled
+                    queuePriority = cldnn::priority_mode_types::disabled;
+                }
+                if (val.find(GPUConfigParams::GPU_HOST_TASK_PRIORITY_HIGH) != std::string::npos) {
+                    task_exec_config._threadPreferredCoreType = IStreamsExecutor::Config::BIG;
+                    found_matched_value = true;
+                } else if (val.find(GPUConfigParams::GPU_HOST_TASK_PRIORITY_LOW) != std::string::npos) {
+                    task_exec_config._threadPreferredCoreType = IStreamsExecutor::Config::LITTLE;
+                    found_matched_value = true;
+                } else if (val.find(GPUConfigParams::GPU_HOST_TASK_PRIORITY_ANY) != std::string::npos) {
+                    task_exec_config._threadPreferredCoreType = IStreamsExecutor::Config::ANY;
+                    found_matched_value = true;
+                } else { // default is any
+                    task_exec_config._threadPreferredCoreType = IStreamsExecutor::Config::ANY;
+                }
+            }
+            if (!found_matched_value) {
+                IE_THROW() << "Not found appropriate value for property key " << GPUConfigParams::KEY_GPU_PLUGIN_PRIORITY
+                    << ".\n Expected Plugin priority such as GPU_PLUGIN_PRIORITY_HIGH / GPU_PLUGIN_PRIORITY_LOW or\n"
+                    << " Combination of queue priority(HIGH, MED, LOW, and DISABLED) and host task priority(HIGH, LOW, and ANY)"
+                    << " such as GPU_QUEUE_PRIORITY_HIGH | GPU_HOST_TASK_PRIORITY_HIGH";
+            }

+            if (getAvailableCoresTypes().size() > 1) {
+                if (task_exec_config._threadPreferredCoreType == IStreamsExecutor::Config::BIG
+                    || task_exec_config._threadPreferredCoreType == IStreamsExecutor::Config::LITTLE) {
+                        task_exec_config._streams = std::min(task_exec_config._streams,
+                                                        getNumberOfCores(task_exec_config._threadPreferredCoreType));
+                    }
+            } else {
+                task_exec_config._threadPreferredCoreType = IStreamsExecutor::Config::ANY;
+                task_exec_config._streams = std::min(task_exec_config._streams,
+                                                        static_cast<int>(std::thread::hardware_concurrency()));
+            }
        } else if (key.compare(GPUConfigParams::KEY_GPU_PLUGIN_THROTTLE) == 0 ||
                   key.compare(CLDNNConfigParams::KEY_CLDNN_PLUGIN_THROTTLE) == 0) {
            std::stringstream ss(val);
@ -233,10 +304,9 @@ void Config::UpdateFromMap(const std::map<std::string, std::string>& configMap)
            try {
                int val_i = std::stoi(val);
                if (val_i <= 0 || val_i > max_threads) {
-                    n_threads = max_threads;
-                } else {
-                    n_threads = val_i;
+                    val_i = max_threads;
                }
+                task_exec_config._streams = std::min(task_exec_config._streams, val_i);
            } catch (const std::exception&) {
                IE_THROW() << "Wrong value for property key " << GPUConfigParams::KEY_GPU_MAX_NUM_THREADS << ": " << val
                                   << "\nSpecify the number of threads use for build as an integer."
@ -298,6 +368,28 @@ void Config::adjustKeyMapValues() {
    else
        key_config_map[CLDNNConfigParams::KEY_CLDNN_ENABLE_FP16_FOR_QUANTIZED_MODELS] = PluginConfigParams::NO;

+    {
+        if (queuePriority == cldnn::priority_mode_types::high && task_exec_config._threadPreferredCoreType == IStreamsExecutor::Config::BIG) {
+            key_config_map[GPUConfigParams::KEY_GPU_MODEL_PRIORITY] = GPUConfigParams::GPU_MODEL_PRIORITY_HIGH;
+        } else if (queuePriority == cldnn::priority_mode_types::low && task_exec_config._threadPreferredCoreType == IStreamsExecutor::Config::LITTLE) {
+            key_config_map[GPUConfigParams::KEY_GPU_MODEL_PRIORITY] = GPUConfigParams::GPU_MODEL_PRIORITY_LOW;
+        } else {
+            std::string val_plugin_priority;
+            switch (queuePriority) {
+            case cldnn::priority_mode_types::low:   val_plugin_priority = GPUConfigParams::GPU_QUEUE_PRIORITY_LOW; break;
+            case cldnn::priority_mode_types::med:   val_plugin_priority = GPUConfigParams::GPU_QUEUE_PRIORITY_MED; break;
+            case cldnn::priority_mode_types::high:  val_plugin_priority = GPUConfigParams::GPU_QUEUE_PRIORITY_HIGH; break;
+            default:                                val_plugin_priority = GPUConfigParams::GPU_QUEUE_PRIORITY_DEFAULT; break;
+            }
+            val_plugin_priority += "|";
+            switch (task_exec_config._threadPreferredCoreType) {
+            case IStreamsExecutor::Config::LITTLE:      val_plugin_priority += GPUConfigParams::GPU_HOST_TASK_PRIORITY_HIGH; break;
+            case IStreamsExecutor::Config::BIG:         val_plugin_priority += GPUConfigParams::GPU_HOST_TASK_PRIORITY_LOW; break;
+            case IStreamsExecutor::Config::ANY:default: val_plugin_priority += GPUConfigParams::GPU_HOST_TASK_PRIORITY_ANY; break;
+            }
+            key_config_map[GPUConfigParams::KEY_GPU_PLUGIN_PRIORITY]        = val_plugin_priority;
+        }
+    }
    {
        std::string qp = "0";
        switch (queuePriority) {
@ -340,7 +432,7 @@ void Config::adjustKeyMapValues() {
    key_config_map[PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS] = std::to_string(throughput_streams);
    key_config_map[PluginConfigParams::KEY_DEVICE_ID] = device_id;
    key_config_map[PluginConfigParams::KEY_CONFIG_FILE] = "";
-    key_config_map[GPUConfigParams::KEY_GPU_MAX_NUM_THREADS] = std::to_string(n_threads);
+    key_config_map[GPUConfigParams::KEY_GPU_MAX_NUM_THREADS] = std::to_string(task_exec_config._streams);

    if (enable_loop_unrolling)
        key_config_map[GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING] = PluginConfigParams::YES;
--- a/inference-engine/src/cldnn_engine/cldnn_config.h
+++ b/inference-engine/src/cldnn_engine/cldnn_config.h
@ -10,6 +10,7 @@
 #include "cldnn_custom_layer.h"
 #include <ie_performance_hints.hpp>
 #include <cldnn/graph/network.hpp>
+#include <threading/ie_cpu_streams_executor.hpp>

 namespace CLDNNPlugin {

@ -32,7 +33,14 @@ struct Config {
                                          graph_dumps_dir(""),
                                          sources_dumps_dir(""),
                                          kernels_cache_dir(""),
-                                          n_threads(std::max(static_cast<unsigned int>(1), std::thread::hardware_concurrency())),
+                                          task_exec_config({"GPU plugin internal task executor",                        // name
+                                                    std::max(1, static_cast<int>(std::thread::hardware_concurrency())), // # of streams
+                                                    1,                                                                  // # of threads per streams
+                                                    InferenceEngine::IStreamsExecutor::ThreadBindingType::HYBRID_AWARE, // thread binding type
+                                                    1,                                                                  // thread binding step
+                                                    0,                                                                  // thread binding offset
+                                                    1,                                                                  // # of threads
+                                                    InferenceEngine::IStreamsExecutor::Config::ANY}),                   // preferred core type
                                          enable_loop_unrolling(true) {
        adjustKeyMapValues();
    }
@ -58,7 +66,8 @@ struct Config {
    std::string graph_dumps_dir;
    std::string sources_dumps_dir;
    std::string kernels_cache_dir;
-    size_t n_threads;
+    InferenceEngine::IStreamsExecutor::Config task_exec_config;
+
    bool enable_loop_unrolling;

    std::map<std::string, std::string> key_config_map;
--- a/inference-engine/src/cldnn_engine/cldnn_engine.cpp
+++ b/inference-engine/src/cldnn_engine/cldnn_engine.cpp
@ -240,7 +240,8 @@ IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceE
               context_config.tuningConfig.cache_file_path == current_config.tuningConfig.cache_file_path &&
               context_config.kernels_cache_dir == current_config.kernels_cache_dir &&
               context_config.device_id == current_config.device_id &&
-               context_config.n_threads == current_config.n_threads &&
+               context_config.task_exec_config._streams == current_config.task_exec_config._streams &&
+               context_config.task_exec_config._threadPreferredCoreType == current_config.task_exec_config._threadPreferredCoreType &&
               context_config.enable_loop_unrolling == current_config.enable_loop_unrolling;
    };

--- a/inference-engine/src/cldnn_engine/cldnn_remote_context.cpp
+++ b/inference-engine/src/cldnn_engine/cldnn_remote_context.cpp
@ -251,16 +251,19 @@ CLDNNExecutionContextImpl::CLDNNExecutionContextImpl(const std::shared_ptr<IInfe
            queue_type = cldnn::queue_types::out_of_order;
        }

+
+        ITaskExecutor::Ptr task_executor = std::make_shared<CPUStreamsExecutor>(m_config.task_exec_config);
        bool use_unified_shared_memory = true;
-        m_engine = cldnn::engine::create(engine_type, runtime_type, dev, cldnn::engine_configuration(enable_profiling,
-                                                                                                     queue_type,
-                                                                                                     m_config.sources_dumps_dir,
-                                                                                                     m_config.queuePriority,
-                                                                                                     m_config.queueThrottle,
-                                                                                                     m_config.memory_pool_on,
-                                                                                                     use_unified_shared_memory,
-                                                                                                     m_config.kernels_cache_dir,
-                                                                                                     m_config.n_threads));
+        m_engine = cldnn::engine::create(engine_type, runtime_type, dev,
+                                                    cldnn::engine_configuration(enable_profiling,
+                                                                                queue_type,
+                                                                                m_config.sources_dumps_dir,
+                                                                                m_config.queuePriority,
+                                                                                m_config.queueThrottle,
+                                                                                m_config.memory_pool_on,
+                                                                                use_unified_shared_memory,
+                                                                                m_config.kernels_cache_dir,
+                                                                                m_config.throughput_streams), task_executor);
    }
 }

--- a/inference-engine/src/inference_engine/include/ie/gpu/gpu_config.hpp
+++ b/inference-engine/src/inference_engine/include/ie/gpu/gpu_config.hpp
@ -71,6 +71,48 @@ namespace GPUConfigParams {
 #define DECLARE_GPU_CONFIG_KEY(name)   DECLARE_CONFIG_KEY(GPU_##name)
 #define DECLARE_GPU_CONFIG_VALUE(name) DECLARE_CONFIG_VALUE(GPU_##name)

+/**
+ * @brief This key instructs the GPU plugin to use two priorities of GPU configuration as follows:
+ * • OpenCL queue priority hint as defined in https://www.khronos.org/registry/OpenCL/specs/opencl-2.1-extensions.pdf,
+ *      it has 4 types of levels: High, Med, Low, and Default. the default is Default
+ * • Host task priority which is set cpu core type of TBB affinity used in load network.
+ *      this has 3 types of levels: High, LOW, and ANY. the default is ANY.
+ *      it is only affected on Hybrid CPUs. if the device doesn't support Hybrid CPUs, it is set to the default.
+ *
+ * There are two types of setting you can choose from: Model level setting and Queue/Host Task level setting.
+ * • Plugin level setting is the predefined combination of OpenCL queue priority and host task priority.
+ *      It provides only two types of levels: High and Low.
+ * • Queue/Host Task level setting is the combination of OpenCL Queue priority and host task priority
+ *      such as GPU_QUEUE_PRIORITY_HIGH|GPU_HOST_TASK_PRIORITY_HIGH.
+ *      You can set each levels of OpenCL Queue priority and host task priority directly using this setting.
+ *
+ * The default value of GPU_MODEL_PRIORITY is "GPI_QUEUE_PRIORITY_DEFAULT|GPU_HOST_TASK_PRIORITY_ANY".
+ * The detailed option values are as follows:
+ * Model priority
+ * • GPUConfigParams::GPU_MODEL_PRIORITY_HIGH  - GPU_QUEUE_PRIORITY_HIGH|GPU_HOST_TASK_PRIORITY_HIGH
+ * • GPUConfigParams::GPU_MODEL_PRIORITY_LOW   - GPU_QUEUE_PRIORITY_LOW|GPU_HOST_TASK_PRIORITY_LOW
+ * OpenCL queue priority
+ * • GPUConfigParams::GPU_QUEUE_PRIORITY_HIGH       - mapped to CL_QUEUE_PRIORITY_HIGH_KHR
+ * • GPUConfigParams::GPU_QUEUE_PRIORITY_MED        - mapped to CL_QUEUE_PRIORITY_MED_KHR
+ * • GPUConfigParams::GPU_QUEUE_PRIORITY_LOW        - mapped to CL_QUEUE_PRIORITY_LOW_KHR
+ * • GPUConfigParams::GPI_QUEUE_PRIORITY_DEFAULT    - Not set queue priority property in cl_queue_properties
+ * Host task priority
+ * • GPUConfigParams::GPU_HOST_TASK_PRIORITY_HIGH   - mapped to IStreamsExecutor::Config::BIG
+ * • GPUConfigParams::GPU_HOST_TASK_PRIORITY_LOW    - mapped to IStreamsExecutor::Config::LITTLE
+ * • GPUConfigParams::GPU_HOST_TASK_PRIORITY_ANY    - mapped to IStreamsExecutor::Config::ANY
+ */
+
+DECLARE_GPU_CONFIG_KEY(MODEL_PRIORITY);
+DECLARE_GPU_CONFIG_VALUE(MODEL_PRIORITY_HIGH);
+DECLARE_GPU_CONFIG_VALUE(MODEL_PRIORITY_LOW);
+DECLARE_GPU_CONFIG_VALUE(QUEUE_PRIORITY_HIGH);
+DECLARE_GPU_CONFIG_VALUE(QUEUE_PRIORITY_MED);
+DECLARE_GPU_CONFIG_VALUE(QUEUE_PRIORITY_LOW);
+DECLARE_GPU_CONFIG_VALUE(QUEUE_PRIORITY_DEFAULT);
+DECLARE_GPU_CONFIG_VALUE(HOST_TASK_PRIORITY_HIGH);
+DECLARE_GPU_CONFIG_VALUE(HOST_TASK_PRIORITY_LOW);
+DECLARE_GPU_CONFIG_VALUE(HOST_TASK_PRIORITY_ANY);
+
 /**
 * @brief This key instructs the GPU plugin to use the OpenCL queue priority hint
 * as defined in https://www.khronos.org/registry/OpenCL/specs/opencl-2.1-extensions.pdf
--- a/inference-engine/tests/functional/plugin/gpu/shared_tests_instances/behavior/plugin/config.cpp
+++ b/inference-engine/tests/functional/plugin/gpu/shared_tests_instances/behavior/plugin/config.cpp
@ -127,6 +127,10 @@ namespace {
            {{InferenceEngine::GPUConfigParams::KEY_GPU_PLUGIN_THROTTLE, "1"}},
            {{InferenceEngine::GPUConfigParams::KEY_GPU_PLUGIN_PRIORITY, "0"}},
            {{InferenceEngine::GPUConfigParams::KEY_GPU_PLUGIN_PRIORITY, "1"}},
+            {{InferenceEngine::GPUConfigParams::KEY_GPU_MODEL_PRIORITY, InferenceEngine::GPUConfigParams::GPU_QUEUE_PRIORITY_HIGH
+                                                        + std::string("|") + InferenceEngine::GPUConfigParams::GPU_HOST_TASK_PRIORITY_ANY}},
+            {{InferenceEngine::GPUConfigParams::KEY_GPU_MODEL_PRIORITY, InferenceEngine::GPUConfigParams::GPU_QUEUE_PRIORITY_LOW
+                                                        + std::string("|") + InferenceEngine::GPUConfigParams::GPU_HOST_TASK_PRIORITY_ANY}},
            {{InferenceEngine::GPUConfigParams::KEY_GPU_MAX_NUM_THREADS, "1"}},
            {{InferenceEngine::GPUConfigParams::KEY_GPU_MAX_NUM_THREADS, "4"}},
            {{InferenceEngine::GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING, InferenceEngine::PluginConfigParams::YES}},
--- a/inference-engine/thirdparty/CMakeLists.txt
+++ b/inference-engine/thirdparty/CMakeLists.txt
@ -23,7 +23,6 @@ if (ENABLE_CLDNN)
    else()
        set(CLDNN__INCLUDE_TESTS OFF CACHE BOOL "" FORCE)
    endif()
-    set(CLDNN_THREADING "${THREADING}" CACHE STRING "" FORCE)
    add_subdirectory(clDNN)
 endif()

--- a/inference-engine/thirdparty/clDNN/CMakeLists.txt
+++ b/inference-engine/thirdparty/clDNN/CMakeLists.txt
@ -12,15 +12,6 @@ project("${CLDNN__PROJ_NAME}")
 # ====================================== HELPER CONSTANT VARIABLES =====================================
 # ======================================================================================================
 # ======================================================================================================
-if(CLDNN_THREADING MATCHES "SEQ")
-    add_definitions(-DCLDNN_THREADING=CLDNN_THREADING_SEQ)
-elseif(CLDNN_THREADING MATCHES "TBB")
-    add_definitions(-DCLDNN_THREADING=CLDNN_THREADING_TBB)
-else()
-    add_definitions(-DCLDNN_THREADING=CLDNN_THREADING_THREADPOOL)
-endif()
-
-
 if (ENABLE_ONEDNN_FOR_GPU)
    ExternalProject_Get_property(onednn_gpu_build SOURCE_DIR)
    ExternalProject_Get_property(onednn_gpu_build BINARY_DIR)
@ -94,6 +85,7 @@ include_directories(
    ${CLDNN_UTILS__RAPIDJSON_INCDIRS}
    "${CLDNN__CODEGEN_INCDIR}"
    "${CLDNN__API_DIR}"
+    $<TARGET_PROPERTY:inference_engine_plugin_api,INTERFACE_INCLUDE_DIRECTORIES>
  )

 # =================================== Link targets and dependencies ====================================
@ -105,3 +97,5 @@ if(CLDNN__INCLUDE_TESTS)
 endif()

 add_subdirectory(kernel_selector)
+
+target_link_libraries(${CLDNN_BUILD__PROJ__clDNN} PRIVATE inference_engine)
--- a/inference-engine/thirdparty/clDNN/api/cldnn/runtime/engine.hpp
+++ b/inference-engine/thirdparty/clDNN/api/cldnn/runtime/engine.hpp
@ -10,6 +10,7 @@
 #include "memory_caps.hpp"
 #include "memory_pool.hpp"
 #include "layout.hpp"
+#include <threading/ie_cpu_streams_executor.hpp>

 #include <memory>
 #include <set>
@ -17,10 +18,6 @@
 #include <string>
 #include <atomic>

-#define CLDNN_THREADING_SEQ 0
-#define CLDNN_THREADING_TBB 1
-#define CLDNN_THREADING_THREADPOOL 2
-
 #ifdef ENABLE_ONEDNN_FOR_GPU
 #include <oneapi/dnnl/dnnl.hpp>
 #endif
@ -135,29 +132,40 @@ public:
    /// Returns onednn engine object which shares device and context with current engine
    virtual dnnl::engine& get_onednn_engine() const = 0;
 #endif
+    /// Return GPU plugin internal task executor
+    const InferenceEngine::ITaskExecutor::Ptr get_task_executor();

    /// Factory method which creates engine object with impl configured by @p engine_type
    /// @param engine_type requested engine type
+    /// @param task_executor GPU plugin internal task executor
    /// @param runtime_type requested execution runtime for the engine. @note some runtime/engine types configurations might be unsupported
    /// @param device specifies the device which the engine is created for
    /// @param configuration options for the engine
    static std::shared_ptr<cldnn::engine> create(engine_types engine_type,
                                                 runtime_types runtime_type,
                                                 const device::ptr device,
-                                                 const engine_configuration& configuration = engine_configuration());
+                                                 const engine_configuration& configuration = engine_configuration(),
+                                                 const InferenceEngine::ITaskExecutor::Ptr task_executor =
+                                                        std::make_shared<InferenceEngine::CPUStreamsExecutor>(
+                                                                    InferenceEngine::CPUStreamsExecutor::Config()));

    /// Factory method which creates engine object with impl configured by @p engine_type
    /// @param engine_type requested engine type
    /// @param runtime_type requested execution runtime for the engine. @note some runtime/engine types configurations might be unsupported
+    /// @param task_executor GPU plugin internal task executor
    /// @param configuration options for the engine
    /// @note engine is created for the first device returned by devices query
    static std::shared_ptr<cldnn::engine> create(engine_types engine_type,
                                                 runtime_types runtime_type,
-                                                 const engine_configuration& configuration = engine_configuration());
+                                                 const engine_configuration& configuration = engine_configuration(),
+                                                 const InferenceEngine::ITaskExecutor::Ptr task_executor =
+                                                        std::make_shared<InferenceEngine::CPUStreamsExecutor>(
+                                                                    InferenceEngine::CPUStreamsExecutor::Config()));

 protected:
    /// Create engine for given @p device and @p configuration
-    engine(const device::ptr device, const engine_configuration& configuration);
+    engine(const device::ptr device, const engine_configuration& configuration, const InferenceEngine::ITaskExecutor::Ptr task_executor);
+    const InferenceEngine::ITaskExecutor::Ptr _task_executor;
    const device::ptr _device;
    engine_configuration _configuration;
    mutable std::mutex _mutex;
--- a/inference-engine/thirdparty/clDNN/api/cldnn/runtime/engine_configuration.hpp
+++ b/inference-engine/thirdparty/clDNN/api/cldnn/runtime/engine_configuration.hpp
@ -9,6 +9,7 @@
 #include <string>
 #include <stdexcept>
 #include <thread>
+#include <threading/ie_cpu_streams_executor.hpp>

 namespace cldnn {

@ -66,8 +67,8 @@ struct engine_configuration {
                                              ///< (switched off for older drivers then NEO).
    bool use_unified_shared_memory;           ///< Enables USM usage
    const std::string kernels_cache_path;     ///< Path to compiled kernels cache
-    uint16_t n_threads;                       ///< Max number of host threads used in gpu plugin
-    uint16_t n_streams;                       ///< Number of queues executed in parallel
+    uint16_t throughput_streams;              ///< Number of queues/streams executed in parallel by GPU plugin
+
    const std::string tuning_cache_path;      ///< Path to tuning kernel cache

    /// @brief Constructs engine configuration with specified options.
@ -80,7 +81,7 @@ struct engine_configuration {
    /// @param use_unified_shared_memory If this option it true and device supports USM, then engine will use USM for all memory allocations
    /// @param kernels_cache_path Path to existing directory where plugin can cache compiled kernels
    /// @param n_threads Max number of host threads used in gpu plugin
-    /// @param n_streams Number of queues executed in parallel
+    /// @param throughput_streams Number of queues/streams executed in parallel by GPU plugin
    /// @param tuning_cache_path Path to tuning kernel cache
    engine_configuration(
        bool enable_profiling = false,
@ -91,8 +92,7 @@ struct engine_configuration {
        bool use_memory_pool = true,
        bool use_unified_shared_memory = true,
        const std::string& kernels_cache_path = "",
-        uint16_t n_threads = std::max(static_cast<uint16_t>(std::thread::hardware_concurrency()), static_cast<uint16_t>(1)),
-        uint16_t n_streams = 1,
+        uint16_t throughput_streams = 1,
        const std::string& tuning_cache_path = "cache.json")
        : enable_profiling(enable_profiling)
        , queue_type(queue_type)
@ -102,8 +102,7 @@ struct engine_configuration {
        , use_memory_pool(use_memory_pool)
        , use_unified_shared_memory(use_unified_shared_memory)
        , kernels_cache_path(kernels_cache_path)
-        , n_threads(n_threads)
-        , n_streams(n_streams)
+        , throughput_streams(throughput_streams)
        , tuning_cache_path(tuning_cache_path) { }
 };

--- a/inference-engine/thirdparty/clDNN/runtime/CMakeLists.txt
+++ b/inference-engine/thirdparty/clDNN/runtime/CMakeLists.txt
@ -37,6 +37,7 @@ source_group("${__CLDNN_Label__main}"             FILES ${__CLDNN_Sources__main}
 include_directories(
    "${CMAKE_CURRENT_SOURCE_DIR}/include"
    "${__CLDNN_Directory__main}"
+    $<TARGET_PROPERTY:inference_engine_plugin_api,INTERFACE_INCLUDE_DIRECTORIES>
  )

 # =================================== Link targets and dependencies ====================================
@ -78,4 +79,5 @@ elseif((NOT ANDROID) AND (UNIX))
  target_link_libraries("${CLDNN_BUILD__PROJ}" PRIVATE pthread)
 endif()

+target_link_libraries("${CLDNN_BUILD__PROJ}" PRIVATE inference_engine)
 # ======================================================================================================
--- a/inference-engine/thirdparty/clDNN/runtime/engine.cpp
+++ b/inference-engine/thirdparty/clDNN/runtime/engine.cpp
@ -19,9 +19,10 @@

 namespace cldnn {

-engine::engine(const device::ptr device, const engine_configuration& configuration)
+engine::engine(const device::ptr device, const engine_configuration& configuration, const InferenceEngine::ITaskExecutor::Ptr task_executor)
 : _device(device)
-, _configuration(configuration) {}
+, _configuration(configuration)
+, _task_executor(task_executor) {}

 device_info engine::get_device_info() const {
    return _device->get_info();
@ -183,23 +184,29 @@ void engine::subtract_memory_used(size_t bytes, allocation_type type) {
    }
 }

+const InferenceEngine::ITaskExecutor::Ptr engine::get_task_executor() {
+    return _task_executor;
+}
+
 std::shared_ptr<cldnn::engine> engine::create(engine_types engine_type,
                                              runtime_types runtime_type,
                                              const device::ptr device,
-                                              const engine_configuration& configuration) {
+                                              const engine_configuration& configuration,
+                                              const InferenceEngine::ITaskExecutor::Ptr task_executor) {
    switch (engine_type) {
-        case engine_types::ocl: return ocl::create_ocl_engine(device, runtime_type, configuration);
+        case engine_types::ocl: return ocl::create_ocl_engine(device, runtime_type, configuration, task_executor);
        default: throw std::runtime_error("Invalid engine type");
    }
 }

 std::shared_ptr<cldnn::engine> engine::create(engine_types engine_type,
                                              runtime_types runtime_type,
-                                              const engine_configuration& configuration) {
+                                              const engine_configuration& configuration,
+                                              const InferenceEngine::ITaskExecutor::Ptr task_executor) {
    device_query query(engine_type, runtime_type);
    device::ptr default_device = query.get_available_devices().begin()->second;

-    return engine::create(engine_type, runtime_type, default_device, configuration);
+    return engine::create(engine_type, runtime_type, default_device, configuration, task_executor);
 }

 }  // namespace cldnn
--- a/inference-engine/thirdparty/clDNN/runtime/kernels_cache.cpp
+++ b/inference-engine/thirdparty/clDNN/runtime/kernels_cache.cpp
@ -17,15 +17,6 @@
 #include <utility>

 #include "cldnn_itt.hpp"
-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-#include <tbb/parallel_for.h>
-#include <tbb/blocked_range.h>
-#elif(CLDNN_THREADING == CLDNN_THREADING_THREADPOOL)
-#include <thread>
-#include <future>
-#include <queue>
-#include <condition_variable>
-#endif
 #if defined(__unix__) && !defined(__ANDROID__)
 #include <malloc.h>
 #endif
@ -49,9 +40,6 @@
 #include <Windows.h>
 #endif

-#if (CLDNN_THREADING != CLDNN_THREADING_SEQ)
-#define DEFAULT_NUM_THREADS 2
-#endif
 namespace {
 std::mutex cacheAccessMutex;

@ -418,51 +406,35 @@ void kernels_cache::build_all() {

    std::unique_ptr<ocl::ocl_engine> _build_engine = nullptr;
    if (_engine.type() == engine_types::ocl) {
-        _build_engine = std::unique_ptr<ocl::ocl_engine>(new ocl::ocl_engine(_engine.get_device(), runtime_types::ocl, _engine.configuration()));
+        _build_engine = std::unique_ptr<ocl::ocl_engine>(new ocl::ocl_engine(_engine.get_device(), runtime_types::ocl,
+                                                                    _engine.configuration(), _engine.get_task_executor()));
    }
    std::vector<batch_program> batches;
    {
        std::lock_guard<std::mutex> lock(_mutex);
        get_program_source(_kernels_code, &batches);
-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-        int n_threads = _engine.configuration().n_threads;
-        arena = std::unique_ptr<tbb::task_arena>(new tbb::task_arena());
-        arena->initialize(n_threads);
-#elif(CLDNN_THREADING == CLDNN_THREADING_THREADPOOL)
-        int n_threads = _engine.configuration().n_threads;
-        pool = std::unique_ptr<thread_pool>(new thread_pool(n_threads));
-#endif
    }

-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-    arena->execute([this, &_build_engine, &batches] {
-        tbb::parallel_for(tbb::blocked_range<size_t>(0, batches.size()), [this, &_build_engine, &batches](const tbb::blocked_range<size_t>& r) {
-            for (auto i = r.begin(); i != r.end(); ++i) {
-                build_batch(*_build_engine, batches[i]);
+    auto _task_executor = _engine.get_task_executor();
+    std::exception_ptr exception;
+    std::vector<InferenceEngine::Task> tasks;
+    for (int idx = 0; idx < batches.size(); idx++) {
+        auto& batch = batches[idx];
+        tasks.push_back([this, &_build_engine, batch, &exception] {
+            try {
+                build_batch(*_build_engine, batch);
+            } catch(...) {
+                exception = std::current_exception();
            }
        });
-    });
-#elif(CLDNN_THREADING == CLDNN_THREADING_THREADPOOL)
-    std::vector<std::future<void>> builds;
-    for (size_t i = 0; i < batches.size(); ++i) {
-        builds.push_back(pool->enqueue([this, &_build_engine, &batches, i] () {
-            build_batch(*_build_engine, batches[i]);
-        }));
    }
-    std::for_each(builds.begin(), builds.end(), [] (std::future<void>& f) { f.wait(); });
-#else
-    // no parallel build
-    for (const auto& batch : batches) {
-        build_batch(*_build_engine, batch);
-    }
-#endif
+    _task_executor->runAndWait(tasks);
+    tasks.clear();

    {
        std::lock_guard<std::mutex> lock(_mutex);
        _kernels_code.clear();
        _pending_compilation = false;
-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-        arena.reset();
 #if defined(__unix__) && !defined(__ANDROID__)
    //  NOTE: In linux, without malloc_trim, an amount of the memory used by compilation is not being returned to system thought they are freed.
    //  (It is at least 500 MB when we perform parallel compilation)
@ -470,12 +442,6 @@ void kernels_cache::build_all() {
    //  Also, this is not happening in Windows.
    //  So, added malloc_trim for linux build until we figure out a better solution.
        malloc_trim(0);
-#endif
-#elif(CLDNN_THREADING == CLDNN_THREADING_THREADPOOL)
-        pool.reset();
-#if defined(__unix__) && !defined(__ANDROID__)
-        malloc_trim(0);
-#endif
 #endif
    }
 }
--- a/inference-engine/thirdparty/clDNN/runtime/kernels_cache.hpp
+++ b/inference-engine/thirdparty/clDNN/runtime/kernels_cache.hpp
@ -15,80 +15,9 @@
 #include <string>
 #include <set>

-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-#include <tbb/task_arena.h>
-#elif(CLDNN_THREADING == CLDNN_THREADING_THREADPOOL)
-#include <queue>
-#include <future>
-#include <functional>
-#include <condition_variable>
-#endif
+#include <threading/ie_cpu_streams_executor.hpp>

 namespace cldnn {
-
-#if (CLDNN_THREADING == CLDNN_THREADING_THREADPOOL)
-class thread_pool {
-public:
-    thread_pool(size_t num_threads) : _stop_pool(false) {
-        _workers.reserve(num_threads);
-        for (size_t i = 0; i < num_threads; ++i) {
-            _workers.emplace_back(std::thread(&thread_pool::worker_thread, this));
-        }
-    }
-
-    ~thread_pool() {
-        {
-            std::lock_guard<std::mutex> lock(_q_m);
-            _stop_pool = true;
-        }
-        this->wait_all();
-    }
-
-    template <class F, class... Args>
-    std::future<typename std::result_of<F(Args...)>::type> enqueue(F&& f, Args&&... args) {
-        if (_stop_pool) {
-            throw std::runtime_error("Thread pool is stoped");
-        }
-
-        using return_type = typename std::result_of<F(Args...)>::type;
-        auto task = std::make_shared<std::packaged_task<return_type()>> (std::bind(std::forward<F>(f), std::forward<Args>(args)...));
-        std::future<return_type> result = task->get_future();
-        {
-            std::lock_guard<std::mutex> lock(_q_m);
-            _tasks.push([task]() {(*task)();});
-        }
-        _cv.notify_one();
-        return result;
-    }
-
-    void wait_all() {
-        _cv.notify_all();
-        for (auto& w : _workers) {
-            w.join();
-        }
-    }
-
-private:
-    std::vector<std::thread> _workers;
-    std::queue<std::function<void()>> _tasks;
-    std::condition_variable _cv;
-    std::mutex _q_m;
-    bool _stop_pool;
-
-    void worker_thread() {
-        while (true) {
-            std::unique_lock<std::mutex> lock(this->_q_m);
-            _cv.wait(lock, [this]() { return (!this->_tasks.empty()) || (_stop_pool); });
-            if ((_stop_pool) && (this->_tasks.empty())) return;
-            auto task = std::move(_tasks.front());
-            this->_tasks.pop();
-            lock.unlock();
-            task();
-        }
-    }
-};
-#endif
-
 class kernels_cache {
 public:
    using source_code = std::vector<std::string>;
@ -147,11 +76,6 @@ private:
    kernels_code _kernels_code;
    std::atomic<bool> _pending_compilation{false};
    std::map<const std::string, kernel::ptr> _kernels;
-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-    std::unique_ptr<tbb::task_arena> arena;
-#elif(CLDNN_THREADING == CLDNN_THREADING_THREADPOOL)
-    std::unique_ptr<thread_pool> pool;
-#endif
    std::vector<std::string> batch_header_str;

    void get_program_source(const kernels_code& kernels_source_code, std::vector<batch_program>*) const;
--- a/inference-engine/thirdparty/clDNN/runtime/ocl/ocl_engine.cpp
+++ b/inference-engine/thirdparty/clDNN/runtime/ocl/ocl_engine.cpp
@ -39,8 +39,9 @@ namespace ocl {
 ocl_error::ocl_error(cl::Error const& err)
    : std::runtime_error(err.what() + std::string(", error code: ") + std::to_string(err.err())) {}

-ocl_engine::ocl_engine(const device::ptr dev, runtime_types runtime_type, const engine_configuration& conf)
-    : engine(dev, conf) {
+ocl_engine::ocl_engine(const device::ptr dev, runtime_types runtime_type,
+            const engine_configuration& conf, const InferenceEngine::ITaskExecutor::Ptr task_executor)
+    : engine(dev, conf, task_executor) {
    if (runtime_type != runtime_types::ocl) {
        throw std::runtime_error("Invalid runtime type specified for OCL engine. Only OCL runtime is supported");
    }
@ -221,12 +222,14 @@ stream& ocl_engine::get_program_stream() const {
    return *_program_stream;
 }

-std::shared_ptr<cldnn::engine> ocl_engine::create(const device::ptr device, runtime_types runtime_type, const engine_configuration& configuration) {
-    return std::make_shared<ocl::ocl_engine>(device, runtime_type, configuration);
+std::shared_ptr<cldnn::engine> ocl_engine::create(const device::ptr device, runtime_types runtime_type,
+                            const engine_configuration& configuration, const InferenceEngine::ITaskExecutor::Ptr task_executor) {
+    return std::make_shared<ocl::ocl_engine>(device, runtime_type, configuration, task_executor);
 }

-std::shared_ptr<cldnn::engine> create_ocl_engine(const device::ptr device, runtime_types runtime_type, const engine_configuration& configuration) {
-    return ocl_engine::create(device, runtime_type, configuration);
+std::shared_ptr<cldnn::engine> create_ocl_engine(const device::ptr device, runtime_types runtime_type,
+                            const engine_configuration& configuration, const InferenceEngine::ITaskExecutor::Ptr task_executor) {
+    return ocl_engine::create(device, runtime_type, configuration, task_executor);
 }

 }  // namespace ocl
--- a/inference-engine/thirdparty/clDNN/runtime/ocl/ocl_engine.hpp
+++ b/inference-engine/thirdparty/clDNN/runtime/ocl/ocl_engine.hpp
@ -20,7 +20,7 @@ namespace ocl {

 class ocl_engine : public engine {
 public:
-    ocl_engine(const device::ptr dev, runtime_types runtime_type, const engine_configuration& conf);
+    ocl_engine(const device::ptr dev, runtime_types runtime_type, const engine_configuration& conf, const InferenceEngine::ITaskExecutor::Ptr task_executor);
    engine_types type() const override { return engine_types::ocl; };
    runtime_types runtime_type() const override { return runtime_types::ocl; };

@ -48,7 +48,8 @@ public:
    dnnl::engine& get_onednn_engine() const override;
 #endif

-    static std::shared_ptr<cldnn::engine> create(const device::ptr device, runtime_types runtime_type, const engine_configuration& configuration);
+    static std::shared_ptr<cldnn::engine> create(const device::ptr device, runtime_types runtime_type,
+                const engine_configuration& configuration, const InferenceEngine::ITaskExecutor::Ptr task_executor);

 private:
    std::string _extensions;
--- a/inference-engine/thirdparty/clDNN/runtime/ocl/ocl_engine_factory.hpp
+++ b/inference-engine/thirdparty/clDNN/runtime/ocl/ocl_engine_factory.hpp
@ -13,7 +13,8 @@ namespace ocl {

 // Factory for ocl_engine creation. It's moved outside of ocl_engine class to avoid possible CL includes conflict
 // between different engines in engine.cpp file
-std::shared_ptr<cldnn::engine> create_ocl_engine(const device::ptr device, runtime_types runtime_type, const engine_configuration& configuration);
+std::shared_ptr<cldnn::engine> create_ocl_engine(const device::ptr device, runtime_types runtime_type,
+        const engine_configuration& configuration, InferenceEngine::ITaskExecutor::Ptr task_executor);

 }  // namespace ocl
 }  // namespace cldnn
--- a/inference-engine/thirdparty/clDNN/src/graph_optimizer/compile_graph.cpp
+++ b/inference-engine/thirdparty/clDNN/src/graph_optimizer/compile_graph.cpp
@ -14,10 +14,7 @@
 #include <cmath>
 #include <iomanip>

-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-#include <tbb/parallel_for.h>
-#include <tbb/blocked_range.h>
-#endif
+#include <threading/ie_cpu_streams_executor.hpp>

 using namespace cldnn;

@ -31,28 +28,31 @@ void compile_graph::run(program& p) {
        }
    }

-#if (CLDNN_THREADING == CLDNN_THREADING_TBB)
-    const auto n_threads = p.get_engine().get_device_info().supports_immad ? 1 : p.get_engine().configuration().n_threads;
-    auto arena = std::unique_ptr<tbb::task_arena>(new tbb::task_arena());
-    arena->initialize(n_threads);
-    arena->execute([this, &p] {
-        auto& proc_order = p.get_processing_order();
-        tbb::parallel_for(tbb::blocked_range<size_t>(0, proc_order.size()), [&proc_order, &p](const tbb::blocked_range<size_t>& r) {
-            for (auto i = r.begin(); i != r.end(); ++i) {
-                auto& node = *(std::next(proc_order.begin(), i));
-                node->set_unique_id(std::to_string(i));
-                if (!node->is_type<data>() && !(node->is_type<mutable_data>() && node->get_dependencies().empty())) {
-                    node->selected_impl = node->type()->choose_impl(*node);
-                }
+    if (p.get_engine().get_device_info().supports_immad) {
+        for (auto& node : p.get_processing_order()) {
+            if (!node->is_type<data>() && !(node->is_type<mutable_data>() && node->get_dependencies().empty())) {
+                node->selected_impl = node->type()->choose_impl(*node);
            }
-        });
-    });
-    arena.reset();
-#else
-    for (auto& node : p.get_processing_order()) {
-        if (!node->is_type<data>() && !(node->is_type<mutable_data>() && node->get_dependencies().empty())) {
-            node->selected_impl = node->type()->choose_impl(*node);
        }
+    } else {
+        auto task_executor = p.get_engine().get_task_executor();
+        auto& proc_order = p.get_processing_order();
+        std::vector<InferenceEngine::Task> tasks;
+        std::exception_ptr exception;
+        for (int idx = 0; idx < proc_order.size(); idx++) {
+            auto& node = *(std::next(proc_order.begin(), idx));
+            if (!node->is_type<data>() && !(node->is_type<mutable_data>() && node->get_dependencies().empty())) {
+                tasks.push_back([node, &exception] {
+                    try {
+                        node->selected_impl = node->type()->choose_impl(*node);
+                    } catch(...) {
+                        exception = std::current_exception();
+                    }
+                });
+            }
+        }
+
+        task_executor->runAndWait(tasks);
+        tasks.clear();
    }
-#endif
 }
--- a/inference-engine/thirdparty/clDNN/src/layout_optimizer.cpp
+++ b/inference-engine/thirdparty/clDNN/src/layout_optimizer.cpp
@ -1128,7 +1128,7 @@ impl_types layout_optimizer::get_preferred_impl_type(program_node& node, format
        auto scoresTensor = convert_data_tensor(nms_node.input_scores().get_output_layout());
        const size_t kBatchNum = scoresTensor.Batch().v;
        const size_t kClassNum = scoresTensor.Feature().v;
-        const size_t kNStreams = static_cast<size_t>(node.get_program().get_engine().configuration().n_streams);
+        const size_t kNStreams = static_cast<size_t>(node.get_program().get_engine().configuration().throughput_streams);
        const size_t kKeyValue = kBatchNum * std::min(kClassNum, static_cast<size_t>(8)) * kNStreams;
        preferred_impl = (kKeyValue > 64) ? impl_types::ocl : impl_types::cpu;
    } else if (node.is_type<reorder>()) {