Auto Batching impl (#7883)

* auto-batching POC squashed (all commits from auto-batch-2021.3 branch) (cherry picked from commit d7742f2c747bc514a126cc9a4d5b99f0ff5cbbc7) * applying/accomodating the API changes after rebase to the master * replaying modified version of actual batch selection * eearly experiments with model mem footprint * changes from rebasing to the latest master * experimenting with DG1 on the batch size selection, also collecting the mem footprint * WIP:moving the auto-batching to the icore to let the MULT/AUTO support that, ALLOW_AUTO_BATCHING as a conventional config key. still fials hot device swap * quick-n-dirty batch footpint vs device total mem * code style * testing which models perform badly due to kernels and NOT (batched) footprint * stub pipeline task to comunicate the readiness rather than promise/future * quick-n-dirty timeout impl * explicit _completionTasks,reverting BA to use the timeout * inputs outputs copies, works with AUTO and demo now * accomodate the config per device-id, after rebase to the latest master * allowing the auto-batching only with tput hint to let more conventional tests pass * fix the pre-mature timeout restaring via waiting for batch1 requests completion * moved the bacthed request statring ( along with input copies) to the dedicated thread * [IE CLDNN] Disable bs_fs_yx_bsv16_fsv16 format for int8 convolution * code style * increasing the timeout to test the ssd_* models perf (timeout?) issues * reducing number of output stuff in BA to avoid bloating the logs in experiments * more aggressive batching for experiments, not limited to 32 and also 4 as a min * more accurate timeout debugging info * getting the reqs limitation from the plugin SetConfig as well * refactor the reshape logic a bit to accomodate CPU for bathcing, also added remeote context * let the benchamrk_app to consume specific batch values for the auto-batching such as BATCH:GPU(4) * auto-batching functional test (with results check vs ref) and GPU instance for that * fixed arithemtic on blobs ptrs * clang * handling possible batched network failure * BATCH as the constants device name in test * ENABLE_BATCH * func tests for CPU, also DetectionOutput hetero tests (CPU and GPU) * DetectionOutput hetero test for the CPU * reenabling the Auto-Batching in the AUTO * auto-batching device enabled in the test * fixed the DO test * improve the loading loop logic * brushed the config keys * allow hetero code-path for explicit device name like BATCH:GPU(4), used in the hetero code-path tests * fix the test after refactoring * clang * moving ThreadSafeQueue to the ie_parallel, as it is re-used in the AUTO/MULTI and BATCH now * auto-batching hetero test (subgraph with DetectionOutput) * fixed minor changes that were result of experiments with impl * code-style * brushing, disabling CPU's HETERO tests until planned activity for 22.2 * removing home-baked MAX_BATCH_SZIE and swicthing to the official impl by GPU team * remote blobs tests for the auto-batching (old API) * brushed names a bit * CreateContext and LoadNEtwork with context for the Auto-Batching plus remote-blobs tests * fixed the ieUnitTests with adding CreateContext stub to the MockICore * clang * improved remote-blobs tests * revert the back BA from exeprimenents with AB + device_use_mem * conformance tests for BATCH, alos batch size 1 is default for BATCH:DEVICE * remote blobs 2.0 tests, issue with context having the orig device name * debugging DG1 perf drop (presumably due to non-fitting the device-mem) * disbaling WA with batch/=2 for excesive mem footptint, leaving only streams 2 * remote blobs 2.0 tests for different tensor sharing types * converting assert to throw to accomodate legacy API where the lock() was possible to be called * revert the timeout back to avoid mixing the studies, fixed the footprint calc * reverting to estimating the max batch by extrapolating from bacth1 size * more conservative footptint etimation (with bacth1), graceful bacth 1 handling without duplication * even graceful batch 1 handling without duplication * WA for MAX_BATCH_SIZE failure, removing batch4 as a min for the auto-batching * AutoBatchPlugin -> ov_auto_batch_plugin * WA for gcc 4.8 * clang * fix misprint * fixed errors resulted from recent OV's Variant to Any transition * skip auto-batching for already-batched networks * AUTO_BATCH_TIMEOUT and tests * GPU-specific L3 * switched to pure config, also improved ALLOW_AUTO_BATCHING config key handling logic * debugging device info * enabling the config tests for the GPU and fixing the Auto-batching tests to pass * making the default (when not recognized the driver) cache size more aggressive, to accomodate recent HW with old drivers * skip auto-batching for RNNs and alikes (e.g. single CHW input) * fixed fallback to the bacth1 and moved HETERO path under condition to avoid bloating * brushing * Auto plugin GetMetric support gpu auto-batch Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * add test case Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * add comments on test Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * brushing the vars names, alos adding the excpetion handling * disabling the auto-batching for the networks with non-batched outputs and faster-rcnn and alikes (CVS-74085) to minimize the of #failures * add try catch Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * brushing the code changed in the GPU plugin * Auto-Batch requests tests * brushed varibles a bit (ref) * cleaned debug output from the ie_core * cleaned cmake for the Auto-Batch * removed batchN estimation from batch1 * cleaned from debug printf * comments, cleanup * WA the mock test errors introduced with merging the https://github.com/myshevts/openvino/pull/13 * Adding back removed batchN estimation from batch1 to debug degradations on DG1 (resulted from too optimistic MAX_BATCH_SIZE?). This partially reverts commit e8f1738ac1. * brushing ie_core.cpp * fix 32bit compilation * Code review: ENABLE_AUTO_BATCH * consolidate the auot-batching logic in ie_core.cpp into single ApplyAutoBAtching * renamed brushed the OPTIMAL_BATCH (now with_SIZE) and mimicks the MAX_BATCH_SZIE wrt MODEL_PTR * default value for the OPTIMAL_BATCH_SIZE * clang * accomodate new func tests location * fix shuffle of headers after clang + copyrights * fixed misprint made during code refactoring * moving the common therad-safe containers (like ThreadSafeQueue) to the dedicated dev_api header * switch from the device name to the OPTIMAL_BATCH_SIZE metric presence as a conditin to consider Auto-Batching * switching from the unsafe size() and minimizing time under lock * code style * brushed the ApplyAutoBatching * brushed the netric/config names and descriptions * completed the core intergration tests for the auto-batching * ExecGraphInfo and check for incorrect cfg * removed explicit dependencies from cmake file of the plugin * disabling Auto-Batching thru the tput hint (to preserve current product default), only excplicit like BATCH:GPU used in the tests Co-authored-by: Roman Lyamin <roman.lyamin@intel.com> Co-authored-by: Hu, Yuan2 <yuan2.hu@intel.com>
2021-12-24 12:55:22 +03:00 · 2021-12-24 12:55:22 +03:00 · 49b5e5728b
commit 49b5e5728b
parent bc5da8d522
47 changed files with 1882 additions and 188 deletions
--- a/cmake/features.cmake
+++ b/cmake/features.cmake
@ -100,6 +100,8 @@ ie_option (ENABLE_GAPI_PREPROCESSING "Enables G-API preprocessing" ON)
 ie_option (ENABLE_MULTI "Enables MULTI Device Plugin" ON)
 ie_option (ENABLE_AUTO "Enables AUTO Device Plugin" ON)

+ie_option (ENABLE_AUTO_BATCH "Enables Auto-Batching Plugin" ON)
+
 ie_option (ENABLE_HETERO "Enables Hetero Device Plugin" ON)

 ie_option (ENABLE_TEMPLATE "Enable template plugin" ON)
--- a/docs/IE_DG/supported_plugins/GPU.md
+++ b/docs/IE_DG/supported_plugins/GPU.md
@ -141,6 +141,9 @@ When specifying key values as raw strings (that is, when using Python API), omit

@snippet snippets/GPU_Metric1.cpp part1

+* OPTIMAL_BATCH_SIZE : Returns _optimal_ batch size for a given network on the given GPU device. The returned value is aligned to power of 2. Also, MODEL_PTR is the required option for this metric since the optimal batch size highly depends on the model. If the MODEL_PTR is not given, the value of 1 is returned. The example code to set the required and optional configs for this metric is available in the following snippet:
+
+@snippet snippets/GPU_Metric1.cpp part2
 ## GPU Context and Video Memory Sharing RemoteBlob API

 See [RemoteBlob API of GPU Plugin](GPU_RemoteBlob_API.md)
--- a/docs/snippets/GPU_Metric1.cpp
+++ b/docs/snippets/GPU_Metric1.cpp
@ -14,4 +14,12 @@ options.insert(std::make_pair("AVAILABLE_DEVICE_MEM_SIZE", available_device_mem_

 auto max_batch_size = core.GetMetric("GPU", GPU_METRIC_KEY(MAX_BATCH_SIZE), options).as<uint32_t>();
 //! [part1]
+//! [part2]
+std::map<std::string, Parameter> opt = {{"MODEL_PTR", cnnNetwork.getFunction()}}; // Required. Same usage as for the MAX_BATCH_SIZE above. If not set, the OPTIONAL_BATCH_SIZE returns 1.
+// This is not entirely GPU-specific metric (so METRIC_KEY is used rather than GPU_METRIC_KEY below),
+// but the GPU is the only device that supports that at the moment.
+// For the GPU, the metric already accommodates limitation for the on-device memory that the MAX_BATCH_SIZE poses.
+// so OPTIMAL_BATCH_SIZE is always less than MAX_BATCH_SIZE. Unlike the latter it is also aligned to the power of 2.
+auto optimal_batch_size = core.GetMetric("GPU", METRIC_KEY(OPTIMAL_BATCH_SIZE), options).as<unsigned int>();
+//! [part2]
 }
--- a/inference-engine/thirdparty/clDNN/api/intel_gpu/runtime/device_info.hpp
+++ b/inference-engine/thirdparty/clDNN/api/intel_gpu/runtime/device_info.hpp
@ -6,6 +6,7 @@

 #include <string>
 #include <vector>
+#include <tuple>

 namespace cldnn {
 /// @addtogroup cpp_api C++ API
@ -25,6 +26,10 @@ struct gfx_version {
    uint16_t major;
    uint8_t minor;
    uint8_t revision;
+    friend bool operator < (const gfx_version& l, const gfx_version& r)  {
+        return std::tie(l.major, l.minor, l.revision)
+               < std::tie(r.major, r.minor, r.revision); // same order
+    }
 };

 /// @brief Information about the device properties and capabilities.
--- a/samples/cpp/benchmark_app/remote_blobs_filling.cpp
+++ b/samples/cpp/benchmark_app/remote_blobs_filling.cpp
@ -124,6 +124,7 @@ std::map<std::string, std::vector<InferenceEngine::Blob::Ptr>> getRemoteInputBlo
        }

        auto blob = InferenceEngine::gpu::make_shared_blob(desc, context, clBuffer.back());
+        blob->allocate();
        remoteBlobs[name].push_back(blob);
    };

--- a/samples/cpp/benchmark_app/utils.cpp
+++ b/samples/cpp/benchmark_app/utils.cpp
@ -109,8 +109,10 @@ std::vector<float> splitFloat(const std::string& s, char delim) {

 std::vector<std::string> parseDevices(const std::string& device_string) {
    std::string comma_separated_devices = device_string;
-    if (comma_separated_devices.find(":") != std::string::npos) {
-        comma_separated_devices = comma_separated_devices.substr(comma_separated_devices.find(":") + 1);
+    auto colon = comma_separated_devices.find(":");
+    if (colon != std::string::npos) {
+        auto bracket = comma_separated_devices.find("(");  // e.g. in BATCH:GPU(4)
+        comma_separated_devices = comma_separated_devices.substr(colon + 1, bracket - colon - 1);
    }
    if ((comma_separated_devices == "MULTI") || (comma_separated_devices == "HETERO"))
        return std::vector<std::string>();
--- a/src/bindings/c/tests/CMakeLists.txt
+++ b/src/bindings/c/tests/CMakeLists.txt
@ -26,6 +26,10 @@ if(ENABLE_AUTO OR ENABLE_MULTI)
    add_dependencies(${TARGET_NAME} ov_auto_plugin)
 endif()

+if(ENABLE_AUTO_BATCH)
+    add_dependencies(${TARGET_NAME} ov_auto_batch_plugin)
+endif()
+
 if(ENABLE_INTEL_CPU)
    add_dependencies(${TARGET_NAME} ov_intel_cpu_plugin)
 endif()
--- a/src/inference/dev_api/ie_icore.hpp
+++ b/src/inference/dev_api/ie_icore.hpp
@ -16,6 +16,7 @@
 #include "cpp/ie_cnn_network.h"
 #include "cpp_interfaces/interface/ie_iexecutable_network_internal.hpp"
 #include "ie_parameter.hpp"
+#include "ie_remote_context.hpp"
 #include "threading/ie_itask_executor.hpp"

 namespace InferenceEngine {
@ -60,6 +61,22 @@ public:
                                                    const std::string& deviceName,
                                                    const std::map<std::string, std::string>& config = {}) = 0;

+    /**
+     * @brief Creates an executable network from a network object.
+     *
+     * Users can create as many networks as they need and use
+     *        them simultaneously (up to the limitation of the hardware resources)
+     *
+     * @param network CNNNetwork object acquired from Core::ReadNetwork
+     * @param remoteCtx  "Remote" (non-CPU) accelerator device-specific execution context to use
+     * @param config Optional map of pairs: (config parameter name, config parameter value) relevant only for this load
+     * operation
+     * @return An executable network reference
+     */
+    virtual SoExecutableNetworkInternal LoadNetwork(const CNNNetwork& network,
+                                                    const RemoteContext::Ptr& remoteCtx,
+                                                    const std::map<std::string, std::string>& config = {}) = 0;
+
    /**
     * @brief Creates an executable network from a model file.
     *
@ -142,6 +159,16 @@ public:
     */
    virtual bool DeviceSupportsImportExport(const std::string& deviceName) const = 0;

+    /**
+     * @brief Create a new shared context object on specified accelerator device
+     * using specified plugin-specific low level device API parameters (device handle, pointer, etc.)
+     * @param deviceName Name of a device to create new shared context on.
+     * @param params Map of device-specific shared context parameters.
+     * @return A shared pointer to a created remote context.
+     */
+    virtual InferenceEngine::RemoteContext::Ptr CreateContext(const std::string& deviceName,
+                                                              const InferenceEngine::ParamMap&) = 0;
+
    virtual bool isNewAPI() const = 0;

    /**
@ -165,6 +192,7 @@ public:

    static std::vector<std::string> getHeteroDevices(std::string fallbackDevice);
    static std::vector<std::string> getMultiDevices(std::string devicesList);
+    static std::string getBatchDevice(std::string devicesList);
 };

 }  // namespace InferenceEngine
--- a/src/inference/dev_api/performance_heuristics.hpp
+++ b/src/inference/dev_api/performance_heuristics.hpp
@ -23,14 +23,12 @@ struct MemBandwidthPressure {

 static MemBandwidthPressure MemBandwidthPressureTolerance(
    const std::shared_ptr<ngraph::Function> nGraphFunc,
-    const float L2_cache_size,
-    const float L3_cache_size,
+    const float cache_size,
    const float memThresholdAssumeLimited = MemBandwidthPressure::LIMITED) {
    int total_convs = 0, mem_limited_convs = 0, compute_convs = 0, total_gemms = 0, mem_limited_gemms = 0,
        total_deconvs = 0, compute_deconvs = 0, mem_limited_deconvs = 0;
-    auto memLimitedFactor = [&](int size_data_moved, int datatype_size) -> float {
-        return (L2_cache_size * 1.0f /*util factor, tbd */
-                / (size_data_moved * datatype_size));
+    auto memLimitedFactor = [&](int size_data_moved, int datatype_size = 4) -> float {
+        return (cache_size / (size_data_moved * datatype_size));
    };
    auto isLowPrecision = [&](ngraph::element::Type type) -> bool {
        return (type == ngraph::element::i8) || (type == ngraph::element::u8);
--- a/src/inference/dev_api/threading/ie_thread_safe_containers.hpp
+++ b/src/inference/dev_api/threading/ie_thread_safe_containers.hpp
@ -0,0 +1,86 @@
+// Copyright (C) 2018-2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+#pragma once
+
+#include <cstddef>
+#include <mutex>
+#include <queue>
+#include <type_traits>
+
+#include "ie_parallel.hpp"
+#if ((IE_THREAD == IE_THREAD_TBB) || (IE_THREAD == IE_THREAD_TBB_AUTO))
+#    include <tbb/concurrent_queue.h>
+#endif
+
+namespace InferenceEngine {
+
+template <typename T>
+class ThreadSafeQueueWithSize {
+public:
+    void push(T value) {
+        std::lock_guard<std::mutex> lock(_mutex);
+        _queue.push(std::move(value));
+    }
+    bool try_pop(T& value) {
+        std::lock_guard<std::mutex> lock(_mutex);
+        if (!_queue.empty()) {
+            value = std::move(_queue.front());
+            _queue.pop();
+            return true;
+        } else {
+            return false;
+        }
+    }
+    size_t size() {
+        std::lock_guard<std::mutex> lock(_mutex);
+        return _queue.size();
+    }
+
+protected:
+    std::queue<T> _queue;
+    std::mutex _mutex;
+};
+#if ((IE_THREAD == IE_THREAD_TBB) || (IE_THREAD == IE_THREAD_TBB_AUTO))
+template <typename T>
+using ThreadSafeQueue = tbb::concurrent_queue<T>;
+template <typename T>
+using ThreadSafeBoundedQueue = tbb::concurrent_bounded_queue<T>;
+#else
+template <typename T>
+using ThreadSafeQueue = ThreadSafeQueueWithSize<T>;
+template <typename T>
+class ThreadSafeBoundedQueue {
+public:
+    ThreadSafeBoundedQueue() = default;
+    bool try_push(T value) {
+        std::lock_guard<std::mutex> lock(_mutex);
+        if (_capacity) {
+            _queue.push(std::move(value));
+        }
+        return _capacity;
+    }
+    bool try_pop(T& value) {
+        std::lock_guard<std::mutex> lock(_mutex);
+        if (_capacity && !_queue.empty()) {
+            value = std::move(_queue.front());
+            _queue.pop();
+            return true;
+        } else {
+            return false;
+        }
+    }
+    void set_capacity(std::size_t newCapacity) {
+        std::lock_guard<std::mutex> lock(_mutex);
+        _capacity = newCapacity;
+    }
+
+protected:
+    std::queue<T> _queue;
+    std::mutex _mutex;
+    bool _capacity = false;
+};
+#endif
+}  // namespace InferenceEngine
--- a/src/inference/include/ie/ie_plugin_config.hpp
+++ b/src/inference/include/ie/ie_plugin_config.hpp
@ -118,6 +118,18 @@ DECLARE_METRIC_VALUE(BATCHED_BLOB);
 * String value for metric name is "RANGE_FOR_STREAMS".
 */
 DECLARE_METRIC_KEY(RANGE_FOR_STREAMS, std::tuple<unsigned int, unsigned int>);
+/**
+ * @brief Metric to query information optimal batch size for the given device and the network
+ *
+ * Metric returns a value of unsigned int type,
+ * Returns optimal batch size for a given network on the given device. The returned value is aligned to power of 2.
+ * Also, MODEL_PTR is the required option for this metric since the optimal batch size depends on the model,
+ * so if the MODEL_PTR is not given, the result of the metric is always 1.
+ * For the GPU the metric is queried automatically whenever the OpenVINO performance hint for the throughput is used,
+ * so that the result (>1) governs the automatic batching (transparently to the application).
+ * The automatic batching can be disabled with ALLOW_AUTO_BATCHING set to NO
+ */
+DECLARE_METRIC_KEY(OPTIMAL_BATCH_SIZE, unsigned int);

 /**
 * @brief Metric to provide a hint for a range for number of async infer requests. If device supports streams,
@ -250,6 +262,15 @@ DECLARE_CONFIG_KEY(PERFORMANCE_HINT_NUM_REQUESTS);
 DECLARE_CONFIG_VALUE(YES);
 DECLARE_CONFIG_VALUE(NO);

+/**
+ * @brief Auto-batching configuration, string for the device + batch size, e.g. "GPU(4)"
+ */
+DECLARE_CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG);
+/**
+ * @brief Auto-batching configuration: string with timeout (in ms), e.g. "100"
+ */
+DECLARE_CONFIG_KEY(AUTO_BATCH_TIMEOUT);
+
 /**
 * @brief Limit `#threads` that are used by Inference Engine for inference on the CPU.
 */
--- a/src/inference/src/ie_core.cpp
+++ b/src/inference/src/ie_core.cpp
@ -46,6 +46,7 @@
 #endif

 using namespace InferenceEngine::PluginConfigParams;
+using namespace InferenceEngine;
 using namespace std::placeholders;

 namespace ov {
@ -94,6 +95,9 @@ Parsed<T> parseDeviceNameIntoConfig(const std::string& deviceName, const std::ma
            config_[ie::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES] =
                deviceName.substr(std::string("AUTO:").size());
        }
+    } else if (deviceName_.find("BATCH:") == 0) {
+        deviceName_ = "BATCH";
+        config_[CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG)] = deviceName.substr(6);
    } else {
        ie::DeviceIDParser parser(deviceName_);
        deviceName_ = parser.getDeviceName();
@ -480,14 +484,22 @@ public:
        return newAPI;
    }

-    ov::runtime::SoPtr<ie::IExecutableNetworkInternal> LoadNetwork(const ie::CNNNetwork& network,
+    ov::runtime::SoPtr<ie::IExecutableNetworkInternal> LoadNetwork(
+        const ie::CNNNetwork& network,
        const std::shared_ptr<ie::RemoteContext>& context,
-                                                                   const std::map<std::string, std::string>& config) {
+        const std::map<std::string, std::string>& config) override {
        OV_ITT_SCOPE(FIRST_INFERENCE, ie::itt::domains::IE_LT, "Core::LoadNetwork::RemoteContext");
        if (context == nullptr) {
            IE_THROW() << "Remote context is null";
        }
+        // have to deduce the device name/config from the context first
        auto parsed = parseDeviceNameIntoConfig(context->getDeviceName(), config);
+        std::string& deviceName = parsed._deviceName;
+        std::map<std::string, std::string>& config_with_batch = parsed._config;
+        // if auto-batching is applicable, the below function will patch the device name and config accordingly:
+        ApplyAutoBatching(network, deviceName, config_with_batch);
+        parsed = parseDeviceNameIntoConfig(deviceName, config_with_batch);
+
        auto plugin = GetCPPPluginByName(parsed._deviceName);
        ov::runtime::SoPtr<ie::IExecutableNetworkInternal> res;
        auto cacheManager = coreConfig.getCacheConfig()._cacheManager;
@ -508,12 +520,59 @@ public:
        return res;
    }

+    void ApplyAutoBatching(const ie::CNNNetwork& network,
+                           std::string& deviceName,
+                           std::map<std::string, std::string>& config_with_batch) {
+        if (deviceName.find("BATCH") != std::string::npos) {
+            // explicitly enabled Auto-Batching e.g. in the tests
+            auto pos = deviceName.find_first_of(":");
+            if (pos != std::string::npos) {
+                auto deviceNameWithBatchSize = deviceName.substr(pos + 1);
+                auto deviceNameWithoutBatch = DeviceIDParser::getBatchDevice(deviceNameWithBatchSize);
+                auto function = network.getFunction();
+                // have to execute the DetectionOutput separately (without batching)
+                // as this layer mix-in the values from the different inputs (batch id)
+                bool bDetectionOutput = false;
+                const std::string detectionOutputOpName = ngraph::op::DetectionOutput::get_type_info_static().name;
+                const std::string resultOpName = ngraph::op::Result::get_type_info_static().name;
+                for (auto&& node : function->get_ops()) {
+                    auto isDetectionOutputParent = [&detectionOutputOpName](decltype(node)& nd) {
+                        for (size_t n = 0; n < nd->get_input_size(); n++) {
+                            if (detectionOutputOpName == nd->get_input_node_ptr(n)->get_type_info().name)
+                                return true;
+                        }
+                        return false;
+                    };
+
+                    if ((detectionOutputOpName == node->get_type_info().name) ||
+                        ((resultOpName == node->get_type_info().name) && isDetectionOutputParent(node))) {
+                        node->get_rt_info()["affinity"] = deviceNameWithoutBatch;
+                        bDetectionOutput = true;
+                    } else {
+                        node->get_rt_info()["affinity"] = "BATCH";
+                    }
+                }
+                if (bDetectionOutput) {
+                    deviceName = "HETERO:BATCH," + deviceNameWithoutBatch;
+                    config_with_batch[CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG)] = deviceNameWithBatchSize;
+                } else {
+                    deviceName = "BATCH:" + deviceNameWithBatchSize;
+                }
+            }
+        }
+    }
+
    ie::SoExecutableNetworkInternal LoadNetwork(const ie::CNNNetwork& network,
-                                                const std::string& deviceName,
+                                                const std::string& deviceNameOrig,
                                                const std::map<std::string, std::string>& config) override {
        OV_ITT_SCOPE(FIRST_INFERENCE, ie::itt::domains::IE_LT, "Core::LoadNetwork::CNN");
-        bool forceDisableCache = config.count(CONFIG_KEY_INTERNAL(FORCE_DISABLE_CACHE)) > 0;
-        auto parsed = parseDeviceNameIntoConfig(deviceName, config);
+        std::string deviceName = deviceNameOrig;
+        std::map<std::string, std::string> config_with_batch = config;
+        // if auto-batching is applicable, the below function will patch the device name and config accordingly:
+        ApplyAutoBatching(network, deviceName, config_with_batch);
+
+        bool forceDisableCache = config_with_batch.count(CONFIG_KEY_INTERNAL(FORCE_DISABLE_CACHE)) > 0;
+        auto parsed = parseDeviceNameIntoConfig(deviceName, config_with_batch);
        if (forceDisableCache) {
            // remove this config key from parsed as plugins can throw unsupported exception
            parsed._config.erase(CONFIG_KEY_INTERNAL(FORCE_DISABLE_CACHE));
@ -732,6 +791,19 @@ public:
        return devices;
    }

+    /**
+     * @brief Create a new shared context object on specified accelerator device
+     * using specified plugin-specific low level device API parameters (device handle, pointer, etc.)
+     * @param deviceName Name of a device to create new shared context on.
+     * @param params Map of device-specific shared context parameters.
+     * @return A shared pointer to a created remote context.
+     */
+    InferenceEngine::RemoteContext::Ptr CreateContext(const std::string& deviceName,
+                                                      const InferenceEngine::ParamMap& params) override {
+        auto parsed = ov::runtime::parseDeviceNameIntoConfig(deviceName, params);
+        return GetCPPPluginByName(parsed._deviceName).create_context(parsed._config)._ptr;
+    }
+
    /**
     * @brief Returns reference to CPP plugin wrapper by a device name
     * @param deviceName A name of device
@ -1030,6 +1102,12 @@ public:
                    deviceNames = ie::DeviceIDParser::getMultiDevices(deviceName.substr(pos + 1));
                }
                deviceNames.emplace_back("AUTO");
+            } else if (deviceName.find("BATCH") == 0) {
+                auto pos = deviceName.find_first_of(":");
+                if (pos != std::string::npos) {
+                    deviceNames = {ie::DeviceIDParser::getBatchDevice(deviceName.substr(pos + 1))};
+                }
+                deviceNames.push_back("BATCH");
            } else {
                deviceNames.push_back(deviceName);
            }
@ -1120,8 +1198,8 @@ std::vector<std::string> DeviceIDParser::getHeteroDevices(std::string fallbackDe
 }

 std::vector<std::string> DeviceIDParser::getMultiDevices(std::string devicesList) {
-    std::vector<std::string> deviceNames;
-    auto trim_request_info = [](std::string device_with_requests) {
+    std::set<std::string> deviceNames;
+    auto trim_request_info = [](const std::string& device_with_requests) {
        auto opening_bracket = device_with_requests.find_first_of('(');
        return device_with_requests.substr(0, opening_bracket);
    };
@ -1132,14 +1210,36 @@ std::vector<std::string> DeviceIDParser::getMultiDevices(std::string devicesList
    // we skip the #requests info here
    while ((pos = devicesList.find(delimiter)) != std::string::npos) {
        auto d = devicesList.substr(0, pos);
-        deviceNames.push_back(trim_request_info(d));
+        if (d.find("BATCH") == 0) {
+            deviceNames.insert("BATCH");
+            auto p = d.find_first_of(":");
+            if (p != std::string::npos)
+                deviceNames.insert(DeviceIDParser::getBatchDevice(d.substr(p + 1)));
+        } else {
+            deviceNames.insert(trim_request_info(d));
+        }
        devicesList.erase(0, pos + 1);
    }

-    if (!devicesList.empty())
-        deviceNames.push_back(trim_request_info(devicesList));
+    if (!devicesList.empty()) {
+        if (devicesList.find("BATCH") == 0) {
+            deviceNames.insert("BATCH");
+            auto p = devicesList.find_first_of(":");
+            if (p != std::string::npos)
+                deviceNames.insert(DeviceIDParser::getBatchDevice(devicesList.substr(p + 1)));
+        } else {
+            deviceNames.insert(trim_request_info(devicesList));
+        }
+    }
+    return std::vector<std::string>(deviceNames.begin(), deviceNames.end());
+}

-    return deviceNames;
+std::string DeviceIDParser::getBatchDevice(std::string device) {
+    auto trim_request_info = [](const std::string& device_with_requests) {
+        auto opening_bracket = device_with_requests.find_first_of('(');
+        return device_with_requests.substr(0, opening_bracket);
+    };
+    return trim_request_info(device);
 }

 class Core::Impl : public ov::runtime::CoreImpl {
@ -1207,18 +1307,7 @@ ExecutableNetwork Core::LoadNetwork(const std::string& modelPath, const std::map
 }

 RemoteContext::Ptr Core::CreateContext(const std::string& deviceName, const ParamMap& params) {
-    if (deviceName.find("HETERO") == 0) {
-        IE_THROW() << "HETERO device does not support remote context";
-    }
-    if (deviceName.find("MULTI") == 0) {
-        IE_THROW() << "MULTI device does not support remote context";
-    }
-    if (deviceName.find("AUTO") == 0) {
-        IE_THROW() << "AUTO device does not support remote context";
-    }
-
-    auto parsed = ov::runtime::parseDeviceNameIntoConfig(deviceName, params);
-    return _impl->GetCPPPluginByName(parsed._deviceName).create_context(parsed._config)._ptr;
+    return _impl->CreateContext(deviceName, params);
 }

 RemoteContext::Ptr Core::GetDefaultContext(const std::string& deviceName) {
--- a/src/plugins/CMakeLists.txt
+++ b/src/plugins/CMakeLists.txt
@ -21,3 +21,7 @@ endif()
 if(ENABLE_AUTO OR ENABLE_MULTI)
    add_subdirectory(auto)
 endif()
+
+if(ENABLE_AUTO_BATCH)
+    add_subdirectory(auto_batch)
+endif()
--- a/src/plugins/auto/executable_network.cpp
+++ b/src/plugins/auto/executable_network.cpp
@ -156,7 +156,8 @@ MultiDeviceExecutableNetwork::MultiDeviceExecutableNetwork(const std::string&
                                                           , _needPerfCounters(needPerfCounters)
                                                           , _multiPlugin(plugin)
                                                           , _context(context)
-                                                           , _workModeIsAUTO(true) {
+                                                           , _workModeIsAUTO(true)
+                                                           , _network(network) {
    if (_multiPlugin->GetCore() == nullptr) {
        IE_THROW() << "Please, work with " << _multiPlugin->GetName() << " device via InferencEngine::Core object";
    }
@ -667,10 +668,30 @@ InferenceEngine::Parameter MultiDeviceExecutableNetwork::GetMetric(const std::st
                real = _loadContext[ACTUALDEVICE].
                    executableNetwork->GetMetric(name).as<unsigned int>();
            } else {
+                IE_ASSERT(_loadContext[CPU].isAlready == true);
                real = _loadContext[CPU].
                    executableNetwork->GetMetric(name).as<unsigned int>();
+                std::unique_lock<std::mutex> lock(_confMutex);
+                auto deviceInfo =  _loadContext[ACTUALDEVICE].deviceInfo;
+                lock.unlock();
+                if (deviceInfo.deviceName.find("GPU") != std::string::npos) {
+                    const auto& mode = deviceInfo.config.find(CONFIG_KEY(PERFORMANCE_HINT));
+                    if (mode != deviceInfo.config.end() && mode->second == CONFIG_VALUE(THROUGHPUT)) {
+                         std::map<std::string, InferenceEngine::Parameter> options;
+                         options["MODEL_PTR"] = _network.getFunction(); // CNNntework
+                         try {
+                             auto optimalBatchSize = _core->GetMetric(deviceInfo.deviceName,
+                                     METRIC_KEY(OPTIMAL_BATCH_SIZE), options).as<unsigned int>();
+                             auto rangeOfStreams = _core->GetMetric(deviceInfo.deviceName,
+                                     METRIC_KEY(RANGE_FOR_STREAMS), options).as<std::tuple<unsigned int, unsigned int>>();
+                             real = (std::max)(real, std::get<1>(rangeOfStreams) * optimalBatchSize);
+                         } catch (const InferenceEngine::Exception &iie) {
+                             LOG_WARNING("[AUTOPLUGIN]get optimal infer requset num for GPU auto-batch failed :%s", iie.what());
                         }
-            unsigned int res = std::max(8u, real);
+                    }
+                }
+            }
+            unsigned int res = (std::max)(8u, real);
            IE_SET_METRIC_RETURN(OPTIMAL_NUMBER_OF_INFER_REQUESTS, res);
        }

--- a/src/plugins/auto/executable_network.hpp
+++ b/src/plugins/auto/executable_network.hpp
@ -7,22 +7,17 @@

 #include <atomic>
 #include <mutex>
-#include <queue>
 #include <unordered_map>
 #include <map>
 #include <vector>
 #include <string>

-#include <cpp_interfaces/impl/ie_executable_network_thread_safe_default.hpp>
-#include <ie_parallel.hpp>
-#include <threading/ie_itask_executor.hpp>
-#include <threading/ie_executor_manager.hpp>
+#include "cpp_interfaces/impl/ie_executable_network_thread_safe_default.hpp"
+#include "threading/ie_thread_safe_containers.hpp"
+#include "threading/ie_itask_executor.hpp"
+#include "threading/ie_executor_manager.hpp"
 #include "ie_icore.hpp"

-#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
-# include <tbb/concurrent_queue.h>
-#endif
-
 #ifdef  MULTIUNITTEST
 #define MOCKTESTMACRO virtual
 #define MultiDevicePlugin MockMultiDevicePlugin
@ -79,66 +74,6 @@ enum AutoLoadContextIndex {
 template<typename T>
 using DeviceMap = std::unordered_map<DeviceName, T>;

-#if ((IE_THREAD == IE_THREAD_TBB) || (IE_THREAD == IE_THREAD_TBB_AUTO))
-template <typename T>
-using ThreadSafeQueue = tbb::concurrent_queue<T>;
-template <typename T>
-using ThreadSafeBoundedQueue = tbb::concurrent_bounded_queue<T>;
-#else
-template <typename T>
-class ThreadSafeQueue {
-public:
-    void push(T value) {
-        std::lock_guard<std::mutex> lock(_mutex);
-        _queue.push(std::move(value));
-    }
-    bool try_pop(T& value) {
-        std::lock_guard<std::mutex> lock(_mutex);
-        if (!_queue.empty()) {
-            value = std::move(_queue.front());
-            _queue.pop();
-            return true;
-        } else {
-            return false;
-        }
-    }
-protected:
-    std::queue<T>   _queue;
-    std::mutex      _mutex;
-};
-template <typename T>
-class ThreadSafeBoundedQueue {
-public:
-    ThreadSafeBoundedQueue() = default;
-    bool try_push(T value) {
-        std::lock_guard<std::mutex> lock(_mutex);
-        if (_capacity) {
-            _queue.push(std::move(value));
-        }
-        return _capacity;
-    }
-    bool try_pop(T& value) {
-        std::lock_guard<std::mutex> lock(_mutex);
-        if (_capacity && !_queue.empty()) {
-            value = std::move(_queue.front());
-            _queue.pop();
-            return true;
-        } else {
-            return false;
-        }
-    }
-    void set_capacity(std::size_t newCapacity) {
-        std::lock_guard<std::mutex> lock(_mutex);
-        _capacity = newCapacity;
-    }
-
-protected:
-    std::queue<T>   _queue;
-    std::mutex      _mutex;
-    bool            _capacity = false;
-};
-#endif
-
 class MultiDeviceExecutableNetwork : public InferenceEngine::ExecutableNetworkThreadSafeDefault,
                                     public InferenceEngine::ITaskExecutor {
 public:
@ -148,7 +83,7 @@ public:
        InferenceEngine::Task                     _task;
        std::exception_ptr                        _exceptionPtr = nullptr;
    };
-    using NotBusyWorkerRequests = ThreadSafeBoundedQueue<WorkerInferRequest*>;
+    using NotBusyWorkerRequests = InferenceEngine::ThreadSafeBoundedQueue<WorkerInferRequest*>;

    explicit MultiDeviceExecutableNetwork(const DeviceMap<InferenceEngine::SoExecutableNetworkInternal>&        networksPerDevice,
                                          const std::vector<DeviceInformation>&                                 networkDevices,
@ -186,8 +121,8 @@ public:
    std::vector<DeviceInformation>                              _devicePriorities;
    const std::vector<DeviceInformation>                        _devicePrioritiesInitial;
    DeviceMap<InferenceEngine::SoExecutableNetworkInternal>     _networksPerDevice;
-    ThreadSafeQueue<InferenceEngine::Task>                      _inferPipelineTasks;
-    DeviceMap<std::unique_ptr<ThreadSafeQueue<InferenceEngine::Task>>> _inferPipelineTasksDeviceSpecific;
+    InferenceEngine::ThreadSafeQueue<InferenceEngine::Task>                      _inferPipelineTasks;
+    DeviceMap<std::unique_ptr<InferenceEngine::ThreadSafeQueue<InferenceEngine::Task>>> _inferPipelineTasksDeviceSpecific;
    DeviceMap<NotBusyWorkerRequests>                            _idleWorkerRequests;
    DeviceMap<std::vector<WorkerInferRequest>>                  _workerRequests;
    std::unordered_map<std::string, InferenceEngine::Parameter> _config;
@ -217,6 +152,7 @@ private:
    std::promise<void>                                                  _firstLoadPromise;
    mutable AutoLoadContext                                             _loadContext[CONTEXTNUM];
    mutable std::mutex                                                  _confMutex;
+    const InferenceEngine::CNNNetwork                                   _network;
 };

 }  // namespace MultiDevicePlugin
--- a/src/plugins/auto_batch/CMakeLists.txt
+++ b/src/plugins/auto_batch/CMakeLists.txt
@ -0,0 +1,20 @@
+# Copyright (C) 2018-2021 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+#
+
+set(TARGET_NAME "ov_auto_batch_plugin")
+
+file(GLOB SOURCES ${CMAKE_CURRENT_SOURCE_DIR}/*.cpp)
+
+file(GLOB HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/*.hpp)
+
+ie_add_plugin(NAME ${TARGET_NAME}
+              DEVICE_NAME "BATCH"
+              SOURCES ${SOURCES} ${HEADERS}
+              VERSION_DEFINES_FOR auto_batch.cpp ADD_CLANG_FORMAT)
+
+target_link_libraries(${TARGET_NAME} PRIVATE Threads::Threads)
+
+ie_add_api_validator_post_build_step(TARGET ${TARGET_NAME})
+
+set_target_properties(${TARGET_NAME} PROPERTIES INTERPROCEDURAL_OPTIMIZATION_RELEASE ${ENABLE_LTO})
--- a/src/plugins/auto_batch/auto_batch.cpp
+++ b/src/plugins/auto_batch/auto_batch.cpp
@ -0,0 +1,731 @@
+// Copyright (C) 2018-2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+#include "auto_batch.hpp"
+
+#include <cpp_interfaces/interface/ie_internal_plugin_config.hpp>
+#include <ie_icore.hpp>
+#include <ie_ngraph_utils.hpp>
+#include <ie_performance_hints.hpp>
+#include <iostream>
+#include <map>
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <utility>
+#include <vector>
+
+namespace AutoBatchPlugin {
+using namespace InferenceEngine;
+
+std::vector<std::string> supported_configKeys = {CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG), CONFIG_KEY(AUTO_BATCH_TIMEOUT)};
+
+template <Precision::ePrecision precision>
+Blob::Ptr create_shared_blob_on_top_of_batched_blob(Blob::Ptr batched_blob, size_t batch_id, size_t batch_num) {
+    typedef typename PrecisionTrait<precision>::value_type TYPE;
+    typedef typename std::add_pointer<TYPE>::type TYPEPTR;
+    auto ptr = batched_blob->buffer().as<TYPEPTR>();
+    auto sizePerBatch = batched_blob->size() / batch_num;
+    auto layout = batched_blob->getTensorDesc().getLayout();
+    SizeVector dims = batched_blob->getTensorDesc().getDims();
+    // the below code is a placeholder for the WIP (22.1) functionality
+    // that will check the reshaping by the batch is robust (CVS-51744)
+    if (layout == InferenceEngine::Layout::NC || layout == InferenceEngine::Layout::NCDHW ||
+        layout == InferenceEngine::Layout::NCHW || layout == InferenceEngine::Layout::NHWC ||
+        layout == InferenceEngine::Layout::NDHWC) {
+        dims[0] = 1;
+        assert(batched_blob->getTensorDesc().getPrecision() == precision);
+        return make_shared_blob<TYPE>({precision, dims, batched_blob->getTensorDesc().getLayout()},
+                                      ptr + sizePerBatch * batch_id,
+                                      sizePerBatch);
+    } else {
+        // same blob for all requests (e.g. constants)
+        return make_shared_blob<TYPE>({precision, dims, batched_blob->getTensorDesc().getLayout()}, ptr);
+    }
+}
+
+// ------------------------------AutoBatchInferRequest----------------------------
+AutoBatchInferRequest::AutoBatchInferRequest(const InputsDataMap& networkInputs,
+                                             const OutputsDataMap& networkOutputs,
+                                             AutoBatchExecutableNetwork::WorkerInferRequest& workerRequestPtr,
+                                             int batch_id,
+                                             int num_batch,
+                                             bool needPerfCounters)
+    : IInferRequestInternal(networkInputs, networkOutputs),
+      _myBatchedRequestWrapper(workerRequestPtr),
+      _needPerfCounters(needPerfCounters),
+      _batchId(batch_id),
+      _batchSize(num_batch) {
+    // Allocate all input blobs
+    for (const auto& it : networkInputs) {
+        auto blob = _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first);
+        Blob::Ptr res;
+        switch (it.second->getTensorDesc().getPrecision()) {
+        case InferenceEngine::Precision::FP32:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::FP32>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        case InferenceEngine::Precision::I32:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::I32>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        case InferenceEngine::Precision::I8:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::I8>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        case InferenceEngine::Precision::U16:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::U16>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+
+        case InferenceEngine::Precision::I16:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::I16>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+
+            break;
+        case InferenceEngine::Precision::U8:
+        case InferenceEngine::Precision::BOOL:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::U8>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        default:
+            IE_THROW() << "Unsupported input precision " << it.second->getTensorDesc().getPrecision();
+        }
+        _inputs[it.first] = res;
+    }
+    // Allocate all output blobs
+    for (const auto& it : networkOutputs) {
+        auto blob = _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first);
+        Blob::Ptr res;
+        switch (it.second->getTensorDesc().getPrecision()) {
+        case InferenceEngine::Precision::FP32:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::FP32>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        case InferenceEngine::Precision::I32:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::I32>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        case InferenceEngine::Precision::I8:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::I8>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        case InferenceEngine::Precision::U16:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::U16>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+
+        case InferenceEngine::Precision::I16:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::I16>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+
+            break;
+        case InferenceEngine::Precision::U8:
+        case InferenceEngine::Precision::BOOL:
+            res = create_shared_blob_on_top_of_batched_blob<InferenceEngine::Precision::U8>(
+                _myBatchedRequestWrapper._inferRequestBatched->GetBlob(it.first),
+                batch_id,
+                num_batch);
+            break;
+        default:
+            IE_THROW(NotImplemented) << "Unsupported input precision " << it.second->getTensorDesc().getPrecision();
+        }
+        _outputs[it.first] = res;
+    }
+}
+
+void AutoBatchInferRequest::SetBlobsToAnotherRequest(SoIInferRequestInternal& req) {
+    for (const auto& it : _networkInputs) {
+        auto& name = it.first;
+        // this request is already in BUSY state, so using the internal functions safely
+        auto blob = GetBlob(name);
+        if (req->GetBlob(name) != blob)
+            req->SetBlob(name, blob);
+    }
+    for (const auto& it : _networkOutputs) {
+        auto& name = it.first;
+        // this request is already in BUSY state, so using the internal functions safely
+        auto blob = GetBlob(name);
+        if (req->GetBlob(name) != blob)
+            req->SetBlob(name, blob);
+    }
+}
+
+void AutoBatchInferRequest::CopyInputsIfNeeded() {
+    for (const auto& it : _networkInputs) {
+        auto& name = it.first;
+        // this request is already in BUSY state, so using the internal functions safely
+        CopyBlobIfNeeded(GetBlob(name), _myBatchedRequestWrapper._inferRequestBatched->GetBlob(name), true);
+    }
+}
+
+void AutoBatchInferRequest::CopyBlobIfNeeded(InferenceEngine::Blob::CPtr src,
+                                             InferenceEngine::Blob::Ptr dst,
+                                             bool bInput) {
+    auto bufferDst = dst->buffer();
+    auto ptrDst = bufferDst.as<char*>();
+    auto bufferSrc = src->cbuffer();
+    auto ptrSrc = bufferSrc.as<const char*>();
+    ptrdiff_t szDst = dst->byteSize();
+    ptrdiff_t szSrc = src->byteSize();
+    if (bInput) {
+        ptrdiff_t offset = szSrc != szDst ? _batchId * szDst / _batchSize : 0;
+        if ((ptrDst + offset) == ptrSrc)
+            return;
+        else
+            memcpy(ptrDst + offset, ptrSrc, szSrc);
+    } else {
+        ptrdiff_t offset = szSrc != szDst ? _batchId * szSrc / _batchSize : 0;
+        if ((ptrSrc + offset) == ptrDst)
+            return;
+        else
+            memcpy(ptrDst, ptrSrc + offset, szDst);
+    }
+}
+
+void AutoBatchInferRequest::CopyOutputsIfNeeded() {
+    for (const auto& it : _networkOutputs) {
+        auto& name = it.first;
+        // this request is already in BUSY state, so using the internal functions safely
+        CopyBlobIfNeeded(_myBatchedRequestWrapper._inferRequestBatched->GetBlob(name), GetBlob(name), false);
+    }
+}
+
+std::map<std::string, InferenceEngine::InferenceEngineProfileInfo> AutoBatchInferRequest::GetPerformanceCounts() const {
+    return _perfMap;
+}
+
+AutoBatchAsyncInferRequest::AutoBatchAsyncInferRequest(
+    const AutoBatchInferRequest::Ptr& inferRequest,
+    const bool needPerfCounters,
+    InferenceEngine::SoIInferRequestInternal& inferRequestWithoutBatch,
+    const ITaskExecutor::Ptr& callbackExecutor)
+    : AsyncInferRequestThreadSafeDefault(inferRequest, nullptr, callbackExecutor),
+      _inferRequestWithoutBatch(inferRequestWithoutBatch),
+      _inferRequest{inferRequest} {
+    // this executor starts the inference while  the task (checking the result) is passed to the next stage
+    struct ThisRequestExecutor : public ITaskExecutor {
+        explicit ThisRequestExecutor(AutoBatchAsyncInferRequest* _this_) : _this{_this_} {}
+        void run(Task task) override {
+            auto& workerInferRequest = _this->_inferRequest->_myBatchedRequestWrapper;
+            std::pair<AutoBatchAsyncInferRequest*, InferenceEngine::Task> t;
+            t.first = _this;
+            t.second = std::move(task);
+            workerInferRequest._tasks.push(t);
+            // it is ok to call size() here as the queue only grows (and the bulk removal happens under the mutex)
+            const int sz = workerInferRequest._tasks.size();
+            if (sz == workerInferRequest._batchSize) {
+                workerInferRequest._cond.notify_one();
+            }
+        };
+        AutoBatchAsyncInferRequest* _this = nullptr;
+    };
+    _pipeline = {
+        {/*TaskExecutor*/ std::make_shared<ThisRequestExecutor>(this), /*task*/ [this, needPerfCounters] {
+             if (this->_inferRequest->_exceptionPtr)  // if the exception happened in the batch1 fallback
+                 std::rethrow_exception(this->_inferRequest->_exceptionPtr);
+             if (this->_inferRequest->_myBatchedRequestWrapper._exceptionPtr)  // when the batchN execution failed
+                 std::rethrow_exception(this->_inferRequest->_myBatchedRequestWrapper._exceptionPtr);
+             this->_inferRequest->CopyOutputsIfNeeded();
+         }}};
+}
+
+void AutoBatchAsyncInferRequest::Infer_ThreadUnsafe() {
+    InferUsingAsync();
+}
+
+AutoBatchAsyncInferRequest::~AutoBatchAsyncInferRequest() {
+    StopAndWait();
+}
+
+// ------------------------------AutoBatchExecutableNetwork----------------------------
+AutoBatchExecutableNetwork::AutoBatchExecutableNetwork(
+    const InferenceEngine::SoExecutableNetworkInternal& networkWithBatch,
+    const InferenceEngine::SoExecutableNetworkInternal& networkWithoutBatch,
+    const DeviceInformation& networkDevice,
+    const std::unordered_map<std::string, InferenceEngine::Parameter>& config,
+    const bool needPerfCounters)
+    : InferenceEngine::ExecutableNetworkThreadSafeDefault(nullptr,
+                                                          std::make_shared<InferenceEngine::ImmediateExecutor>()),
+      _network{networkWithBatch},
+      _networkWithoutBatch{networkWithoutBatch},
+      _config{config},
+      _needPerfCounters{needPerfCounters} {
+    // WA for gcc 4.8 ( fails compilation with member init-list)
+    _device = networkDevice;
+    auto time_out = config.find(CONFIG_KEY(AUTO_BATCH_TIMEOUT));
+    if (time_out != config.end())
+        _timeOut = ParseTimeoutValue(time_out->second.as<std::string>());
+}
+
+AutoBatchExecutableNetwork::~AutoBatchExecutableNetwork() {
+    _terminate = true;
+    for (auto w : _workerRequests) {
+        w->_thread.join();
+    }
+    _workerRequests.clear();
+}
+
+unsigned int AutoBatchExecutableNetwork::ParseTimeoutValue(const std::string& s) {
+    auto val = std::stoi(s);
+    if (val < 0)
+        IE_THROW(ParameterMismatch) << "Value for the " << CONFIG_KEY(AUTO_BATCH_TIMEOUT) << " should be unsigned int";
+    return val;
+}
+
+std::shared_ptr<InferenceEngine::RemoteContext> AutoBatchExecutableNetwork::GetContext() const {
+    return _network->GetContext();
+}
+
+InferenceEngine::IInferRequestInternal::Ptr AutoBatchExecutableNetwork::CreateInferRequestImpl(
+    InferenceEngine::InputsDataMap networkInputs,
+    InferenceEngine::OutputsDataMap networkOutputs) {
+    // todo : guard request creation from another thread/on-the-fly
+    auto num = _numRequestsCreated++;
+    auto batch_id = num % _device.batchForDevice;
+    if (!batch_id) {  // need new request
+        _workerRequests.push_back(std::make_shared<WorkerInferRequest>());
+        auto workerRequestPtr = _workerRequests.back();
+        workerRequestPtr->_inferRequestBatched = {_network->CreateInferRequest(), _network._so};
+        workerRequestPtr->_batchSize = _device.batchForDevice;
+        workerRequestPtr->_completionTasks.resize(workerRequestPtr->_batchSize);
+        workerRequestPtr->_inferRequestBatched->SetCallback(
+            [workerRequestPtr, this](std::exception_ptr exceptionPtr) mutable {
+                if (exceptionPtr)
+                    workerRequestPtr->_exceptionPtr = exceptionPtr;
+                IE_ASSERT(workerRequestPtr->_completionTasks.size() == (size_t)workerRequestPtr->_batchSize);
+                // notify the individual requests on the completion
+                for (int c = 0; c < workerRequestPtr->_batchSize; c++) {
+                    workerRequestPtr->_completionTasks[c]();
+                }
+                // reset the timeout
+                workerRequestPtr->_cond.notify_one();
+            });
+
+        workerRequestPtr->_thread = std::thread([workerRequestPtr, this] {
+            while (1) {
+                std::cv_status status;
+                {
+                    std::unique_lock<std::mutex> lock(workerRequestPtr->_mutex);
+                    status = workerRequestPtr->_cond.wait_for(lock, std::chrono::milliseconds(_timeOut));
+                }
+                if (_terminate) {
+                    break;
+                } else {
+                    // as we pop the tasks from the queue only here
+                    // it is ok to call size() (as the _tasks can only grow in parallel)
+                    const int sz = workerRequestPtr->_tasks.size();
+                    if (sz == workerRequestPtr->_batchSize) {
+                        std::pair<AutoBatchAsyncInferRequest*, InferenceEngine::Task> t;
+                        for (int n = 0; n < sz; n++) {
+                            IE_ASSERT(workerRequestPtr->_tasks.try_pop(t));
+                            workerRequestPtr->_completionTasks[n] = std::move(t.second);
+                            t.first->_inferRequest->CopyInputsIfNeeded();
+                        }
+                        workerRequestPtr->_inferRequestBatched->StartAsync();
+                    } else if ((status == std::cv_status::timeout) && sz) {
+                        // timeout to collect the batch is over, have to execute the requests in the batch1 mode
+                        std::pair<AutoBatchAsyncInferRequest*, InferenceEngine::Task> t;
+                        // popping all tasks collected by the moment of the time-out and execute each with batch1
+                        std::atomic<int> arrived = {0};
+                        std::promise<void> all_completed;
+                        auto all_completed_future = all_completed.get_future();
+                        for (int n = 0; n < sz; n++) {
+                            IE_ASSERT(workerRequestPtr->_tasks.try_pop(t));
+                            t.first->_inferRequestWithoutBatch->SetCallback(
+                                [t, sz, &arrived, &all_completed](std::exception_ptr p) {
+                                    if (p)
+                                        t.first->_inferRequest->_exceptionPtr = p;
+                                    t.second();
+                                    if (sz == ++arrived)
+                                        all_completed.set_value();
+                                });
+                            t.first->_inferRequest->SetBlobsToAnotherRequest(t.first->_inferRequestWithoutBatch);
+                            t.first->_inferRequestWithoutBatch->StartAsync();
+                        }
+                        all_completed_future.get();
+                        // now when all the tasks for this batch are completed, start waiting for the timeout again
+                    }
+                }
+            }
+        });
+    }
+    return std::make_shared<AutoBatchInferRequest>(networkInputs,
+                                                   networkOutputs,
+                                                   *_workerRequests.back(),
+                                                   batch_id,
+                                                   _device.batchForDevice,
+                                                   _needPerfCounters);
+}
+
+InferenceEngine::IInferRequestInternal::Ptr AutoBatchExecutableNetwork::CreateInferRequest() {
+    auto syncRequestImpl = CreateInferRequestImpl(_networkInputs, _networkOutputs);
+    syncRequestImpl->setPointerToExecutableNetworkInternal(shared_from_this());
+    InferenceEngine::SoIInferRequestInternal inferRequestWithoutBatch = {_networkWithoutBatch->CreateInferRequest(),
+                                                                         _networkWithoutBatch._so};
+    return std::make_shared<AutoBatchAsyncInferRequest>(
+        std::static_pointer_cast<AutoBatchInferRequest>(syncRequestImpl),
+        _needPerfCounters,
+        inferRequestWithoutBatch,
+        _callbackExecutor);
+}
+
+std::shared_ptr<ngraph::Function> AutoBatchExecutableNetwork::GetExecGraphInfo() {
+    return _network->GetExecGraphInfo() ? _network->GetExecGraphInfo() : _networkWithoutBatch->GetExecGraphInfo();
+}
+
+void AutoBatchExecutableNetwork::SetConfig(const std::map<std::string, InferenceEngine::Parameter>& config) {
+    auto timeout = config.find(CONFIG_KEY(AUTO_BATCH_TIMEOUT));
+    if (timeout == config.end() || config.size() > 1) {
+        IE_THROW() << "The only config that can be changed on the fly for the AutoBatching the is the "
+                   << CONFIG_KEY(AUTO_BATCH_TIMEOUT);
+    } else {
+        _timeOut = ParseTimeoutValue(timeout->second.as<std::string>());
+    }
+}
+
+InferenceEngine::Parameter AutoBatchExecutableNetwork::GetConfig(const std::string& name) const {
+    auto it = _config.find(name);
+    if (it != _config.end()) {
+        return it->second;
+    } else {
+        // find config key among networks config keys
+        auto param = _network->GetMetric(METRIC_KEY(SUPPORTED_CONFIG_KEYS));
+        for (auto&& configKey : param.as<std::vector<std::string>>()) {
+            if (configKey == name) {
+                return _network->GetConfig(configKey);
+            }
+        }
+        IE_THROW(NotFound) << name << " not found in the ExecutableNetwork config";
+    }
+}
+
+InferenceEngine::Parameter AutoBatchExecutableNetwork::GetMetric(const std::string& name) const {
+    if (name == METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS)) {
+        auto reqs = 0;
+        try {
+            auto hint = _network->GetConfig(CONFIG_KEY(PERFORMANCE_HINT_NUM_REQUESTS)).as<std::string>();
+            reqs = InferenceEngine::PerfHintsConfig::CheckPerformanceHintRequestValue(hint);
+            if (!reqs)  // no limitations from user, let's deduce the full blown #requests
+                // (multiplied by the devices capabilities to run multiple <batched> requests for further perf)
+                reqs = _device.batchForDevice *
+                       _network->GetMetric(METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS)).as<unsigned int>();
+        } catch (const InferenceEngine::Exception& iie) {
+        }
+        reqs = std::max(reqs, _device.batchForDevice);  // round up to the possible  user's value
+        IE_SET_METRIC_RETURN(OPTIMAL_NUMBER_OF_INFER_REQUESTS, reqs);
+    } else if (name == METRIC_KEY(NETWORK_NAME)) {
+        IE_SET_METRIC_RETURN(NETWORK_NAME, _network->GetMetric(METRIC_KEY(NETWORK_NAME)).as<std::string>());
+    } else if (name == METRIC_KEY(SUPPORTED_METRICS)) {
+        IE_SET_METRIC_RETURN(SUPPORTED_METRICS,
+                             {METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS),
+                              METRIC_KEY(SUPPORTED_METRICS),
+                              METRIC_KEY(NETWORK_NAME),
+                              METRIC_KEY(SUPPORTED_CONFIG_KEYS)});
+    } else if (name == METRIC_KEY(SUPPORTED_CONFIG_KEYS)) {
+        IE_SET_METRIC_RETURN(SUPPORTED_CONFIG_KEYS,
+                             {CONFIG_KEY(AUTO_BATCH_TIMEOUT)});  // only timeout can be changed on the fly
+    } else {
+        IE_THROW() << "Unsupported Network metric: " << name;
+    }
+}
+
+// ------------------------------AutoBatchInferencePlugin----------------------------
+
+namespace {
+
+std::map<std::string, std::string> mergeConfigs(std::map<std::string, std::string> config,
+                                                const std::map<std::string, std::string>& local) {
+    for (auto&& kvp : local) {
+        config[kvp.first] = kvp.second;
+    }
+    return config;
+}
+
+}  // namespace
+
+std::map<std::string, std::string> AutoBatchInferencePlugin::GetSupportedConfig(
+    const std::map<std::string, std::string>& config,
+    const std::string& deviceName) const {
+    std::vector<std::string> supportedConfigKeys = GetCore()->GetMetric(deviceName, METRIC_KEY(SUPPORTED_CONFIG_KEYS));
+    std::map<std::string, std::string> supportedConfig;
+    for (auto&& key : supportedConfigKeys) {
+        auto itKey = config.find(key);
+        if (config.end() != itKey) {
+            supportedConfig[key] = itKey->second;
+        }
+    }
+    return supportedConfig;
+}
+
+DeviceInformation AutoBatchInferencePlugin::ParseBatchDevice(const std::string& deviceWithBatch) {
+    auto&& d = deviceWithBatch;
+    auto openingBracket = d.find_first_of('(');
+    auto closingBracket = d.find_first_of(')', openingBracket);
+    auto deviceName = d.substr(0, openingBracket);
+
+    int batch = 1;
+    if (closingBracket != std::string::npos && openingBracket < closingBracket) {
+        batch = std::stol(d.substr(openingBracket + 1, closingBracket - 1));
+
+        if (batch <= 0) {
+            IE_THROW() << "Batch value for '" << deviceName << "' must be > 0, while " << batch << "is passed";
+        }
+    }
+    return {deviceName, {{}}, batch};
+}
+
+DeviceInformation AutoBatchInferencePlugin::ParseMetaDevice(const std::string& devicesBatchCfg,
+                                                            const std::map<std::string, std::string>& config) const {
+    auto getDeviceConfig = [&](const DeviceName& deviceWithID) {
+        DeviceIDParser deviceParser(deviceWithID);
+        std::string deviceName = deviceParser.getDeviceName();
+        std::map<std::string, std::string> tconfig = mergeConfigs(_config, config);
+
+        // set device ID if any
+        std::string deviceIDLocal = deviceParser.getDeviceID();
+        if (!deviceIDLocal.empty()) {
+            tconfig[PluginConfigParams::KEY_DEVICE_ID] = deviceIDLocal;
+        }
+
+        return GetSupportedConfig(tconfig, deviceName);
+    };
+
+    auto metaDevice = ParseBatchDevice(devicesBatchCfg);
+    metaDevice.config = getDeviceConfig(metaDevice.deviceName);
+
+    auto cfg = config;
+    // check that no irrelevant config-keys left
+    for (auto k : config) {
+        const auto& name = k.first;
+        auto found_in_supported_cfg = std::find(supported_configKeys.begin(), supported_configKeys.end(), k.first);
+        auto found_in_device_cfg = metaDevice.config.find(k.first);
+        if (found_in_device_cfg == metaDevice.config.end() && found_in_supported_cfg == supported_configKeys.end()) {
+            IE_THROW() << "Unsupported config key: " << name;
+        }
+    }
+    return metaDevice;
+}
+
+RemoteContext::Ptr AutoBatchInferencePlugin::CreateContext(const InferenceEngine::ParamMap& config) {
+    auto cfg = config;
+    auto it = cfg.find(CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG));
+    if (it == cfg.end())
+        IE_THROW() << "Value for KEY_AUTO_BATCH is not set";
+
+    auto val = it->second;
+    auto metaDevice = ParseMetaDevice(val, std::map<std::string, std::string>());
+    cfg.erase(it);
+    return GetCore()->CreateContext(metaDevice.deviceName, cfg);
+}
+
+Parameter AutoBatchInferencePlugin::GetConfig(const std::string& name,
+                                              const std::map<std::string, Parameter>& options) const {
+    if (supported_configKeys.end() != std::find(supported_configKeys.begin(), supported_configKeys.end(), name)) {
+        auto it = _config.find(name);
+        if (it == _config.end()) {
+            IE_THROW() << "Value for " << name << " is not set";
+        } else {
+            return {it->second};
+        }
+    } else {
+        IE_THROW() << "Unsupported config key: " << name;
+    }
+}
+
+void AutoBatchInferencePlugin::CheckConfig(const std::map<std::string, std::string>& config) {
+    for (auto&& kvp : config) {
+        const auto name = kvp.first;
+        const auto val = kvp.second;
+        if (supported_configKeys.end() == std::find(supported_configKeys.begin(), supported_configKeys.end(), name))
+            IE_THROW() << "Unsupported config key: " << name;
+        if (name == CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG)) {
+            ParseBatchDevice(val);
+        } else if (name == CONFIG_KEY(AUTO_BATCH_TIMEOUT)) {
+            try {
+                auto t = std::stoi(val);
+                if (t < 0)
+                    IE_THROW(ParameterMismatch);
+            } catch (const std::exception& e) {
+                IE_THROW(ParameterMismatch)
+                    << " Expecting unsigned int value for " << CONFIG_KEY(AUTO_BATCH_TIMEOUT) << " got " << val;
+            }
+        }
+    }
+}
+
+void AutoBatchInferencePlugin::SetConfig(const std::map<std::string, std::string>& config) {
+    CheckConfig(config);
+    for (auto&& kvp : config) {
+        _config[kvp.first] = kvp.second;
+    }
+}
+
+static const Version version = {{2, 1}, CI_BUILD_NUMBER, "AutoBatchPlugin"};
+IE_DEFINE_PLUGIN_CREATE_FUNCTION(AutoBatchInferencePlugin, version)
+
+AutoBatchInferencePlugin::AutoBatchInferencePlugin() {
+    _pluginName = "BATCH";
+}
+
+InferenceEngine::Parameter AutoBatchInferencePlugin::GetMetric(
+    const std::string& name,
+    const std::map<std::string, InferenceEngine::Parameter>& options) const {
+    if (name == METRIC_KEY(SUPPORTED_METRICS)) {
+        std::vector<std::string> metrics;
+        metrics.push_back(METRIC_KEY(SUPPORTED_METRICS));
+        metrics.push_back(METRIC_KEY(FULL_DEVICE_NAME));
+        metrics.push_back(METRIC_KEY(SUPPORTED_CONFIG_KEYS));
+        IE_SET_METRIC_RETURN(SUPPORTED_METRICS, metrics);
+    } else if (name == METRIC_KEY(FULL_DEVICE_NAME)) {
+        IE_SET_METRIC_RETURN(FULL_DEVICE_NAME, _pluginName);
+    } else if (name == METRIC_KEY(SUPPORTED_CONFIG_KEYS)) {
+        IE_SET_METRIC_RETURN(SUPPORTED_CONFIG_KEYS, supported_configKeys);
+    } else {
+        IE_THROW(NotFound) << "Unsupported metric key " << name;
+    }
+}
+
+IExecutableNetworkInternal::Ptr AutoBatchInferencePlugin::LoadExeNetworkImpl(
+    const InferenceEngine::CNNNetwork& network,
+    const std::map<std::string, std::string>& config) {
+    return LoadNetworkImpl(network, nullptr, config);
+}
+
+InferenceEngine::IExecutableNetworkInternal::Ptr AutoBatchInferencePlugin::LoadNetworkImpl(
+    const InferenceEngine::CNNNetwork& network,
+    const std::shared_ptr<InferenceEngine::RemoteContext> ctx,
+    const std::map<std::string, std::string>& config) {
+    if (GetCore() == nullptr) {
+        IE_THROW() << "Please, work with MULTI device via InferencEngine::Core object";
+    }
+
+    auto fullConfig = mergeConfigs(_config, config);
+    auto device_batch = fullConfig.find(CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG));
+    if (device_batch == fullConfig.end()) {
+        IE_THROW() << "KEY_AUTO_BATCH key is not set for BATCH device";
+    }
+
+    auto metaDevice = ParseMetaDevice(device_batch->second, fullConfig);
+    const auto& deviceName = metaDevice.deviceName;
+    const auto& deviceConfig = metaDevice.config;
+    const auto perfConfig = fullConfig.find(PluginConfigParams::KEY_PERF_COUNT);
+    const bool enablePerfCounters = (fullConfig.end() != perfConfig) && (perfConfig->second == PluginConfigParams::YES);
+
+    auto report_footprint = [](std::shared_ptr<ICore> pCore, std::string device) -> size_t {
+        size_t footprint = 0;
+        // TODO: use the per-network metric (22.2) rather than plugin-level
+        auto stats = pCore->GetMetric(device, GPU_METRIC_KEY(MEMORY_STATISTICS)).as<std::map<std::string, uint64_t>>();
+        for (auto s : stats)
+            if (s.first.find("_current") != std::string::npos)
+                footprint += s.second;
+        return footprint;
+    };
+
+    size_t batch1_footprint = 0;
+    if (deviceName.find("GPU") != std::string::npos)
+        batch1_footprint = report_footprint(GetCore(), deviceName);
+    auto executableNetworkWithoutBatch = ctx ? GetCore()->LoadNetwork(network, ctx, deviceConfig)
+                                             : GetCore()->LoadNetwork(network, deviceName, deviceConfig);
+    if (deviceName.find("GPU") != std::string::npos) {
+        batch1_footprint = report_footprint(GetCore(), deviceName) - batch1_footprint;
+        if (batch1_footprint) {
+            const uint64_t total_mem = GetCore()->GetMetric(deviceName, GPU_METRIC_KEY(DEVICE_TOTAL_MEM_SIZE));
+            const int estimated_batch = (total_mem - batch1_footprint) / batch1_footprint;
+            int closest = pow(2, floor(log(estimated_batch) / log(2)));
+            closest = std::max(1, closest);
+            metaDevice.batchForDevice = std::min(metaDevice.batchForDevice, closest);
+        }
+    }
+    // auto-batch settings
+    std::unordered_map<std::string, InferenceEngine::Parameter> networkConfig;
+    for (auto c : fullConfig) {
+        if (supported_configKeys.end() != std::find(supported_configKeys.begin(), supported_configKeys.end(), c.first))
+            networkConfig.insert(c);
+    }
+
+    InferenceEngine::SoExecutableNetworkInternal executableNetworkWithBatch;
+    if (metaDevice.batchForDevice > 1) {
+        try {
+            CNNNetwork clonedNetwork(InferenceEngine::details::cloneNetwork(network));
+            const InputsDataMap inputInfo = clonedNetwork.getInputsInfo();
+            ICNNNetwork::InputShapes shapes = clonedNetwork.getInputShapes();
+            for (const InputsDataMap::value_type& item : inputInfo) {
+                auto layout = item.second->getTensorDesc().getLayout();
+                // the below code is a placeholder for the WIP (22.1) functionality
+                // that will check the reshaping by the batch is robust (CVS-51744)
+                if (layout == InferenceEngine::Layout::NC || layout == InferenceEngine::Layout::NCDHW ||
+                    layout == InferenceEngine::Layout::NCHW || layout == InferenceEngine::Layout::NHWC ||
+                    layout == InferenceEngine::Layout::NDHWC) {
+                    assert(1 == shapes[item.first][0]);  // do not reshape/re-batch originally batched networks
+                    shapes[item.first][0] = metaDevice.batchForDevice;
+                }
+            }
+            clonedNetwork.reshape(shapes);
+            executableNetworkWithBatch =
+                ctx ? GetCore()->LoadNetwork(CNNNetwork{clonedNetwork}, ctx, deviceConfig)
+                    : GetCore()->LoadNetwork(CNNNetwork{clonedNetwork}, deviceName, deviceConfig);
+        } catch (...) {
+            executableNetworkWithBatch = {nullptr, nullptr};
+        }
+    }
+
+    if (!executableNetworkWithBatch) {
+        executableNetworkWithBatch = executableNetworkWithoutBatch;
+        metaDevice.batchForDevice = 1;
+    }
+
+    return std::make_shared<AutoBatchExecutableNetwork>(executableNetworkWithBatch,
+                                                        executableNetworkWithoutBatch,
+                                                        metaDevice,
+                                                        networkConfig,
+                                                        enablePerfCounters);
+}
+
+InferenceEngine::IExecutableNetworkInternal::Ptr AutoBatchInferencePlugin::LoadExeNetworkImpl(
+    const InferenceEngine::CNNNetwork& network,
+    const std::shared_ptr<InferenceEngine::RemoteContext>& context,
+    const std::map<std::string, std::string>& config) {
+    return LoadNetworkImpl(network, context, config);
+}
+
+InferenceEngine::QueryNetworkResult AutoBatchInferencePlugin::QueryNetwork(
+    const InferenceEngine::CNNNetwork& network,
+    const std::map<std::string, std::string>& config) const {
+    auto cfg = config;
+    for (auto c : cfg) {
+        if (c.first == CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG)) {
+            auto val = c.second;
+            cfg.erase(c.first);
+            auto metaDevice = ParseMetaDevice(val, cfg);
+            return GetCore()->QueryNetwork(network, metaDevice.deviceName, cfg);
+        }
+    }
+    IE_THROW() << "Value for KEY_AUTO_BATCH is not set";
+}
+}  // namespace AutoBatchPlugin
--- a/src/plugins/auto_batch/auto_batch.hpp
+++ b/src/plugins/auto_batch/auto_batch.hpp
@ -0,0 +1,159 @@
+// Copyright (C) 2018-2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+#pragma once
+
+#include <atomic>
+#include <map>
+#include <mutex>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
+#include "cpp_interfaces/impl/ie_executable_network_thread_safe_default.hpp"
+#include "cpp_interfaces/impl/ie_infer_async_request_thread_safe_default.hpp"
+#include "cpp_interfaces/interface/ie_iplugin_internal.hpp"
+#include "ie_metric_helpers.hpp"
+#include "threading/ie_thread_safe_containers.hpp"
+
+namespace AutoBatchPlugin {
+
+using DeviceName = std::string;
+
+struct DeviceInformation {
+    DeviceName deviceName;
+    std::map<std::string, std::string> config;
+    int batchForDevice;
+};
+
+class AutoBatchAsyncInferRequest;
+class AutoBatchExecutableNetwork : public InferenceEngine::ExecutableNetworkThreadSafeDefault {
+public:
+    using Ptr = std::shared_ptr<AutoBatchExecutableNetwork>;
+    struct WorkerInferRequest {
+        using Ptr = std::shared_ptr<WorkerInferRequest>;
+        InferenceEngine::SoIInferRequestInternal _inferRequestBatched;
+        int _batchSize;
+        InferenceEngine::ThreadSafeQueueWithSize<std::pair<AutoBatchAsyncInferRequest*, InferenceEngine::Task>> _tasks;
+        std::vector<InferenceEngine::Task> _completionTasks;
+        std::thread _thread;
+        std::condition_variable _cond;
+        std::mutex _mutex;
+        std::exception_ptr _exceptionPtr;
+    };
+
+    explicit AutoBatchExecutableNetwork(
+        const InferenceEngine::SoExecutableNetworkInternal& networkForDevice,
+        const InferenceEngine::SoExecutableNetworkInternal& networkForDeviceWithoutBatch,
+        const DeviceInformation& networkDevices,
+        const std::unordered_map<std::string, InferenceEngine::Parameter>& config,
+        const bool needPerfCounters = false);
+
+    void SetConfig(const std::map<std::string, InferenceEngine::Parameter>& config) override;
+    InferenceEngine::Parameter GetConfig(const std::string& name) const override;
+    InferenceEngine::Parameter GetMetric(const std::string& name) const override;
+    InferenceEngine::IInferRequestInternal::Ptr CreateInferRequest() override;
+    InferenceEngine::IInferRequestInternal::Ptr CreateInferRequestImpl(
+        InferenceEngine::InputsDataMap networkInputs,
+        InferenceEngine::OutputsDataMap networkOutputs) override;
+    std::shared_ptr<InferenceEngine::RemoteContext> GetContext() const override;
+    std::shared_ptr<ngraph::Function> GetExecGraphInfo() override;
+    virtual ~AutoBatchExecutableNetwork();
+
+protected:
+    static unsigned int ParseTimeoutValue(const std::string&);
+    std::atomic_bool _terminate = {false};
+    DeviceInformation _device;
+    InferenceEngine::SoExecutableNetworkInternal _network;
+    InferenceEngine::SoExecutableNetworkInternal _networkWithoutBatch;
+    std::vector<WorkerInferRequest::Ptr> _workerRequests;
+    std::unordered_map<std::string, InferenceEngine::Parameter> _config;
+    bool _needPerfCounters = false;
+    std::atomic_size_t _numRequestsCreated = {0};
+    std::atomic_int _timeOut = {1000};  // in ms
+};
+
+class AutoBatchInferRequest : public InferenceEngine::IInferRequestInternal {
+public:
+    using Ptr = std::shared_ptr<AutoBatchInferRequest>;
+    explicit AutoBatchInferRequest(const InferenceEngine::InputsDataMap& networkInputs,
+                                   const InferenceEngine::OutputsDataMap& networkOutputs,
+                                   AutoBatchExecutableNetwork::WorkerInferRequest& workerRequestPtr,
+                                   int batch_id,
+                                   int num_batch,
+                                   bool _needPerfCounters = false);
+    std::map<std::string, InferenceEngine::InferenceEngineProfileInfo> GetPerformanceCounts() const override;
+
+    // Batch-Device impl specific: sets the data (blobs from the device request to the batched device request)
+    void SetBlobsToAnotherRequest(InferenceEngine::SoIInferRequestInternal& req);
+    void CopyInputsIfNeeded();
+    void CopyOutputsIfNeeded();
+    AutoBatchExecutableNetwork::WorkerInferRequest& _myBatchedRequestWrapper;
+    std::exception_ptr _exceptionPtr;
+
+protected:
+    std::map<std::string, InferenceEngine::InferenceEngineProfileInfo> _perfMap;
+    bool _needPerfCounters = false;
+    void CopyBlobIfNeeded(InferenceEngine::Blob::CPtr src, InferenceEngine::Blob::Ptr dst, bool bInput);
+    size_t _batchId;
+    size_t _batchSize;
+};
+
+class AutoBatchAsyncInferRequest : public InferenceEngine::AsyncInferRequestThreadSafeDefault {
+public:
+    using Ptr = std::shared_ptr<AutoBatchAsyncInferRequest>;
+
+    explicit AutoBatchAsyncInferRequest(const AutoBatchInferRequest::Ptr& inferRequest,
+                                        const bool needPerfCounters,
+                                        InferenceEngine::SoIInferRequestInternal& inferRequestWithoutBatch,
+                                        const InferenceEngine::ITaskExecutor::Ptr& callbackExecutor);
+    void Infer_ThreadUnsafe() override;
+    virtual ~AutoBatchAsyncInferRequest();
+
+    InferenceEngine::SoIInferRequestInternal _inferRequestWithoutBatch;
+    AutoBatchInferRequest::Ptr _inferRequest;
+};
+
+class AutoBatchInferencePlugin : public InferenceEngine::IInferencePlugin {
+public:
+    AutoBatchInferencePlugin();
+    virtual ~AutoBatchInferencePlugin() = default;
+    InferenceEngine::IExecutableNetworkInternal::Ptr LoadExeNetworkImpl(
+        const InferenceEngine::CNNNetwork& network,
+        const std::map<std::string, std::string>& config) override;
+    InferenceEngine::IExecutableNetworkInternal::Ptr LoadExeNetworkImpl(
+        const InferenceEngine::CNNNetwork& network,
+        const std::shared_ptr<InferenceEngine::RemoteContext>& context,
+        const std::map<std::string, std::string>& config) override;
+
+    void SetConfig(const std::map<std::string, std::string>& config) override;
+    void CheckConfig(const std::map<std::string, std::string>& config);
+
+    InferenceEngine::Parameter GetConfig(
+        const std::string& name,
+        const std::map<std::string, InferenceEngine::Parameter>& options) const override;
+    InferenceEngine::QueryNetworkResult QueryNetwork(const InferenceEngine::CNNNetwork& network,
+                                                     const std::map<std::string, std::string>& config) const override;
+    InferenceEngine::Parameter GetMetric(
+        const std::string& name,
+        const std::map<std::string, InferenceEngine::Parameter>& options) const override;
+    InferenceEngine::RemoteContext::Ptr CreateContext(const InferenceEngine::ParamMap&) override;
+
+protected:
+    DeviceInformation ParseMetaDevice(const std::string& devicesBatchCfg,
+                                      const std::map<std::string, std::string>& config) const;
+
+    std::map<std::string, std::string> GetSupportedConfig(const std::map<std::string, std::string>& config,
+                                                          const DeviceName& deviceName) const;
+    static DeviceInformation ParseBatchDevice(const std::string& deviceWithBatch);
+
+    InferenceEngine::IExecutableNetworkInternal::Ptr LoadNetworkImpl(
+        const InferenceEngine::CNNNetwork& network,
+        const std::shared_ptr<InferenceEngine::RemoteContext> context,
+        const std::map<std::string, std::string>& config);
+};
+
+}  // namespace AutoBatchPlugin
--- a/src/plugins/intel_cpu/src/mkldnn_plugin.cpp
+++ b/src/plugins/intel_cpu/src/mkldnn_plugin.cpp
@ -609,11 +609,9 @@ Engine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network, const std
                // the more "capable" the CPU in general, the more streams we may want to keep to keep it utilized
                const float memThresholdAssumeLimitedForISA = ov::MemBandwidthPressure::LIMITED/isaSpecificThreshold;
                const float L2_cache_size = mkldnn::utils::get_cache_size(2 /*level*/, true /*per core */);
-                const float L3_cache_size = mkldnn::utils::get_cache_size(3, false);
                ov::MemBandwidthPressure networkToleranceForLowCache = ov::MemBandwidthPressureTolerance(
                        clonedNetwork.getFunction(),
-                        L2_cache_size, L3_cache_size,
-                        memThresholdAssumeLimitedForISA);
+                        L2_cache_size, memThresholdAssumeLimitedForISA);
                // num of phys CPU cores (most aggressive value for #streams)
                const auto num_cores = getNumberOfCPUCores();
                // less aggressive
--- a/src/plugins/intel_gpu/src/plugin/plugin.cpp
+++ b/src/plugins/intel_gpu/src/plugin/plugin.cpp
@ -28,6 +28,7 @@

 #include "intel_gpu/runtime/device_query.hpp"
 #include "intel_gpu/runtime/debug_configuration.hpp"
+#include <performance_heuristics.hpp>
 #ifdef __linux__
 # include <dlfcn.h>
 #endif
@ -681,6 +682,7 @@ Parameter Plugin::GetMetric(const std::string& name, const std::map<std::string,
        metrics.push_back(METRIC_KEY(RANGE_FOR_STREAMS));
        metrics.push_back(METRIC_KEY(DEVICE_TYPE));
        metrics.push_back(METRIC_KEY(DEVICE_GOPS));
+        metrics.push_back(METRIC_KEY(OPTIMAL_BATCH_SIZE));
        metrics.push_back(GPU_METRIC_KEY(MAX_BATCH_SIZE));
        metrics.push_back(GPU_METRIC_KEY(DEVICE_TOTAL_MEM_SIZE));
        metrics.push_back(GPU_METRIC_KEY(UARCH_VERSION));
@ -716,6 +718,76 @@ Parameter Plugin::GetMetric(const std::string& name, const std::map<std::string,
              << static_cast<int>(device_info.gfx_ver.revision);
        }
        IE_SET_METRIC_RETURN(GPU_UARCH_VERSION, s.str());
+    } else if (name == METRIC_KEY(OPTIMAL_BATCH_SIZE)) {
+        auto next_pow_of_2 = [] (float x) {
+            return pow(2, ceil(log(x)/log(2)));
+        };
+        auto closest_pow_of_2 = [] (float x) {
+            return pow(2, floor(log(x)/log(2)));
+        };
+        auto model_param = options.find("MODEL_PTR");
+        if (model_param == options.end()) {
+            GPU_DEBUG_IF(debug_config->verbose >= 1) {
+                GPU_DEBUG_COUT << "[GPU_OPTIMAL_BATCH_SIZE] MODELS_PTR is not set: return 1" << std::endl;
+            }
+            IE_SET_METRIC_RETURN(OPTIMAL_BATCH_SIZE, static_cast<unsigned int>(1));
+        }
+        std::shared_ptr<ngraph::Function> model;
+        try {
+            model = model_param->second.as<std::shared_ptr<ngraph::Function>>();
+        } catch (...) {
+            IE_THROW() << "[GPU_OPTIMAL_BATCH_SIZE] MODEL_PTR should be std::shared_ptr<ngraph::Function> type";
+        }
+        GPU_DEBUG_IF(debug_config->verbose >= 1) {
+            GPU_DEBUG_COUT << "DEVICE_INFO:"
+                           << "gfx_version.major, " << device_info.gfx_ver.major
+                           << "gfx_version.minor " << std::to_string(device_info.gfx_ver.minor) << std::endl;
+        }
+        static std::map<cldnn::gfx_version, size_t> gen_kbytes_per_bank = {
+                {{12, 0, 0}, 480},  // TGL
+                {{12, 1, 0}, 2048}, // DG1
+                {{12, 5, 0}, 320},
+                {{12, 7, 0}, 512},
+        };
+        size_t L3_cache_size = device_info.gfx_ver.major && (device_info.gfx_ver.major <= 9)
+                ? 768 * 1024 // Gen9
+                : 2 * 768 * 1024;  //reasonable default when no arch has been detected (e.g. due to old driver ver)
+        cldnn::gfx_version gen = {device_info.gfx_ver.major, device_info.gfx_ver.minor, 0 /*ignore the revision*/};
+        auto val = gen_kbytes_per_bank.find(gen);
+        if (gen_kbytes_per_bank.end() != val) {
+            auto kbytes_per_bank = val->second;
+            auto num_banks_per_slice = device_info.num_sub_slices_per_slice > 4
+                                       ? next_pow_of_2(device_info.num_sub_slices_per_slice)
+                                       : 2 * device_info.num_sub_slices_per_slice;
+            L3_cache_size = kbytes_per_bank * 1024 * num_banks_per_slice * device_info.num_slices;
+            GPU_DEBUG_IF(debug_config->verbose >= 1) {
+                GPU_DEBUG_COUT << "DEVICE_INFO:"
+                               << "num_slices " << device_info.num_slices
+                               << ", num_sub_slices_per_slice " << device_info.num_sub_slices_per_slice
+                               << ", num_banks_per_slice " << num_banks_per_slice
+                               << ", gen_kbytes_per_bank : " << kbytes_per_bank
+                               << ", L3_cache_size is (MB): " << float(L3_cache_size) / 1024 / 1024 << std::endl;
+            }
+        }
+        Config config = _impl->m_configs.GetConfig(device_id);
+        auto networkCloned = CloneAndTransformNetwork(CNNNetwork(model), config);
+        ov::MemBandwidthPressure memPressure = ov::MemBandwidthPressureTolerance(networkCloned.getFunction(), L3_cache_size);
+        unsigned int batch = 1;
+        if (memPressure.max_mem_tolerance != ov::MemBandwidthPressure::UNKNOWN)
+            batch = std::max(1.0, 16 * closest_pow_of_2(memPressure.max_mem_tolerance));
+        std::map<std::string, InferenceEngine::Parameter> options_for_max_batch;
+        options_for_max_batch["MODEL_PTR"] = model;
+        options_for_max_batch["GPU_THROUGHPUT_STREAMS"] = CONFIG_VALUE(GPU_THROUGHPUT_AUTO);
+        auto max_batch_size = GetMetric(GPU_METRIC_KEY(MAX_BATCH_SIZE), options_for_max_batch).as<unsigned int>();
+        unsigned int closest = closest_pow_of_2(max_batch_size);
+        batch = std::min(closest, batch);
+        batch = std::min(256u, batch); //batch 256 is a max
+        GPU_DEBUG_IF(debug_config->verbose >= 1) {
+            GPU_DEBUG_COUT << memPressure.max_mem_tolerance << std::endl;
+            GPU_DEBUG_COUT << "MAX_BATCH: " << max_batch_size << std::endl;
+            GPU_DEBUG_COUT << "ACTUAL OPTIMAL BATCH: " << batch << std::endl;
+        }
+        IE_SET_METRIC_RETURN(OPTIMAL_BATCH_SIZE, batch);
    } else if (name == METRIC_KEY(FULL_DEVICE_NAME)) {
        auto deviceName = StringRightTrim(device_info.dev_name, "NEO", false);
        deviceName += std::string(" (") + (device_info.dev_type == cldnn::device_type::discrete_gpu ? "dGPU" : "iGPU") + ")";
--- a/src/tests/functional/inference_engine/CMakeLists.txt
+++ b/src/tests/functional/inference_engine/CMakeLists.txt
@ -48,6 +48,10 @@ if(ENABLE_AUTO OR ENABLE_MULTI)
    list(APPEND DEPENDENCIES ov_auto_plugin)
 endif()

+if(ENABLE_AUTO_BATCH)
+    list(APPEND DEPENDENCIES ov_auto_batch_plugin)
+endif()
+
 if (NOT ENABLE_OV_ONNX_FRONTEND)
    list(APPEND EXCLUDED_SOURCE_PATHS "${CMAKE_CURRENT_SOURCE_DIR}/onnx_reader")
 endif()
--- a/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/include/api_conformance_helpers.hpp
+++ b/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/include/api_conformance_helpers.hpp
@ -24,6 +24,7 @@ inline const std::string getPluginLibNameByDevice(const std::string& deviceName)
            { "GNA", "ov_intel_gna_plugin" },
            { "GPU", "ov_intel_gpu_plugin" },
            { "HETERO", "ov_hetero_plugin" },
+            { "BATCH", "ov_auto_batch_plugin" },
            { "MULTI", "ov_multi_plugin" },
            { "MYRIAD", "myriadPlugin" },
            { "TEMPLATE", "ov_template_plugin" },
@ -42,6 +43,11 @@ inline const std::pair<std::string, std::string> generateDefaultHeteroConfig() {
    return { "TARGET_FALLBACK" , ConformanceTests::targetDevice };
 }

+inline const std::pair<std::string, std::string> generateDefaultBatchConfig() {
+    // auto-batching with batch 1 (no real batching in fact, but full machinery is in action)
+    return { CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , std::string(ConformanceTests::targetDevice)};
+}
+
 inline const std::vector<std::map<std::string, std::string>> generateConfigs(const std::string& targetDevice,
                                                                             const std::vector<std::map<std::string, std::string>>& config = {}) {
    std::pair<std::string, std::string> defaultConfig;
@ -49,6 +55,8 @@ inline const std::vector<std::map<std::string, std::string>> generateConfigs(con
        defaultConfig = generateDefaultMultiConfig();
    } else if (targetDevice ==  std::string(CommonTestUtils::DEVICE_HETERO)) {
        defaultConfig = generateDefaultHeteroConfig();
+    } else if (targetDevice ==  std::string(CommonTestUtils::DEVICE_BATCH)) {
+        defaultConfig = generateDefaultBatchConfig();
    } else {
        throw std::runtime_error("Incorrect target device: " + targetDevice);
    }
@ -70,7 +78,8 @@ inline const std::string generateComplexDeviceName(const std::string& deviceName

 inline const std::vector<std::string> returnAllPossibleDeviceCombination() {
    std::vector<std::string> res{ConformanceTests::targetDevice};
-    std::vector<std::string> devices{CommonTestUtils::DEVICE_HETERO, CommonTestUtils::DEVICE_AUTO, CommonTestUtils::DEVICE_MULTI};
+    std::vector<std::string> devices{CommonTestUtils::DEVICE_HETERO, CommonTestUtils::DEVICE_AUTO,
+                                     CommonTestUtils::DEVICE_BATCH, CommonTestUtils::DEVICE_MULTI};
    for (const auto& device : devices) {
        res.emplace_back(generateComplexDeviceName(device));
    }
--- a/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/callback.cpp
+++ b/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/callback.cpp
@ -33,4 +33,10 @@ INSTANTIATE_TEST_SUITE_P(smoke_Hetero_BehaviorTests, InferRequestCallbackTests,
                                 ::testing::Values(CommonTestUtils::DEVICE_HETERO),
                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_HETERO))),
                         InferRequestCallbackTests::getTestCaseName);
+
+INSTANTIATE_TEST_SUITE_P(smoke_Batch_BehaviorTests, InferRequestCallbackTests,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_BATCH))),
+                         InferRequestCallbackTests::getTestCaseName);
 }  // namespace
--- a/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/io_blob.cpp
+++ b/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/io_blob.cpp
@ -36,4 +36,10 @@ INSTANTIATE_TEST_SUITE_P(smoke_Hetero_BehaviorTests, InferRequestIOBBlobTest,
                                 ::testing::Values(CommonTestUtils::DEVICE_HETERO),
                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_HETERO))),
                         InferRequestIOBBlobTest::getTestCaseName);
+
+INSTANTIATE_TEST_SUITE_P(smoke_Batch_BehaviorTests, InferRequestIOBBlobTest,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_BATCH))),
+                         InferRequestIOBBlobTest::getTestCaseName);
 }  // namespace
--- a/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/multitheading.cpp
+++ b/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/multitheading.cpp
@ -38,4 +38,10 @@ INSTANTIATE_TEST_SUITE_P(smoke_Hetero_BehaviorTests, InferRequestMultithreadingT
                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_HETERO))),
                         InferRequestMultithreadingTests::getTestCaseName);

+INSTANTIATE_TEST_SUITE_P(smoke_Batch_BehaviorTests, InferRequestMultithreadingTests,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_BATCH))),
+                         InferRequestMultithreadingTests::getTestCaseName);
+
 }  // namespace
--- a/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/set_blob_by_type.cpp
+++ b/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/set_blob_by_type.cpp
@ -46,4 +46,10 @@ INSTANTIATE_TEST_SUITE_P(smoke_Behavior_Hetero, InferRequestSetBlobByType,
                                            ::testing::Values(CommonTestUtils::DEVICE_HETERO),
                                            ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_HETERO))),
                         InferRequestSetBlobByType::getTestCaseName);
+
+INSTANTIATE_TEST_SUITE_P(smoke_Behavior_Batch, InferRequestSetBlobByType,
+                         ::testing::Combine(::testing::ValuesIn(setBlobTypes),
+                                            ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                            ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_BATCH))),
+                         InferRequestSetBlobByType::getTestCaseName);
 } // namespace
--- a/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/wait.cpp
+++ b/src/tests/functional/plugin/conformance/test_runner/api_conformance_runner/src/behavior/infer_request/wait.cpp
@ -37,4 +37,9 @@ INSTANTIATE_TEST_SUITE_P(smoke_Hetero_BehaviorTests, InferRequestWaitTests,
                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_HETERO))),
                         InferRequestWaitTests::getTestCaseName);

+INSTANTIATE_TEST_SUITE_P(smoke_Batch_BehaviorTests, InferRequestWaitTests,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                 ::testing::ValuesIn(generateConfigs(CommonTestUtils::DEVICE_BATCH))),
+                         InferRequestWaitTests::getTestCaseName);
 }  // namespace
--- a/src/tests/functional/plugin/cpu/shared_tests_instances/auto_batching/auto_batching_tests.cpp
+++ b/src/tests/functional/plugin/cpu/shared_tests_instances/auto_batching/auto_batching_tests.cpp
@ -0,0 +1,31 @@
+// Copyright (C) 2018-2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+#include <auto_batching/auto_batching_tests.hpp>
+
+const std::vector<bool>   get_vs_set{ true, false };
+const std::vector<size_t> num_streams{ 1, 2 };
+const std::vector<size_t> num_requests{ 1, 3, 8, 9, 16, 64 };
+const std::vector<size_t> num_batch{ 1, 4, 8, 16, 32, 64, 128, 256 };
+using namespace AutoBatchingTests;
+
+namespace {
+INSTANTIATE_TEST_SUITE_P(smoke_AutoBatching_CPU, AutoBatching_Test,
+        ::testing::Combine(
+                ::testing::Values(CommonTestUtils::DEVICE_CPU),
+                ::testing::ValuesIn(get_vs_set),
+                ::testing::ValuesIn(num_streams),
+                ::testing::ValuesIn(num_requests),
+                ::testing::ValuesIn(num_batch)),
+                         AutoBatching_Test::getTestCaseName);
+// TODO: for 22.2 (CVS-68949)
+//INSTANTIATE_TEST_SUITE_P(smoke_AutoBatching_CPU, AutoBatching_Test_DetectionOutput,
+//                         ::testing::Combine(
+//                                 ::testing::Values(CommonTestUtils::DEVICE_CPU),
+//                                 ::testing::ValuesIn(get_vs_set),
+//                                 ::testing::ValuesIn(num_streams),
+//                                 ::testing::ValuesIn(num_requests),
+//                                 ::testing::ValuesIn(num_batch)),
+//                         AutoBatching_Test_DetectionOutput::getTestCaseName);
+
+}  // namespace
--- a/src/tests/functional/plugin/gpu/remote_blob_tests/cldnn_remote_blob_tests.cpp
+++ b/src/tests/functional/plugin/gpu/remote_blob_tests/cldnn_remote_blob_tests.cpp
@ -21,16 +21,27 @@ using namespace ::testing;
 using namespace InferenceEngine;
 using namespace InferenceEngine::gpu;

-class RemoteBlob_Test : public CommonTestUtils::TestsCommon {
+class RemoteBlob_Test : public CommonTestUtils::TestsCommon, public testing::WithParamInterface<bool> {
 protected:
    std::shared_ptr<ngraph::Function> fn_ptr;
+    std::string deviceName;

+public:
    void SetUp() override {
        fn_ptr = ngraph::builder::subgraph::makeSplitMultiConvConcat();
+        deviceName = CommonTestUtils::DEVICE_GPU;
+        auto with_auto_batching = this->GetParam();
+        if (with_auto_batching) { // BATCH:GPU
+            deviceName = std::string(CommonTestUtils::DEVICE_BATCH) + ":" + deviceName;
+        }
+    }
+    static std::string getTestCaseName(const testing::TestParamInfo<bool>& obj) {
+        auto with_auto_batch = obj.param;
+        return std::string("RemoteBlob_Test") + (with_auto_batch ? "_WITH_AUTO_BATCHING": "");
    }
 };

-TEST_F(RemoteBlob_Test, smoke_canInputUserBlob) {
+TEST_P(RemoteBlob_Test, smoke_canInputUserBlob) {
 #if defined(ANDROID)
    GTEST_SKIP();
 #endif
@ -41,7 +52,7 @@ TEST_F(RemoteBlob_Test, smoke_canInputUserBlob) {

    // TODO: Issue: investigate issue with IECore
    auto ie = InferenceEngine::Core();
-    auto exec_net = ie.LoadNetwork(net, CommonTestUtils::DEVICE_GPU);
+    auto exec_net = ie.LoadNetwork(net, deviceName);

    // regular inference
    auto inf_req_regular = exec_net.CreateInferRequest();
@ -70,6 +81,7 @@ TEST_F(RemoteBlob_Test, smoke_canInputUserBlob) {

    Blob::Ptr shared_blob = make_shared_blob(net.getInputsInfo().begin()->second->getTensorDesc(), cldnn_context,
                                             shared_buffer);
+    shared_blob->allocate();
    inf_req_shared.SetBlob(net.getInputsInfo().begin()->first, shared_blob);

    inf_req_shared.Infer();
@ -85,7 +97,7 @@ TEST_F(RemoteBlob_Test, smoke_canInputUserBlob) {
 }


-TEST_F(RemoteBlob_Test, smoke_canInputPluginRemoteBlob) {
+TEST_P(RemoteBlob_Test, smoke_canInputPluginRemoteBlob) {
 #if defined(ANDROID)
    GTEST_SKIP();
 #endif
@ -96,7 +108,7 @@ TEST_F(RemoteBlob_Test, smoke_canInputPluginRemoteBlob) {

    // TODO: Issue: investigate issue with IECore
    auto ie = InferenceEngine::Core();
-    auto exec_net = ie.LoadNetwork(net, CommonTestUtils::DEVICE_GPU);
+    auto exec_net = ie.LoadNetwork(net, deviceName);

    // regular inference
    auto inf_req_regular = exec_net.CreateInferRequest();
@ -139,7 +151,7 @@ TEST_F(RemoteBlob_Test, smoke_canInputPluginRemoteBlob) {
 }


-TEST_F(RemoteBlob_Test, smoke_canInferOnUserContext) {
+TEST_P(RemoteBlob_Test, smoke_canInferOnUserContext) {
    auto fn_ptr = ngraph::builder::subgraph::makeSplitMultiConvConcat();
    CNNNetwork net(fn_ptr);

@ -149,7 +161,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserContext) {
    auto blob = FuncTestUtils::createAndFillBlob(net.getInputsInfo().begin()->second->getTensorDesc());

    auto ie = PluginCache::get().ie();
-    auto exec_net_regular = ie->LoadNetwork(net, CommonTestUtils::DEVICE_GPU);
+    auto exec_net_regular = ie->LoadNetwork(net, deviceName);

    // regular inference
    auto inf_req_regular = exec_net_regular.CreateInferRequest();
@ -161,7 +173,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserContext) {

    // inference using remote blob
    auto ocl_instance = std::make_shared<OpenCL>();
-    auto remote_context = make_shared_context(*ie, CommonTestUtils::DEVICE_GPU, ocl_instance->_context.get());
+    auto remote_context = make_shared_context(*ie, deviceName, ocl_instance->_context.get());
    auto exec_net_shared = ie->LoadNetwork(net, remote_context);
    auto inf_req_shared = exec_net_shared.CreateInferRequest();
    inf_req_shared.SetBlob(net.getInputsInfo().begin()->first, fakeImageData);
@ -178,7 +190,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserContext) {
    }
 }

-TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_out_of_order) {
+TEST_P(RemoteBlob_Test, smoke_canInferOnUserQueue_out_of_order) {
 #if defined _WIN32
    GTEST_SKIP();
 #endif
@ -191,7 +203,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_out_of_order) {
    auto blob = FuncTestUtils::createAndFillBlob(net.getInputsInfo().begin()->second->getTensorDesc());

    auto ie = PluginCache::get().ie();
-    auto exec_net_regular = ie->LoadNetwork(net, CommonTestUtils::DEVICE_GPU);
+    auto exec_net_regular = ie->LoadNetwork(net, deviceName);

    // regular inference
    auto inf_req_regular = exec_net_regular.CreateInferRequest();
@ -214,7 +226,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_out_of_order) {

    // In this scenario we create shared OCL queue and run simple pre-process action and post-process action (buffer copies in both cases)
    // without calling thread blocks
-    auto remote_context = make_shared_context(*ie, CommonTestUtils::DEVICE_GPU, ocl_instance->_queue.get());
+    auto remote_context = make_shared_context(*ie, deviceName, ocl_instance->_queue.get());
    auto exec_net_shared = ie->LoadNetwork(net, remote_context);
    auto inf_req_shared = exec_net_shared.CreateInferRequest();

@ -270,7 +282,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_out_of_order) {
    }
 }

-TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_in_order) {
+TEST_P(RemoteBlob_Test, smoke_canInferOnUserQueue_in_order) {
 #if defined _WIN32
    GTEST_SKIP();
 #endif
@ -283,7 +295,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_in_order) {
    auto blob = FuncTestUtils::createAndFillBlob(net.getInputsInfo().begin()->second->getTensorDesc());

    auto ie = PluginCache::get().ie();
-    auto exec_net_regular = ie->LoadNetwork(net, CommonTestUtils::DEVICE_GPU);
+    auto exec_net_regular = ie->LoadNetwork(net, deviceName);

    // regular inference
    auto inf_req_regular = exec_net_regular.CreateInferRequest();
@ -307,7 +319,7 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_in_order) {

    // In this scenario we create shared OCL queue and run simple pre-process action and post-process action (buffer copies in both cases)
    // without calling thread blocks
-    auto remote_context = make_shared_context(*ie, CommonTestUtils::DEVICE_GPU, ocl_instance->_queue.get());
+    auto remote_context = make_shared_context(*ie, deviceName, ocl_instance->_queue.get());
    auto exec_net_shared = ie->LoadNetwork(net, remote_context);
    auto inf_req_shared = exec_net_shared.CreateInferRequest();

@ -358,6 +370,10 @@ TEST_F(RemoteBlob_Test, smoke_canInferOnUserQueue_in_order) {
    }
 }

+std::vector<bool> with_auto_batching {true, false};
+INSTANTIATE_TEST_SUITE_P(smoke_RemoteBlob, RemoteBlob_Test, ::testing::ValuesIn(with_auto_batching),
+        RemoteBlob_Test::getTestCaseName);
+
 class BatchedBlob_Test : public CommonTestUtils::TestsCommon, public testing::WithParamInterface<size_t> {
    void SetUp() override {
        num_batch = this->GetParam();
--- a/src/tests/functional/plugin/gpu/remote_blob_tests/gpu_remote_tensor_tests.cpp
+++ b/src/tests/functional/plugin/gpu/remote_blob_tests/gpu_remote_tensor_tests.cpp
@ -30,6 +30,7 @@ protected:
    }
 };

+std::vector<bool> ov_with_auto_batching {true, false};
 enum class RemoteTensorSharingType {
    USER_CL_TENSOR = 0,
    PLUGIN_CL_TENSOR = 1,
@ -54,17 +55,34 @@ std::ostream& operator<<(std::ostream& stream, RemoteTensorSharingType sharing_t
    return stream;
 }

-class OVRemoteTensorInputBlob_Test : public OVRemoteTensor_Test, public testing::WithParamInterface<RemoteTensorSharingType> {
+using RemoteTensorSharingTestOptionsParams = std::tuple<RemoteTensorSharingType, bool /*auto-batching*/>;
+
+class OVRemoteTensorInputBlob_Test : public OVRemoteTensor_Test,
+        public testing::WithParamInterface<RemoteTensorSharingTestOptionsParams> {
+protected:
+    std::shared_ptr<ngraph::Function> fn_ptr;
+    std::string deviceName;
+
 public:
    void SetUp() override {
        fn_ptr = ngraph::builder::subgraph::makeSplitMultiConvConcat();
+        deviceName = CommonTestUtils::DEVICE_GPU;
+        RemoteTensorSharingType sharing_type;
+        bool with_auto_batching;
+        std::tie(sharing_type, with_auto_batching) = this->GetParam();
+        if (with_auto_batching)  // BATCH:GPU
+            deviceName = std::string(CommonTestUtils::DEVICE_BATCH) + ":" + deviceName;
    }
-
-    static std::string getTestCaseName(testing::TestParamInfo<RemoteTensorSharingType> obj) {
-        RemoteTensorSharingType sharing_type = obj.param;
+    static std::string getTestCaseName(const testing::TestParamInfo<RemoteTensorSharingTestOptionsParams>& obj) {
+        RemoteTensorSharingType sharing_type;
+        bool with_auto_batching;
+        std::tie(sharing_type, with_auto_batching) = obj.param;

        std::ostringstream result;
+        result << "OVRemoteTensorInputBlob_Test_";
        result << sharing_type;
+        if (with_auto_batching)
+            result << "_WITH_AUTO_BATCHING";
        return result.str();
    }
 };
@ -81,9 +99,17 @@ TEST_P(OVRemoteTensorInputBlob_Test, smoke_canInputRemoteTensor) {
    p.input().preprocess().convert_element_type(ov::element::f32);

    auto function = p.build();
-    auto exec_net = ie.compile_model(function, CommonTestUtils::DEVICE_GPU);
+    RemoteTensorSharingType sharing_type;
+    bool with_auto_batching;
+    std::tie(sharing_type, with_auto_batching) = GetParam();

-    RemoteTensorSharingType sharing_type = GetParam();
+    // auto-batching relies on availability of the lock() for the tensor (and the *USM_DEVICE is not lockable)
+    if (with_auto_batching
+            && (RemoteTensorSharingType::USER_USM_DEVICE_TENSOR == sharing_type
+                    || RemoteTensorSharingType::PLUGIN_USM_DEVICE_TENSOR == sharing_type))
+        GTEST_SKIP();
+
+    auto exec_net = ie.compile_model(function, deviceName);

    // regular inference
    auto inf_req_regular = exec_net.create_infer_request();
@ -244,6 +270,7 @@ TEST_P(OVRemoteTensorInputBlob_Test, smoke_canInputRemoteTensor) {
 INSTANTIATE_TEST_SUITE_P(
    smoke_GPU,
    OVRemoteTensorInputBlob_Test,
+    ::testing::Combine(
        ::testing::ValuesIn(std::vector<RemoteTensorSharingType>{RemoteTensorSharingType::USER_CL_TENSOR,
                                                                 RemoteTensorSharingType::PLUGIN_CL_TENSOR,
                                                                 RemoteTensorSharingType::USER_USM_HOST_TENSOR,
@ -251,9 +278,29 @@ INSTANTIATE_TEST_SUITE_P(
                                                                 RemoteTensorSharingType::PLUGIN_USM_HOST_TENSOR,
                                                                 RemoteTensorSharingType::PLUGIN_USM_DEVICE_TENSOR,
                                                                 RemoteTensorSharingType::PLUGIN_HOST_TENSOR}),
+        ::testing::ValuesIn(ov_with_auto_batching)),
        OVRemoteTensorInputBlob_Test::getTestCaseName);

-TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserContext) {
+class OVRemoteTensor_TestsWithContext : public OVRemoteTensor_Test, public testing::WithParamInterface<bool> {
+protected:
+    std::shared_ptr<ngraph::Function> fn_ptr;
+    std::string deviceName;
+public:
+    void SetUp() override {
+        fn_ptr = ngraph::builder::subgraph::makeSplitMultiConvConcat();
+        deviceName = CommonTestUtils::DEVICE_GPU;
+        auto with_auto_batching = this->GetParam();
+        if (with_auto_batching) { // BATCH:GPU
+            deviceName = std::string(CommonTestUtils::DEVICE_BATCH) + ":" + deviceName;
+        }
+    }
+    static std::string getTestCaseName(const testing::TestParamInfo<bool>& obj) {
+        auto with_auto_batch = obj.param;
+        return std::string("RemoteTensor_Test") + (with_auto_batch ? "_WITH_AUTO_BATCHING": "");
+    }
+};
+
+TEST_P(OVRemoteTensor_TestsWithContext, smoke_canInferOnUserContext) {
    auto ie = ov::runtime::Core();

    using namespace ov::preprocess;
@ -262,7 +309,7 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserContext) {
    p.input().preprocess().convert_element_type(ov::element::f32);
    auto function = p.build();

-    auto exec_net_regular = ie.compile_model(function, CommonTestUtils::DEVICE_GPU);
+    auto exec_net_regular = ie.compile_model(function, deviceName);
    auto input = function->get_parameters().at(0);
    auto output = function->get_results().at(0);

@ -296,7 +343,7 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserContext) {
    }
 }

-TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserContextWithMultipleDevices) {
+TEST_P(OVRemoteTensor_TestsWithContext, smoke_canInferOnUserContextWithMultipleDevices) {
    auto ie = ov::runtime::Core();

    using namespace ov::preprocess;
@ -305,7 +352,7 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserContextWithMultipleDevices) {
    p.input().preprocess().convert_element_type(ov::element::f32);
    auto function = p.build();

-    auto exec_net_regular = ie.compile_model(function, CommonTestUtils::DEVICE_GPU);
+    auto exec_net_regular = ie.compile_model(function, deviceName);
    auto input = function->get_parameters().at(0);
    auto output = function->get_results().at(0);

@ -344,7 +391,7 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserContextWithMultipleDevices) {
    }
 }

-TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserQueue_out_of_order) {
+TEST_P(OVRemoteTensor_TestsWithContext, smoke_canInferOnUserQueue_out_of_order) {
    auto ie = ov::runtime::Core();

    using namespace ov::preprocess;
@ -353,7 +400,7 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserQueue_out_of_order) {
    p.input().preprocess().convert_element_type(ov::element::f32);
    auto function = p.build();

-    auto exec_net_regular = ie.compile_model(function, CommonTestUtils::DEVICE_GPU);
+    auto exec_net_regular = ie.compile_model(function, deviceName);
    auto input = function->get_parameters().at(0);
    auto output = function->get_results().at(0);

@ -423,7 +470,7 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserQueue_out_of_order) {
    }
 }

-TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserQueue_in_order) {
+TEST_P(OVRemoteTensor_TestsWithContext, smoke_canInferOnUserQueue_in_order) {
    auto ie = ov::runtime::Core();

    using namespace ov::preprocess;
@ -432,7 +479,7 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserQueue_in_order) {
    p.input().preprocess().convert_element_type(ov::element::f32);
    auto function = p.build();

-    auto exec_net_regular = ie.compile_model(function, CommonTestUtils::DEVICE_GPU);
+    auto exec_net_regular = ie.compile_model(function, deviceName);
    auto input = function->get_parameters().at(0);
    auto output = function->get_results().at(0);

@ -498,6 +545,9 @@ TEST_F(OVRemoteTensor_Test, smoke_canInferOnUserQueue_in_order) {
    }
 }

+INSTANTIATE_TEST_SUITE_P(smoke_RemoteTensor, OVRemoteTensor_TestsWithContext, ::testing::ValuesIn(ov_with_auto_batching),
+                         OVRemoteTensor_TestsWithContext::getTestCaseName);
+
 TEST_F(OVRemoteTensor_Test, NV12toBGR_image) {
 #if defined(ANDROID)
    GTEST_SKIP();
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/auto_batching/auto_batching_tests.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/auto_batching/auto_batching_tests.cpp
@ -0,0 +1,31 @@
+// Copyright (C) 2018-2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+#include <auto_batching/auto_batching_tests.hpp>
+
+const std::vector<size_t> num_streams{ 2 };
+const std::vector<bool>   get_vs_set{ true, false };
+const std::vector<size_t> num_requests{ 1, 8, 16, 64 };
+const std::vector<size_t> num_batch{ 1, 8, 32, 256 };
+using namespace AutoBatchingTests;
+
+namespace AutoBatchingTests {
+
+INSTANTIATE_TEST_SUITE_P(smoke_AutoBatching_GPU, AutoBatching_Test,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_GPU),
+                                 ::testing::ValuesIn(get_vs_set),
+                                 ::testing::ValuesIn(num_streams),
+                                 ::testing::ValuesIn(num_requests),
+                                 ::testing::ValuesIn(num_batch)),
+                         AutoBatching_Test::getTestCaseName);
+
+INSTANTIATE_TEST_SUITE_P(smoke_AutoBatching_GPU, AutoBatching_Test_DetectionOutput,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_GPU),
+                                 ::testing::ValuesIn(get_vs_set),
+                                 ::testing::ValuesIn(num_streams),
+                                 ::testing::ValuesIn(num_requests),
+                                 ::testing::ValuesIn(num_batch)),
+                         AutoBatching_Test_DetectionOutput::getTestCaseName);
+}  // namespace AutoBatchingTests
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/executable_network/exec_net_base.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/executable_network/exec_net_base.cpp
@ -52,6 +52,10 @@ const std::vector<std::map<std::string, std::string>> autoConfig = {
        {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU}},
 };

+const std::vector<std::map<std::string, std::string>> autoBatchConfig = {
+        {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU}},
+};
+
 INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, ExecNetSetPrecision,
                         ::testing::Combine(
                                 ::testing::ValuesIn(netPrecisions),
@ -72,4 +76,11 @@ INSTANTIATE_TEST_SUITE_P(smoke_Auto_BehaviorTests, ExecNetSetPrecision,
                                 ::testing::Values(CommonTestUtils::DEVICE_AUTO),
                                 ::testing::ValuesIn(autoConfig)),
                         ExecNetSetPrecision::getTestCaseName);
+
+INSTANTIATE_TEST_SUITE_P(smoke_AutoBatch_BehaviorTests, ExecNetSetPrecision,
+                         ::testing::Combine(
+                                 ::testing::ValuesIn(netPrecisions),
+                                 ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                 ::testing::ValuesIn(autoBatchConfig)),
+                         ExecNetSetPrecision::getTestCaseName);
 }  // namespace
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/executable_network/get_metric.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/executable_network/get_metric.cpp
@ -22,27 +22,27 @@ namespace {

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassExecutableNetworkGetMetricTest, IEClassExecutableNetworkGetMetricTest_OPTIMAL_NUMBER_OF_INFER_REQUESTS,
-        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU")
+        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU", "BATCH:GPU")
 );

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassExecutableNetworkGetMetricTest, IEClassExecutableNetworkGetMetricTest_SUPPORTED_CONFIG_KEYS,
-        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU")
+        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU", "BATCH:GPU")
 );

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassExecutableNetworkGetMetricTest, IEClassExecutableNetworkGetMetricTest_SUPPORTED_METRICS,
-        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU")
+        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU", "BATCH:GPU")
 );

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassExecutableNetworkGetMetricTest, IEClassExecutableNetworkGetMetricTest_NETWORK_NAME,
-        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU")
+        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU", "BATCH:GPU")
 );

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassExecutableNetworkGetMetricTest, IEClassExecutableNetworkGetMetricTest_ThrowsUnsupported,
-        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU")
+        ::testing::Values("GPU", "MULTI:GPU", "HETERO:GPU", "AUTO:GPU,CPU", "BATCH:GPU")
 );

 //
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/infer_request/callback.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/infer_request/callback.cpp
@ -19,6 +19,10 @@ const std::vector<std::map<std::string, std::string>> autoConfigs = {
        {InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU + std::string(",") + CommonTestUtils::DEVICE_CPU}}
 };

+const std::vector<std::map<std::string, std::string>> autoBatchConfigs = {
+        {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU}},
+};
+
 INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, InferRequestCallbackTests,
        ::testing::Combine(
            ::testing::Values(CommonTestUtils::DEVICE_GPU),
@ -36,4 +40,10 @@ INSTANTIATE_TEST_SUITE_P(smoke_Auto_BehaviorTests, InferRequestCallbackTests,
            ::testing::Values(CommonTestUtils::DEVICE_AUTO),
            ::testing::ValuesIn(autoConfigs)),
        InferRequestCallbackTests::getTestCaseName);
+
+INSTANTIATE_TEST_SUITE_P(smoke_AutoBatch_BehaviorTests, InferRequestCallbackTests,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                 ::testing::ValuesIn(autoBatchConfigs)),
+                         InferRequestCallbackTests::getTestCaseName);
 }  // namespace
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/infer_request/multithreading.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/infer_request/multithreading.cpp
@ -18,6 +18,10 @@ const std::vector<std::map<std::string, std::string>> autoconfigs = {
        {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES, std::string(CommonTestUtils::DEVICE_CPU) + "," + CommonTestUtils::DEVICE_GPU}}
 };

+const std::vector<std::map<std::string, std::string>> auto_batch_configs = {
+        {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU}},
+};
+
 INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, InferRequestMultithreadingTests,
                        ::testing::Combine(
                                ::testing::Values(CommonTestUtils::DEVICE_GPU),
@ -36,4 +40,10 @@ INSTANTIATE_TEST_SUITE_P(smoke_Auto_BehaviorTests, InferRequestMultithreadingTes
                                ::testing::ValuesIn(autoconfigs)),
                        InferRequestMultithreadingTests::getTestCaseName);

+
+INSTANTIATE_TEST_SUITE_P(smoke_AutoBatch_BehaviorTests, InferRequestMultithreadingTests,
+                         ::testing::Combine(
+                                 ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                 ::testing::ValuesIn(auto_batch_configs)),
+                         InferRequestMultithreadingTests::getTestCaseName);
 }  // namespace
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/infer_request/wait.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/infer_request/wait.cpp
@ -19,6 +19,11 @@ namespace {
             CommonTestUtils::DEVICE_GPU + std::string(",") + CommonTestUtils::DEVICE_CPU}}
    };

+
+    const std::vector<std::map<std::string, std::string>> autoBatchConfigs = {
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU}},
+    };
+
    INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, InferRequestWaitTests,
                            ::testing::Combine(
                                    ::testing::Values(CommonTestUtils::DEVICE_GPU),
@ -37,4 +42,10 @@ namespace {
                                     ::testing::ValuesIn(autoConfigs)),
                             InferRequestWaitTests::getTestCaseName);

+    INSTANTIATE_TEST_SUITE_P(smoke_AutoBatch_BehaviorTests, InferRequestWaitTests,
+                             ::testing::Combine(
+                                     ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                                     ::testing::ValuesIn(autoBatchConfigs)),
+                             InferRequestWaitTests::getTestCaseName);
+
 }  // namespace
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/ov_plugin/core_integration.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/ov_plugin/core_integration.cpp
@ -30,11 +30,11 @@ INSTANTIATE_TEST_SUITE_P(nightly_OVClassNetworkTestP, OVClassNetworkTestP, ::tes

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,
        OVClassGetMetricTest_SUPPORTED_CONFIG_KEYS,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO"));
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH"));

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,
        OVClassGetMetricTest_SUPPORTED_METRICS,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO"));
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH"));

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,
        OVClassGetMetricTest_AVAILABLE_DEVICES,
@ -42,7 +42,7 @@ INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,
        OVClassGetMetricTest_FULL_DEVICE_NAME,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO"));
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH"));

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,
        OVClassGetMetricTest_OPTIMIZATION_CAPABILITIES,
@ -62,11 +62,11 @@ INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetMetricTest,
        OVClassGetMetricTest_ThrowUnsupported,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO"));
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH"));

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetConfigTest,
        OVClassGetConfigTest_ThrowUnsupported,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO"));
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH"));

 INSTANTIATE_TEST_SUITE_P(nightly_OVClassGetAvailableDevices, OVClassGetAvailableDevices, ::testing::Values("GPU"));

--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/plugin/configuration_tests.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/plugin/configuration_tests.cpp
@ -104,6 +104,29 @@ namespace {
                 CommonTestUtils::DEVICE_GPU + std::string(",") + CommonTestUtils::DEVICE_CPU},
                {InferenceEngine::MultiDeviceConfigParams::KEY_AUTO_NETWORK_PRIORITY, "should be int"}}
    };
+
+
+    const std::vector<std::map<std::string, std::string>> auto_batch_inconfigs = {
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG), CommonTestUtils::DEVICE_GPU},
+                    {CONFIG_KEY(AUTO_BATCH_TIMEOUT), "-1"}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG), CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, "DOESN'T EXIST"}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "-1"}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_PERF_COUNT, "ON"}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "unknown_file"}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_DUMP_KERNELS, "ON"}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_TUNING_MODE, "TUNING_UNKNOWN_MODE"}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_DEVICE_ID, "DEVICE_UNKNOWN"}},
+    };
+
+
    IE_SUPPRESS_DEPRECATED_END

    INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, IncorrectConfigTests,
@ -125,6 +148,12 @@ namespace {
                            IncorrectConfigTests::getTestCaseName);


+    INSTANTIATE_TEST_SUITE_P(smoke_AutoBatch_BehaviorTests, IncorrectConfigTests,
+             ::testing::Combine(
+                     ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                     ::testing::ValuesIn(auto_batch_inconfigs)),
+             IncorrectConfigTests::getTestCaseName);
+
    const std::vector<std::map<std::string, std::string>> conf = {
            {}
    };
@ -167,17 +196,6 @@ namespace {
    };
    IE_SUPPRESS_DEPRECATED_END

-    const std::vector<std::map<std::string, std::string>> multiconf = {
-            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU}},
-            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
-                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT}},
-            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
-                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY}},
-            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
-                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
-                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "1"}}
-    };
-
    const std::vector<std::map<std::string, std::string>> autoConfigs = {
            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU}},
            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
@ -232,6 +250,12 @@ namespace {
             {InferenceEngine::MultiDeviceConfigParams::KEY_AUTO_NETWORK_PRIORITY, "2"}}
    };

+    const std::vector<std::map<std::string, std::string>> auto_batch_configs = {
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU}},
+            {{CONFIG_KEY(AUTO_BATCH_DEVICE_CONFIG) , CommonTestUtils::DEVICE_GPU},
+             {CONFIG_KEY(AUTO_BATCH_TIMEOUT) , "1"}},
+    };
+
    INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, DefaultValuesConfigTests,
            ::testing::Combine(
                ::testing::Values(CommonTestUtils::DEVICE_GPU),
@ -255,4 +279,15 @@ namespace {
                    ::testing::Values(CommonTestUtils::DEVICE_AUTO),
                    ::testing::ValuesIn(autoinconfigs)),
            IncorrectConfigAPITests::getTestCaseName);
+    INSTANTIATE_TEST_SUITE_P(smoke_AutoBatch_BehaviorTests, IncorrectConfigAPITests,
+             ::testing::Combine(
+                     ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                     ::testing::ValuesIn(auto_batch_inconfigs)),
+             IncorrectConfigAPITests::getTestCaseName);
+
+    INSTANTIATE_TEST_SUITE_P(smoke_AutoBatch_BehaviorTests, CorrectConfigTests,
+             ::testing::Combine(
+                     ::testing::Values(CommonTestUtils::DEVICE_BATCH),
+                     ::testing::ValuesIn(auto_batch_configs)),
+             CorrectConfigTests::getTestCaseName);
 } // namespace
--- a/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/plugin/core_integration.cpp
+++ b/src/tests/functional/plugin/gpu/shared_tests_instances/behavior/plugin/core_integration.cpp
@ -35,12 +35,12 @@ INSTANTIATE_TEST_SUITE_P(

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassGetMetricTest, IEClassGetMetricTest_SUPPORTED_CONFIG_KEYS,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO")
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH")
 );

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassGetMetricTest, IEClassGetMetricTest_SUPPORTED_METRICS,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO")
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH")
 );

 INSTANTIATE_TEST_SUITE_P(
@ -50,7 +50,7 @@ INSTANTIATE_TEST_SUITE_P(

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassGetMetricTest, IEClassGetMetricTest_FULL_DEVICE_NAME,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO")
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH")
 );

 INSTANTIATE_TEST_SUITE_P(
@ -80,12 +80,12 @@ INSTANTIATE_TEST_SUITE_P(

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassGetMetricTest, IEClassGetMetricTest_ThrowUnsupported,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO")
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH")
 );

 INSTANTIATE_TEST_SUITE_P(
        nightly_IEClassGetConfigTest, IEClassGetConfigTest_ThrowUnsupported,
-        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO")
+        ::testing::Values("GPU", "MULTI", "HETERO", "AUTO", "BATCH")
 );

 INSTANTIATE_TEST_SUITE_P(
@ -115,6 +115,26 @@ INSTANTIATE_TEST_SUITE_P(
        ::testing::Values("GPU")
 );

+using IEClassGetMetricTest_GPU_OPTIMAL_BATCH_SIZE = BehaviorTestsUtils::IEClassBaseTestP;
+TEST_P(IEClassGetMetricTest_GPU_OPTIMAL_BATCH_SIZE, GetMetricAndPrintNoThrow) {
+    SKIP_IF_CURRENT_TEST_IS_DISABLED()
+    InferenceEngine::Core ie;
+    InferenceEngine::Parameter p;
+
+    std::map<std::string, InferenceEngine::Parameter> _options = {{"MODEL_PTR", simpleCnnNetwork.getFunction()}};
+    ASSERT_NO_THROW(p = ie.GetMetric(deviceName, METRIC_KEY(OPTIMAL_BATCH_SIZE), _options).as<unsigned int>());
+    unsigned int t = p;
+
+    std::cout << "GPU device optimal batch size: " << t << std::endl;
+
+    ASSERT_METRIC_SUPPORTED_IE(METRIC_KEY(OPTIMAL_BATCH_SIZE));
+}
+
+INSTANTIATE_TEST_SUITE_P(
+        nightly_IEClassExecutableNetworkGetMetricTest, IEClassGetMetricTest_GPU_OPTIMAL_BATCH_SIZE,
+        ::testing::Values("GPU")
+);
+
 using IEClassGetMetricTest_GPU_MAX_BATCH_SIZE_DEFAULT = BehaviorTestsUtils::IEClassBaseTestP;
 TEST_P(IEClassGetMetricTest_GPU_MAX_BATCH_SIZE_DEFAULT, GetMetricAndPrintNoThrow) {
    SKIP_IF_CURRENT_TEST_IS_DISABLED()
@ -135,6 +155,7 @@ INSTANTIATE_TEST_SUITE_P(
        ::testing::Values("GPU")
 );

+
 using IEClassGetMetricTest_GPU_MAX_BATCH_SIZE_STREAM_DEVICE_MEM = BehaviorTestsUtils::IEClassBaseTestP;
 TEST_P(IEClassGetMetricTest_GPU_MAX_BATCH_SIZE_STREAM_DEVICE_MEM, GetMetricAndPrintNoThrow) {
    SKIP_IF_CURRENT_TEST_IS_DISABLED()
--- a/src/tests/functional/plugin/shared/CMakeLists.txt
+++ b/src/tests/functional/plugin/shared/CMakeLists.txt
@ -16,6 +16,11 @@ if(ENABLE_AUTO OR ENABLE_MULTI)
    list(APPEND DEPENDENCIES ov_auto_plugin)
 endif()

+if(ENABLE_AUTO_BATCH)
+    list(APPEND DEPENDENCIES ov_auto_batch_plugin)
+endif()
+
+
 # remove once CVS-69781 is fixed
 if(ENABLE_OV_IR_FRONTEND)
    list(APPEND DEPENDENCIES ov_ir_frontend)
--- a/src/tests/functional/plugin/shared/include/auto_batching/auto_batching_tests.hpp
+++ b/src/tests/functional/plugin/shared/include/auto_batching/auto_batching_tests.hpp
@ -0,0 +1,161 @@
+// Copyright (C) 2018-2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+#include <string>
+#include <utility>
+#include <vector>
+#include <memory>
+
+#include <gpu/gpu_config.hpp>
+#include <common_test_utils/test_common.hpp>
+#include <functional_test_utils/plugin_cache.hpp>
+
+#include "ngraph_functions/subgraph_builders.hpp"
+#include "functional_test_utils/blob_utils.hpp"
+
+using namespace ::testing;
+using namespace InferenceEngine;
+
+namespace AutoBatchingTests {
+using AutoBatchTwoNetsParams = std::tuple<
+        std::string,             // device name
+        bool,  // get or set blob
+        size_t,  // number of streams
+        size_t,  // number of requests
+        size_t>; // batch size>
+
+class AutoBatching_Test : public CommonTestUtils::TestsCommon,
+                          public testing::WithParamInterface<AutoBatchTwoNetsParams> {
+    void SetUp() override {
+        std::tie(device_name, use_get_blob, num_streams, num_requests, num_batch) = this->GetParam();
+        fn_ptrs = {ngraph::builder::subgraph::makeSingleConv(),
+                   ngraph::builder::subgraph::makeMultiSingleConv()};
+    };
+public:
+    static std::string getTestCaseName(const testing::TestParamInfo<AutoBatchTwoNetsParams> &obj) {
+        size_t streams, requests, batch;
+        bool use_get_blob;
+        std::string device_name;
+        std::tie(device_name, use_get_blob, streams, requests, batch) = obj.param;
+        return device_name + std::string(use_get_blob ? "_get_blob" : "_set_blob") + "_batch_size_" +
+               std::to_string(batch) +
+               "_num_streams_" + std::to_string(streams) + "_num_req_" + std::to_string(requests);
+    }
+
+protected:
+    std::string device_name;
+    bool use_get_blob;
+    size_t num_streams;
+    size_t num_requests;
+    size_t num_batch;
+    std::vector<std::shared_ptr<ngraph::Function>> fn_ptrs;
+
+    void TestAutoBatch() {
+        std::vector<InferenceEngine::CNNNetwork> nets;
+        for (auto &fn_ptr : fn_ptrs) {
+            nets.push_back(CNNNetwork(fn_ptr));
+        }
+
+        auto ie = InferenceEngine::Core();
+        std::vector<std::string> outputs;
+        std::vector<InferRequest> irs;
+        std::vector<std::vector<uint8_t>> ref;
+        std::vector<int> outElementsCount;
+
+        for (size_t i = 0; i < nets.size(); ++i) {
+            auto net = nets[i];
+            auto inputs = net.getInputsInfo();
+            for (auto n : inputs) {
+                n.second->setPrecision(Precision::FP32);
+            }
+            std::map<std::string, std::string> config;
+            if (device_name.find("GPU") != std::string::npos)
+                config[CONFIG_KEY(GPU_THROUGHPUT_STREAMS)] = std::to_string(num_streams);
+            if (device_name.find("CPU") != std::string::npos)
+                config[CONFIG_KEY(CPU_THROUGHPUT_STREAMS)] = std::to_string(num_streams);
+            // minimize timeout to reduce test time
+            config[CONFIG_KEY(AUTO_BATCH_TIMEOUT)] = std::to_string(1);
+            auto exec_net_ref = ie.LoadNetwork(net, std::string(CommonTestUtils::DEVICE_BATCH) + ":" +
+                                                    device_name + "(" + std::to_string(num_batch) + ")",
+                                               config);
+
+            for (size_t j = 0; j < num_requests; j++) {
+                outputs.push_back(net.getOutputsInfo().begin()->first); //single output
+                outElementsCount.push_back(
+                        std::accumulate(begin(fn_ptrs[i]->get_output_shape(0)), end(fn_ptrs[i]->get_output_shape(0)), 1,
+                                        std::multiplies<size_t>()));
+
+                auto inf_req = exec_net_ref.CreateInferRequest();
+                irs.push_back(inf_req);
+
+                std::vector<std::vector<uint8_t>> inData;
+                for (auto n : inputs) {
+                    auto blob = FuncTestUtils::createAndFillBlob(n.second->getTensorDesc());
+                    if (use_get_blob)
+                        memcpy(reinterpret_cast<void *>(inf_req.GetBlob(n.first)->buffer().as<uint8_t*>()),
+                               reinterpret_cast<const void *>(blob->cbuffer().as<uint8_t*>()), blob->byteSize());
+                    else
+                        inf_req.SetBlob(n.first, blob);
+
+                    const auto inBlob = inf_req.GetBlob(n.first);
+                    const auto blobSize = inBlob->byteSize();
+                    const auto inBlobBuf = inBlob->cbuffer().as<uint8_t *>();
+                    inData.push_back(std::vector<uint8_t>(inBlobBuf, inBlobBuf + blobSize));
+                }
+                auto refOutData = ngraph::helpers::interpreterFunction(fn_ptrs[i], {inData}).front().second;
+                ref.push_back(refOutData);
+            }
+        }
+
+        const int niter = 1;
+        for (int i = 0; i < niter; i++) {
+            for (auto ir : irs) {
+                ir.StartAsync();
+            }
+
+            for (auto ir : irs) {
+                ir.Wait(InferRequest::RESULT_READY);
+            }
+        }
+
+        auto thr = FuncTestUtils::GetComparisonThreshold(InferenceEngine::Precision::FP32);
+        for (size_t i = 0; i < irs.size(); ++i) {
+            const auto &refBuffer = ref[i].data();
+            ASSERT_EQ(outElementsCount[i], irs[i].GetBlob(outputs[i])->size());
+            FuncTestUtils::compareRawBuffers(irs[i].GetBlob(outputs[i])->buffer().as<float *>(),
+                                             reinterpret_cast<const float *>(refBuffer), outElementsCount[i],
+                                             outElementsCount[i],
+                                             thr);
+        }
+    }
+};
+
+class AutoBatching_Test_DetectionOutput : public AutoBatching_Test {
+public:
+    void SetUp() override {
+        std::tie(device_name, use_get_blob, num_streams, num_requests, num_batch) = this->GetParam();
+        fn_ptrs = {ngraph::builder::subgraph::makeEltwisePlusDetectionOutput(),
+                   ngraph::builder::subgraph::makeEltwisePlusDetectionOutput()};
+    };
+
+    static std::string getTestCaseName(const testing::TestParamInfo<AutoBatchTwoNetsParams> &obj) {
+        size_t streams, requests, batch;
+        bool use_get_blob;
+        std::string device_name;
+        std::tie(device_name, use_get_blob, streams, requests, batch) = obj.param;
+        return "DetectionOutput_HETERO_" + device_name + std::string(use_get_blob ? "_get_blob" : "_set_blob") +
+               "_batch_size_" + std::to_string(batch) +
+               "_num_streams_" + std::to_string(streams) + "_num_req_" + std::to_string(requests);
+    }
+};
+
+TEST_P(AutoBatching_Test, compareAutoBatchingToSingleBatch) {
+    TestAutoBatch();
+}
+
+TEST_P(AutoBatching_Test_DetectionOutput, compareAutoBatchingToSingleBatch) {
+    TestAutoBatch();
+}
+
+}  // namespace AutoBatchingTests
--- a/src/tests/ie_test_utils/common_test_utils/test_constants.hpp
+++ b/src/tests/ie_test_utils/common_test_utils/test_constants.hpp
@ -10,6 +10,7 @@ const char DEVICE_AUTO[] = "AUTO";
 const char DEVICE_CPU[] = "CPU";
 const char DEVICE_GNA[] = "GNA";
 const char DEVICE_GPU[] = "GPU";
+const char DEVICE_BATCH[] = "BATCH";
 const char DEVICE_HDDL[] = "HDDL";
 const char DEVICE_MYRIAD[] = "MYRIAD";
 const char DEVICE_KEEMBAY[] = "VPUX";
--- a/src/tests/ie_test_utils/unit_test_utils/mocks/cpp_interfaces/interface/mock_icore.hpp
+++ b/src/tests/ie_test_utils/unit_test_utils/mocks/cpp_interfaces/interface/mock_icore.hpp
@ -26,6 +26,9 @@ public:
    MOCK_METHOD3(ImportNetwork, InferenceEngine::SoExecutableNetworkInternal(
        std::istream&, const std::shared_ptr<InferenceEngine::RemoteContext>&, const std::map<std::string, std::string>&));

+    MOCK_METHOD2(CreateContext, InferenceEngine::RemoteContext::Ptr(const std::string& deviceName,
+            const InferenceEngine::ParamMap& params));
+
    MOCK_CONST_METHOD3(QueryNetwork, InferenceEngine::QueryNetworkResult(
        const InferenceEngine::CNNNetwork&, const std::string&, const std::map<std::string, std::string>&));

--- a/src/tests/ngraph_helpers/ngraph_functions/include/ngraph_functions/subgraph_builders.hpp
+++ b/src/tests/ngraph_helpers/ngraph_functions/include/ngraph_functions/subgraph_builders.hpp
@ -242,6 +242,44 @@ inline std::shared_ptr<ngraph::Function> makeSingleConv(std::vector<size_t> inpu
    return fn_ptr;
 }

+inline std::shared_ptr<ngraph::Function> makeEltwisePlusDetectionOutput(std::vector<std::vector<size_t>> inShapes =
+        {{1, 60}, {1, 165}, {1, 1, 75}},
+                                                                         ngraph::element::Type_t type = ngraph::element::Type_t::f32) {
+    // adding Eltwise so that we can tests Auto-Batching's HETERO code-path that splits the DetectionOutput and the rest of the network
+    auto params = ngraph::builder::makeParams(ngraph::element::f32, inShapes);
+    auto paramOuts = ngraph::helpers::convert2OutputVector(
+            ngraph::helpers::castOps2Nodes<ngraph::opset3::Parameter>(params));
+    ngraph::OutputVector outs;
+    for (size_t i = 0; i < inShapes.size(); i++) {
+        auto shape = inShapes[i];
+        auto p = std::make_shared<ngraph::opset3::Parameter>(ngraph::element::f32, ngraph::Shape{shape});
+        auto add = ngraph::builder::makeEltwise(paramOuts[i], p, ngraph::helpers::EltwiseTypes::ADD);
+        params.push_back(p);
+        outs.push_back(add->output(0));
+    }
+    ngraph::op::DetectionOutput::Attributes attr;
+    attr.num_classes = 11;
+    attr.background_label_id = 0;
+    attr.top_k = 75;
+    attr.variance_encoded_in_target = true;
+    attr.keep_top_k = {50};
+    attr.code_type = std::string{"caffe.PriorBoxParameter.CORNER"};
+    attr.share_location = true;
+    attr.nms_threshold = 0.5f;
+    attr.confidence_threshold = 0.5f;
+    attr.clip_after_nms = false;
+    attr.clip_before_nms = false;
+    attr.decrease_label_id = false;
+    attr.normalized = false;
+    attr.input_height = 1;
+    attr.input_width = 1;
+    attr.objectness_score = 0.4f;
+
+    auto detOut = ngraph::builder::makeDetectionOutput(outs, attr);
+    ngraph::ResultVector results{std::make_shared<ngraph::opset3::Result>(detOut)};
+    return std::make_shared<ngraph::Function>(results, params, "EltWiseWithDetectionOutput");
+}
+
 inline std::shared_ptr<ngraph::Function> makeMultiSingleConv(std::vector<size_t> inputShape = {1, 3, 24, 24},
    ngraph::element::Type type = ngraph::element::Type_t::f32) {
    auto param0 = std::make_shared<ngraph::opset1::Parameter>(type, ngraph::Shape(inputShape));
--- a/src/tests/unit/auto/exec_network_get_metrics.cpp
+++ b/src/tests/unit/auto/exec_network_get_metrics.cpp
@ -38,6 +38,7 @@ using Config = std::map<std::string, std::string>;
 using namespace MockMultiDevice;

 using ConfigParams = std::tuple<
+        bool,                        // if THROUGHPUT
        unsigned int,                // cpu OPTIMAL_NUMBER_OF_INFER_REQUESTS
        int,                         // cpu infer requet num of customer want
        bool,                        // if cpu sleep, cpu device will load slow
@ -77,12 +78,18 @@ public:
        unsigned int expectOptimalNum;
        bool cpuSleep;
        bool gpuSleep;
-        std::tie(cpuOptimalNum, cpuCustomerNum, cpuSleep,
+        bool isThroughput;
+        std::tie(isThroughput, cpuOptimalNum, cpuCustomerNum, cpuSleep,
                 gpuOptimalNum, gpuCustomerNum, gpuSleep, expectOptimalNum) = obj.param;
        std::ostringstream result;
        result << "cpuOptimalNum_" << cpuOptimalNum << "cpuCustomerNum_" << cpuCustomerNum;
        result << "gpuOptimalNum_" << gpuOptimalNum << "gpuCustomerNum_" << gpuCustomerNum;
        result << "expectOptimalNum_" << expectOptimalNum;
+        if (isThroughput) {
+            result << "_isThroughput" << "true";
+        } else {
+            result << "__isThroughput" << "false";
+        }
        if (cpuSleep) {
            result << "_cpuSleep_" << "true";
        } else {
@ -147,7 +154,7 @@ public:
       IE_SET_METRIC(SUPPORTED_CONFIG_KEYS, supportConfigs, {});
       ON_CALL(*core, GetMetric(_, StrEq(METRIC_KEY(SUPPORTED_CONFIG_KEYS)), _))
           .WillByDefault(RETURN_MOCK_VALUE(supportConfigs));
-       EXPECT_CALL(*core, GetMetric(_, StrEq(METRIC_KEY(SUPPORTED_CONFIG_KEYS)), _)).Times(AnyNumber());
+       EXPECT_CALL(*core, GetMetric(_, _, _)).Times(AnyNumber());

       // test auto plugin
       config.insert({CONFIG_KEY_INTERNAL(MULTI_WORK_MODE_AS_AUTO), InferenceEngine::PluginConfigParams::YES});
@ -168,11 +175,24 @@ TEST_P(ExecNetworkGetMetric, OPTIMAL_NUMBER_OF_INFER_REQUESTS) {
    unsigned int expectOptimalNum;
    bool cpuSleep;
    bool gpuSleep;
-    std::tie(cpuOptimalNum, cpuCustomerNum, cpuSleep,
+    bool isThroughput;
+    std::tie(isThroughput, cpuOptimalNum, cpuCustomerNum, cpuSleep,
             gpuOptimalNum, gpuCustomerNum, gpuSleep, expectOptimalNum) = this->GetParam();
-
+    if (isThroughput) {
+        metaDevices.push_back({CommonTestUtils::DEVICE_CPU, {{CONFIG_KEY(PERFORMANCE_HINT),
+                    InferenceEngine::PluginConfigParams::THROUGHPUT}}, cpuCustomerNum, ""});
+        metaDevices.push_back({CommonTestUtils::DEVICE_GPU, {{CONFIG_KEY(PERFORMANCE_HINT),
+                    InferenceEngine::PluginConfigParams::THROUGHPUT}}, gpuCustomerNum, ""});
+        IE_SET_METRIC(OPTIMAL_BATCH_SIZE, optimalBatchNum, 256);
+        IE_SET_METRIC(RANGE_FOR_STREAMS, rangeOfStreams, std::make_tuple<unsigned int, unsigned int>(1, 2));
+        ON_CALL(*core.get(), GetMetric(StrEq(CommonTestUtils::DEVICE_GPU), StrEq(METRIC_KEY(OPTIMAL_BATCH_SIZE)), _))
+            .WillByDefault(RETURN_MOCK_VALUE(optimalBatchNum));
+        ON_CALL(*core.get(), GetMetric(StrEq(CommonTestUtils::DEVICE_GPU), StrEq(METRIC_KEY(RANGE_FOR_STREAMS)), _))
+            .WillByDefault(RETURN_MOCK_VALUE(rangeOfStreams));
+    } else {
        metaDevices.push_back({CommonTestUtils::DEVICE_CPU, {}, cpuCustomerNum, ""});
        metaDevices.push_back({CommonTestUtils::DEVICE_GPU, {}, gpuCustomerNum, ""});
+    }
    ON_CALL(*plugin, SelectDevice(_, _, _)).WillByDefault(Return(metaDevices[1]));
    ON_CALL(*plugin, ParseMetaDevices(_, _)).WillByDefault(Return(metaDevices));
    EXPECT_CALL(*plugin, ParseMetaDevices(_, _)).Times(1);
@ -241,27 +261,28 @@ TEST_P(ExecNetworkGetMetric, OPTIMAL_NUMBER_OF_INFER_REQUESTS) {
 }


-// ConfigParams {unsigned int, int, bool,
+// ConfigParams {bool, unsigned int, int, bool,
 //               unsigned int, int, bool, unsigned int}
 //
 // every element for ConfigParams
-// {cpuOptimalNum, customer hope for cpu infer requset num, if cpu sleep when load,
+// {is throughput mode, cpuOptimalNum, customer hope for cpu infer requset num, if cpu sleep when load,
 //  gpuOptimalNum, customer hope for gpu infer requset num, if gpu sleep when load,
 //  expectOptimalNum of Auto ExecNetwork}
 //
 const std::vector<ConfigParams> testConfigs = {
-                                               ConfigParams {1, -1, false, 2, -1, true, 8},
-                                               ConfigParams {1, -1, false, 10, -1, true, 8},
-                                               ConfigParams {12, -1, false, 2, -1, true, 12},
-                                               ConfigParams {12, -1, false, 10, -1, true, 12},
-                                               ConfigParams {1, -1, true, 2, -1, false, 8},
-                                               ConfigParams {1, -1, true, 10, -1, false, 10},
-                                               ConfigParams {6, -1, true, 2, -1, false, 8},
-                                               ConfigParams {6, -1, true, 10, -1, false, 10},
-                                               ConfigParams {6, 4, false, 2, 3, true, 8},
-                                               ConfigParams {6, 4, false, 10, 3, true, 8},
-                                               ConfigParams {1, 4, true, 2, 3, false, 8},
-                                               ConfigParams {1, 4, true, 10, 3, false, 10}
+                                               ConfigParams {false, 1, -1, false, 2, -1, true, 8},
+                                               ConfigParams {false, 1, -1, false, 10, -1, true, 8},
+                                               ConfigParams {false, 12, -1, false, 2, -1, true, 12},
+                                               ConfigParams {false, 12, -1, false, 10, -1, true, 12},
+                                               ConfigParams {false, 1, -1, true, 2, -1, false, 8},
+                                               ConfigParams {false, 1, -1, true, 10, -1, false, 10},
+                                               ConfigParams {false, 6, -1, true, 2, -1, false, 8},
+                                               ConfigParams {false, 6, -1, true, 10, -1, false, 10},
+                                               ConfigParams {false, 6, 4, false, 2, 3, true, 8},
+                                               ConfigParams {false, 6, 4, false, 10, 3, true, 8},
+                                               ConfigParams {false, 1, 4, true, 2, 3, false, 8},
+                                               ConfigParams {false, 1, 4, true, 10, 3, false, 10},
+                                               ConfigParams {true, 1, 4, false, 10, 3, true, 512}
                                              };

 INSTANTIATE_TEST_SUITE_P(smoke_Auto_BehaviorTests, ExecNetworkGetMetric,
--- a/src/tests_deprecated/behavior/shared_tests/CMakeLists.txt
+++ b/src/tests_deprecated/behavior/shared_tests/CMakeLists.txt
@ -14,6 +14,11 @@ if(ENABLE_AUTO OR ENABLE_MULTI)
    add_dependencies(${TARGET_NAME} ov_auto_plugin)
 endif()

+if(ENABLE_AUTO_BATCH)
+    add_dependencies(${TARGET_NAME} ov_auto_batch_plugin)
+endif()
+
+
 target_include_directories(${TARGET_NAME} PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/plugin_tests")

 target_link_libraries(${TARGET_NAME} PUBLIC
--- a/src/tests_deprecated/functional/shared_tests/CMakeLists.txt
+++ b/src/tests_deprecated/functional/shared_tests/CMakeLists.txt
@ -25,6 +25,10 @@ if(ENABLE_AUTO OR ENABLE_MULTI)
    add_dependencies(${TARGET_NAME} ov_auto_plugin)
 endif()

+if(ENABLE_AUTO_BATCH)
+    add_dependencies(${TARGET_NAME} ov_auto_batch_plugin)
+endif()
+
 set_ie_threading_interface_for(${TARGET_NAME})

 ie_faster_build(${TARGET_NAME}