OV Performance Hints (CPU and GPU logic for selecting the actual configs), while AUTO/MULTI are passing them thru) (#6993)

* rebasing the perf-modes-2021.3 to the 2021.4 Caveats: the (explicit) setting #streams is not disabled (as it was before for experiments with DLBenchmark), and the logic slighlty differ (streamsSet) (cherry picked from commit 1ae1edc0ed) * overriding streams (to force the TPUT mode to the DLBenchnark) (cherry picked from commit 7f506cda31) * disabling reducing #streams to fully mimic baseline c4df94d42d of the 2021.3 (before experiments) (cherry picked from commit 85073dd1dd) * clang/identation (cherry picked from commit 050a4155a9) * splitting the Transformation to general and CPU specific. Now hopefully,this fully mimics the baseline c4df94d42d of the 2021.3 (before experiments), as the streams reduce num (as well as early exit on GRU/LSTM/TensorIterator) is deisabled (cherry picked from commit e98b2c1a67) * disabling GRU/LSTM/TI + reducing of streams + 5D considered compute-limited only for int8 (cherry picked from commit 32b8d80dee) * refactored to avoid compute_limited_ratio, reverted the reducing #streams, removed LSTM from limitations (cherry picked from commit f2b972171b) * isa-based threshold logic (cherry picked from commit b218457e1a) * mode->hint (cherry picked from commit ec20aa8eca) * optional PERFORMANCE_HINT_NUM_REQUESTS (cherry picked from commit 5a3883e3f3) * moving the perfHints to the common OV config class + initial tests (CPU only, as the actual AUTO/MULTI should be accommodated on the master) (cherry picked from commit (then fixed)45bafe7d527f466507dea0693aeed51be4ebf776) * AUTO support for PerfHints * MULTI support for PerfHints * Enabling Perf hints for the GPU plugin * brushing settings output a bit * disabling "throughput" perf hint being default (until OV 2.0) * uncommenting the logic which was disabled to force the DLBenchmark to use the throughput mode by default * removing dead and experimental code, and debug printfs * clang/code-style * code-review remarks * Moved the output of the actual params that the hint produced to the right place * aligning MULTI's GetConfig beh to HETERO's as captured in the preso (CVS-59960) ratified with the ArchForum * clang * benchmark_app brushing * Update inference-engine/samples/benchmark_app/README.md * propagating the perf hints thru one more scenario in the merged AUTO-MULTI * fixed mispint * Python benchmark_app update for perf hints * addresssing reviewers comments on the python benchmark_app * simplifying/brushing logic a bit * refactor the heuristic to the separate file (to be shared with iGPU soon) * refactor conversion of modes to the specific GPU config per feedback from Vladimir
2021-09-13 15:40:36 +03:00 · 2021-09-13 15:40:36 +03:00 · 3bec32449f
commit 3bec32449f
parent 2793963e6f
25 changed files with 646 additions and 103 deletions
--- a/inference-engine/samples/benchmark_app/README.md
+++ b/inference-engine/samples/benchmark_app/README.md
@ -1,6 +1,7 @@
 # Benchmark C++ Tool {#openvino_inference_engine_samples_benchmark_app_README}

-This topic demonstrates how to use the Benchmark C++ Tool to estimate deep learning inference performance on supported devices. Performance can be measured for two inference modes: synchronous (latency-oriented) and asynchronous (throughput-oriented).
+This topic demonstrates how to use the Benchmark C++ Tool to estimate deep learning inference performance on supported devices.
+Performance can be measured for two inference modes: latency- and throughput-oriented.

 > **NOTE:** This topic describes usage of C++ implementation of the Benchmark Tool. For the Python* implementation, refer to [Benchmark Python* Tool](../../../tools/benchmark_tool/README.md).

@ -12,12 +13,19 @@ This topic demonstrates how to use the Benchmark C++ Tool to estimate deep learn

 ## How It Works

-Upon start-up, the application reads command-line parameters and loads a network and images/binary files to the Inference Engine plugin, which is chosen depending on a specified device. The number of infer requests and execution approach depend on the mode defined with the `-api` command-line parameter.
+Upon start-up, the application reads command-line parameters and loads a network and inputs (images/binary files) to the specified device.

-> **NOTE**: By default, Inference Engine samples, tools and demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application or reconvert your model using the Model Optimizer tool with `--reverse_input_channels` argument specified. For more information about the argument, refer to **When to Reverse Input Channels** section of [Converting a Model Using General Conversion Parameters](../../../docs/MO_DG/prepare_model/convert_model/Converting_Model_General.md).
+  **NOTE**: By default, Inference Engine samples, tools and demos expect input with BGR channels order.
+  If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application
+  or reconvert your model using the Model Optimizer tool with `--reverse_input_channels` argument specified.
+  For more information about the argument, refer to **When to Reverse Input Channels** section of
+  [Converting a Model Using General Conversion Parameters](../../../docs/MO_DG/prepare_model/convert_model/Converting_Model_General.md).

-If you run the application in the synchronous mode, it creates one infer request and executes the `Infer` method.
-If you run the application in the asynchronous mode, it creates as many infer requests as specified in the `-nireq` command-line parameter and executes the `StartAsync` method for each of them. If `-nireq` is not set, the application will use the default value for specified device.
+Device-specific execution parameters (number of streams, threads, and so on) can be either explicitly specified through the command line
+or left default. In the last case, the sample logic will select the values for the optimal throughput.
+While experimenting with individual parameters allows to find the performance sweet spot, usually, the parameters are not very performance-portable,
+so the values from one machine or device are not necessarily optimal for another.
+From this perspective, the most portable way is experimenting only with the performance hints. To learn more, refer to the section on the command-line parameters below.

 A number of execution steps is defined by one of the following parameters:
 * Number of iterations specified with the `-niter` command-line argument
@ -25,14 +33,9 @@ A number of execution steps is defined by one of the following parameters:
 * Both of them (execution will continue until both conditions are met)
 * Predefined duration if `-niter` and `-t` are not specified. Predefined duration value depends on a device.

-During the execution, the application collects latency for each executed infer request.
-
-Reported latency value is calculated as a median value of all collected latencies. Reported throughput value is reported
-in frames per second (FPS) and calculated as a derivative from:
-* Reported latency in the Sync mode
-* The total execution time in the Async mode
-
-Throughput value also depends on batch size.
+During the execution, the application calculates latency (if applicable) and overall throughput:
+* By default, the median latency value is reported
+* Throughput is calculated as overall_inference_time/number_of_processed_requests. Note that the throughput value also depends on batch size.

 The application also collects per-layer Performance Measurement (PM) counters for each executed infer request if you
 enable statistics dumping by setting the `-report_type` parameter to one of the possible values:
@ -56,7 +59,7 @@ Note that the benchmark_app usually produces optimal performance for any device
 ./benchmark_app -m <model> -i <input> -d CPU
 ```

-But it is still may be non-optimal for some cases, especially for very small networks. More details can read in [Introduction to Performance Topics](../../../docs/IE_DG/Intro_to_Performance.md).
+But it is still may be sub-optimal for some cases, especially for very small networks. More details can read in [Introduction to Performance Topics](../../../docs/IE_DG/Intro_to_Performance.md).

 As explained in the  [Introduction to Performance Topics](../../../docs/IE_DG/Intro_to_Performance.md) section, for all devices, including new [MULTI device](../../../docs/IE_DG/supported_plugins/MULTI.md) it is preferable to use the FP16 IR for the model.
 Also if latency of the CPU inference on the multi-socket machines is of concern, please refer to the same
@ -83,7 +86,12 @@ Options:
    -l "<absolute_path>"        Required for CPU custom layers. Absolute path to a shared library with the kernels implementations.
          Or
    -c "<absolute_path>"        Required for GPU custom kernels. Absolute path to an .xml file with the kernels description.
-    -api "<sync/async>"         Optional. Enable Sync/Async API. Default value is "async".
+    -hint "<throughput(or just 'tput')/latency">
+                                Optional. Performance hint (optimize for latency or throughput).
+                                The hint allows the OpenVINO device to select the right network-specific settings,
+                                as opposite to just accepting specific values from the sample command line.
+                                So you can specify only the hint without setting explicit 'nstreams' or other device-specific options.
+    -api "<sync/async>"         Optional (deprecated). Enable Sync/Async API. Default value is "async".
    -niter "<integer>"          Optional. Number of iterations. If not specified, the number of iterations is calculated depending on a device.
    -nireq "<integer>"          Optional. Number of infer requests. Default value is determined automatically for a device.
    -b "<integer>"              Optional. Batch size value. If not specified, the batch size value is determined from Intermediate Representation.
--- a/inference-engine/samples/benchmark_app/benchmark_app.hpp
+++ b/inference-engine/samples/benchmark_app/benchmark_app.hpp
@ -22,8 +22,15 @@ static const char model_message[] =
    "Required. Path to an .xml/.onnx file with a trained model or to a .blob files with "
    "a trained compiled model.";

+/// @brief message for performance hint
+static const char hint_message[] =
+    "Optional. Performance hint (optimize for latency or throughput). "
+    "The hint allows the OpenVINO device to select the right network-specific settings,"
+    "as opposite to just accepting specific values from the sample command line."
+    "So you can specify only the hint without setting  explicit 'nstreams' or other device-specific options";
+
 /// @brief message for execution mode
-static const char api_message[] = "Optional. Enable Sync/Async API. Default value is \"async\".";
+static const char api_message[] = "Optional (deprecated). Enable Sync/Async API. Default value is \"async\".";

 /// @brief message for assigning cnn calculation to device
 static const char target_device_message[] =
@ -193,6 +200,9 @@ DEFINE_string(i, "", input_message);
 /// It is a required parameter
 DEFINE_string(m, "", model_message);

+/// @brief Define execution mode
+DEFINE_string(hint, "", hint_message);
+
 /// @brief Define execution mode
 DEFINE_string(api, "async", api_message);

--- a/inference-engine/samples/benchmark_app/main.cpp
+++ b/inference-engine/samples/benchmark_app/main.cpp
@ -59,7 +59,10 @@ bool ParseAndCheckCommandLine(int argc, char* argv[]) {
    if (FLAGS_api != "async" && FLAGS_api != "sync") {
        throw std::logic_error("Incorrect API. Please set -api option to `sync` or `async` value.");
    }
-
+    if (!FLAGS_hint.empty() && FLAGS_hint != "throughput" && FLAGS_hint != "tput" && FLAGS_hint != "latency") {
+        throw std::logic_error("Incorrect performance hint. Please set -hint option to"
+                               "either `throughput`(tput) or `latency' value.");
+    }
    if (!FLAGS_report_type.empty() && FLAGS_report_type != noCntReport && FLAGS_report_type != averageCntReport &&
        FLAGS_report_type != detailedCntReport) {
        std::string err = "only " + std::string(noCntReport) + "/" + std::string(averageCntReport) + "/" +
@ -208,6 +211,11 @@ int main(int argc, char* argv[]) {
        // ----------------- 3. Setting device configuration
        // -----------------------------------------------------------
        next_step();
+        std::string ov_perf_hint;
+        if (FLAGS_hint == "throughput" || FLAGS_hint == "tput")
+            ov_perf_hint = CONFIG_VALUE(THROUGHPUT);
+        else if (FLAGS_hint == "latency")
+            ov_perf_hint = CONFIG_VALUE(LATENCY);

        bool perf_counts = false;
        // Update config per device according to command line parameters
@ -219,6 +227,13 @@ int main(int argc, char* argv[]) {
                config[device] = {};
            std::map<std::string, std::string>& device_config = config.at(device);

+            // high-level performance modes
+            if (!ov_perf_hint.empty()) {
+                device_config[CONFIG_KEY(PERFORMANCE_HINT)] = ov_perf_hint;
+                if (FLAGS_nireq != 0)
+                    device_config[CONFIG_KEY(PERFORMANCE_HINT_NUM_REQUESTS)] = std::to_string(FLAGS_nireq);
+            }
+
            // Set performance counter
            if (isFlagSetInCommandLine("pc")) {
                // set to user defined value
@ -241,6 +256,7 @@ int main(int argc, char* argv[]) {
            }
            perf_counts = (device_config.at(CONFIG_KEY(PERF_COUNT)) == CONFIG_VALUE(YES)) ? true : perf_counts;

+            // the rest are individual per-device settings (overriding the values set with perf modes)
            auto setThroughputStreams = [&]() {
                const std::string key = device + "_THROUGHPUT_STREAMS";
                if (device_nstreams.count(device)) {
@ -255,7 +271,7 @@ int main(int argc, char* argv[]) {
                                               " or via configuration file.");
                    }
                    device_config[key] = device_nstreams.at(device);
-                } else if (!device_config.count(key) && (FLAGS_api == "async")) {
+                } else if (ov_perf_hint.empty() && !device_config.count(key) && (FLAGS_api == "async")) {
                    slog::warn << "-nstreams default value is determined automatically for " << device
                               << " device. "
                                  "Although the automatic selection usually provides a "
@ -484,9 +500,24 @@ int main(int argc, char* argv[]) {
                batchSize = 1;
            }
        }
-        // ----------------- 8. Setting optimal runtime parameters
+        // ----------------- 8. Querying optimal runtime parameters
        // -----------------------------------------------------
        next_step();
+        // output of the actual settings that the device selected based on the hint
+        if (!ov_perf_hint.empty()) {
+            for (const auto& device : devices) {
+                std::vector<std::string> supported_config_keys =
+                    ie.GetMetric(device, METRIC_KEY(SUPPORTED_CONFIG_KEYS));
+                slog::info << "Device: " << device << slog::endl;
+                for (const auto& cfg : supported_config_keys) {
+                    try {
+                        slog::info << "  {" << cfg << " , " << exeNetwork.GetConfig(cfg).as<std::string>();
+                    } catch (...) {
+                    };
+                    slog::info << " }" << slog::endl;
+                }
+            }
+        }

        // Update number of streams
        for (auto&& ds : device_nstreams) {
--- a/inference-engine/src/cldnn_engine/cldnn_config.cpp
+++ b/inference-engine/src/cldnn_engine/cldnn_config.cpp
@ -46,8 +46,10 @@ void Config::UpdateFromMap(const std::map<std::string, std::string>& configMap)
    for (auto& kvp : configMap) {
        std::string key = kvp.first;
        std::string val = kvp.second;
-
-        if (key.compare(PluginConfigParams::KEY_PERF_COUNT) == 0) {
+        const auto hints = perfHintsConfig.SupportedKeys();
+        if (hints.end() != std::find(hints.begin(), hints.end(), key)) {
+            perfHintsConfig.SetConfig(key, val);
+        } else if (key.compare(PluginConfigParams::KEY_PERF_COUNT) == 0) {
            if (val.compare(PluginConfigParams::YES) == 0) {
                useProfiling = true;
            } else if (val.compare(PluginConfigParams::NO) == 0) {
@ -341,6 +343,9 @@ void Config::adjustKeyMapValues() {
        key_config_map[GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING] = PluginConfigParams::YES;
    else
        key_config_map[GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING] = PluginConfigParams::NO;
+    key_config_map.insert({ PluginConfigParams::KEY_PERFORMANCE_HINT, perfHintsConfig.ovPerfHint });
+    key_config_map.insert({ PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS,
+                     std::to_string(perfHintsConfig.ovPerfHintNumRequests) });
 }
 IE_SUPPRESS_DEPRECATED_END

--- a/inference-engine/src/cldnn_engine/cldnn_config.h
+++ b/inference-engine/src/cldnn_engine/cldnn_config.h
@ -8,7 +8,7 @@
 #include <string>

 #include "cldnn_custom_layer.h"
-
+#include <ie_performance_hints.hpp>
 #include <cldnn/graph/network.hpp>

 namespace CLDNNPlugin {
@ -62,6 +62,7 @@ struct Config {
    bool enable_loop_unrolling;

    std::map<std::string, std::string> key_config_map;
+    InferenceEngine::PerfHintsConfig  perfHintsConfig;
 };

 }  // namespace CLDNNPlugin
--- a/inference-engine/src/cldnn_engine/cldnn_engine.cpp
+++ b/inference-engine/src/cldnn_engine/cldnn_engine.cpp
@ -553,14 +553,40 @@ void clDNNEngine::UpdateConfig(CLDNNPlugin::Config& conf, const InferenceEngine:
    }
 }

+std::map<std::string, std::string> clDNNEngine::ConvertPerfHintsToConfig(
+        const std::map<std::string, std::string>& network_config,
+        const CLDNNPlugin::Config& plugin_config) const {
+    // deduces the actual settings from the performance hints and returns fully-defined config
+    auto config = network_config;
+    const auto &mode = config.find(PluginConfigParams::KEY_PERFORMANCE_HINT);
+    // the mode may have just arrived to the LoadNetwork, or was set with the plugins' SetConfig
+    if (mode != config.end() || !plugin_config.perfHintsConfig.ovPerfHint.empty()) {
+        const auto mode_name = (mode != config.end())
+                               ? PerfHintsConfig::CheckPerformanceHintValue(mode->second)
+                               : plugin_config.perfHintsConfig.ovPerfHint;
+        //checking streams (to avoid overriding what user might explicitly set in the incoming config or previously via SetConfig)
+        const auto streams = config.find(PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS);
+        if (streams == config.end() && !streamsSet) {
+            if (mode_name == CONFIG_VALUE(LATENCY)) {
+                config[PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS] = std::to_string(1);
+            } else if (mode_name == CONFIG_VALUE(THROUGHPUT)) {
+                config[PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS] = CONFIG_VALUE(GPU_THROUGHPUT_AUTO);
+                config[GPUConfigParams::KEY_GPU_PLUGIN_THROTTLE] = std::to_string(1);
+            }
+        }
+    }
+    return config;
+}
+
 IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network,
-                                                                const std::map<std::string, std::string> &config) {
+                                                                const std::map<std::string, std::string> &orig_config) {
    OV_ITT_SCOPED_TASK(itt::domains::CLDNNPlugin, "clDNNEngine::LoadExeNetworkImpl");
    // verification of supported input
    InferenceEngine::InputsDataMap _networkInputs = network.getInputsInfo();
    check_inputs(_networkInputs);

    CLDNNPlugin::Config conf = _impl->m_config;
+    auto config = ConvertPerfHintsToConfig(orig_config, conf);
    UpdateConfig(conf, network, config);

    CLDNNRemoteCLContext::Ptr context;
@ -606,7 +632,7 @@ IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceE

 IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network,
                                                                const IRemoteContext::Ptr &context,
-                                                                const std::map<std::string, std::string> &config) {
+                                                                const std::map<std::string, std::string> &orig_config) {
    InferenceEngine::InputsDataMap _networkInputs = network.getInputsInfo();
    check_inputs(_networkInputs);

@ -616,6 +642,7 @@ IExecutableNetworkInternal::Ptr clDNNEngine::LoadExeNetworkImpl(const InferenceE
    }

    CLDNNPlugin::Config conf = getContextImpl(casted)->GetConfig();
+    auto config = ConvertPerfHintsToConfig(orig_config, conf);
    UpdateConfig(conf, network, config);

    auto transformedNetwork = CloneAndTransformNetwork(network, conf);
@ -647,6 +674,7 @@ IRemoteContext::Ptr clDNNEngine::GetDefaultContext(const ParamMap& params) {
 }

 void clDNNEngine::SetConfig(const std::map<std::string, std::string> &config) {
+    streamsSet = (config.find(PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS) != config.end());
    _impl->m_config.UpdateFromMap(config);
 }

--- a/inference-engine/src/cldnn_engine/cldnn_engine.h
+++ b/inference-engine/src/cldnn_engine/cldnn_engine.h
@ -20,6 +20,7 @@ class clDNNEngine : public InferenceEngine::IInferencePlugin,
                    public InferenceEngine::gpu::details::param_map_obj_getter {
    struct impl;
    std::shared_ptr<impl> _impl;
+    bool streamsSet = false;

    // key: device_id, value: cldnn device
    std::map<std::string, cldnn::device::ptr> device_map;
@ -31,6 +32,9 @@ class clDNNEngine : public InferenceEngine::IInferencePlugin,
    InferenceEngine::CNNNetwork CloneAndTransformNetwork(const InferenceEngine::CNNNetwork& network,
                                                         const CLDNNPlugin::Config& config) const;

+    std::map<std::string, std::string> ConvertPerfHintsToConfig(const std::map<std::string, std::string>& network_config,
+                                                               const CLDNNPlugin::Config& plugin_config) const;
+
    void RegisterPrimitives();
    void UpdateConfig(Config& conf, const InferenceEngine::CNNNetwork &network, const std::map<std::string, std::string> &params) const;
 public:
--- a/inference-engine/src/cldnn_engine/cldnn_executable_network.cpp
+++ b/inference-engine/src/cldnn_engine/cldnn_executable_network.cpp
@ -34,11 +34,12 @@ namespace CLDNNPlugin {

 CLDNNExecNetwork::CLDNNExecNetwork(InferenceEngine::CNNNetwork &network, std::shared_ptr<IRemoteContext> context, Config config) :
    InferenceEngine::ExecutableNetworkThreadSafeDefault{[&]()->InferenceEngine::ITaskExecutor::Ptr {
-        if (config.throughput_streams > 1) {
+        if (config.exclusiveAsyncRequests) {
+            //exclusiveAsyncRequests essentially disables the streams (and hence should be checked first) => aligned with the CPU behavior
+            return ExecutorManager::getInstance()->getExecutor("GPU");
+        }  else if (config.throughput_streams > 1) {
            return std::make_shared<InferenceEngine::CPUStreamsExecutor>(
                IStreamsExecutor::Config{"CLDNNPlugin executor", config.throughput_streams});
-        } else if (config.exclusiveAsyncRequests) {
-            return ExecutorManager::getInstance()->getExecutor("GPU");
        } else {
            return std::make_shared<InferenceEngine::CPUStreamsExecutor>(
                IStreamsExecutor::Config{"CLDNNPlugin executor", 1});
--- a/inference-engine/src/inference_engine/include/ie/ie_plugin_config.hpp
+++ b/inference-engine/src/inference_engine/include/ie/ie_plugin_config.hpp
@ -229,6 +229,21 @@ namespace PluginConfigParams {
 #define CONFIG_VALUE(name)         InferenceEngine::PluginConfigParams::name
 #define DECLARE_CONFIG_VALUE(name) static constexpr auto name = #name

+/**
+ * @brief High-level OpenVINO Performance Hints
+ * unlike low-level config keys that are individual (per-device), the hints are smth that every device accepts
+ * and turns into device-specific settings
+ */
+DECLARE_CONFIG_KEY(PERFORMANCE_HINT);
+DECLARE_CONFIG_VALUE(LATENCY);
+DECLARE_CONFIG_VALUE(THROUGHPUT);
+/**
+ * @brief (Optional) config key that backs the (above) Performance Hints
+ * by giving additional information on how many inference requests the application will be keeping in flight
+ * usually this value comes from the actual use-case (e.g. number of video-cameras, or other sources of inputs)
+ */
+DECLARE_CONFIG_KEY(PERFORMANCE_HINT_NUM_REQUESTS);
+
 /**
 * @brief generic boolean values
 */
--- a/inference-engine/src/inference_engine/src/threading/ie_istreams_executor.cpp
+++ b/inference-engine/src/inference_engine/src/threading/ie_istreams_executor.cpp
@ -27,6 +27,19 @@ std::vector<std::string> IStreamsExecutor::Config::SupportedKeys() {
        CONFIG_KEY_INTERNAL(CPU_THREADS_PER_STREAM),
    };
 }
+int IStreamsExecutor::Config::GetDefaultNumStreams() {
+    const int sockets = static_cast<int>(getAvailableNUMANodes().size());
+    // bare minimum of streams (that evenly divides available number of core)
+    const int num_cores = sockets == 1 ? std::thread::hardware_concurrency() : getNumberOfCPUCores();
+    if (0 == num_cores % 4)
+        return std::max(4, num_cores / 4);
+    else if (0 == num_cores % 5)
+        return std::max(5, num_cores / 5);
+    else if (0 == num_cores % 3)
+        return std::max(3, num_cores / 3);
+    else  // if user disables some cores say in BIOS, so we got weird #cores which is not easy to divide
+        return 1;
+}

 void IStreamsExecutor::Config::SetConfig(const std::string& key, const std::string& value) {
    if (key == CONFIG_KEY(CPU_BIND_THREAD)) {
@ -50,17 +63,8 @@ void IStreamsExecutor::Config::SetConfig(const std::string& key, const std::stri
        if (value == CONFIG_VALUE(CPU_THROUGHPUT_NUMA)) {
            _streams = static_cast<int>(getAvailableNUMANodes().size());
        } else if (value == CONFIG_VALUE(CPU_THROUGHPUT_AUTO)) {
-            const int sockets = static_cast<int>(getAvailableNUMANodes().size());
            // bare minimum of streams (that evenly divides available number of cores)
-            const int num_cores = sockets == 1 ? std::thread::hardware_concurrency() : getNumberOfCPUCores();
-            if (0 == num_cores % 4)
-                _streams = std::max(4, num_cores / 4);
-            else if (0 == num_cores % 5)
-                _streams = std::max(5, num_cores / 5);
-            else if (0 == num_cores % 3)
-                _streams = std::max(3, num_cores / 3);
-            else  // if user disables some cores say in BIOS, so we got weird #cores which is not easy to divide
-                _streams = 1;
+            _streams = GetDefaultNumStreams();
        } else {
            int val_i;
            try {
--- a/inference-engine/src/mkldnn_plugin/config.cpp
+++ b/inference-engine/src/mkldnn_plugin/config.cpp
@ -46,16 +46,17 @@ Config::Config() {
    updateProperties();
 }

-
 void Config::readProperties(const std::map<std::string, std::string> &prop) {
-    auto streamExecutorConfigKeys = streamExecutorConfig.SupportedKeys();
-    for (auto& kvp : prop) {
-        auto& key = kvp.first;
-        auto& val = kvp.second;
-
+    const auto streamExecutorConfigKeys = streamExecutorConfig.SupportedKeys();
+    const auto hintsConfigKeys = perfHintsConfig.SupportedKeys();
+    for (const auto& kvp : prop) {
+        const auto& key = kvp.first;
+        const auto& val = kvp.second;
        if (streamExecutorConfigKeys.end() !=
            std::find(std::begin(streamExecutorConfigKeys), std::end(streamExecutorConfigKeys), key)) {
            streamExecutorConfig.SetConfig(key, val);
+        } else if (hintsConfigKeys.end() != std::find(hintsConfigKeys.begin(), hintsConfigKeys.end(), key)) {
+            perfHintsConfig.SetConfig(key, val);
        } else if (key == PluginConfigParams::KEY_DYN_BATCH_LIMIT) {
            int val_i = -1;
            try {
@ -163,6 +164,9 @@ void Config::updateProperties() {
            _config.insert({ PluginConfigParams::KEY_ENFORCE_BF16, PluginConfigParams::YES });
        else
            _config.insert({ PluginConfigParams::KEY_ENFORCE_BF16, PluginConfigParams::NO });
+        _config.insert({ PluginConfigParams::KEY_PERFORMANCE_HINT, perfHintsConfig.ovPerfHint });
+        _config.insert({ PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS,
+                         std::to_string(perfHintsConfig.ovPerfHintNumRequests) });
    }
 }

--- a/inference-engine/src/mkldnn_plugin/config.h
+++ b/inference-engine/src/mkldnn_plugin/config.h
@ -5,6 +5,7 @@
 #pragma once

 #include <threading/ie_istreams_executor.hpp>
+#include <ie_performance_hints.hpp>
 #include "utils/debug_capabilities.h"

 #include <string>
@ -26,7 +27,7 @@ struct Config {
    std::string dumpToDot = "";
    int batchLimit = 0;
    InferenceEngine::IStreamsExecutor::Config streamExecutorConfig;
-
+    InferenceEngine::PerfHintsConfig  perfHintsConfig;
 #if defined(__arm__) || defined(__aarch64__)
    // Currently INT8 mode is not optimized on ARM, fallback to FP32 mode.
    LPTransformsMode lpTransformsMode = LPTransformsMode::Off;
--- a/inference-engine/src/mkldnn_plugin/mkldnn_plugin.cpp
+++ b/inference-engine/src/mkldnn_plugin/mkldnn_plugin.cpp
@ -11,6 +11,7 @@
 #include <threading/ie_executor_manager.hpp>
 #include <memory>
 #include <ie_plugin_config.hpp>
+#include <cpp_interfaces/interface/ie_internal_plugin_config.hpp>
 #include <vector>
 #include <tuple>
 #include <unordered_set>
@ -85,6 +86,7 @@
 #include <low_precision/network_helper.hpp>

 #include <ie_algorithm.hpp>
+#include "performance_heuristics.hpp"

 #include "nodes/mkldnn_mvn_node.h"
 #include "nodes/mkldnn_fake_quantize_node.h"
@ -114,14 +116,12 @@ Engine::~Engine() {
    ExecutorManager::getInstance()->clear("CPUCallbackExecutor");
 }

-static void Transformation(CNNNetwork& clonedNetwork, const Config& conf) {
-    auto nGraphFunc = clonedNetwork.getFunction();
-
+static void TransformationUpToCPUSpecificOpSet(std::shared_ptr<ngraph::Function> nGraphFunc, const bool _enableLPT) {
    ngraph::pass::Manager manager;
    manager.register_pass<ngraph::pass::InitNodeInfo>();

    const bool useLpt =
-        (conf.lpTransformsMode == Config::LPTransformsMode::On) &&
+            _enableLPT &&
        ngraph::pass::low_precision::LowPrecision::isFunctionQuantized(nGraphFunc);
    if (useLpt) {
        manager.register_pass<ngraph::pass::DisableConvertConstantFoldingOnConstPath>(
@ -394,12 +394,16 @@ static void Transformation(CNNNetwork& clonedNetwork, const Config& conf) {
    });

    postLPTPassManager.run_passes(nGraphFunc);
+}

+static void Transformation(CNNNetwork& clonedNetwork, const bool _enableLPT) {
+    auto nGraphFunc = clonedNetwork.getFunction();
+    TransformationUpToCPUSpecificOpSet(nGraphFunc, _enableLPT);
    ConvertToCPUSpecificOpset(nGraphFunc);
 }

 InferenceEngine::IExecutableNetworkInternal::Ptr
-Engine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network, const std::map<std::string, std::string> &config) {
+Engine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network, const std::map<std::string, std::string> &orig_config) {
    OV_ITT_SCOPED_TASK(itt::domains::MKLDNNPlugin, "Engine::LoadExeNetworkImpl");

    // verification of supported input
@ -421,25 +425,97 @@ Engine::LoadExeNetworkImpl(const InferenceEngine::CNNNetwork &network, const std
        }
    }

-    // TODO: handle input precision differently - per input and not one per network...
+    auto config = orig_config;
+    CNNNetwork clonedNetwork = InferenceEngine::details::cloneNetwork(network);
+    const auto& lptProp = config.find(InferenceEngine::PluginConfigInternalParams::KEY_LP_TRANSFORMS_MODE);
+    const bool enableLPT = (lptProp != config.end() && lptProp->second == PluginConfigParams::YES) /* enabled in the orig_config*/
+            || Config::LPTransformsMode::On == engConfig.lpTransformsMode /* or already enabled for the plugin */;
+    auto nGraphFunc = clonedNetwork.getFunction();
+    TransformationUpToCPUSpecificOpSet(nGraphFunc, enableLPT);

+    // Here the OV perf modes are turned into specific settings (as we need the network for better params selection)
+    const auto& mode = config.find(PluginConfigParams::KEY_PERFORMANCE_HINT);
+    // the mode may have just arrived to the LoadNetwork, or was set with the plugins' SetConfig
+    if (mode != config.end() || !engConfig.perfHintsConfig.ovPerfHint.empty()) {
+        const auto mode_name = (mode != config.end())
+                               ? PerfHintsConfig::CheckPerformanceHintValue(mode->second) : engConfig.perfHintsConfig.ovPerfHint;
+        //checking streams (to avoid overriding what user might explicitly set in the incoming config or previously via SetConfig)
+        const auto streams = config.find(PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS);
+        if (streams == config.end() && !streamsSet) {
+            if (mode_name == CONFIG_VALUE(LATENCY)) {
+                config[PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS] = CONFIG_VALUE(CPU_THROUGHPUT_NUMA);
+            } else if (mode_name == CONFIG_VALUE(THROUGHPUT)) {
+                const auto isa = dnnl::get_effective_cpu_isa();
+                float isaSpecificThreshold = 1.0f;
+                switch (isa) {
+                    case dnnl::cpu_isa::sse41 :
+                        isaSpecificThreshold = 0.5f;
+                        break;
+                    case dnnl::cpu_isa::avx2:
+                    case dnnl::cpu_isa::avx512_core:
+                        isaSpecificThreshold = 1.0f;
+                        break;
+                    case dnnl::cpu_isa::avx512_core_vnni:
+                    case dnnl::cpu_isa::avx2_vnni:
+                        isaSpecificThreshold = 2.0f;
+                        break;
+                    case dnnl::cpu_isa::avx512_core_amx:
+                        isaSpecificThreshold = 4.0f;
+                        break;
+                    default:
+                        isaSpecificThreshold = 1.0f;
+                }
+                // the more "capable" the CPU in general, the more streams we may want to keep to keep it utilized
+                const float memThresholdAssumeLimitedForISA = ov::MemBandwidthPressure::LIMITED/isaSpecificThreshold;
+                const float L2_cache_size = mkldnn::utils::get_cache_size(2 /*level*/, true /*per core */);
+                const float L3_cache_size = mkldnn::utils::get_cache_size(3, false);
+                ov::MemBandwidthPressure networkToleranceForLowCache = ov::MemBandwidthPressureTolerance(
+                        clonedNetwork.getFunction(),
+                        L2_cache_size, L3_cache_size,
+                        memThresholdAssumeLimitedForISA);
+                // num of phys CPU cores (most aggressive value for #streams)
+                const auto num_cores = getNumberOfCPUCores();
+                // less aggressive
+                const auto num_streams_less_aggressive = num_cores / 2;
+                // default #streams value (most conservative)
+                const auto default_num_streams = IStreamsExecutor::Config::GetDefaultNumStreams();
+                int num_streams = default_num_streams;
+                if (networkToleranceForLowCache.max_mem_tolerance == ov::MemBandwidthPressure::UNKNOWN) {
+                    if ((networkToleranceForLowCache.ratio_compute_convs == ov::MemBandwidthPressure::ALL)
+                        || (networkToleranceForLowCache.ratio_compute_deconvs == ov::MemBandwidthPressure::ALL)) {
+                        // all relevant layers (convs, etc) are compute-limited, the most aggressive val for #streams
+                        num_streams = num_cores;
+                    }   // otherwise (no recognized layers) falling back to the default value
+                } else if (networkToleranceForLowCache.max_mem_tolerance > memThresholdAssumeLimitedForISA) {
+                    // network is below the ISA-specific threshold
+                    num_streams = num_cores;
+                } else if (networkToleranceForLowCache.max_mem_tolerance > ov::MemBandwidthPressure::LIMITED) {
+                    // network is below general threshold
+                    num_streams = std::max(default_num_streams, num_streams_less_aggressive);
+                }
+                auto num_requests = config.find(PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS);
+                if (num_requests != config.end())
+                    num_streams = std::min(num_streams, PerfHintsConfig::CheckPerformanceHintRequestValue(num_requests->second));
+                config[PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS] = std::to_string(num_streams);
+           }
+        }
+    }
+    ConvertToCPUSpecificOpset(nGraphFunc);
+
+    // update the props after the perf mode translated to configs
    // TODO: Clarify the behavior of SetConfig method. Skip eng_config or not?
    Config conf = engConfig;
    conf.readProperties(config);
-
    if (conf.enableDynamicBatch) {
        conf.batchLimit = static_cast<int>(network.getBatchSize());
    }

-    CNNNetwork clonedNetwork = InferenceEngine::details::cloneNetwork(network);
-
-    Transformation(clonedNetwork, conf);
-
    return std::make_shared<MKLDNNExecNetwork>(clonedNetwork, conf, extensionManager, weightsSharing);
 }

 void Engine::SetConfig(const std::map<std::string, std::string> &config) {
    // accumulate config parameters on engine level
+    streamsSet = (config.find(PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS) != config.end());
    engConfig.readProperties(config);
 }

@ -554,7 +630,10 @@ QueryNetworkResult Engine::QueryNetwork(const CNNNetwork& network, const std::ma

        auto clonedNetwork = InferenceEngine::details::cloneNetwork(network);
        auto ops = clonedNetwork.getFunction()->get_ordered_ops();
-        Transformation(clonedNetwork, conf);
+        const auto& lptProp = config.find(InferenceEngine::PluginConfigInternalParams::KEY_LP_TRANSFORMS_MODE);
+        const bool enableLPT = (lptProp != config.end() && lptProp->second == PluginConfigParams::YES) /* enabled in the orig_config*/
+                               || Config::LPTransformsMode::On == engConfig.lpTransformsMode /* or already enabled */;
+        Transformation(clonedNetwork, enableLPT);
        std::unordered_set<std::string> supported;
        std::unordered_set<std::string> unsupported;
        for (auto op : ops) {
--- a/inference-engine/src/mkldnn_plugin/mkldnn_plugin.h
+++ b/inference-engine/src/mkldnn_plugin/mkldnn_plugin.h
@ -13,6 +13,7 @@
 #include <memory>
 #include <functional>
 #include <vector>
+#include <cfloat>

 namespace MKLDNNPlugin {

@ -40,6 +41,7 @@ private:
    Config engConfig;
    NumaNodesWeights weightsSharing;
    MKLDNNExtensionManager::Ptr extensionManager = std::make_shared<MKLDNNExtensionManager>();
+    bool streamsSet = false;
 };

 }  // namespace MKLDNNPlugin
--- a/inference-engine/src/multi_device/multi_device_exec_network.cpp
+++ b/inference-engine/src/multi_device/multi_device_exec_network.cpp
@ -237,6 +237,16 @@ InferenceEngine::Parameter MultiDeviceExecutableNetwork::GetConfig(const std::st
    if (it != _config.end()) {
        return it->second;
    } else {
+        // find config key among networks config keys
+        for (const auto& desc : _networksPerDevice) {
+            const auto& execNetwork = desc.second;
+            auto param = execNetwork->GetMetric(METRIC_KEY(SUPPORTED_CONFIG_KEYS));
+            for (auto &&configKey : param.as<std::vector<std::string>>()) {
+                if (configKey == name) {
+                    return execNetwork->GetConfig(configKey);
+                }
+            }
+        }
        IE_THROW(NotFound) << name <<" not found in the ExecutableNetwork config";
    }
 }
--- a/inference-engine/src/multi_device/multi_device_plugin.cpp
+++ b/inference-engine/src/multi_device/multi_device_plugin.cpp
@ -16,6 +16,7 @@
 #include "ngraph_ops/deconvolution_ie.hpp"

 #include <ie_metric_helpers.hpp>
+#include <ie_performance_hints.hpp>
 #include <threading/ie_executor_manager.hpp>
 #include "multi_device_plugin.hpp"
 #include <ie_algorithm.hpp>
@ -56,10 +57,12 @@ namespace {
        }
        return config;
    }
-    const std::vector<std::string> supported_configKeys = {
-        MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES,
-        CONFIG_KEY_INTERNAL(MULTI_WORK_MODE_AS_AUTO)
-    };
+    std::vector<std::string> supported_configKeys = []() -> decltype(PerfHintsConfig::SupportedKeys()) {
+                    auto res = PerfHintsConfig::SupportedKeys();
+                    res.push_back(MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES);
+                    res.push_back(CONFIG_KEY_INTERNAL(MULTI_WORK_MODE_AS_AUTO));
+                    return res;
+                }();
 }  // namespace

 std::map<std::string, std::string> MultiDeviceInferencePlugin::GetSupportedConfig(
@ -142,12 +145,16 @@ InferenceEngine::Parameter MultiDeviceInferencePlugin::GetConfig(const std::stri
 }

 void MultiDeviceInferencePlugin::SetConfig(const std::map<std::string, std::string> & config) {
+    const auto perf_hints_configs = PerfHintsConfig::SupportedKeys();
    for (auto && kvp : config) {
        const auto& name = kvp.first;
-        if (supported_configKeys.end() != std::find(supported_configKeys.begin(), supported_configKeys.end(), name))
+        if (supported_configKeys.end() != std::find(supported_configKeys.begin(), supported_configKeys.end(), name)) {
+            if (std::find(perf_hints_configs.begin(), perf_hints_configs.end(), kvp.first) != perf_hints_configs.end())
+                PerfHintsConfig::CheckConfigAndValue(kvp);
            _config[name] = kvp.second;
-        else
+        } else {
            IE_THROW() << "Unsupported config key: " << name;
+        }
    }
 }

@ -235,8 +242,10 @@ IExecutableNetworkInternal::Ptr MultiDeviceInferencePlugin::LoadNetworkImpl(cons
    }
    // check if it is -d AUTO or -d AUTO:xPU use case
    if (workModeAuto) {
-        auto targetDevice = SelectDevice(metaDevices, networkPrecision);
-        metaDevices = { targetDevice };
+        // select the device
+        auto device = SelectDevice(metaDevices, networkPrecision).deviceName;
+        // parse the config for the device
+        metaDevices = ParseMetaDevices(SelectDevice(metaDevices, networkPrecision).deviceName, fullConfig);
    }

    DeviceMap<SoExecutableNetworkInternal> executableNetworkPerDevice;
--- a/inference-engine/src/plugin_api/ie_performance_hints.hpp
+++ b/inference-engine/src/plugin_api/ie_performance_hints.hpp
@ -0,0 +1,102 @@
+// Copyright (C) 2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+/**
+ * @brief A header file for config that holds the performance hints
+ * @file ie_performance_hints.hpp
+ */
+
+#pragma once
+#include <ie_parameter.hpp>
+#include <ie_plugin_config.hpp>
+
+namespace InferenceEngine {
+struct PerfHintsConfig {
+    std::string ovPerfHint = "";
+    int ovPerfHintNumRequests = 0;
+
+    /**
+     * @brief Parses configuration key/value pair
+     * @param key configuration key
+     * @param value configuration values
+     */
+    void SetConfig(const std::string& key, const std::string& value) {
+        if (PluginConfigParams::KEY_PERFORMANCE_HINT == key) {
+            ovPerfHint = CheckPerformanceHintValue(value);
+        } else if (PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS == key) {
+            ovPerfHintNumRequests = CheckPerformanceHintRequestValue(value);
+        }
+    }
+
+    /**
+     * @brief Return configuration value
+     * @param key configuration key
+     * @return configuration value wrapped into Parameter
+     */
+    Parameter GetConfig(const std::string& key) {
+        if (PluginConfigParams::KEY_PERFORMANCE_HINT == key) {
+            return ovPerfHint;
+        } else if (PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS == key) {
+            return ovPerfHintNumRequests;
+        } else {
+            IE_THROW() << "Unsupported Performance Hint config: " << key << std::endl;
+        }
+    }
+
+    /**
+     * @brief Supported Configuration keys
+     * @return vector of supported configuration keys
+     */
+    static std::vector<std::string> SupportedKeys() {
+        return {PluginConfigParams::KEY_PERFORMANCE_HINT, PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS};
+    }
+
+    /**
+     * @brief Checks configuration key and value, otherwise throws
+     * @param configuration key + value
+     * @return void
+     */
+    static void CheckConfigAndValue(std::pair<const std::string, const std::string&> kvp) {
+        if (kvp.first == PluginConfigParams::KEY_PERFORMANCE_HINT)
+            CheckPerformanceHintValue(kvp.second);
+        else if (kvp.first == PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS)
+            CheckPerformanceHintRequestValue(kvp.second);
+        else
+            IE_THROW() << "Unsupported Performance Hint config: " << kvp.first << std::endl;
+    }
+
+    /**
+     * @brief Returns configuration value if it is valid, otherwise throws
+     * @param configuration value
+     * @return configuration value
+     */
+    static std::string CheckPerformanceHintValue(const std::string& val) {
+        if (val == PluginConfigParams::LATENCY || val == PluginConfigParams::THROUGHPUT)
+            return val;
+        else
+            IE_THROW() << "Wrong value for property key " << PluginConfigParams::KEY_PERFORMANCE_HINT
+                       << ". Expected only " << PluginConfigParams::LATENCY << "/" << PluginConfigParams::THROUGHPUT;
+    }
+
+    /**
+     * @brief Returns configuration value if it is valid, otherwise throws
+     * @param configuration value as string
+     * @return configuration value as number
+     */
+    static int CheckPerformanceHintRequestValue(const std::string& val) {
+        int val_i = -1;
+        try {
+            val_i = std::stoi(val);
+            if (val_i > 0)
+                return val_i;
+            else
+                throw std::logic_error("wrong val");
+        } catch (const std::exception&) {
+            IE_THROW() << "Wrong value of " << val << " for property key "
+                       << PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS
+                       << ". Expected only positive integer numbers";
+        }
+    }
+};
+}  // namespace InferenceEngine
--- a/inference-engine/src/plugin_api/performance_heuristics.hpp
+++ b/inference-engine/src/plugin_api/performance_heuristics.hpp
@ -0,0 +1,136 @@
+// Copyright (C) 2018-2021 Intel Corporation
+// SPDX-License-Identifier: Apache-2.0
+//
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+#pragma once
+#include <cfloat>
+
+#include "ngraph/ngraph.hpp"
+
+namespace ov {
+struct MemBandwidthPressure {
+    float max_mem_tolerance = UNKNOWN;
+    float ratio_compute_convs = 0;
+    float ratio_mem_limited_convs = 0;
+    float ratio_compute_deconvs = 0;
+
+    static constexpr float UNKNOWN = FLT_MAX;
+    static constexpr float ALL = 1.0f;
+    static constexpr float NONE = 0.0f;
+    static constexpr float LIMITED = 0.5f;  // conservatively assume 1/2 utilization of the cache
+};
+
+MemBandwidthPressure MemBandwidthPressureTolerance(
+    const std::shared_ptr<ngraph::Function> nGraphFunc,
+    const float L2_cache_size,
+    const float L3_cache_size,
+    const float memThresholdAssumeLimited = MemBandwidthPressure::LIMITED) {
+    int total_convs = 0, mem_limited_convs = 0, compute_convs = 0, total_gemms = 0, mem_limited_gemms = 0,
+        total_deconvs = 0, compute_deconvs = 0, mem_limited_deconvs = 0;
+    auto memLimitedFactor = [&](int size_data_moved, int datatype_size = 4) -> float {
+        return (L2_cache_size * 1.0f /*util factor, tbd */
+                / (size_data_moved * datatype_size));
+    };
+    auto isLowPrecision = [&](ngraph::element::Type type) -> bool {
+        return (type == ngraph::element::i8) || (type == ngraph::element::u8);
+    };
+    auto isHalfPrecision = [&](ngraph::element::Type type) -> bool {
+        return (type == ngraph::element::bf16) || (type == ngraph::element::f16);
+    };
+
+    float worst_case = MemBandwidthPressure::UNKNOWN;
+    // Traverse nGraph Function in topological order
+    for (auto& node : nGraphFunc->get_ordered_ops()) {
+        const auto node_name = node->get_type_info().name;
+        if (std::strcmp("MatMul", node_name) && std::strcmp("Convolution", node_name) &&
+            std::strcmp("ConvolutionBackpropData", node_name)) {
+            if (!std::strcmp("GRUSequence", node_name) || !std::strcmp("TensorIterator", node_name)) {
+                MemBandwidthPressure res;
+                res.max_mem_tolerance = MemBandwidthPressure::UNKNOWN;
+                return res;
+            }
+            continue;
+        }
+        auto type1 = node->input_value(1).get_element_type();  // weights
+        const bool isINT8 = isLowPrecision(type1);
+        const bool isBF16orFP16 = isHalfPrecision(type1);
+        const int data_type_size = isINT8 ? 1 : isBF16orFP16 ? 2 : 4;
+
+        int dataSizeInput = 0, dataSizeOutput = 0;
+        if (!std::strcmp("MatMul", node_name)) {
+            const auto input0 = node->input(0);
+            const auto input1 = node->input(1);
+            const auto output = node->output(0);
+            // Check that input and output shape a fully defined (not dynamic)
+            if (input0.get_partial_shape().is_static() && input1.get_partial_shape().is_static() &&
+                output.get_partial_shape().is_static()) {
+                const auto& shapeInput0 = input0.get_shape();
+                const auto& shapeInput1 = input1.get_shape();
+                const auto non_const = !get_constant_from_source(node->input_value(1));
+                const auto& shapeOutput = output.get_shape();
+                const auto dataSizeInput0 =
+                    std::accumulate(shapeInput0.begin(), shapeInput0.end(), 1, std::multiplies<int>());
+                const auto dataSizeInput1 =
+                    std::accumulate(shapeInput1.begin(), shapeInput1.end(), 1, std::multiplies<int>());
+                dataSizeOutput = std::accumulate(shapeOutput.begin(), shapeOutput.end(), 1, std::multiplies<int>());
+                const auto total_data = dataSizeInput0 + non_const * dataSizeInput1 + dataSizeOutput;
+                total_gemms++;
+                const auto factor = memLimitedFactor(total_data, data_type_size);
+                mem_limited_gemms += factor < memThresholdAssumeLimited;
+                worst_case = std::min(factor, worst_case);
+            }
+        } else if (!std::strcmp("Convolution", node_name)) {
+            // Check that input and output shape a fully defined (not dynamic)
+            const auto input = node->input(0);
+            const auto output = node->output(0);
+            const auto kernels = node->input(1);
+            const auto& shape = kernels.get_shape();
+            total_convs++;
+            if (shape.size() >= 4 /* conventional 2D/3D conv */ && shape[2] >= 3 && shape[3] >= 3) {
+                compute_convs++;
+                continue;
+            }
+            if (input.get_partial_shape().is_static() && output.get_partial_shape().is_static()) {
+                const auto& shapeInput = input.get_shape();
+                const auto& shapeOutput = output.get_shape();
+                if (shapeInput.size() > 4 /*5D*/ && isINT8) {
+                    compute_convs++;
+                    continue;
+                }
+                dataSizeInput = std::accumulate(shapeInput.begin(), shapeInput.end(), 1, std::multiplies<int>());
+                dataSizeOutput = std::accumulate(shapeOutput.begin(), shapeOutput.end(), 1, std::multiplies<int>());
+                const auto factor = memLimitedFactor(dataSizeInput + dataSizeOutput, data_type_size);
+                mem_limited_convs += factor < memThresholdAssumeLimited;
+                worst_case = std::min(factor, worst_case);
+            }
+        } else if (!std::strcmp("ConvolutionBackpropData", node_name)) {
+            const auto input = node->input(0);
+            const auto output = node->output(0);
+            total_deconvs++;
+
+            // Check that input and output shape a fully defined (not dynamic)
+            if (input.get_partial_shape().is_static() && output.get_partial_shape().is_static()) {
+                const auto shapeInput = input.get_shape();
+                const auto shapeOutput = output.get_shape();
+                if (shapeInput.size() > 4 /*5D*/ && isINT8) {
+                    compute_deconvs++;
+                    continue;
+                }
+                dataSizeInput = std::accumulate(shapeInput.begin(), shapeInput.end(), 1, std::multiplies<int>());
+                dataSizeOutput = std::accumulate(shapeOutput.begin(), shapeOutput.end(), 1, std::multiplies<int>());
+                const auto factor = memLimitedFactor(dataSizeInput + dataSizeOutput, data_type_size);
+                mem_limited_deconvs += factor < memThresholdAssumeLimited;
+                worst_case = std::min(factor, worst_case);
+            }
+        }
+    }
+    MemBandwidthPressure res;
+    res.max_mem_tolerance = worst_case;
+    res.ratio_mem_limited_convs = total_convs ? static_cast<float>(mem_limited_convs) / total_convs : 0;
+    res.ratio_compute_convs = total_convs ? static_cast<float>(compute_convs) / total_convs : 0;
+    res.ratio_compute_deconvs = total_deconvs ? static_cast<float>(compute_deconvs) / total_deconvs : 0;
+    return res;
+}
+
+}  // namespace ov
--- a/inference-engine/src/plugin_api/threading/ie_istreams_executor.hpp
+++ b/inference-engine/src/plugin_api/threading/ie_istreams_executor.hpp
@ -82,6 +82,7 @@ public:
         * @return configured values
         */
        static Config MakeDefaultMultiThreaded(const Config& initial, const bool fp_intesive = true);
+        static int GetDefaultNumStreams();  // no network specifics considered (only CPU's caps);

        std::string _name;          //!< Used by `ITT` to name executor threads
        int _streams = 1;           //!< Number of streams.
--- a/inference-engine/tests/functional/plugin/cpu/shared_tests_instances/behavior/config.cpp
+++ b/inference-engine/tests/functional/plugin/cpu/shared_tests_instances/behavior/config.cpp
@ -18,6 +18,10 @@ namespace {

    const std::vector<std::map<std::string, std::string>> Configs = {
            {},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "1"}},
            {{InferenceEngine::PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS, InferenceEngine::PluginConfigParams::CPU_THROUGHPUT_AUTO}},
            {{InferenceEngine::PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS, InferenceEngine::PluginConfigParams::CPU_THROUGHPUT_NUMA}},
            {{InferenceEngine::PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS, "8"}},
@ -27,7 +31,13 @@ namespace {
    };

    const std::vector<std::map<std::string, std::string>> MultiConfigs = {
-            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU}}
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "1"}}
    };

    INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, CorrectConfigTests,
@ -52,12 +62,25 @@ namespace {
            CorrectConfigTests::getTestCaseName);

    const std::vector<std::map<std::string, std::string>> inconfigs = {
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, "DOESN'T EXIST"}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "-1"}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "should be int"}},
            {{InferenceEngine::PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS, "OFF"}},
            {{InferenceEngine::PluginConfigParams::KEY_CPU_BIND_THREAD, "OFF"}},
            {{InferenceEngine::PluginConfigParams::KEY_DYN_BATCH_LIMIT, "NAN"}}
    };

    const std::vector<std::map<std::string, std::string>> multiinconfigs = {
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+             {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, "DOESN'T EXIST"}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+             {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "-1"}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+             {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "should be int"}},
            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
                    {InferenceEngine::PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS, "OFF"}},
            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
@ -67,6 +90,13 @@ namespace {
    };

    const std::vector<std::map<std::string, std::string>> multiconf = {
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+             {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+             {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU},
+             {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+             {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "1"}},
            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_CPU}}
    };

--- a/inference-engine/tests/functional/plugin/gpu/shared_tests_instances/behavior/config.cpp
+++ b/inference-engine/tests/functional/plugin/gpu/shared_tests_instances/behavior/config.cpp
@ -15,6 +15,11 @@ namespace {

    IE_SUPPRESS_DEPRECATED_START
    const std::vector<std::map<std::string, std::string>> inconfigs = {
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, "DOESN'T EXIST"}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "-1"}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "should be int"}},
            {{InferenceEngine::PluginConfigParams::KEY_GPU_THROUGHPUT_STREAMS, "OFF"}},
            {{InferenceEngine::PluginConfigParams::KEY_PERF_COUNT, "ON"}},
            {{InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "unknown_file"}},
@ -24,6 +29,11 @@ namespace {
    };

    const std::vector<std::map<std::string, std::string>> multiinconfigs = {
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, "DOESN'T EXIST"}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "-1"}},
            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
                    {InferenceEngine::PluginConfigParams::KEY_PERF_COUNT, "ON"}},
            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
@ -83,11 +93,22 @@ namespace {
            {{InferenceEngine::GPUConfigParams::KEY_GPU_MAX_NUM_THREADS, "4"}},
            {{InferenceEngine::GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING, InferenceEngine::PluginConfigParams::YES}},
            {{InferenceEngine::GPUConfigParams::KEY_GPU_ENABLE_LOOP_UNROLLING, InferenceEngine::PluginConfigParams::NO}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY}},
+            {{InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                    {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "1"}},
    };
    IE_SUPPRESS_DEPRECATED_END

    const std::vector<std::map<std::string, std::string>> multiconf = {
-            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU}}
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
+                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::THROUGHPUT}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
+                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY}},
+            {{InferenceEngine::MultiDeviceConfigParams::KEY_MULTI_DEVICE_PRIORITIES , CommonTestUtils::DEVICE_GPU},
+                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT, InferenceEngine::PluginConfigParams::LATENCY},
+                {InferenceEngine::PluginConfigParams::KEY_PERFORMANCE_HINT_NUM_REQUESTS, "1"}}
    };

    INSTANTIATE_TEST_SUITE_P(smoke_BehaviorTests, CorrectConfigAPITests,
--- a/tools/benchmark_tool/README.md
+++ b/tools/benchmark_tool/README.md
@ -1,6 +1,7 @@
 # Benchmark Python* Tool {#openvino_inference_engine_tools_benchmark_tool_README}

-This topic demonstrates how to run the Benchmark Python* Tool, which performs inference using convolutional networks. Performance can be measured for two inference modes: synchronous (latency-oriented) and asynchronous (throughput-oriented).
+This topic demonstrates how to run the Benchmark Python* Tool, which performs inference using convolutional networks.
+Performance can be measured for two inference modes: latency- and throughput-oriented.

 > **NOTE:** This topic describes usage of Python implementation of the Benchmark Tool. For the C++ implementation, refer to [Benchmark C++ Tool](../../inference-engine/samples/benchmark_app/README.md).

@ -10,33 +11,45 @@ This topic demonstrates how to run the Benchmark Python* Tool, which performs in
 > deployment on various Intel® platforms.

 ## How It Works
-
-Upon start-up, the application reads command-line parameters and loads a network and images/binary files to the Inference Engine plugin, which is chosen depending on a specified device. The number of infer requests and execution approach depend on the mode defined with the `-api` command-line parameter.
+Upon start-up, the application reads command-line parameters and loads a network and inputs (images/binary files) to the specified device.
+Device-specific execution parameters (number of streams, threads, and so on) can be either explicitly specified through the command line
+or left default. In the latter case, the sample logic will select the values for the optimal throughput.
+While further experimenting with individual parameters (like number of streams and requests, batch size, etc) allows to find the performance sweet spot,
+usually, the resulting values are not very performance-portable,
+so the values from one machine or device are not necessarily optimal for another.
+From this perspective, the most portable way is experimenting only the performance hints. To learn more, refer to the section below.

 > **NOTE**: By default, Inference Engine samples, tools and demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application or reconvert your model using the Model Optimizer tool with `--reverse_input_channels` argument specified. For more information about the argument, refer to **When to Reverse Input Channels** section of [Converting a Model Using General Conversion Parameters](../../docs/MO_DG/prepare_model/convert_model/Converting_Model_General.md).

-### Synchronous API
+### Latency and Throughput-focused Inference Modes
+In many cases the primary performance metric is the time (in milliseconds) for an individual inference request.
+For conventional devices the best latency is usually achieved when the application operates single inference request.
+Similarly, while for some devices the synchronous API (`Infer` method) was slightly better for the latency.
+However, advanced devices like multi-socket CPUs, modern GPUs and so on, are capable to run multiple inference requests,
+while delivering the same latency (as with the single request). Also, the asynchronous API is more general/flexible
+(with respect to handling multiple inference requests).
+Overall, the legacy way of measuring latency (triggered by '-api sync') with a single request and synchronous API is discouraged
+in favor of the dedicated '-hint latency' that lets the _device_ to apply the right settings to minimize the time to request.

-For synchronous mode, the primary metric is latency. The application creates one infer request and executes the `Infer` method. A number of executions is defined by one of the two values:
-* Number of iterations defined with the `-niter` command-line argument
-* Time duration specified with the `-t` command-line argument
+Throughput-oriented scenarios, in contrast, are focused on fully saturating the machine with enough data to crunch,
+as opposite to the time of the individual request. So, the primary performance metric is rather FPS (frames per second).
+Yet, just like with the latency case, the optimal execution parameters may differ between machines and devices.
+So, again, as explained in the previous section, the most portable way is to use the dedicated performance hint, rather than playing individual parameters.
+The hints allow the device to configure actual settings for the specified mode. The sample then queries/executes the optimal number of inference requests.
+
+During the execution, the application collects/reports two types of metrics:
+* Wall-clock time (latency) of each infer request and resulting latency
+* Duration of all inference executions and resulting throughput
+By default, the reported latency value is always calculated as the median (i.e. 50th percentile) value of all collected latencies from individual requests.
+Notice that you can change the desired percentile with the command-line flag.
+The throughput value is derived from the overall inference execution time and number of completed requests (respecting the batch size).
+
+### Defining the Number of Inference Executions
+A number of executions is defined by one of the two values:
+* Explicitly, with the `-niter` command-line argument
+* As _time_ duration specified with the `-t` command-line argument
 * Both of them (execution will continue until both conditions are met)
-* Predefined duration if `-niter` and `-t` are not specified. Predefined duration value depends on device.
-
-During the execution, the application collects two types of metrics:
-* Latency for each infer request executed with `Infer` method
-* Duration of all executions
-
-Reported latency value is calculated as mean value of all collected latencies. Reported throughput value is a derivative from reported latency and additionally depends on batch size.
-
-### Asynchronous API
-For asynchronous mode, the primary metric is throughput in frames per second (FPS). The application creates a certain number of infer requests and executes the `StartAsync` method. A number of executions is defined by one of the two values:
-* Number of iterations defined with the `-niter` command-line argument
-* Time duration specified with the `-t` command-line argument
-* Both of them (execution will continue until both conditions are met)
-* Predefined duration if `-niter` and `-t` are not specified. Predefined duration value depends on device.
-
-The infer requests are executed asynchronously. Callback is used to wait for previous execution to complete. The application measures all infer requests executions and reports the throughput metric based on batch size and total execution duration.
+* Predefined duration if neither `-niter`nor `-t` are not specified. Predefined duration value depends on the device.

 ## Run the Tool

@ -52,7 +65,7 @@ Notice that the benchmark_app usually produces optimal performance for any devic
 python3 benchmark_app.py -m <model> -i <input> -d CPU
 ```

-But it is still may be non-optimal for some cases, especially for very small networks. More details can read in [Introduction to Performance Topics](../../docs/IE_DG/Intro_to_Performance.md).
+But it is still may be sub-optimal for some cases, especially for very small networks. More details can read in [Introduction to Performance Topics](../../docs/IE_DG/Intro_to_Performance.md).

 Running the application with the `-h` or `--help`' option yields the following usage message:

@ -60,11 +73,12 @@ Running the application with the `-h` or `--help`' option yields the following u
 usage: benchmark_app.py [-h] [-i PATH_TO_INPUT] -m PATH_TO_MODEL
                        [-d TARGET_DEVICE]
                        [-l PATH_TO_EXTENSION] [-c PATH_TO_CLDNN_CONFIG]
+                        [-hint {throughput, latency}]
                        [-api {sync,async}] [-niter NUMBER_ITERATIONS]
                        [-b BATCH_SIZE]
                        [-stream_output [STREAM_OUTPUT]] [-t TIME]
                        [-progress [PROGRESS]] [-nstreams NUMBER_STREAMS]
-                        [-nthreads NUMBER_THREADS] [-pin {YES,NO}]
+                        [-nthreads NUMBER_THREADS] [-pin {YES,NO,NUMA,HYBRID_AWARE}]
                        [--exec_graph_path EXEC_GRAPH_PATH]
                        [-pc [PERF_COUNTS]]

@ -90,6 +104,11 @@ Options:
  -c PATH_TO_CLDNN_CONFIG, --path_to_cldnn_config PATH_TO_CLDNN_CONFIG
                        Optional. Required for GPU custom kernels. Absolute
                        path to an .xml file with the kernels description.
+  -hint {throughput, latency}, --perf_hint {throughput, latency}
+                        Optional. Performance hint (optimize for latency or throughput).
+                        The hint allows the OpenVINO device to select the right network-specific settings,
+                        as opposite to defining specific values like  \nstreams\ from the command line.
+                        So you can specify just the hint without adding explicit device-specific options.
  -api {sync,async}, --api_type {sync,async}
                        Optional. Enable using sync/async API. Default value
                        is async.
@ -115,7 +134,7 @@ Options:
                        "input1[NCHW],input2[NC]" or "[NCHW]" in case of one
                        input size.
  -nstreams NUMBER_STREAMS, --number_streams NUMBER_STREAMS
-                       Optional. Number of streams to use for inference on the CPU/GPU in throughput mode
+                       Optional. Number of streams to use for inference on the CPU/GPU/MYX in throughput mode
                       (for HETERO and MULTI device cases use format <device1>:<nstreams1>,<device2>:<nstreams2> or just <nstreams>).
                       Default value is determined automatically for a device.
                       Please note that although the automatic selection usually provides a reasonable performance,
@ -123,9 +142,12 @@ Options:
  -nthreads NUMBER_THREADS, --number_threads NUMBER_THREADS
                        Number of threads to use for inference on the CPU
                        (including HETERO  and MULTI cases).
-  -pin {YES,NUMA,NO}, --infer_threads_pinning {YES,NUMA,NO}
-                        Optional. Enable threads->cores ("YES", default), threads->(NUMA)nodes ("NUMA") or completely disable
-                        ("NO") CPU threads pinning for CPU-involved inference.
+  -pin {YES,NO,NUMA,HYBRID_AWARE}, --infer_threads_pinning {YES,NO,NUMA,HYBRID_AWARE}
+                        Optional. Enable threads->cores ('YES' which is OpenVINO runtime's default for conventional CPUs),
+                        threads->(NUMA)nodes ('NUMA'),
+                        threads->appropriate core types ('HYBRID_AWARE', which is OpenVINO runtime's default for Hybrid CPUs)
+                        or completely disable ('NO')
+                        CPU threads pinning for CPU-involved inference.
  --exec_graph_path EXEC_GRAPH_PATH
                        Optional. Path to a file where to store executable
                        graph information serialized.
--- a/tools/benchmark_tool/openvino/tools/benchmark/main.py
+++ b/tools/benchmark_tool/openvino/tools/benchmark/main.py
@ -108,6 +108,12 @@ def run(args):
                config[device]['PERF_COUNT'] = 'YES' if args.perf_counts else 'NO'
            perf_counts = True if config[device]['PERF_COUNT'] == 'YES' else perf_counts

+            ## high-level performance hints
+            if is_flag_set_in_command_line('hint'):
+                config[device]['PERFORMANCE_HINT'] = args.perf_hint.upper()
+                if is_flag_set_in_command_line('nireq'):
+                    config[device]['PERFORMANCE_HINT_NUM_REQUESTS'] = str(args.number_infer_requests)
+            ## the rest are individual per-device settings (overriding the values the device will deduce from perf hint)
            def set_throughput_streams():
                key = device + "_THROUGHPUT_STREAMS"
                if device in device_number_streams.keys():
@ -117,7 +123,8 @@ def run(args):
                        raise Exception(f"Device {device} doesn't support config key '{key}'! " +
                                        "Please specify -nstreams for correct devices in format  <dev1>:<nstreams1>,<dev2>:<nstreams2>")
                    config[device][key] = device_number_streams[device]
-                elif key not in config[device].keys() and args.api_type == "async":
+                elif key not in config[device].keys() and args.api_type == "async" and not is_flag_set_in_command_line('hint'):
+                    ## set the _AUTO value for the #streams
                    logger.warning(f"-nstreams default value is determined automatically for {device} device. " +
                                   "Although the automatic selection usually provides a reasonable performance,"
                                   "but it still may be non-optimal for some cases, for more information look at README.")
@ -284,13 +291,20 @@ def run(args):
            if batch_size == 0:
                batch_size = 1

-        # --------------------- 8. Setting optimal runtime parameters --------------------------------------------------
+        # --------------------- 8. Querying optimal runtime parameters --------------------------------------------------
        next_step()
+        if is_flag_set_in_command_line('hint'):
+            ## actual device-deduced settings for the hint
+            for device in devices:
+                keys = benchmark.ie.get_metric(device, 'SUPPORTED_CONFIG_KEYS')
+                logger.info(f'DEVICE: {device}')
+                for k in keys:
+                    logger.info(f'  {k}  , {exe_network.get_config(k)}')

        # Update number of streams
        for device in device_number_streams.keys():
            key = device + '_THROUGHPUT_STREAMS'
-            device_number_streams[device] = benchmark.ie.get_config(device, key)
+            device_number_streams[device] = exe_network.get_config(key)

        # Number of requests
        infer_requests = exe_network.requests
@ -328,7 +342,7 @@ def run(args):

        # ------------------------------------ 10. Measuring performance -----------------------------------------------

-        output_string = process_help_inference_string(benchmark)
+        output_string = process_help_inference_string(benchmark, exe_network)

        next_step(additional_info=output_string)
        progress_bar_total_count = 10000
--- a/tools/benchmark_tool/openvino/tools/benchmark/parameters.py
+++ b/tools/benchmark_tool/openvino/tools/benchmark/parameters.py
@ -48,6 +48,11 @@ def parse_args():
    args.add_argument('-c', '--path_to_cldnn_config', type=str, required=False,
                      help='Optional. Required for GPU custom kernels. Absolute path to an .xml file with the '
                           'kernels description.')
+    args.add_argument('-hint', '--perf_hint', type=str, required=False, default='', choices=['throughput', 'latency'],
+                      help='Optional. Performance hint (optimize for latency or throughput).'
+                            'The hint allows the OpenVINO device to select the right network-specific settings,'
+                            'as opposite to accepting specific values like  \'nstreams\' from the command line.'
+                            'So you can specify just the hint without adding explicit device-specific options')
    args.add_argument('-api', '--api_type', type=str, required=False, default='async', choices=['sync', 'async'],
                      help='Optional. Enable using sync/async API. Default value is async.')
    args.add_argument('-niter', '--number_iterations', type=check_positive, required=False, default=None,
--- a/tools/benchmark_tool/openvino/tools/benchmark/utils/utils.py
+++ b/tools/benchmark_tool/openvino/tools/benchmark/utils/utils.py
@ -30,7 +30,7 @@ def next_step(additional_info='', step_id=0):
        5: "Resizing network to match image sizes and given batch",
        6: "Configuring input of the model",
        7: "Loading the model to the device",
-        8: "Setting optimal runtime parameters",
+        8: "Querying optimal runtime parameters",
        9: "Creating infer requests and filling input blobs with images",
        10: "Measuring performance",
        11: "Dumping statistics report",
@ -194,18 +194,18 @@ def parse_nstreams_value_per_device(devices, values_string):
    return result


-def process_help_inference_string(benchmark_app):
+def process_help_inference_string(benchmark_app, exe_network):
    output_string = f'Start inference {benchmark_app.api_type}hronously'
    if benchmark_app.api_type == 'async':
        output_string += f', {benchmark_app.nireq} inference requests'

        device_ss = ''
        if CPU_DEVICE_NAME in benchmark_app.device:
-            device_ss += str(benchmark_app.ie.get_config(CPU_DEVICE_NAME, 'CPU_THROUGHPUT_STREAMS'))
+            device_ss += str(exe_network.get_config('CPU_THROUGHPUT_STREAMS'))
            device_ss += f' streams for {CPU_DEVICE_NAME}'
        if GPU_DEVICE_NAME in benchmark_app.device:
            device_ss += ', ' if device_ss else ''
-            device_ss += str(benchmark_app.ie.get_config(GPU_DEVICE_NAME, 'GPU_THROUGHPUT_STREAMS'))
+            device_ss += str(exe_network.get_config('GPU_THROUGHPUT_STREAMS'))
            device_ss += f' streams for {GPU_DEVICE_NAME}'

        if device_ss: