Openvino hybrid awareness (#5261)
* change the deprecated method to the recent * first ver of the hybrid cores aware CPU streams (+debug info) * more debug and fixed sum threads * disabled NUMA pinning to experiment with affinity via OS * further brushing of stream to core type logic * hybrid CPU-aware getNumberOfCPUCores * adding check on the efficiency * experimental TBB package (that cmake should pull from the internal server) * iterating over core types in the reversed order (so the big cores are populated first in case user specified less than all #threads) * adding back the NUMA affinity code-path for the full validation (incl 2 sockets Windows Server) * cpplint fix and tabbing the #if clauses for the readbility * pre-production TBB from internal server * wrapping over #cores/types * wrapping over #cores/types, ver 2 * wrapping over #streams instead * disabling warnings as errors for a while (to unlock testing) * accomodating new TBB layout for dependencies.bat * next tbb ver (with debug binaries that probably can unlock the commodity builds, without playing product_configs) * minor brushing for experiments (so that pinning can be disabled) * minor brushing from code review * Updating the SHA hash which appeared when rebasing to the master * WIP refactoring * Completed refactoring of the "config" phase of the cpu stream executor and on-the-fly streams to core types mapping * making the benchmark_app aware about new pinning mode * Brushing a bit (in preparation for the "soft" affinity) * map to vector to simplify the things * updated executors comparison * more fine-grained pinning scheme for the HYBRID (required to allow all cores on 2+8 1+4, and other LITTLE-skewed scenarios) TODO: seprate little to big ratio for the fp322 and int8 (and pass the fp32Only flag to the MakeDefaultMultiTHreaded) * separating fp32 and int8 intensive cases for hybrid execution, also leveraging the HT if the #big_cores is small, refactored. also switched to the 2021.2 oneTBB RC package * code style * stripped tbb archives from unused folders and files, also has to rename the LICENSE.txt to the LICENSE to match existing OV packaging tools * assigning nodeId regradless of pinning mode * tests OpenCV builds with same 2021.2 oneTBB, ubuntu 18/20 * cmake install paths for oneTBB, alos a ie_parallel.cmake warning on older ver of TBB * Updated latency case desc to cover multi-socket machines * adding centos8 OCV with oneTBB build updating TBB drops with hwloc shared libs added. * enabled internal OCV from THIRD_PARTY_SERVER to test thru CI.. Added Centos7 notbb OCV build (until g-api get ready for onetbb) to unlock the Centos7 CI build * separate rpath log to respect one-tbb specific paths * fixed SEQ code-path * fixed doc misprint * allowing all cores in 2+8 for int8 as well * cleaned from debug printfs * HYBRID_AWARE pinning option for the Python benchmark_app * OpenVINO Hybrid CPUs support * Remove custom::task_arena abstraction layout * Get back to the custom::task_arena interface * Add windows.h inclusion * Fix typo in macro name * Separate TBB and TBBbind packages * Fix compile-time conditions * Fix preprocessors conditions * Fix typo * Fix linking * make linking private * Fix typo * Fix target_compile_definitions syntax * Implement CMake install logic, update sha hash for the tbbbind_2_4 package * Add tbbbind_2_4 required paths to setup_vars * Update CI paths * Include ie_parallel.hpp to ie_system_conf.cpp * Try to update dependencies scripts * Try to fix dependencies.bat * Modify dependencies script * Use static tbbbind_2_4 library * Remove redundant paths from CI * Revert "cleaned from debug printfs" This reverts commit82c9bd90c5
. # Conflicts: # inference-engine/src/inference_engine/os/win/win_system_conf.cpp # inference-engine/src/inference_engine/threading/ie_cpu_streams_executor.cpp # inference-engine/src/mkldnn_plugin/config.cpp * Update tbbbind package version * fixed compilation * removing the direct tbb::info calls from CPU plugin, to aggregate everything in the single module (that exposes the higher level APIs) * Update tbbbind package version (cherry picked from commitf66b8f6aa6
) * compilation fix * brushing the headers a bit * Make custom::task_arena inherited from tbb::task_arena * change to the latest TBB API, and more debug printfs * code-style * ARM compilation * aligned "failed system config" between OV and TBB (by using '-1') * macos compilation fix * default arena creation (to make sure all code-path have that fallback) * Incapsulate all TBB versions related logic inside the custom namespace * Move custom layer header to internal scope + minor improvements * with all NUMA/Hybrid checks now consolidated in the custom_arena, cleaning the ugly ifdefs thta we had * Introduce new ThreadBindingType + fix compilation * fixing OMP compilation * OpenVINO Hybrid CPUs support * Remove custom::task_arena abstraction layout * Get back to the custom::task_arena interface * Add windows.h inclusion * Fix typo in macro name * Separate TBB and TBBbind packages * Fix compile-time conditions * Fix preprocessors conditions * Fix typo * Fix linking * make linking private * Fix typo * Fix target_compile_definitions syntax * Implement CMake install logic, update sha hash for the tbbbind_2_4 package * Add tbbbind_2_4 required paths to setup_vars * Update CI paths * Include ie_parallel.hpp to ie_system_conf.cpp * Try to update dependencies scripts * Try to fix dependencies.bat * Modify dependencies script * Use static tbbbind_2_4 library * Remove redundant paths from CI * Update tbbbind package version * Make custom::task_arena inherited from tbb::task_arena * Incapsulate all TBB versions related logic inside the custom namespace * Move custom layer header to internal scope + minor improvements * Introduce new ThreadBindingType + fix compilation * Fix compilation * Use public tbbbind_2_4 package * fixed macos build, corrected comments/desc * reverted to the default binding selection logic ( to preserve the legacy beh) * OpenVINO Hybrid CPUs support * Remove custom::task_arena abstraction layout * Get back to the custom::task_arena interface * Add windows.h inclusion * Fix typo in macro name * Separate TBB and TBBbind packages * Fix compile-time conditions * Fix preprocessors conditions * Fix typo * Fix linking * make linking private * Fix typo * Fix target_compile_definitions syntax * Implement CMake install logic, update sha hash for the tbbbind_2_4 package * Add tbbbind_2_4 required paths to setup_vars * Update CI paths * Include ie_parallel.hpp to ie_system_conf.cpp * Try to update dependencies scripts * Try to fix dependencies.bat * Modify dependencies script * Use static tbbbind_2_4 library * Remove redundant paths from CI * Update tbbbind package version * Make custom::task_arena inherited from tbb::task_arena * Incapsulate all TBB versions related logic inside the custom namespace * Move custom layer header to internal scope + minor improvements * Introduce new ThreadBindingType + fix compilation * Fix compilation * Use public tbbbind_2_4 package * Apply review comments * Fix compilation without tbbbind_2_4 * Fix compilation with different TBB versions * code review remarks * fix for the NONE pinning code-path under HYBRID_AWAR * whitespace and cleaning the debug printfs (per review) * code-review comments * fixed code-style Co-authored-by: Kochin, Ivan <ivan.kochin@intel.com> Co-authored-by: Kochin Ivan <kochin.ivan@intel.com>
This commit is contained in:
parent
8413b85b8a
commit
80f5fe953b
@ -205,11 +205,16 @@ DECLARE_CONFIG_KEY(CPU_THREADS_NUM);
|
||||
* @brief The name for setting CPU affinity per thread option.
|
||||
*
|
||||
* It is passed to Core::SetConfig(), this option should be used with values:
|
||||
* PluginConfigParams::YES (pinning threads to cores, best for static benchmarks),
|
||||
* PluginConfigParams::NUMA (pinning threads to NUMA nodes, best for real-life, contented cases)
|
||||
* this is TBB-specific knob, and the only pinning option (beyond 'NO', below) on the Windows*
|
||||
* PluginConfigParams::NO (no pinning for CPU inference threads)
|
||||
* All settings are ignored, if the OpenVINO compiled with OpenMP threading and any affinity-related OpenMP's
|
||||
* PluginConfigParams::YES, which is default on the conventional CPUs (pinning threads to cores, best for static benchmarks),
|
||||
*
|
||||
* the following options are implemented only for the TBB as a threading option
|
||||
* PluginConfigParams::NUMA (pinning threads to NUMA nodes, best for real-life, contented cases)
|
||||
* on the Windows and MacOS* this option behaves as YES
|
||||
* PluginConfigParams::HYBRID_AWARE (let the runtime to do pinning to the cores types, e.g. prefer the "big" cores for latency tasks)
|
||||
* on the hybrid CPUs this option is default
|
||||
*
|
||||
* Also, the settings are ignored, if the OpenVINO compiled with OpenMP and any affinity-related OpenMP's
|
||||
* environment variable is set (as affinity is configured explicitly)
|
||||
*/
|
||||
DECLARE_CONFIG_KEY(CPU_BIND_THREAD);
|
||||
|
@ -104,14 +104,16 @@ Options:
|
||||
estimations the number of streams should be set to 1.
|
||||
-nthreads "<integer>" Optional. Number of threads to use for inference on the CPU (including HETERO and MULTI cases).
|
||||
-enforcebf16="<true/false>" Optional. By default floating point operations execution in bfloat16 precision are enforced if supported by platform.
|
||||
'true' - enable bfloat16 regardless of platform support
|
||||
'false' - disable bfloat16 regardless of platform support.
|
||||
-pin "YES"/"NO"/"NUMA" Optional. Enable threads->cores ("YES", default), threads->(NUMA)nodes ("NUMA") or completely disable ("NO") CPU threads pinning for CPU-involved inference.
|
||||
-pin "YES"/"HYBRID_AWARE"/"NUMA"/"NO"
|
||||
Optional. Explicit inference threads binding options (leave empty to let the OpenVINO to make a choice):
|
||||
enabling threads->cores pinning ("YES", which is already default for a conventional CPU),
|
||||
letting the runtime to decide on the threads->different core types ("HYBRID_AWARE", which is default on the hybrid CPUs)
|
||||
threads->(NUMA)nodes ("NUMA") or
|
||||
completely disable ("NO") CPU inference threads pinning.
|
||||
-ip "U8"/"FP16"/"FP32" Optional. Specifies precision for all input layers of the network.
|
||||
-op "U8"/"FP16"/"FP32" Optional. Specifies precision for all output layers of the network.
|
||||
-iop Optional. Specifies precision for input and output layers by name. Example: -iop "input:FP16, output:FP16". Notice that quotes are required. Overwrites precision from ip and op options for specified layers.
|
||||
|
||||
|
||||
Statistics dumping options:
|
||||
-report_type "<type>" Optional. Enable collecting statistics report. "no_counters" report contains configuration options specified, resulting FPS and latency. "average_counters" report extends "no_counters" report and additionally includes average PM counters values for each layer from the network. "detailed_counters" report extends "average_counters" report and additionally includes per-layer PM counters and latency for each executed infer request.
|
||||
-report_folder Optional. Path to a folder where statistics report is stored.
|
||||
|
@ -73,10 +73,12 @@ static const char batch_size_message[] = "Optional. Batch size value. If not spe
|
||||
"Intermediate Representation.";
|
||||
|
||||
// @brief message for CPU threads pinning option
|
||||
static const char infer_threads_pinning_message[] = "Optional. Enable threads->cores (\"YES\", default), threads->(NUMA)nodes (\"NUMA\") "
|
||||
"or completely disable (\"NO\") "
|
||||
"CPU threads pinning for CPU-involved inference.";
|
||||
|
||||
static const char infer_threads_pinning_message[] =
|
||||
"Optional. Explicit inference threads binding options (leave empty to let the OpenVINO to make a choice):\n"
|
||||
"\t\t\t\tenabling threads->cores pinning(\"YES\", which is already default for any conventional CPU), \n"
|
||||
"\t\t\t\tletting the runtime to decide on the threads->different core types(\"HYBRID_AWARE\", which is default on the hybrid CPUs) \n"
|
||||
"\t\t\t\tthreads->(NUMA)nodes(\"NUMA\") or \n"
|
||||
"\t\t\t\tcompletely disable(\"NO\") CPU inference threads pinning";
|
||||
// @brief message for stream_output option
|
||||
static const char stream_output_message[] = "Optional. Print progress as a plain text. When specified, an interactive progress bar is "
|
||||
"replaced with a "
|
||||
@ -187,7 +189,7 @@ DEFINE_bool(enforcebf16, false, enforce_bf16_message);
|
||||
DEFINE_uint32(b, 0, batch_size_message);
|
||||
|
||||
// @brief Enable plugin messages
|
||||
DEFINE_string(pin, "YES", infer_threads_pinning_message);
|
||||
DEFINE_string(pin, "", infer_threads_pinning_message);
|
||||
|
||||
/// @brief Enables multiline text output instead of progress bar
|
||||
DEFINE_bool(stream_output, false, stream_output_message);
|
||||
@ -264,7 +266,7 @@ static void showUsage() {
|
||||
std::cout << " -nstreams \"<integer>\" " << infer_num_streams_message << std::endl;
|
||||
std::cout << " -nthreads \"<integer>\" " << infer_num_threads_message << std::endl;
|
||||
std::cout << " -enforcebf16=<true/false> " << enforce_bf16_message << std::endl;
|
||||
std::cout << " -pin \"YES\"/\"NO\"/\"NUMA\" " << infer_threads_pinning_message << std::endl;
|
||||
std::cout << " -pin \"YES\"/\"HYBRID_AWARE\"/\"NO\"/\"NUMA\" " << infer_threads_pinning_message << std::endl;
|
||||
std::cout << std::endl << " Statistics dumping options:" << std::endl;
|
||||
std::cout << " -report_type \"<type>\" " << report_type_message << std::endl;
|
||||
std::cout << " -report_folder " << report_folder_message << std::endl;
|
||||
|
@ -267,9 +267,6 @@ int main(int argc, char* argv[]) {
|
||||
if ((device_name.find("MULTI") != std::string::npos) && (device_name.find("GPU") != std::string::npos)) {
|
||||
slog::warn << "Turn off threads pinning for " << device << " device since multi-scenario with GPU device is used." << slog::endl;
|
||||
device_config[CONFIG_KEY(CPU_BIND_THREAD)] = CONFIG_VALUE(NO);
|
||||
} else {
|
||||
// set to default value
|
||||
device_config[CONFIG_KEY(CPU_BIND_THREAD)] = FLAGS_pin;
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -90,9 +90,9 @@ bool checkOpenMpEnvVars(bool includeOMPNumThreads) {
|
||||
#if defined(__APPLE__)
|
||||
// for Linux and Windows the getNumberOfCPUCores (that accounts only for physical cores) implementation is OS-specific
|
||||
// (see cpp files in corresponding folders), for __APPLE__ it is default :
|
||||
int getNumberOfCPUCores() { return parallel_get_max_threads();}
|
||||
int getNumberOfCPUCores(bool) { return parallel_get_max_threads();}
|
||||
#if !((IE_THREAD == IE_THREAD_TBB) || (IE_THREAD == IE_THREAD_TBB_AUTO))
|
||||
std::vector<int> getAvailableNUMANodes() { return {0}; }
|
||||
std::vector<int> getAvailableNUMANodes() { return {-1}; }
|
||||
#endif
|
||||
#endif
|
||||
|
||||
@ -100,6 +100,15 @@ std::vector<int> getAvailableNUMANodes() { return {0}; }
|
||||
std::vector<int> getAvailableNUMANodes() {
|
||||
return custom::info::numa_nodes();
|
||||
}
|
||||
// this is impl only with the TBB
|
||||
std::vector<int> getAvailableCoresTypes() {
|
||||
return custom::info::core_types();
|
||||
}
|
||||
#else
|
||||
// as the core types support exists only with the TBB, the fallback is same for any other threading API
|
||||
std::vector<int> getAvailableCoresTypes() {
|
||||
return {-1};
|
||||
}
|
||||
#endif
|
||||
|
||||
std::exception_ptr& CurrentException() {
|
||||
|
@ -7,11 +7,12 @@
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include <iostream>
|
||||
#include <sched.h>
|
||||
#include "ie_system_conf.h"
|
||||
#include "ie_parallel.hpp"
|
||||
#include "ie_common.h"
|
||||
#include <numeric>
|
||||
#include <sched.h>
|
||||
|
||||
#include "ie_common.h"
|
||||
#include "ie_system_conf.h"
|
||||
#include "threading/ie_parallel_custom_arena.hpp"
|
||||
|
||||
|
||||
namespace InferenceEngine {
|
||||
@ -61,7 +62,7 @@ std::vector<int> getAvailableNUMANodes() {
|
||||
return nodes;
|
||||
}
|
||||
#endif
|
||||
int getNumberOfCPUCores() {
|
||||
int getNumberOfCPUCores(bool bigCoresOnly) {
|
||||
unsigned numberOfProcessors = cpu._processors;
|
||||
unsigned totalNumberOfCpuCores = cpu._cores;
|
||||
IE_ASSERT(totalNumberOfCpuCores != 0);
|
||||
@ -81,7 +82,16 @@ int getNumberOfCPUCores() {
|
||||
}
|
||||
}
|
||||
}
|
||||
return CPU_COUNT(¤tCoreSet);
|
||||
int phys_cores = CPU_COUNT(¤tCoreSet);
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
|
||||
auto core_types = custom::info::core_types();
|
||||
if (bigCoresOnly && core_types.size() > 1) /*Hybrid CPU*/ {
|
||||
phys_cores = custom::info::default_concurrency(custom::task_arena::constraints{}
|
||||
.set_core_type(core_types.back())
|
||||
.set_max_threads_per_core(1));
|
||||
}
|
||||
#endif
|
||||
return phys_cores;
|
||||
}
|
||||
|
||||
} // namespace InferenceEngine
|
||||
|
@ -10,10 +10,10 @@
|
||||
#include <memory>
|
||||
#include <vector>
|
||||
#include "ie_system_conf.h"
|
||||
#include "ie_parallel.hpp"
|
||||
#include "threading/ie_parallel_custom_arena.hpp"
|
||||
|
||||
namespace InferenceEngine {
|
||||
int getNumberOfCPUCores() {
|
||||
int getNumberOfCPUCores(bool bigCoresOnly) {
|
||||
const int fallback_val = parallel_get_max_threads();
|
||||
DWORD sz = 0;
|
||||
// querying the size of the resulting structure, passing the nullptr for the buffer
|
||||
@ -32,12 +32,21 @@ int getNumberOfCPUCores() {
|
||||
offset += reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(ptr.get() + offset)->Size;
|
||||
phys_cores++;
|
||||
} while (offset < sz);
|
||||
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
|
||||
auto core_types = custom::info::core_types();
|
||||
if (bigCoresOnly && core_types.size() > 1) /*Hybrid CPU*/ {
|
||||
phys_cores = custom::info::default_concurrency(custom::task_arena::constraints{}
|
||||
.set_core_type(core_types.back())
|
||||
.set_max_threads_per_core(1));
|
||||
}
|
||||
#endif
|
||||
return phys_cores;
|
||||
}
|
||||
|
||||
#if !(IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
|
||||
// OMP/SEQ threading on the Windows doesn't support NUMA
|
||||
std::vector<int> getAvailableNUMANodes() { return std::vector<int>(1, 0); }
|
||||
std::vector<int> getAvailableNUMANodes() { return {-1}; }
|
||||
#endif
|
||||
|
||||
} // namespace InferenceEngine
|
||||
|
@ -71,19 +71,30 @@ struct CPUStreamsExecutor::Impl {
|
||||
((_impl->_config._streams + _impl->_usedNumaNodes.size() - 1)/_impl->_usedNumaNodes.size()))
|
||||
: _impl->_usedNumaNodes.at(_streamId % _impl->_usedNumaNodes.size());
|
||||
#if IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO
|
||||
auto concurrency = (0 == _impl->_config._threadsPerStream) ? custom::task_arena::automatic : _impl->_config._threadsPerStream;
|
||||
const auto concurrency = (0 == _impl->_config._threadsPerStream) ? custom::task_arena::automatic : _impl->_config._threadsPerStream;
|
||||
if (ThreadBindingType::HYBRID_AWARE == _impl->_config._threadBindingType) {
|
||||
_taskArena.reset(new custom::task_arena{
|
||||
custom::task_arena::constraints{}
|
||||
.set_core_type(custom::info::core_types().back())
|
||||
.set_max_concurrency(concurrency)
|
||||
});
|
||||
if (Config::PreferredCoreType::ROUND_ROBIN != _impl->_config._threadPreferredCoreType) {
|
||||
if (Config::PreferredCoreType::ANY == _impl->_config._threadPreferredCoreType) {
|
||||
_taskArena.reset(new custom::task_arena{concurrency});
|
||||
} else {
|
||||
const auto selected_core_type = Config::PreferredCoreType::BIG == _impl->_config._threadPreferredCoreType
|
||||
? custom::info::core_types().back() // running on Big cores only
|
||||
: custom::info::core_types().front(); // running on Little cores only
|
||||
_taskArena.reset(new custom::task_arena{
|
||||
custom::task_arena::constraints{}.set_core_type(selected_core_type).set_max_concurrency(concurrency)});
|
||||
}
|
||||
} else {
|
||||
// assigning the stream to the core type in the round-robin fashion
|
||||
// wrapping around total_streams (i.e. how many streams all different core types can handle together)
|
||||
const auto total_streams = _impl->total_streams_on_core_types.back().second;
|
||||
const auto streamId_wrapped = _streamId % total_streams;
|
||||
const auto& selected_core_type = std::find_if(_impl->total_streams_on_core_types.cbegin(), _impl->total_streams_on_core_types.cend(),
|
||||
[streamId_wrapped](const decltype(_impl->total_streams_on_core_types)::value_type & p) { return p.second > streamId_wrapped; })->first;
|
||||
_taskArena.reset(new custom::task_arena{
|
||||
custom::task_arena::constraints{}.set_core_type(selected_core_type).set_max_concurrency(concurrency)});
|
||||
}
|
||||
} else if (ThreadBindingType::NUMA == _impl->_config._threadBindingType) {
|
||||
_taskArena.reset(new custom::task_arena{
|
||||
custom::task_arena::constraints{}
|
||||
.set_numa_id(_numaNodeId)
|
||||
.set_max_concurrency(concurrency)
|
||||
});
|
||||
_taskArena.reset(new custom::task_arena{custom::task_arena::constraints{_numaNodeId, concurrency}});
|
||||
} else if ((0 != _impl->_config._threadsPerStream) || (ThreadBindingType::CORES == _impl->_config._threadBindingType)) {
|
||||
_taskArena.reset(new custom::task_arena{concurrency});
|
||||
if (ThreadBindingType::CORES == _impl->_config._threadBindingType) {
|
||||
@ -164,6 +175,25 @@ struct CPUStreamsExecutor::Impl {
|
||||
} else {
|
||||
_usedNumaNodes = numaNodes;
|
||||
}
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
|
||||
if (ThreadBindingType::HYBRID_AWARE == config._threadBindingType) {
|
||||
const auto core_types = custom::info::core_types();
|
||||
const int threadsPerStream = (0 == config._threadsPerStream) ? std::thread::hardware_concurrency() : config._threadsPerStream;
|
||||
int sum = 0;
|
||||
// reversed order, so BIG cores are first
|
||||
for (auto iter = core_types.rbegin(); iter < core_types.rend(); iter++) {
|
||||
const auto& type = *iter;
|
||||
// calculating the #streams per core type
|
||||
const int num_streams_for_core_type = std::max(1,
|
||||
custom::info::default_concurrency(
|
||||
custom::task_arena::constraints{}.set_core_type(type)) / threadsPerStream);
|
||||
sum += num_streams_for_core_type;
|
||||
// prefix sum, so the core type for a given stream id will be deduced just as a upper_bound
|
||||
// (notice that the map keeps the elements in the descending order, so the big cores are populated first)
|
||||
total_streams_on_core_types.push_back({type, sum});
|
||||
}
|
||||
}
|
||||
#endif
|
||||
for (auto streamId = 0; streamId < _config._streams; ++streamId) {
|
||||
_threads.emplace_back([this, streamId] {
|
||||
openvino::itt::threadName(_config._name + "_" + std::to_string(streamId));
|
||||
@ -232,6 +262,14 @@ struct CPUStreamsExecutor::Impl {
|
||||
bool _isStopped = false;
|
||||
std::vector<int> _usedNumaNodes;
|
||||
ThreadLocal<std::shared_ptr<Stream>> _streams;
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
|
||||
// stream id mapping to the core type
|
||||
// stored in the reversed order (so the big cores, with the highest core_type_id value, are populated first)
|
||||
// every entry is the core type and #streams that this AND ALL EARLIER entries can handle (prefix sum)
|
||||
// (so mapping is actually just an upper_bound: core type is deduced from the entry for which the id < #streams)
|
||||
using StreamIdToCoreTypes = std::vector<std::pair<custom::core_type_id, int>>;
|
||||
StreamIdToCoreTypes total_streams_on_core_types;
|
||||
#endif
|
||||
};
|
||||
|
||||
|
||||
|
@ -36,6 +36,8 @@ IStreamsExecutor::Ptr ExecutorManagerImpl::getIdleCPUStreamsExecutor(const IStre
|
||||
executorConfig._threadBindingType == config._threadBindingType &&
|
||||
executorConfig._threadBindingStep == config._threadBindingStep &&
|
||||
executorConfig._threadBindingOffset == config._threadBindingOffset)
|
||||
if (executorConfig._threadBindingType != IStreamsExecutor::ThreadBindingType::HYBRID_AWARE
|
||||
|| executorConfig._threadPreferredCoreType == config._threadPreferredCoreType)
|
||||
return executor;
|
||||
}
|
||||
auto newExec = std::make_shared<CPUStreamsExecutor>(config);
|
||||
|
@ -6,6 +6,7 @@
|
||||
#include "ie_plugin_config.hpp"
|
||||
#include "cpp_interfaces/interface/ie_internal_plugin_config.hpp"
|
||||
#include "ie_parallel.hpp"
|
||||
#include "ie_parallel_custom_arena.hpp"
|
||||
#include "ie_system_conf.h"
|
||||
#include "ie_parameter.hpp"
|
||||
#include <string>
|
||||
@ -29,32 +30,27 @@ std::vector<std::string> IStreamsExecutor::Config::SupportedKeys() {
|
||||
void IStreamsExecutor::Config::SetConfig(const std::string& key, const std::string& value) {
|
||||
if (key == CONFIG_KEY(CPU_BIND_THREAD)) {
|
||||
if (value == CONFIG_VALUE(YES) || value == CONFIG_VALUE(NUMA)) {
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO) && (TBB_INTERFACE_VERSION < 11100)
|
||||
if (value == CONFIG_VALUE(NUMA))
|
||||
IE_THROW() << CONFIG_KEY(CPU_BIND_THREAD) << " property value was set to NUMA. But IE was built with "
|
||||
<< "TBB version without NUMA-aware API. Current TBB API version is " << TBB_INTERFACE_VERSION
|
||||
<< ", required API version 11100 or greater.";
|
||||
#endif
|
||||
|
||||
#if (defined(__APPLE__) || defined(_WIN32))
|
||||
// on the Windows and Apple the CORES and NUMA pinning options are the same
|
||||
#if (defined(__APPLE__) || defined(_WIN32))
|
||||
_threadBindingType = IStreamsExecutor::ThreadBindingType::NUMA;
|
||||
#else
|
||||
#else
|
||||
_threadBindingType = (value == CONFIG_VALUE(YES))
|
||||
? IStreamsExecutor::ThreadBindingType::CORES : IStreamsExecutor::ThreadBindingType::NUMA;
|
||||
#endif
|
||||
#endif
|
||||
} else if (value == CONFIG_VALUE(HYBRID_AWARE)) {
|
||||
_threadBindingType = IStreamsExecutor::ThreadBindingType::HYBRID_AWARE;
|
||||
} else if (value == CONFIG_VALUE(NO)) {
|
||||
_threadBindingType = IStreamsExecutor::ThreadBindingType::NONE;
|
||||
} else {
|
||||
IE_THROW() << "Wrong value for property key " << CONFIG_KEY(CPU_BIND_THREAD)
|
||||
<< ". Expected only YES(binds to cores) / NO(no binding) / NUMA(binds to NUMA nodes)";
|
||||
<< ". Expected only YES(binds to cores) / NO(no binding) / NUMA(binds to NUMA nodes) / "
|
||||
"HYBRID_AWARE (let the runtime recognize and use the hybrid cores)";
|
||||
}
|
||||
} else if (key == CONFIG_KEY(CPU_THROUGHPUT_STREAMS)) {
|
||||
if (value == CONFIG_VALUE(CPU_THROUGHPUT_NUMA)) {
|
||||
_streams = static_cast<int>(getAvailableNUMANodes().size());
|
||||
} else if (value == CONFIG_VALUE(CPU_THROUGHPUT_AUTO)) {
|
||||
const int sockets = static_cast<int>(getAvailableNUMANodes().size());
|
||||
// bare minimum of streams (that evenly divides available number of core)
|
||||
// bare minimum of streams (that evenly divides available number of cores)
|
||||
const int num_cores = sockets == 1 ? std::thread::hardware_concurrency() : getNumberOfCPUCores();
|
||||
if (0 == num_cores % 4)
|
||||
_streams = std::max(4, num_cores / 4);
|
||||
@ -138,12 +134,52 @@ Parameter IStreamsExecutor::Config::GetConfig(const std::string& key) {
|
||||
return {};
|
||||
}
|
||||
|
||||
IStreamsExecutor::Config IStreamsExecutor::Config::MakeDefaultMultiThreaded(const IStreamsExecutor::Config& initial) {
|
||||
IStreamsExecutor::Config IStreamsExecutor::Config::MakeDefaultMultiThreaded(const IStreamsExecutor::Config& initial, const bool fp_intesive) {
|
||||
const auto envThreads = parallel_get_env_threads();
|
||||
const auto& numaNodes = getAvailableNUMANodes();
|
||||
const auto numaNodesNum = numaNodes.size();
|
||||
const int numaNodesNum = numaNodes.size();
|
||||
auto streamExecutorConfig = initial;
|
||||
const auto hwCores = streamExecutorConfig._streams > 1 && numaNodesNum == 1 ? parallel_get_max_threads() : getNumberOfCPUCores();
|
||||
const bool bLatencyCase = streamExecutorConfig._streams <= numaNodesNum;
|
||||
|
||||
// by default, do not use the hyper-threading (to minimize threads synch overheads)
|
||||
int num_cores_default = getNumberOfCPUCores();
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
|
||||
//additional latency-case logic for hybrid processors:
|
||||
if (ThreadBindingType::HYBRID_AWARE == streamExecutorConfig._threadBindingType) {
|
||||
const auto core_types = custom::info::core_types();
|
||||
const auto num_little_cores = custom::info::default_concurrency(custom::task_arena::constraints{}.set_core_type(core_types.front()));
|
||||
const auto num_big_cores_phys = getNumberOfCPUCores(true);
|
||||
const int int8_threshold = 4; // ~relative efficiency of the VNNI-intensive code for Big vs Little cores;
|
||||
const int fp32_threshold = 2; // ~relative efficiency of the AVX2 fp32 code for Big vs Little cores;
|
||||
// by default the latency case uses (faster) Big cores only, depending on the compute ratio
|
||||
const bool bLatencyCaseBigOnly = num_big_cores_phys > (num_little_cores / (fp_intesive ? fp32_threshold : int8_threshold));
|
||||
// selecting the preferred core type
|
||||
streamExecutorConfig._threadPreferredCoreType =
|
||||
bLatencyCase
|
||||
? (bLatencyCaseBigOnly
|
||||
? IStreamsExecutor::Config::PreferredCoreType::BIG
|
||||
: IStreamsExecutor::Config::PreferredCoreType::ANY)
|
||||
: IStreamsExecutor::Config::PreferredCoreType::ROUND_ROBIN;
|
||||
// additionally selecting the #cores to use in the "Big-only" case
|
||||
if (bLatencyCaseBigOnly) {
|
||||
const int hyper_threading_threshold = 2; // min #cores, for which the hyper-threading becomes useful for the latency case
|
||||
const auto num_big_cores = custom::info::default_concurrency(custom::task_arena::constraints{}.set_core_type(core_types.back()));
|
||||
num_cores_default = (num_big_cores_phys <= hyper_threading_threshold) ? num_big_cores : num_big_cores_phys;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
const auto hwCores = !bLatencyCase && numaNodesNum == 1
|
||||
// throughput case on a single-NUMA node machine uses all available cores
|
||||
? parallel_get_max_threads()
|
||||
// in the rest of cases:
|
||||
// multi-node machine
|
||||
// or
|
||||
// latency case, single-node yet hybrid case that uses
|
||||
// all core types
|
||||
// or
|
||||
// big-cores only, but the #cores is "enough" (pls see the logic above)
|
||||
// it is usually beneficial not to use the hyper-threading (which is default)
|
||||
: num_cores_default;
|
||||
const auto threads = streamExecutorConfig._threads ? streamExecutorConfig._threads : (envThreads ? envThreads : hwCores);
|
||||
streamExecutorConfig._threadsPerStream = streamExecutorConfig._streams
|
||||
? std::max(1, threads/streamExecutorConfig._streams)
|
||||
|
@ -13,7 +13,6 @@
|
||||
#include "ie_parallel.hpp"
|
||||
#include "ie_system_conf.h"
|
||||
|
||||
#include <cpp_interfaces/exception2status.hpp>
|
||||
#include <cpp_interfaces/interface/ie_internal_plugin_config.hpp>
|
||||
|
||||
namespace MKLDNNPlugin {
|
||||
@ -21,16 +20,20 @@ namespace MKLDNNPlugin {
|
||||
using namespace InferenceEngine;
|
||||
|
||||
Config::Config() {
|
||||
#if (defined(__APPLE__) || defined(_WIN32))
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO) && (TBB_INTERFACE_VERSION >= 11100)
|
||||
// If we sure that TBB has NUMA aware API part.
|
||||
streamExecutorConfig._threadBindingType = InferenceEngine::IStreamsExecutor::NUMA;
|
||||
#else
|
||||
streamExecutorConfig._threadBindingType = InferenceEngine::IStreamsExecutor::NONE;
|
||||
#endif
|
||||
#else
|
||||
// this is default mode
|
||||
streamExecutorConfig._threadBindingType = InferenceEngine::IStreamsExecutor::CORES;
|
||||
#endif
|
||||
|
||||
// for the TBB code-path, additional configuration depending on the OS and CPU types
|
||||
#if (IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO)
|
||||
#if defined(__APPLE__) || defined(_WIN32)
|
||||
// 'CORES' is not implemented for Win/MacOS; so the 'NUMA' is default
|
||||
streamExecutorConfig._threadBindingType = InferenceEngine::IStreamsExecutor::NUMA;
|
||||
#endif
|
||||
|
||||
if (getAvailableCoresTypes().size() > 1 /*Hybrid CPU*/) {
|
||||
streamExecutorConfig._threadBindingType = InferenceEngine::IStreamsExecutor::HYBRID_AWARE;
|
||||
}
|
||||
#endif
|
||||
|
||||
if (!with_cpu_x86_bfloat16())
|
||||
enforceBF16 = false;
|
||||
@ -128,7 +131,7 @@ void Config::updateProperties() {
|
||||
_config.insert({ PluginConfigParams::KEY_CPU_BIND_THREAD, PluginConfigParams::NUMA });
|
||||
break;
|
||||
case IStreamsExecutor::ThreadBindingType::HYBRID_AWARE:
|
||||
_config.insert({ PluginConfigParams::KEY_CPU_BIND_THREAD, PluginConfigParams::HYBRID_AWARE});
|
||||
_config.insert({ PluginConfigParams::KEY_CPU_BIND_THREAD, PluginConfigParams::HYBRID_AWARE });
|
||||
break;
|
||||
}
|
||||
if (collectPerfCounters == true)
|
||||
|
@ -49,11 +49,11 @@ MKLDNNExecNetwork::MKLDNNExecNetwork(const InferenceEngine::CNNNetwork &network,
|
||||
// we are cloning network if we have statistics and we can transform network.
|
||||
_clonedNetwork = cloneNetwork(network);
|
||||
|
||||
bool isFloatModel = true;
|
||||
if (_cfg.lpTransformsMode == Config::LPTransformsMode::On) {
|
||||
// Check if network is INT8 or Binary.
|
||||
// BF16 transformations were disabled since CPU plug-in doesn't support mixed precision execution:
|
||||
// BF16 + INT8 or BF16 + BIN.
|
||||
bool isFloatModel = true;
|
||||
CNNNetworkIterator iter(network);
|
||||
while (iter != CNNNetworkIterator()) {
|
||||
if (CaselessEq<std::string>()((*iter)->type, "FakeQuantize")) {
|
||||
@ -229,7 +229,7 @@ MKLDNNExecNetwork::MKLDNNExecNetwork(const InferenceEngine::CNNNetwork &network,
|
||||
// special case when all InferRequests are muxed into a single queue
|
||||
_taskExecutor = InferenceEngine::ExecutorManager::getInstance()->getExecutor("CPU");
|
||||
} else {
|
||||
auto streamsExecutorConfig = InferenceEngine::IStreamsExecutor::Config::MakeDefaultMultiThreaded(_cfg.streamExecutorConfig);
|
||||
auto streamsExecutorConfig = InferenceEngine::IStreamsExecutor::Config::MakeDefaultMultiThreaded(_cfg.streamExecutorConfig, isFloatModel);
|
||||
streamsExecutorConfig._name = "CPUStreamsExecutor";
|
||||
_taskExecutor = InferenceEngine::ExecutorManager::getInstance()->getIdleCPUStreamsExecutor(streamsExecutorConfig);
|
||||
}
|
||||
|
@ -37,12 +37,23 @@ INFERENCE_ENGINE_API_CPP(bool) checkOpenMpEnvVars(bool includeOMPNumThreads = tr
|
||||
INFERENCE_ENGINE_API_CPP(std::vector<int>) getAvailableNUMANodes();
|
||||
|
||||
/**
|
||||
* @brief Returns number of CPU physical cores on Linux/Windows (which is considered to be more performance friendly for servers)
|
||||
* (on other OSes it simply relies on the original parallel API of choice, which usually uses the logical cores )
|
||||
* @brief Returns available CPU cores types (on Linux, and Windows) and ONLY with TBB, single core type is assumed otherwise
|
||||
* @ingroup ie_dev_api_system_conf
|
||||
* @return Vector of core types
|
||||
*/
|
||||
INFERENCE_ENGINE_API_CPP(std::vector<int>) getAvailableCoresTypes();
|
||||
|
||||
/**
|
||||
* @brief Returns number of CPU physical cores on Linux/Windows (which is considered to be more performance friendly for servers)
|
||||
* (on other OSes it simply relies on the original parallel API of choice, which usually uses the logical cores).
|
||||
* call function with 'false' to get #phys cores of all types
|
||||
* call function with 'true' to get #phys 'Big' cores
|
||||
* number of 'Little' = 'all' - 'Big'
|
||||
* @ingroup ie_dev_api_system_conf
|
||||
* @param[in] bigCoresOnly Additionally limits the number of reported cores to the 'Big' cores only.
|
||||
* @return Number of physical CPU cores.
|
||||
*/
|
||||
INFERENCE_ENGINE_API_CPP(int) getNumberOfCPUCores();
|
||||
INFERENCE_ENGINE_API_CPP(int) getNumberOfCPUCores(bool bigCoresOnly = false);
|
||||
|
||||
/**
|
||||
* @brief Checks whether CPU supports SSE 4.2 capability
|
||||
|
@ -36,7 +36,7 @@ public:
|
||||
using Ptr = std::shared_ptr<IStreamsExecutor>;
|
||||
|
||||
/**
|
||||
* @brief Defines thread binding type
|
||||
* @brief Defines inference thread binding type
|
||||
*/
|
||||
enum ThreadBindingType : std::uint8_t {
|
||||
NONE, //!< Don't bind the inference threads
|
||||
@ -74,9 +74,11 @@ public:
|
||||
* @brief Create appropriate multithreaded configuration
|
||||
* filing unconfigured values from initial configuration using hardware properties
|
||||
* @param initial Inital configuration
|
||||
* @param fp_intesive additional hint for the the (Hybrid) core-types selection logic
|
||||
* whether the executor should be configured for floating point intensive work (as opposite to int8 intensive)
|
||||
* @return configured values
|
||||
*/
|
||||
static Config MakeDefaultMultiThreaded(const Config& initial);
|
||||
static Config MakeDefaultMultiThreaded(const Config& initial, const bool fp_intesive = true);
|
||||
|
||||
std::string _name; //!< Used by `ITT` to name executor threads
|
||||
int _streams = 1; //!< Number of streams.
|
||||
@ -85,6 +87,12 @@ public:
|
||||
int _threadBindingStep = 1; //!< In case of @ref CORES binding offset type thread binded to cores with defined step
|
||||
int _threadBindingOffset = 0; //!< In case of @ref CORES binding offset type thread binded to cores starting from offset
|
||||
int _threads = 0; //!< Number of threads distributed between streams. Reserved. Should not be used.
|
||||
enum PreferredCoreType {
|
||||
ANY,
|
||||
LITTLE,
|
||||
BIG,
|
||||
ROUND_ROBIN // used w/multiple streams to populate the Big cores first, then the Little, then wrap around (for large #streams)
|
||||
} _threadPreferredCoreType = PreferredCoreType::ANY; //!< In case of @ref HYBRID_AWARE hints the TBB to affinitize
|
||||
|
||||
/**
|
||||
* @brief A constructor with arguments
|
||||
@ -96,6 +104,7 @@ public:
|
||||
* @param[in] threadBindingStep @copybrief Config::_threadBindingStep
|
||||
* @param[in] threadBindingOffset @copybrief Config::_threadBindingOffset
|
||||
* @param[in] threads @copybrief Config::_threads
|
||||
* @param[in] threadPreferBigCores @copybrief Config::_threadPreferBigCores
|
||||
*/
|
||||
Config(
|
||||
std::string name = "StreamsExecutor",
|
||||
@ -104,14 +113,15 @@ public:
|
||||
ThreadBindingType threadBindingType = ThreadBindingType::NONE,
|
||||
int threadBindingStep = 1,
|
||||
int threadBindingOffset = 0,
|
||||
int threads = 0) :
|
||||
int threads = 0,
|
||||
PreferredCoreType threadPreferredCoreType = PreferredCoreType::ANY) :
|
||||
_name{name},
|
||||
_streams{streams},
|
||||
_threadsPerStream{threadsPerStream},
|
||||
_threadBindingType{threadBindingType},
|
||||
_threadBindingStep{threadBindingStep},
|
||||
_threadBindingOffset{threadBindingOffset},
|
||||
_threads{threads} {
|
||||
_threads{threads}, _threadPreferredCoreType(threadPreferredCoreType){
|
||||
}
|
||||
};
|
||||
|
||||
|
@ -142,9 +142,6 @@ def run(args):
|
||||
logger.warning(f"Turn off threads pinning for {device} " +
|
||||
"device since multi-scenario with GPU device is used.")
|
||||
config[device]['CPU_BIND_THREAD'] = 'NO'
|
||||
else:
|
||||
## set to default value
|
||||
config[device]['CPU_BIND_THREAD'] = args.infer_threads_pinning
|
||||
|
||||
## for CPU execution, more throughput-oriented execution via streams
|
||||
set_throughput_streams()
|
||||
|
@ -91,8 +91,11 @@ def parse_args():
|
||||
args.add_argument('-nthreads', '--number_threads', type=int, required=False, default=None,
|
||||
help='Number of threads to use for inference on the CPU, GNA '
|
||||
'(including HETERO and MULTI cases).')
|
||||
args.add_argument('-pin', '--infer_threads_pinning', type=str, required=False, default='YES', choices=['YES', 'NO', 'NUMA'],
|
||||
help='Optional. Enable threads->cores (\'YES\' is default value), threads->(NUMA)nodes (\'NUMA\') or completely disable (\'NO\')'
|
||||
args.add_argument('-pin', '--infer_threads_pinning', type=str, required=False, choices=['YES', 'NO', 'NUMA', 'HYBRID_AWARE'],
|
||||
help='Optional. Enable threads->cores (\'YES\' which is OpenVINO runtime\'s default for conventional CPUs), '
|
||||
'threads->(NUMA)nodes (\'NUMA\'), '
|
||||
'threads->appropriate core types (\'HYBRID_AWARE\', which is OpenVINO runtime\'s default for Hybrid CPUs)'
|
||||
'or completely disable (\'NO\')'
|
||||
'CPU threads pinning for CPU-involved inference.')
|
||||
args.add_argument('-exec_graph_path', '--exec_graph_path', type=str, required=False,
|
||||
help='Optional. Path to a file where to store executable graph information serialized.')
|
||||
|
Loading…
Reference in New Issue
Block a user