[CPU] CPU plugin docs refactoring (#10970)

* CPU device documentation refresh

* Bfloat16 inference page aligned with the new API

* Bfloat16 inference section moved to CPU main

* First review comments applied

* Second review step comments applied

* OneDNN reference changed to the GitHub page

* AvgPool added to the oneDNN ops list
This commit is contained in:
Maksim Kutakov 2022-03-18 14:56:22 +03:00 committed by GitHub
parent a4d164eda4
commit dfdbdb4601
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
22 changed files with 271 additions and 397 deletions

View File

@ -1,214 +0,0 @@
# Bfloat16 Inference {#openvino_docs_IE_DG_Bfloat16Inference}
## Bfloat16 Inference Usage (C++)
@sphinxdirective
.. raw:: html
<div id="switcher-cpp" class="switcher-anchor">C++</div>
@endsphinxdirective
### Disclaimer
Inference Engine with the bfloat16 inference implemented on CPU must support the native *avx512_bf16* instruction and therefore the bfloat16 data format. It is possible to use bfloat16 inference in simulation mode on platforms with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), but it leads to significant performance degradation in comparison with FP32 or native *avx512_bf16* instruction usage.
### Introduction
Bfloat16 computations (referred to as BF16) is the Brain Floating-Point format with 16 bits. This is a truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format FP32. BF16 preserves 8 exponent bits as FP32 but reduces precision of the sign and mantissa from 24 bits to 8 bits.
![bf16_format]
Preserving the exponent bits keeps BF16 to the same range as the FP32 (~1e-38 to ~3e38). This simplifies conversion between two data types: you just need to skip or flush to zero 16 low bits. Truncated mantissa leads to occasionally less precision, but according to [investigations](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus), neural networks are more sensitive to the size of the exponent than the mantissa size. Also, in lots of models, precision is needed close to zero but not so much at the maximum range. Another useful feature of BF16 is possibility to encode INT8 in BF16 without loss of accuracy, because INT8 range completely fits in BF16 mantissa field. It reduces data flow in conversion from INT8 input image data to BF16 directly without intermediate representation in FP32, or in combination of [INT8 inference](Int8Inference.md) and BF16 layers.
See the [BFLOAT16 Hardware Numerics Definition white paper](https://software.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf) for more bfloat16 format details.
There are two ways to check if CPU device can support bfloat16 computations for models:
1. Query the instruction set using one of these system commands:
* `lscpu | grep avx512_bf16`
* `cat /proc/cpuinfo | grep avx512_bf16`
2. Use the [Configure devices](supported_plugins/config_properties.md) with `METRIC_KEY(OPTIMIZATION_CAPABILITIES)`, which should return `BF16` in the list of CPU optimization options:
@snippet snippets/Bfloat16Inference0.cpp part0
The current Inference Engine solution for bfloat16 inference uses the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and supports inference of the significant number of layers in BF16 computation mode.
### Lowering Inference Precision
Lowering precision to increase performance is [widely used](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html) for optimization of inference. The bfloat16 data type usage on CPU for the first time opens the possibility of default optimization approach. The embodiment of this approach is to use the optimization capabilities of the current platform to achieve maximum performance while maintaining the accuracy of calculations within the acceptable range.
Using Bfloat16 precision provides the following performance benefits:
1. Faster multiplication of two BF16 numbers because of shorter mantissa of bfloat16 data.
2. No need to support denormals and handling exceptions as this is a performance optimization.
3. Fast conversion of float32 to bfloat16 and vice versa.
4. Reduced size of data in memory, as a result, larger models fit in the same memory bounds.
5. Reduced amount of data that must be transferred, as a result, reduced data transition time.
For default optimization on CPU, the source model is converted from FP32 or FP16 to BF16 and executed internally on platforms with native BF16 support. In this case, `KEY_ENFORCE_BF16` is set to `YES` in the `PluginConfigParams` for `GetConfig()`. The code below demonstrates how to check if the key is set:
@snippet snippets/Bfloat16Inference1.cpp part1
To disable BF16 internal transformations in C++ API, set the `KEY_ENFORCE_BF16` to `NO`. In this case, the model infers as is without modifications with precisions that were set on each layer edge.
@snippet snippets/Bfloat16Inference2.cpp part2
To disable BF16 in C API:
```
ie_config_t config = { "ENFORCE_BF16", "NO", NULL};
ie_core_load_network(core, network, device_name, &config, &exe_network);
```
An exception with the message `Platform doesn't support BF16 format` is formed in case of setting `KEY_ENFORCE_BF16` to `YES` on CPU without native BF16 support or BF16 simulation mode.
Low-Precision 8-bit integer models cannot be converted to BF16, even if bfloat16 optimization is set by default.
### Bfloat16 Simulation Mode
Bfloat16 simulation mode is available on CPU and Intel® AVX-512 platforms that do not support the native `avx512_bf16` instruction. The simulator does not guarantee good performance. Note that the CPU must still support the AVX-512 extensions.
To enable the simulation of Bfloat16:
* In the [Benchmark App](../../samples/cpp/benchmark_app/README.md), add the `-enforcebf16=true` option
* In C++ API, set `KEY_ENFORCE_BF16` to `YES`
* In C API:
```
ie_config_t config = { "ENFORCE_BF16", "YES", NULL};
ie_core_load_network(core, network, device_name, &config, &exe_network);
```
### Performance Counters
Information about layer precision is stored in the performance counters that are available from the Inference Engine API. The layers have the following marks:
* Suffix `BF16` for layers that had bfloat16 data type input and were computed in BF16 precision
* Suffix `FP32` for layers computed in 32-bit precision
For example, the performance counters table for the Inception model can look as follows:
```
pool5 EXECUTED layerType: Pooling realTime: 143 cpu: 143 execType: jit_avx512_BF16
fc6 EXECUTED layerType: FullyConnected realTime: 47723 cpu: 47723 execType: jit_gemm_BF16
relu6 NOT_RUN layerType: ReLU realTime: 0 cpu: 0 execType: undef
fc7 EXECUTED layerType: FullyConnected realTime: 7558 cpu: 7558 execType: jit_gemm_BF16
relu7 NOT_RUN layerType: ReLU realTime: 0 cpu: 0 execType: undef
fc8 EXECUTED layerType: FullyConnected realTime: 2193 cpu: 2193 execType: jit_gemm_BF16
prob EXECUTED layerType: SoftMax realTime: 68 cpu: 68 execType: jit_avx512_FP32
```
The **execType** column of the table includes inference primitives with specific suffixes.
## Bfloat16 Inference Usage (Python)
@sphinxdirective
.. raw:: html
<div id="switcher-python" class="switcher-anchor">Python</div>
@endsphinxdirective
### Disclaimer
Inference Engine with the bfloat16 inference implemented on CPU must support the native *avx512_bf16* instruction and therefore the bfloat16 data format. It is possible to use bfloat16 inference in simulation mode on platforms with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), but it leads to significant performance degradation in comparison with FP32 or native *avx512_bf16* instruction usage.
### Introduction
Bfloat16 computations (referred to as BF16) is the Brain Floating-Point format with 16 bits. This is a truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format FP32. BF16 preserves 8 exponent bits as FP32 but reduces precision of the sign and mantissa from 24 bits to 8 bits.
![bf16_format]
Preserving the exponent bits keeps BF16 to the same range as the FP32 (~1e-38 to ~3e38). This simplifies conversion between two data types: you just need to skip or flush to zero 16 low bits. Truncated mantissa leads to occasionally less precision, but according to investigations, neural networks are more sensitive to the size of the exponent than the mantissa size. Also, in lots of models, precision is needed close to zero but not so much at the maximum range. Another useful feature of BF16 is possibility to encode INT8 in BF16 without loss of accuracy, because INT8 range completely fits in BF16 mantissa field. It reduces data flow in conversion from INT8 input image data to BF16 directly without intermediate representation in FP32, or in combination of [INT8 inference](Int8Inference.md) and BF16 layers.
See the [BFLOAT16 Hardware Numerics Definition white paper](https://software.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf) for more bfloat16 format details.
There are two ways to check if CPU device can support bfloat16 computations for models:
1. Query the instruction set using one of these system commands:
* `lscpu | grep avx512_bf16`
* `cat /proc/cpuinfo | grep avx512_bf16`
2. Use the Query API with METRIC_KEY(OPTIMIZATION_CAPABILITIES), which should return BF16 in the list of CPU optimization options:
```python
from openvino.inference_engine import IECore
ie = IECore()
net = ie.read_network(path_to_xml_file)
cpu_caps = ie.get_metric(metric_name="OPTIMIZATION_CAPABILITIES", device_name="CPU")
```
The current Inference Engine solution for bfloat16 inference uses the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and supports inference of the significant number of layers in BF16 computation mode.
### Lowering Inference Precision
Lowering precision to increase performance is widely used for optimization of inference. The bfloat16 data type usage on CPU for the first time opens the possibility of default optimization approach. The embodiment of this approach is to use the optimization capabilities of the current platform to achieve maximum performance while maintaining the accuracy of calculations within the acceptable range.
Using Bfloat16 precision provides the following performance benefits:
1. Faster multiplication of two BF16 numbers because of shorter mantissa of bfloat16 data.
2. No need to support denormals and handling exceptions as this is a performance optimization.
3. Fast conversion of float32 to bfloat16 and vice versa.
4. Reduced size of data in memory, as a result, larger models fit in the same memory bounds.
5. Reduced amount of data that must be transferred, as a result, reduced data transition time.
For default optimization on CPU, the source model is converted from FP32 or FP16 to BF16 and executed internally on platforms with native BF16 support. In this case, ENFORCE_BF16 is set to YES. The code below demonstrates how to check if the key is set:
```python
from openvino.inference_engine import IECore
ie = IECore()
net = ie.read_network(path_to_xml_file)
exec_net = ie.load_network(network=net, device_name="CPU")
exec_net.get_config("ENFORCE_BF16")
```
To enable BF16 internal transformations, set the key "ENFORCE_BF16" to "YES" in the ExecutableNetwork configuration.
```python
bf16_config = {"ENFORCE_BF16" : "YES"}
exec_net = ie.load_network(network=net, device_name="CPU", config = bf16_config)
```
To disable BF16 internal transformations, set the key "ENFORCE_BF16" to "NO". In this case, the model infers as is without modifications with precisions that were set on each layer edge.
An exception with the message `Platform doesn't support BF16 format` is formed in case of setting "ENFORCE_BF16" to "YES"on CPU without native BF16 support or BF16 simulation mode.
Low-Precision 8-bit integer models cannot be converted to BF16, even if bfloat16 optimization is set by default.
### Bfloat16 Simulation Mode
Bfloat16 simulation mode is available on CPU and Intel® AVX-512 platforms that do not support the native avx512_bf16 instruction. The simulator does not guarantee good performance. Note that the CPU must still support the AVX-512 extensions.
#### To Enable the simulation of Bfloat16:
* In the Benchmark App, add the -enforcebf16=true option
* In Python, use the following code as an example:
```python
from openvino.inference_engine import IECore
ie = IECore()
net = ie.read_network(path_to_xml_file)
bf16_config = {"ENFORCE_BF16" : "YES"}
exec_net = ie.load_network(network=net, device_name="CPU", config=bf16_config)
```
### Performance Counters
Information about layer precision is stored in the performance counters that are available from the Inference Engine API. The layers have the following marks:
* Suffix *BF16* for layers that had bfloat16 data type input and were computed in BF16 precision
* Suffix *FP32* for layers computed in 32-bit precision
For example, the performance counters table for the Inception model can look as follows:
```
pool5 EXECUTED layerType: Pooling realTime: 143 cpu: 143 execType: jit_avx512_BF16
fc6 EXECUTED layerType: FullyConnected realTime: 47723 cpu: 47723 execType: jit_gemm_BF16
relu6 NOT_RUN layerType: ReLU realTime: 0 cpu: 0 execType: undef
fc7 EXECUTED layerType: FullyConnected realTime: 7558 cpu: 7558 execType: jit_gemm_BF16
relu7 NOT_RUN layerType: ReLU realTime: 0 cpu: 0 execType: undef
fc8 EXECUTED layerType: FullyConnected realTime: 2193 cpu: 2193 execType: jit_gemm_BF16
prob EXECUTED layerType: SoftMax realTime: 68 cpu: 68 execType: jit_avx512_FP32
```
The **execType** column of the table includes inference primitives with specific suffixes.
[bf16_format]: img/bf16_format.png

View File

@ -9,7 +9,6 @@
openvino_docs_deployment_optimization_guide_dldt_optimization_guide
openvino_docs_IE_DG_Model_caching_overview
openvino_docs_IE_DG_Int8Inference
openvino_docs_IE_DG_Bfloat16Inference
openvino_docs_OV_UG_NoDynamicShapes
@endsphinxdirective

View File

@ -1,139 +1,211 @@
# CPU device {#openvino_docs_OV_UG_supported_plugins_CPU}
The CPU plugin is developed to achieve high performance inference of neural networks on Intel® x86-64 CPUs.
For an in-depth description of CPU plugin, see
## Introducing the CPU Plugin
The CPU plugin was developed to achieve high performance of neural networks on CPU, using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).
- [CPU plugin developers documentation](https://github.com/openvinotoolkit/openvino/wiki/CPUPluginDevelopersDocs)
Currently, the CPU plugin uses Intel® Threading Building Blocks (Intel® TBB) in order to parallelize calculations. Please refer to the [Optimization Guide](../../optimization_guide/dldt_optimization_guide.md) for associated performance considerations.
The set of supported layers can be expanded with [the Extensibility mechanism](../../Extensibility_UG/Intro.md).
## Supported Platforms
OpenVINO™ toolkit, including the CPU plugin, is officially supported and validated on the following platforms:
| Host | OS (64-bit) |
| :--- | :--- |
| Development | Ubuntu* 18.04 or 20.04, CentOS* 7.6, MS Windows* 10, macOS* 10.15 |
| Target | Ubuntu* 18.04 or 20.04, CentOS* 7.6, MS Windows* 10, macOS* 10.15 |
The CPU plugin supports inference on Intel® Xeon® with Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and AVX512_BF16, Intel® Core™
Processors with Intel® AVX2, Intel Atom® Processors with Intel® Streaming SIMD Extensions (Intel® SSE).
You can use the `-pc` flag for samples to know which configuration is used by a layer.
This flag shows execution statistics that you can use to get information about layer name, layer type,
execution status, execution time, and the type of the execution primitive.
## Internal CPU Plugin Optimizations
The CPU plugin supports several graph optimization algorithms, such as fusing or removing layers.
Refer to the sections below for details.
> **NOTE**: For layer descriptions, see the [IR Notation Reference](../../ops/opset.md).
### Lowering Inference Precision
The CPU plugin follows a default optimization approach. This approach means that inference is made with lower precision if it is possible on a given platform to reach better performance with an acceptable range of accuracy.
> **NOTE**: For details, see the [Using Bfloat16 Inference](../Bfloat16Inference.md).
### Fusing Convolution and Simple Layers
Merge of a convolution layer and any of the simple layers listed below:
- Activation: ReLU, ELU, Sigmoid, Clamp
- Depthwise: ScaleShift, PReLU
- FakeQuantize
> **NOTE**: You can have any number and order of simple layers.
A combination of a convolution layer and simple layers results in a single fused layer called
*Convolution*:
![conv_simple_01]
- [OpenVINO Runtime CPU plugin source files](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_cpu/)
### Fusing Pooling and FakeQuantize Layers
The CPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit.
A combination of Pooling and FakeQuantize layers results in a single fused layer called *Pooling*:
## Device name
For the CPU plugin `"CPU"` device name is used, and even though there can be more than one socket on a platform, from the plugin's point of view, there is only one `"CPU"` device.
On multi-socket platforms, load balancing and memory usage distribution between NUMA nodes are handled automatically.
In order to use CPU for inference the device name should be passed to `ov::Core::compile_model()` method:
![pooling_fakequant_01]
@snippet snippets/cpu/compile_model.cpp compile_model_default
### Fusing FullyConnected and Activation Layers
A combination of FullyConnected and Activation layers results in a single fused layer called
*FullyConnected*:
![fullyconnected_activation_01]
### Fusing Convolution and Depthwise Convolution Layers Grouped with Simple Layers
> **NOTE**: This pattern is possible only on CPUs with support of Streaming SIMD Extensions 4.2
> (SSE 4.2) and Intel AVX2 Instruction Set Architecture (ISA).
A combination of a group of a Convolution (or Binary Convolution) layer and simple layers and a group of a Depthwise Convolution
layer and simple layers results in a single layer called *Convolution* (or *Binary Convolution*):
> **NOTE**: Depthwise convolution layers should have the same values for the `group`, input channels, and output channels parameters.
![conv_depth_01]
### Fusing Convolution and Sum Layers
A combination of convolution, simple, and Eltwise layers with the sum operation results in a single layer called *Convolution*:
![conv_sum_relu_01]
### Fusing a Group of Convolutions
If a topology contains the following pipeline, a CPU plugin merges split, convolution, and concatenation layers into a single convolution layer with the group parameter:
![group_convolutions_01]
> **NOTE**: Parameters of the convolution layers must coincide.
### Removing a Power Layer
CPU plugin removes a Power layer from a topology if it has the following parameters:
- <b>power</b> = 1
- <b>scale</b> = 1
- <b>offset</b> = 0
## Supported inference data types
CPU plugin supports the following data types as inference precision of internal primitives:
- Floating-point data types:
- f32
- bf16
- Integer data types:
- i32
- Quantized data types:
- u8
- i8
- u1
## Supported Configuration Parameters
[Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out supported data types for all detected devices.
The plugin supports the configuration parameters listed below.
All parameters must be set with the `InferenceEngine::Core::LoadNetwork()` method.
When specifying key values as raw strings (that is, when using Python API), omit the `KEY_` prefix.
Refer to the OpenVINO samples for usage examples: [Benchmark App](../../../samples/cpp/benchmark_app/README.md).
### Quantized data types specifics
These are general options, also supported by other plugins:
Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities.
u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations.
| Parameter name | Parameter values | Default | Description |
| :--- | :--- | :--- | :----------------------------------------------------------------------------------------------------------------------------|
| KEY_EXCLUSIVE_ASYNC_REQUESTS | YES/NO | NO | Forces async requests (also from different executable networks) to execute serially. This prevents potential oversubscription|
| KEY_PERF_COUNT | YES/NO | NO | Enables gathering performance counters |
See [low-precision optimization guide](@ref pot_docs_LowPrecisionOptimizationGuide) for more details on how to get quantized model.
CPU-specific settings:
> **NOTE**: Platforms that do not support Intel® AVX512-VNNI have a known "saturation issue" which in some cases leads to reduced computational accuracy for u8/i8 precision calculations.
> See [saturation (overflow) issue section](@ref pot_saturation_issue) to get more information on how to detect such issues and possible workarounds.
| Parameter name | Parameter values | Default | Description |
| :--- | :--- | :--- | :--- |
| KEY_CPU_THREADS_NUM | positive integer values| 0 | Specifies the number of threads that CPU plugin should use for inference. Zero (default) means using all (logical) cores|
| KEY_CPU_BIND_THREAD | YES/NUMA/NO | YES | Binds inference threads to CPU cores. 'YES' (default) binding option maps threads to cores - this works best for static/synthetic scenarios like benchmarks. The 'NUMA' binding is more relaxed, binding inference threads only to NUMA nodes, leaving further scheduling to specific cores to the OS. This option might perform better in the real-life/contended scenarios. Note that for the latency-oriented cases (number of the streams is less or equal to the number of NUMA nodes, see below) both YES and NUMA options limit number of inference threads to the number of hardware cores (ignoring hyper-threading) on the multi-socket machines. |
| KEY_CPU_THROUGHPUT_STREAMS | KEY_CPU_THROUGHPUT_NUMA, KEY_CPU_THROUGHPUT_AUTO, or positive integer values| 1 | Specifies number of CPU "execution" streams for the throughput mode. Upper bound for the number of inference requests that can be executed simultaneously. All available CPU cores are evenly distributed between the streams. The default value is 1, which implies latency-oriented behavior for single NUMA-node machine, with all available cores processing requests one by one. On the multi-socket (multiple NUMA nodes) machine, the best latency numbers usually achieved with a number of streams matching the number of NUMA-nodes. <br>KEY_CPU_THROUGHPUT_NUMA creates as many streams as needed to accommodate NUMA and avoid associated penalties.<br>KEY_CPU_THROUGHPUT_AUTO creates bare minimum of streams to improve the performance; this is the most portable option if you don't know how many cores your target machine has (and what would be the optimal number of streams). Note that your application should provide enough parallel slack (for example, run many inference requests) to leverage the throughput mode. <br> Non-negative integer value creates the requested number of streams. If a number of streams is 0, no internal streams are created and user threads are interpreted as stream master threads.|
| KEY_ENFORCE_BF16 | YES/NO| YES | The name for setting to execute in bfloat16 precision whenever it is possible. This option lets plugin know to downscale the precision where it sees performance benefits from bfloat16 execution. Such option does not guarantee accuracy of the network, you need to verify the accuracy in this mode separately, based on performance and accuracy results. It should be your decision whether to use this option or not. |
### Floating point data types specifics
> **NOTE**: To disable all internal threading, use the following set of configuration parameters: `KEY_CPU_THROUGHPUT_STREAMS=0`, `KEY_CPU_THREADS_NUM=1`, `KEY_CPU_BIND_THREAD=NO`.
Default floating-point precision of a CPU primitive is f32. To support f16 IRs the plugin internally converts all the f16 values to f32 and all the calculations are performed using native f32 precision.
On platforms that natively support bfloat16 calculations (have AVX512_BF16 extension) bf16 type is automatically used instead of f32 to achieve better performance, thus no special steps are required to run a model with bf16 precision.
See the [BFLOAT16 Hardware Numerics Definition white paper](https://software.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf) for more details about bfloat16 format.
Using bf16 precision provides the following performance benefits:
- Faster multiplication of two bfloat16 numbers because of shorter mantissa of the bfloat16 data.
- Reduced memory consumption since bfloat16 data size is two times less than 32-bit float.
To check if the CPU device can support the bfloat16 data type use the [query device properties interface](./config_properties.md) to query ov::device::capabilities property, which should contain `BF16` in the list of CPU capabilities:
@snippet snippets/cpu/Bfloat16Inference0.cpp part0
In case if the model was converted to bf16, ov::hint::inference_precision is set to ov::element::bf16 and can be checked via ov::CompiledModel::get_property call. The code below demonstrates how to get the element type:
@snippet snippets/cpu/Bfloat16Inference1.cpp part1
To infer the model in f32 precision instead of bf16 on targets with native bf16 support, set the ov::hint::inference_precision to ov::element::f32.
@snippet snippets/cpu/Bfloat16Inference2.cpp part2
Bfloat16 software simulation mode is available on CPUs with Intel® AVX-512 instruction set that do not support the native `avx512_bf16` instruction. This mode is used for development purposes and it does not guarantee good performance.
To enable the simulation, one have to explicitly set ov::hint::inference_precision to ov::element::bf16.
> **NOTE**: An exception is thrown in case of setting ov::hint::inference_precision to ov::element::bf16 on CPU without native bfloat16 support or bfloat16 simulation mode.
> **NOTE**: Due to the reduced mantissa size of the bfloat16 data type, the resulting bf16 inference accuracy may differ from the f32 inference, especially for models that were not trained using the bfloat16 data type. If the bf16 inference accuracy is not acceptable, it is recommended to switch to the f32 precision.
## Supported features
### Multi-device execution
If a machine has OpenVINO supported devices other than CPU (for example integrated GPU), then any supported model can be executed on CPU and all the other devices simultaneously.
This can be achieved by specifying `"MULTI:CPU,GPU.0"` as a target device in case of simultaneous usage of CPU and GPU.
@snippet snippets/cpu/compile_model.cpp compile_model_multi
See [Multi-device execution page](../multi_device.md) for more details.
### Multi-stream execution
If either `ov::num_streams(n_streams)` with `n_streams > 1` or `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` property is set for CPU plugin,
then multiple streams are created for the model. In case of CPU plugin each stream has its own host thread which means that incoming infer requests can be processed simultaneously.
Each stream is pinned to its own group of physical cores with respect to NUMA nodes physical memory usage to minimize overhead on data transfer between NUMA nodes.
See [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide) for more details.
> **NOTE**: When it comes to latency, one needs to keep in mind that running only one stream on multi-socket platform may introduce additional overheads on data transfer between NUMA nodes.
> In that case it is better to run inference on one socket (please see [deployment optimization guide (additional configurations)](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide_additional) for details).
### Dynamic shapes
CPU plugin provides full functional support for models with dynamic shapes in terms of the opset coverage.
> **NOTE**: CPU plugin does not support tensors with dynamically changing rank. In case of an attempt to infer a model with such tensors, an exception will be thrown.
Dynamic shapes support introduce some additional overheads on memory management and may limit internal runtime optimizations.
The more degrees of freedom we have, the more difficult it is to achieve the best performance.
The most flexible configuration is the fully undefined shape, when we do not apply any constraints to the shape dimensions, which is the most convenient approach.
But reducing the level of uncertainty will bring performance gains.
We can reduce memory consumption through memory reuse, and as a result achieve better cache locality, which in its turn leads to better inference performance, if we explicitly set dynamic shapes with defined upper bounds.
@snippet snippets/cpu/dynamic_shape.cpp defined_upper_bound
Some runtime optimizations works better if the model shapes are known in advance.
Therefore, if the input data shape is not changed between inference calls, it is recommended to use a model with static shapes or reshape the existing model with the static input shape to get the best performance.
@snippet snippets/cpu/dynamic_shape.cpp static_shape
See [dynamic shapes guide](../ov_dynamic_shapes.md) for more details.
### Preprocessing acceleration
CPU plugin supports a full set of the preprocessing operations, providing high performance implementations for them.
See [preprocessing API guide](../preprocessing_overview.md) for more details.
@sphinxdirective
.. dropdown:: The CPU plugin support for handling tensor precision conversion is limited to the following ov::element types:
* bf16
* f16
* f32
* f64
* i8
* i16
* i32
* i64
* u8
* u16
* u32
* u64
* boolean
@endsphinxdirective
### Models caching
CPU plugin supports Import/Export network capability. If the model caching is enabled via common OpenVINO™ `ov::cache_dir` property, the plugin will automatically create a cached blob inside the specified directory during model compilation.
This cached blob contains some intermediate representation of the network that it has after common runtime optimizations and low precision transformations.
The next time the model is compiled, the cached representation will be loaded to the plugin instead of the initial IR, so the aforementioned transformation steps will be skipped.
These transformations take a significant amount of time during model compilation, so caching this representation reduces time spent for subsequent compilations of the model,
thereby reducing first inference latency (FIL).
See [model caching overview](@ref openvino_docs_IE_DG_Model_caching_overview) for more details.
### Extensibility
CPU plugin supports fallback on `ov::Op` reference implementation if the plugin do not have its own implementation for such operation.
That means that [OpenVINO™ Extensibility Mechanism](@ref openvino_docs_Extensibility_UG_Intro) can be used for the plugin extension as well.
To enable fallback on a custom operation implementation, one have to override `ov::Op::evaluate` method in the derived operation class (see [custom OpenVINO™ operations](@ref openvino_docs_Extensibility_UG_add_openvino_ops) for details).
> **NOTE**: At the moment, custom operations with internal dynamism (when the output tensor shape can only be determined as a result of performing the operation) are not supported by the plugin.
### Stateful models
CPU plugin supports stateful models without any limitations.
See [stateful models guide](@ref openvino_docs_IE_DG_network_state_intro) for details.
## Supported properties
The plugin supports the properties listed below.
### Read-write properties
All parameters must be set before calling `ov::Core::compile_model()` in order to take effect or passed as additional argument to `ov::Core::compile_model()`
- ov::enable_profiling
- ov::hint::inference_precision
- ov::hint::performance_mode
- ov::hint::num_request
- ov::num_streams
- ov::affinity
- ov::inference_num_threads
### Read-only properties
- ov::cache_dir
- ov::supported_properties
- ov::available_devices
- ov::range_for_async_infer_requests
- ov::range_for_streams
- ov::device::full_name
- ov::device::capabilities
## External dependencies
For some performance-critical DL operations, the CPU plugin uses optimized implementations from the oneAPI Deep Neural Network Library ([oneDNN](https://github.com/oneapi-src/oneDNN)).
@sphinxdirective
.. dropdown:: The following operations are implemented using primitives from the OneDNN library:
* AvgPool
* Concat
* Convolution
* ConvolutionBackpropData
* GroupConvolution
* GroupConvolutionBackpropData
* GRUCell
* GRUSequence
* LRN
* LSTMCell
* LSTMSequence
* MatMul
* MaxPool
* RNNCell
* RNNSequence
* SoftMax
@endsphinxdirective
## See Also
* [Supported Devices](Supported_Devices.md)
* [Optimization guide](@ref openvino_docs_optimization_guide_dldt_optimization_guide)
* [СPU plugin developers documentation](https://github.com/openvinotoolkit/openvino/wiki/CPUPluginDevelopersDocs)
[mkldnn_group_conv]: ../img/mkldnn_group_conv.png
[mkldnn_conv_sum]: ../img/mkldnn_conv_sum.png
[mkldnn_conv_sum_result]: ../img/mkldnn_conv_sum_result.png
[conv_simple_01]: ../img/conv_simple_01.png
[pooling_fakequant_01]: ../img/pooling_fakequant_01.png
[fullyconnected_activation_01]: ../img/fullyconnected_activation_01.png
[conv_depth_01]: ../img/conv_depth_01.png
[group_convolutions_01]: ../img/group_convolutions_01.png
[conv_sum_relu_01]: ../img/conv_sum_relu_01.png

View File

@ -18,7 +18,7 @@ The OpenVINO Runtime provides capabilities to infer deep learning models on the
| Plugin | Device types |
|--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
|[CPU](CPU.md) |Intel&reg; Xeon&reg; with Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and AVX512_BF16, Intel&reg; Core&trade; Processors with Intel&reg; AVX2, Intel&reg; Atom&reg; Processors with Intel® Streaming SIMD Extensions (Intel® SSE) |
|[CPU](CPU.md) |Intel® Xeon®, Intel® Core™ and Intel® Atom® processors with Intel® Streaming SIMD Extensions (Intel® SSE4.2), Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Vector Neural Network Instructions (Intel® AVX512-VNNI) and bfloat16 extension for AVX-512 (Intel® AVX-512_BF16 Extension)|
|[GPU](GPU.md) |Intel® Graphics, including Intel® HD Graphics, Intel® UHD Graphics, Intel® Iris® Graphics, Intel® Xe Graphics, Intel® Xe MAX Graphics |
|[VPUs](VPU.md) |Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X, Intel® Vision Accelerator Design with Intel® Movidius™ VPUs |
|[GNA](GNA.md) |[Intel® Speech Enabling Developer Kit](https://www.intel.com/content/www/us/en/support/articles/000026156/boards-and-kits/smart-home.html); [Amazon Alexa\* Premium Far-Field Developer Kit](https://developer.amazon.com/en-US/alexa/alexa-voice-service/dev-kits/amazon-premium-voice); [Intel® Pentium® Silver Processors N5xxx, J5xxx and Intel® Celeron® Processors N4xxx, J4xxx (formerly codenamed Gemini Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/83915/gemini-lake.html): [Intel® Pentium® Silver J5005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128984/intel-pentium-silver-j5005-processor-4m-cache-up-to-2-80-ghz.html), [Intel® Pentium® Silver N5000 Processor](https://ark.intel.com/content/www/us/en/ark/products/128990/intel-pentium-silver-n5000-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128992/intel-celeron-j4005-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4105 Processor](https://ark.intel.com/content/www/us/en/ark/products/128989/intel-celeron-j4105-processor-4m-cache-up-to-2-50-ghz.html), [Intel® Celeron® J4125 Processor](https://ark.intel.com/content/www/us/en/ark/products/197305/intel-celeron-processor-j4125-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® Processor N4100](https://ark.intel.com/content/www/us/en/ark/products/128983/intel-celeron-processor-n4100-4m-cache-up-to-2-40-ghz.html), [Intel® Celeron® Processor N4000](https://ark.intel.com/content/www/us/en/ark/products/128988/intel-celeron-processor-n4000-4m-cache-up-to-2-60-ghz.html); [Intel® Pentium® Processors N6xxx, J6xxx, Intel® Celeron® Processors N6xxx, J6xxx and Intel Atom® x6xxxxx (formerly codenamed Elkhart Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/128825/products-formerly-elkhart-lake.html); [Intel® Core™ Processors (formerly codenamed Cannon Lake)](https://ark.intel.com/content/www/us/en/ark/products/136863/intel-core-i3-8121u-processor-4m-cache-up-to-3-20-ghz.html); [10th Generation Intel® Core™ Processors (formerly codenamed Ice Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/74979/ice-lake.html): [Intel® Core™ i7-1065G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i71065g7-processor-8m-cache-up-to-3-90-ghz.html), [Intel® Core™ i7-1060G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197120/intel-core-i71060g7-processor-8m-cache-up-to-3-80-ghz.html), [Intel® Core™ i5-1035G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/196591/intel-core-i51035g4-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196592/intel-core-i51035g7-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196603/intel-core-i51035g1-processor-6m-cache-up-to-3-60-ghz.html), [Intel® Core™ i5-1030G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197119/intel-core-i51030g7-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i5-1030G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197121/intel-core-i51030g4-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i3-1005G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196588/intel-core-i31005g1-processor-4m-cache-up-to-3-40-ghz.html), [Intel® Core™ i3-1000G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/197122/intel-core-i31000g1-processor-4m-cache-up-to-3-20-ghz.html), [Intel® Core™ i3-1000G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197123/intel-core-i31000g4-processor-4m-cache-up-to-3-20-ghz.html); [11th Generation Intel® Core™ Processors (formerly codenamed Tiger Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/88759/tiger-lake.html); [12th Generation Intel® Core™ Processors (formerly codenamed Alder Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/147470/products-formerly-alder-lake.html)|

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:80edd1da1c5673d18afa44bc2c0503ba9ecdcc37c2acb94960303b61c602ceee
size 12649

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d3e8856aa175d6fcf940af57a53f962ff6c58acf0a3838bfccc6a093bff1756d
size 9015

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7d53ce33f180cf4d170bbeb69635ee7c49a67d3f6ee8b1c01ec12568fe1cca38
size 17157

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:88745fd132531e943d59afe59ed6af8eaae6b62ba1fda2493dfef76080d31a25
size 7788

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9709bc83f903943b4d737d379babf80a391a72ad8eab98e71abcc0de5424fbfc
size 12361

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:af2641e8e685b027123681ab542162932b008eff257ef5b7105950bfe8b4ade8
size 10373

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:02efdda675c16def7c2705e978964ce8bf65d1ec6cedfdb0a5afc837fb57abf0
size 5660

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e69242d80da7676311e20e5db67c01bd6562008ecf3a53df8fdedaefabb91b70
size 7226

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:37c7908d2379cc2ba1909965c58de7bc55d131a330c47e173321c718846d6745
size 7809

View File

@ -113,14 +113,14 @@ In the Inference Engine, there is no notion of requests priorities. It is left t
Inference precision directly affects the performance.
Model Optimizer can produce an IR with different precision. For example, an FP16 IR initially targets VPU and GPU devices, while, for example, for the CPU, an FP16 IR is typically up-scaled to the regular FP32 automatically upon loading. But notice that further device-specific inference precision settings are available,
for example, [8-bit integer](../OV_Runtime_UG/Int8Inference.md) or [bfloat16](../OV_Runtime_UG/Bfloat16Inference.md), which is specific to the CPU inference, below.
for example, [8-bit integer](../OV_Runtime_UG/Int8Inference.md) or [bfloat16](../OV_Runtime_UG/supported_plugins/CPU.md), which is specific to the CPU inference, below.
Note that for the [Multi-Device execution](../OV_Runtime_UG/multi_device.md) that supports automatic inference on multiple devices in parallel, you can use an FP16 IR (no need for FP32).
You can find more information, including preferred data types for specific devices, in the
[Supported Devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) document.
By default, plugins enable the optimizations that allow lower precision if the acceptable range of accuracy is preserved.
For example, for the CPU that supports the AVX512_BF16 instructions, an FP16/FP32 model is converted to a [bfloat16](../OV_Runtime_UG/Bfloat16Inference.md) IR to accelerate inference.
For example, for the CPU that supports the AVX512_BF16 instructions, an FP16/FP32 model is converted to a [bfloat16](../OV_Runtime_UG/supported_plugins/CPU.md) IR to accelerate inference.
To compare the associated speedup, run the example command below to disable this feature on the CPU device with the AVX512_BF16 support and get regular FP32 execution:

View File

@ -1,10 +0,0 @@
#include <ie_core.hpp>
int main() {
using namespace InferenceEngine;
//! [part0]
InferenceEngine::Core core;
auto cpuOptimizationCapabilities = core.GetMetric("CPU", METRIC_KEY(OPTIMIZATION_CAPABILITIES)).as<std::vector<std::string>>();
//! [part0]
return 0;
}

View File

@ -1,13 +0,0 @@
#include <ie_core.hpp>
int main() {
using namespace InferenceEngine;
//! [part1]
InferenceEngine::Core core;
auto network = core.ReadNetwork("sample.xml");
auto exeNetwork = core.LoadNetwork(network, "CPU");
auto enforceBF16 = exeNetwork.GetConfig(PluginConfigParams::KEY_ENFORCE_BF16).as<std::string>();
//! [part1]
return 0;
}

View File

@ -1,11 +0,0 @@
#include <ie_core.hpp>
int main() {
using namespace InferenceEngine;
//! [part2]
InferenceEngine::Core core;
core.SetConfig({ { CONFIG_KEY(ENFORCE_BF16), CONFIG_VALUE(NO) } }, "CPU");
//! [part2]
return 0;
}

View File

@ -0,0 +1,9 @@
#include <openvino/runtime/core.hpp>
int main() {
//! [part0]
ov::Core core;
auto cpuOptimizationCapabilities = core.get_property("CPU", ov::device::capabilities);
//! [part0]
return 0;
}

View File

@ -0,0 +1,13 @@
#include <openvino/runtime/core.hpp>
int main() {
using namespace InferenceEngine;
//! [part1]
ov::Core core;
auto network = core.read_model("sample.xml");
auto exec_network = core.compile_model(network, "CPU");
auto inference_precision = exec_network.get_property(ov::hint::inference_precision);
//! [part1]
return 0;
}

View File

@ -0,0 +1,11 @@
#include <openvino/runtime/core.hpp>
int main() {
using namespace InferenceEngine;
//! [part2]
ov::Core core;
core.set_property("CPU", ov::hint::inference_precision(ov::element::f32));
//! [part2]
return 0;
}

View File

@ -0,0 +1,20 @@
#include <openvino/runtime/core.hpp>
int main() {
{
//! [compile_model_default]
ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "CPU");
//! [compile_model_default]
}
{
//! [compile_model_multi]
ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "MULTI:CPU,GPU.0");
//! [compile_model_multi]
}
}

View File

@ -0,0 +1,25 @@
#include <openvino/runtime/core.hpp>
int main() {
{
//! [defined_upper_bound]
ov::Core core;
auto model = core.read_model("model.xml");
model->reshape({{ov::Dimension(1, 10), ov::Dimension(1, 20), ov::Dimension(1, 30), ov::Dimension(1, 40)}});
//! [defined_upper_bound]
}
{
//! [static_shape]
ov::Core core;
auto model = core.read_model("model.xml");
ov::Shape static_shape = {10, 20, 30, 40};
model->reshape(static_shape);
//! [static_shape]
}
return 0;
}