* Changed folder for documentation * Fixed links * Merged nGraph DG to OpenVINO Runtime UG * Fixed errors * Fixed some issues * Fixed tree * Fixed typo * Update docs/documentation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update README.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update README.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Fixed name * FIxed snippets * Small fixes * Update docs/HOWTO/Custom_Layers_Guide.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Fixed comments * Try to fix doc * Try to fix doc issue * Update docs/OV_Runtime_UG/Integrate_with_customer_application_new_API.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com>
9.2 KiB
Low-Precision 8-bit Integer Inference
Disclaimer
Low-precision 8-bit inference is optimized for:
- Intel® architecture processors with the following instruction set architecture extensions:
- Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX-512 VNNI)
- Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
- Intel® Advanced Vector Extensions 2.0 (Intel® AVX2)
- Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2)
- Intel® processor graphics:
- Intel® Iris® Xe Graphics
- Intel® Iris® Xe MAX Graphics
Introduction
For 8-bit integer computation, a model must be quantized. You can use a quantized model from [OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel) or quantize a model yourself. For quantization, you can use the following:
- [Post-Training Optimization Tool](@ref pot_docs_LowPrecisionOptimizationGuide) delivered with the Intel® Distribution of OpenVINO™ toolkit release package
- Neural Network Compression Framework available on GitHub: https://github.com/openvinotoolkit/nncf
The quantization process adds FakeQuantize layers on activations and weights for most layers. Read more about mathematical computations in the Uniform Quantization with Fine-Tuning.
When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note that if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.
At runtime, the quantized model is loaded to the plugin. The plugin uses the Low Precision Transformation component to update the model to infer it in low precision:
- Update
FakeQuantizelayers to have quantized output tensors in low-precision range and add dequantization layers to compensate for the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low-precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the nextFakeQuantizelayer. - Weights are quantized and stored in
Constantlayers.
Prerequisites
Let's explore quantized TensorFlow* implementation of the ResNet-50 model. Use [Model Downloader](@ref omz_tools_downloader) to download the FP16 model from OpenVINO™ Toolkit - Open Model Zoo repository:
Note
: If you installed OpenVINO with pip, use
omz_downloaderandomz_quantizerinstead ofdownload.pyandquantize.py. See Open Model Zoo documentation. Replace./benchmark_appwithbenchmark_app.
<omz_dir>/tools/downloader/downloader.py --name resnet-50-tf --precisions FP16-INT8
After that you should quantize the model with the [Model Quantizer](@ref omz_tools_downloader) tool.
<omz_dir>/tools/downloader/quantizer.py --model_dir public/resnet-50-tf --dataset_dir <DATASET_DIR> --precisions=FP16-INT8
The simplest way to infer the model and collect performance counters is the Benchmark Application:
./benchmark_app -m resnet-50-tf.xml -d CPU -niter 1 -api sync -report_type average_counters -report_folder pc_report_dir
If you infer the model with the OpenVINO™ CPU plugin and collect performance counters, all operations (except the last non-quantized SoftMax) are executed in INT8 precision.
Low-Precision 8-bit Integer Inference Workflow
For 8-bit integer computations, a model must be quantized. Quantized models can be downloaded from [Overview of OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel). If the model is not quantized, you can use the [Post-Training Optimization Tool](@ref pot_README) to quantize the model. The quantization process adds FakeQuantize layers on activations and weights for most layers. Read more about mathematical computations in the Uniform Quantization with Fine-Tuning.
8-bit inference pipeline includes two stages (also refer to the figure below):
-
Offline stage, or model quantization. During this stage, FakeQuantize layers are added before most layers to have quantized tensors before layers in a way that low-precision accuracy drop for 8-bit integer inference satisfies the specified threshold. The output of this stage is a quantized model. Quantized model precision is not changed, quantized tensors are in original precision range (
fp32).FakeQuantizelayer haslevelsattribute which defines quants count. Quants count defines precision which is used during inference. Forint8rangelevelsattribute value has to be 255 or 256. To quantize the model, you can use the [Post-Training Optimization Tool](@ref pot_README) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.
-
Runtime stage. This stage is an internal procedure of the OpenVINO™ plugin. During this stage, the quantized model is loaded to the plugin. The plugin uses
Low Precision Transformationcomponent to update the model to infer it in low precision:- Update
FakeQuantizelayers to have quantized output tensors in low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the nextFakeQuantizelayer. - Weights are quantized and stored in
Constantlayers.
- Update
![int8_flow]
Performance Counters
Information about layer precision is stored in the performance counters that are available from the Inference Engine API. For example, the part of performance counters table for quantized TensorFlow* implementation of ResNet-50 model inference on CPU Plugin looks as follows:
| layerName | execStatus | layerType | execType | realTime (ms) | cpuTime (ms) |
|---|---|---|---|---|---|
| resnet_model/batch_normalization_15/FusedBatchNorm/Add | EXECUTED | Convolution | jit_avx512_1x1_I8 | 0.377 | 0.377 |
| resnet_model/conv2d_16/Conv2D/fq_input_0 | NOT_RUN | FakeQuantize | undef | 0 | 0 |
| resnet_model/batch_normalization_16/FusedBatchNorm/Add | EXECUTED | Convolution | jit_avx512_I8 | 0.499 | 0.499 |
| resnet_model/conv2d_17/Conv2D/fq_input_0 | NOT_RUN | FakeQuantize | undef | 0 | 0 |
| resnet_model/batch_normalization_17/FusedBatchNorm/Add | EXECUTED | Convolution | jit_avx512_1x1_I8 | 0.399 | 0.399 |
| resnet_model/add_4/fq_input_0 | NOT_RUN | FakeQuantize | undef | 0 | 0 |
| resnet_model/add_4 | NOT_RUN | Eltwise | undef | 0 | 0 |
| resnet_model/add_5/fq_input_1 | NOT_RUN | FakeQuantize | undef | 0 | 0 |
The exeStatus column of the table includes possible values:
EXECUTED- layer was executed by standalone primitive,NOT_RUN- layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.
The execType column of the table includes inference primitives with specific suffixes. The layers have the following marks:
- Suffix
I8for layers that had 8-bit data type input and were computed in 8-bit precision - Suffix
FP32for layers computed in 32-bit precision
All Convolution layers are executed in int8 precision. Rest layers are fused into Convolutions using post operations optimization technique, which is described in Internal CPU Plugin Optimizations.