* Extensibility guide with FE extensions and remove OV_FRAMEWORK_MAP from docs * Rework of Extensibility Intro, adopted examples to missing OPENVINO_FRAMEWORK_MAP * Removed OPENVINO_FRAMEWORK_MAP reference * Frontend extension detailed documentation * Fixed distributed snippets * Fixed snippet inclusion in FE extension document and chapter headers * Fixed wrong name in a snippet reference * Fixed test for template extension due to changed number of loaded extensions * Update docs/Extensibility_UG/frontend_extensions.md Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> * Minor fixes in extension snippets * Small grammar fix Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> * DOCS: transition banner (#10973) * transition banner * minor fix * update transition banner * updates * update custom.js * updates * updates * Documentation fixes (#11044) * Benchmark app usage * Fixed link to the devices * More fixes * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Removed several hardcoded links Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Updated documentation for compile_tool (#11049) * Added deployment guide (#11060) * Added deployment guide * Added local distribution * Updates * Fixed more indentations * Removed obsolete code snippets (#11061) * Removed obsolete code snippets * NCC style * Fixed NCC for BA * Add a troubleshooting issue for PRC installation (#11074) * updates * adding gna to linux * add missing reference * update * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * update * minor updates * add gna item to yum and apt * add gna to get started page * update reference formatting * merge commit * add a troubleshooting issue * update * update * fix CVS-71846 Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * DOCS: fixed hardcoded links (#11100) * Fixes * Use links * applying reviewers comments to the Opt Guide (#11093) * applying reviewrs comments * fixed refs, more structuring (bold, bullets, etc) * refactoring tput/latency sections * next iteration (mostly latency), also brushed the auto-batching and other sections * updates sync/async images * common opts brushed * WIP tput redesigned * minor brushing of common and auto-batching * Tput fully refactored * fixed doc name in the link * moved int8 perf counters to the right section * fixed links * fixed broken quotes * fixed more links * add ref to the internals to the TOC * Added a note on the batch size Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * [80085] New images for docs (#11114) * change doc structure * fix manager tools * fix manager tools 3 step * fix manager tools 3 step * new img * new img for OV Runtime * fix steps * steps * fix intendents * change list * fix space * fix space * code snippets fix * change display * Benchmarks 2022 1 (#11130) * Minor fixes * Updates for 2022.1 * Edits according to the review * Edits according to review comments * Edits according to review comments * Edits according to review comments * Fixed table * Edits according to review comments * Removed config for Intel® Core™ i7-11850HE * Removed forward-tacotron-duration-prediction-241 graph * Added resnet-18-pytorch * Add info about Docker images in Deployment guide (#11136) * Renamed user guides (#11137) * fix screenshot (#11140) * More conservative recommendations on dynamic shapes usage in docs (#11161) * More conservative recommendations about using dynamic shapes * Duplicated statement from C++ part to Python part of reshape doc (no semantical changes) * Update ShapeInference.md (#11168) * Benchmarks 2022 1 updates (#11180) * Updated graphs * Quick fix for TODO in Dynamic Shapes article * Anchor link fixes * Fixed DM config (#11199) * DOCS: doxy sphinxtabs (#11027) * initial implementation of doxy sphinxtabs * fixes * fixes * fixes * fixes * fixes * WA for ignored visibility attribute * Fixes Co-authored-by: Sergey Lyalin <sergey.lyalin@intel.com> Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com> Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com> Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> Co-authored-by: Ilya Naumov <ilya.naumov@intel.com> Co-authored-by: Evgenya Stepyreva <evgenya.stepyreva@intel.com>
6.3 KiB
Low-Precision 8-bit Integer Inference
Disclaimer
Low-precision 8-bit inference is optimized for:
- Intel® architecture processors with the following instruction set architecture extensions:
- Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX-512 VNNI)
- Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
- Intel® Advanced Vector Extensions 2.0 (Intel® AVX2)
- Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2)
- Intel® processor graphics:
- Intel® Iris® Xe Graphics
- Intel® Iris® Xe MAX Graphics
Introduction
For 8-bit integer computation, a model must be quantized. You can use a quantized model from [OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel) or quantize a model yourself. For quantization, you can use the following:
- [Post-Training Optimization Tool](@ref pot_docs_LowPrecisionOptimizationGuide) delivered with the Intel® Distribution of OpenVINO™ toolkit release package
- Neural Network Compression Framework available on GitHub: https://github.com/openvinotoolkit/nncf
The quantization process adds FakeQuantize layers on activations and weights for most layers. Read more about mathematical computations in the Uniform Quantization with Fine-Tuning.
When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note that if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.
At runtime, the quantized model is loaded to the plugin. The plugin uses the Low Precision Transformation component to update the model to infer it in low precision:
- Update
FakeQuantizelayers to have quantized output tensors in low-precision range and add dequantization layers to compensate for the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low-precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the nextFakeQuantizelayer. - Weights are quantized and stored in
Constantlayers.
Prerequisites
Let's explore quantized TensorFlow* implementation of the ResNet-50 model. Use [Model Downloader](@ref omz_tools_downloader) to download the FP16 model from OpenVINO™ Toolkit - Open Model Zoo repository:
omz_downloader --name resnet-50-tf --precisions FP16-INT8
After that you should quantize the model with the [Model Quantizer](@ref omz_tools_downloader) tool.
omz_quantizer --model_dir public/resnet-50-tf --dataset_dir <DATASET_DIR> --precisions=FP16-INT8
The simplest way to infer the model and collect performance counters is the Benchmark Application:
./benchmark_app -m resnet-50-tf.xml -d CPU -niter 1 -api sync -report_type average_counters -report_folder pc_report_dir
If you infer the model with the OpenVINO™ CPU plugin and collect performance counters, all operations (except the last non-quantized SoftMax) are executed in INT8 precision.
Low-Precision 8-bit Integer Inference Workflow
For 8-bit integer computations, a model must be quantized. Quantized models can be downloaded from [Overview of OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel). If the model is not quantized, you can use the [Post-Training Optimization Tool](@ref pot_README) to quantize the model. The quantization process adds FakeQuantize layers on activations and weights for most layers. Read more about mathematical computations in the Uniform Quantization with Fine-Tuning.
8-bit inference pipeline includes two stages (also refer to the figure below):
-
Offline stage, or model quantization. During this stage, FakeQuantize layers are added before most layers to have quantized tensors before layers in a way that low-precision accuracy drop for 8-bit integer inference satisfies the specified threshold. The output of this stage is a quantized model. Quantized model precision is not changed, quantized tensors are in original precision range (
fp32).FakeQuantizelayer haslevelsattribute which defines quants count. Quants count defines precision which is used during inference. Forint8rangelevelsattribute value has to be 255 or 256. To quantize the model, you can use the [Post-Training Optimization Tool](@ref pot_README) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.
-
Runtime stage. This stage is an internal procedure of the OpenVINO™ plugin. During this stage, the quantized model is loaded to the plugin. The plugin uses
Low Precision Transformationcomponent to update the model to infer it in low precision:- Update
FakeQuantizelayers to have quantized output tensors in low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the nextFakeQuantizelayer. - Weights are quantized and stored in
Constantlayers.
- Update
![int8_flow]