Files

Andrey Zaytsev 322c874113 Feature/azaytsev/cherry picks from 2021 4 (#7389 )

* Added info on DockerHub CI Framework

* Feature/azaytsev/change layout (#3295)

* Changes according to feedback comments

* Replaced @ref's with html links

* Fixed links, added a title page for installing from repos and images, fixed formatting issues

* Added links

* minor fix

* Added DL Streamer to the list of components installed by default

* Link fixes

* Link fixes

* ovms doc fix (#2988)

* added OpenVINO Model Server

* ovms doc fixes

Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>

* Updated openvino_docs.xml

* Updated the link to software license agreements

* Revert "Updated the link to software license agreements"

This reverts commit 706dac500e.

* Updated legal info (#6409)

# Conflicts:
#	thirdparty/ade

* Cherry-pick 4833c8db72

[DOCS]Changed DL WB related docs and tips (#6318)

* changed DL WB related docs and tips

* added two tips to benchmark and changed layout

* changed layout

* changed links

* page title added

* changed tips

* ie layout fixed

* updated diagram and hints

* changed tooltip and ref link

* changet tooltip link

* changed DL WB description

* typo fix
# Conflicts:
#	docs/doxygen/ie_docs.xml
#	thirdparty/ade

* Cherry-pick 6405

Feature/azaytsev/mo devguide changes (#6405)

* MO devguide edits

* MO devguide edits

* MO devguide edits

* MO devguide edits

* MO devguide edits

* Experimenting with videos

* Experimenting with videos

* Experimenting with videos

* Experimenting with videos

* Experimenting with videos

* Experimenting with videos

* Experimenting with videos

* Experimenting with videos

* Experimenting with videos

* Additional edits

* Additional edits

* Updated the workflow diagram

* Minor fix

* Experimenting with videos

* Updated the workflow diagram

* Removed  Prepare_Trained_Model, changed the title for Config_Model_Optimizer

* Rolled back

* Revert "Rolled back"

This reverts commit 6a4a3e1765.

* Revert "Removed  Prepare_Trained_Model, changed the title for Config_Model_Optimizer"

This reverts commit 0810bd534f.

* Fixed ie_docs.xml, Removed  Prepare_Trained_Model, changed the title for Config_Model_Optimizer

* Fixed ie_docs.xml

* Minor fix

* <details> tag issue

* <details> tag issue

* Fix <details> tag issue

* Fix <details> tag issue

* Fix <details> tag issue
# Conflicts:
#	thirdparty/ade

* Cherry-pick #6419

* [Runtime] INT8 inference documentation update

* [Runtime] INT8 inference documentation: typo was fixed

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Table of Contents was removed

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>
# Conflicts:
#	docs/IE_DG/Int8Inference.md
#	thirdparty/ade

* Cherry pick (#6437)

* Q2 changes

* Changed Convert_RNNT.md

Co-authored-by: baychub <cbay@yahoo.com>
# Conflicts:
#	docs/IE_DG/Int8Inference.md
#	docs/install_guides/installing-openvino-conda.md
#	docs/install_guides/pypi-openvino-dev.md
#	thirdparty/ade

* Cherry-pick (#6447)

* Added benchmark page changes

* Make the picture smaller

* Added Intel® Iris® Xe MAX Graphics

* Changed the TIP about DL WB

* Added Note on the driver for Intel® Iris® Xe MAX Graphics

* Fixed formatting

* Added the link to Intel® software for general purpose GPU capabilities

* OVSA ovsa_get_started updates

* Fixed link
# Conflicts:
#	thirdparty/ade

* Cherry-pick #6450

* fix layout

* 4
# Conflicts:
#	thirdparty/ade

* Cherry-pick #6466

* Cherry-pick #6548

* install docs fixes

* changed video width

* CMake reference added

* fixed table

* added backtics and table formating

* new table changes

* GPU table changes

* added more backtics and changed table format

* gpu table changes

* Update get_started_dl_workbench.md

Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com>
# Conflicts:
#	thirdparty/ade

* [Runtime] INT8 inference documentation update (#6419)

* [Runtime] INT8 inference documentation update

* [Runtime] INT8 inference documentation: typo was fixed

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

* Table of Contents was removed

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>
# Conflicts:
#	docs/IE_DG/Int8Inference.md
#	thirdparty/ade

* Cherry-pick #6651

* Edits to MO

Per findings spreadsheet

* macOS changes

per issue spreadsheet

* Fixes from review spreadsheet

Mostly IE_DG fixes

* Consistency changes

* Make doc fixes from last round of review

* Add GSG build-all details

* Fix links to samples and demos pages

* Make MO_DG v2 changes

* Add image view step to classify demo

* Put MO dependency with others

* Edit docs per issues spreadsheet

* Add file to pytorch_specific

* More fixes per spreadsheet

* Prototype sample page

* Add build section

* Update README.md

* Batch download/convert by default

* Add detail to How It Works

* Minor change

* Temporary restored topics

* corrected layout

* Resized

* Added white background into the picture

* fixed link to omz_tools_downloader

* fixed title in the layout

Co-authored-by: baychub <cbay@yahoo.com>
Co-authored-by: baychub <31420038+baychub@users.noreply.github.com>
# Conflicts:
#	docs/doxygen/ie_docs.xml

* Cherry-pick  (#6789) [59449][DOCS] GPU table layout change

* changed argument display

* added br tag to more arguments

* changed argument display in GPU table

* changed more arguments

* changed Quantized_ models display
# Conflicts:
#	thirdparty/ade

* Sync doxygen-ignore

* Removed ref to FPGA.md

* Fixed link to ONNX format doc

Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
Co-authored-by: Edward Shogulin <edward.shogulin@intel.com>
Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com>

2021-09-07 19:21:41 +03:00

7.2 KiB

Raw Blame History

Low-Precision 8-bit Integer Inference

Supported devices

Low-precision 8-bit inference is optimized for:

Intel® architecture processors with the following instruction set architecture extensions:
- Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX-512 VNNI)
- Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
- Intel® Advanced Vector Extensions 2.0 (Intel® AVX2)
- Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2)
Intel® processor graphics:
- Intel® Iris® Xe Graphics
- Intel® Iris® Xe MAX Graphics
A model must be quantized. You can use a quantized model from [OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel) or quantize a model yourself. For quantization, you can use the:
- [Post-Training Optimization Tool](@ref pot_docs_LowPrecisionOptimizationGuide) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.
- Neural Network Compression Framework available on GitHub: https://github.com/openvinotoolkit/nncf

Low-Precision 8-bit Integer Inference Workflow

8-bit computations (referred to as int8) offer better performance compared to the results of inference in higher precision (for example, fp32), because they allow loading more data into a single processor instruction. Usually the cost for significant boost is reduced accuracy. However, it is proved that an accuracy drop can be negligible and depends on task requirements, so that the application engineer can set up the maximum accuracy drop that is acceptable.

For 8-bit integer computations, a model must be quantized. Quantized models can be downloaded from [Overview of OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel). If the model is not quantized, you can use the [Post-Training Optimization Tool](@ref pot_README) to quantize the model. The quantization process adds FakeQuantize layers on activations and weights for most layers. Read more about mathematical computations in the Uniform Quantization with Fine-Tuning.

When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.

In Runtime stage, the quantized model is loaded to the plugin. The plugin uses the Low Precision Transformation component to update the model to infer it in low precision:

Update FakeQuantize layers to have quantized output tensors in a low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers quantized input tensors in the low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the next FakeQuantize layer.
Quantize weights and store them in Constant layers.

Prerequisites

Let's explore the quantized TensorFlow* implementation of ResNet-50 model. Use the [Model Downloader](@ref omz_tools_downloader) tool to download the fp16 model from OpenVINO™ Toolkit - Open Model Zoo repository:

cd $INTEL_OPENVINO_DIR/deployment_tools/tools/model_downloader
./downloader.py --name resnet-50-tf --precisions FP16-INT8 --output_dir <your_model_directory>

After that, you should quantize the model by the [Model Quantizer](@ref omz_tools_downloader) tool. For the dataset, you can choose to download the ImageNet dataset from here.

./quantizer.py --model_dir --name public/resnet-50-tf --dataset_dir <DATASET_DIR> --precisions=FP16-INT8

Inference

The simplest way to infer the model and collect performance counters is the C++ Benchmark Application.

./benchmark_app -m resnet-50-tf.xml -d CPU -niter 1 -api sync -report_type average_counters  -report_folder pc_report_dir

If you infer the model with the Inference Engine CPU plugin and collect performance counters, all operations (except the last non-quantized SoftMax) are executed in INT8 precision.

Results analysis

Information about layer precision is stored in the performance counters that are available from the Inference Engine API. For example, the part of performance counters table for quantized TensorFlow* implementation of ResNet-50 model inference on CPU Plugin looks as follows:

layerName	execStatus	layerType	execType	realTime (ms)	cpuTime (ms)
resnet_model/batch_normalization_15/FusedBatchNorm/Add	EXECUTED	Convolution	jit_avx512_1x1_I8	0.377	0.377
resnet_model/conv2d_16/Conv2D/fq_input_0	NOT_RUN	FakeQuantize	undef	0	0
resnet_model/batch_normalization_16/FusedBatchNorm/Add	EXECUTED	Convolution	jit_avx512_I8	0.499	0.499
resnet_model/conv2d_17/Conv2D/fq_input_0	NOT_RUN	FakeQuantize	undef	0	0
resnet_model/batch_normalization_17/FusedBatchNorm/Add	EXECUTED	Convolution	jit_avx512_1x1_I8	0.399	0.399
resnet_model/add_4/fq_input_0	NOT_RUN	FakeQuantize	undef	0	0
resnet_model/add_4	NOT_RUN	Eltwise	undef	0	0
resnet_model/add_5/fq_input_1	NOT_RUN	FakeQuantize	undef	0	0

The exeStatus column of the table includes possible values:

EXECUTED - layer was executed by standalone primitive,

NOT_RUN - layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.

The execType column of the table includes inference primitives with specific suffixes. The layers have the following marks:

Suffix I8 for layers that had 8-bit data type input and were computed in 8-bit precision

Suffix FP32 for layers computed in 32-bit precision

All Convolution layers are executed in int8 precision. Rest layers are fused into Convolutions using post operations optimization technique, which is described in Internal CPU Plugin Optimizations.

7.2 KiB Raw Blame History