Low Precision IR documentation (#5791)

* Low Precision IR documentation

* Apply suggestions from code review

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>
This commit is contained in:
Evgenya Stepyreva 2021-05-25 17:43:42 +03:00 committed by GitHub
parent e848859e23
commit 74c8bd272b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 20 additions and 20 deletions

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1 version https://git-lfs.github.com/spec/v1
oid sha256:6c9ddc759bc419268f4c23089b91a9e3373114a1d36b01d6fe62a5e87b5c0ad4 oid sha256:4b14b03ebb6a00b5f52a8404282f83d4ad214c8d04aea74738027a775c4ef545
size 59827 size 100581

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1 version https://git-lfs.github.com/spec/v1
oid sha256:59890c0c4a6d1c721dfaca22f0c1d0b305401f75dcd30418f858382830be2d31 oid sha256:cbfadd457b4d943ffb46906a7daf03516e971fe49d2806cd32c84c5015178f03
size 49598 size 92819

View File

@ -2,36 +2,36 @@
## Introduction ## Introduction
Inference Engine CPU plugin can infer models in the 8-bit integer (INT8) precision. Inference Engine CPU and GPU plugin can infer models in the low precision.
For details, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md). For details, refer to [Low Precision Inference on the CPU](../../../IE_DG/Int8Inference.md).
Intermediate Representation (IR) should be specifically formed to be suitable for INT8 inference. Intermediate Representation (IR) should be specifically formed to be suitable for low precision inference.
Such an IR is called an INT8 IR and you can generate it in two ways: Such an IR is called a Low Precision IR and you can generate it in two ways:
- [Quantize model with the Post-Training Optimization tool](@ref pot_README) - [Quantize regular IR with the Post-Training Optimization tool](@ref pot_README)
- Use the Model Optimizer for TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) - Use the Model Optimizer for a model pretrained for Low Precision inference: TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) and ONNX\* quantized models.
Both Tensorflow and ONNX quantized models could be prepared by [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf/blob/develop/README.md)
For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs with the `levels` attribute set to `255` or `256`. For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs.
See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details. See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details.
To see the list of supported INT8 layers, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation: To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation:
![](../../img/expanded_int8_Convolution_weights.png) ![](../../img/expanded_int8_Convolution_weights.png)
INT8 IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between an INT8 IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the INT8 IR. Low pecision IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between a Low Precision IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the Low Precision IR.
Plugins with INT8 inference support recognize these sub-graphs and quantize them during the inference time. Plugins with Low Precision Inference support recognize these sub-graphs and quantize them during the inference time.
Plugins without INT8 support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision. Plugins without Low Precision support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.
Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model. Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model.
If capable, a plugin accepts the recommendation and performs INT8 inference, otherwise the plugin ignores the recommendation and executes a model in the floating-point precision. If capable, a plugin accepts the recommendation and performs Low Precision Inference, otherwise, the plugin ignores the recommendation and executes a model in the floating-point precision.
## Compressed INT8 Weights ## Compressed Low Precision Weights
Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation. Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation.
`Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics. `Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics.
The resulting weights sub-graph stores weights in INT8 `Constant`, which gets unpacked back to floating point with the `Convert` operation. The resulting weights sub-graph stores weights in Low Precision `Constant`, which gets unpacked back to floating point with the `Convert` operation.
Weights compression leaves `FakeQuantize` output arithmetically the same and weights storing takes four times less memory. Weights compression replaces `FakeQuantize` with optional `Subtract` and `Multiply` operation leaving output arithmetically the same and weights storing takes four times less memory.
See the visualization of `Convolution` with the compressed weights: See the visualization of `Convolution` with the compressed weights:
![](../../img/compressed_int8_Convolution_weights.png) ![](../../img/compressed_int8_Convolution_weights.png)
Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default. To generate an expanded INT8 IR, use `--disable_weights_compression`. Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default.