Low Precision IR documentation (#5791)

* Low Precision IR documentation

* Apply suggestions from code review

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>

Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>
This commit is contained in:
Evgenya Stepyreva 2021-05-25 17:43:42 +03:00 committed by GitHub
parent e848859e23
commit 74c8bd272b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 20 additions and 20 deletions

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6c9ddc759bc419268f4c23089b91a9e3373114a1d36b01d6fe62a5e87b5c0ad4
size 59827
oid sha256:4b14b03ebb6a00b5f52a8404282f83d4ad214c8d04aea74738027a775c4ef545
size 100581

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:59890c0c4a6d1c721dfaca22f0c1d0b305401f75dcd30418f858382830be2d31
size 49598
oid sha256:cbfadd457b4d943ffb46906a7daf03516e971fe49d2806cd32c84c5015178f03
size 92819

View File

@ -2,36 +2,36 @@
## Introduction
Inference Engine CPU plugin can infer models in the 8-bit integer (INT8) precision.
For details, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
Inference Engine CPU and GPU plugin can infer models in the low precision.
For details, refer to [Low Precision Inference on the CPU](../../../IE_DG/Int8Inference.md).
Intermediate Representation (IR) should be specifically formed to be suitable for INT8 inference.
Such an IR is called an INT8 IR and you can generate it in two ways:
- [Quantize model with the Post-Training Optimization tool](@ref pot_README)
- Use the Model Optimizer for TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations)
Intermediate Representation (IR) should be specifically formed to be suitable for low precision inference.
Such an IR is called a Low Precision IR and you can generate it in two ways:
- [Quantize regular IR with the Post-Training Optimization tool](@ref pot_README)
- Use the Model Optimizer for a model pretrained for Low Precision inference: TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) and ONNX\* quantized models.
Both Tensorflow and ONNX quantized models could be prepared by [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf/blob/develop/README.md)
For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs with the `levels` attribute set to `255` or `256`.
For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs.
See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details.
To see the list of supported INT8 layers, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation:
![](../../img/expanded_int8_Convolution_weights.png)
INT8 IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between an INT8 IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the INT8 IR.
Plugins with INT8 inference support recognize these sub-graphs and quantize them during the inference time.
Plugins without INT8 support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.
Low pecision IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between a Low Precision IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the Low Precision IR.
Plugins with Low Precision Inference support recognize these sub-graphs and quantize them during the inference time.
Plugins without Low Precision support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.
Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model.
If capable, a plugin accepts the recommendation and performs INT8 inference, otherwise the plugin ignores the recommendation and executes a model in the floating-point precision.
If capable, a plugin accepts the recommendation and performs Low Precision Inference, otherwise, the plugin ignores the recommendation and executes a model in the floating-point precision.
## Compressed INT8 Weights
## Compressed Low Precision Weights
Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation.
`Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics.
The resulting weights sub-graph stores weights in INT8 `Constant`, which gets unpacked back to floating point with the `Convert` operation.
Weights compression leaves `FakeQuantize` output arithmetically the same and weights storing takes four times less memory.
The resulting weights sub-graph stores weights in Low Precision `Constant`, which gets unpacked back to floating point with the `Convert` operation.
Weights compression replaces `FakeQuantize` with optional `Subtract` and `Multiply` operation leaving output arithmetically the same and weights storing takes four times less memory.
See the visualization of `Convolution` with the compressed weights:
![](../../img/compressed_int8_Convolution_weights.png)
Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default. To generate an expanded INT8 IR, use `--disable_weights_compression`.
Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default.