Low Precision IR documentation (#5791)

* Low Precision IR documentation * Apply suggestions from code review Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com> Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>
2021-05-25 17:43:42 +03:00 · 2021-05-25 17:43:42 +03:00 · 74c8bd272b
commit 74c8bd272b
parent e848859e23
3 changed files with 20 additions and 20 deletions
--- a/docs/MO_DG/img/compressed_int8_Convolution_weights.png
+++ b/docs/MO_DG/img/compressed_int8_Convolution_weights.png
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6c9ddc759bc419268f4c23089b91a9e3373114a1d36b01d6fe62a5e87b5c0ad4
-size 59827
+oid sha256:4b14b03ebb6a00b5f52a8404282f83d4ad214c8d04aea74738027a775c4ef545
+size 100581
--- a/docs/MO_DG/img/expanded_int8_Convolution_weights.png
+++ b/docs/MO_DG/img/expanded_int8_Convolution_weights.png
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:59890c0c4a6d1c721dfaca22f0c1d0b305401f75dcd30418f858382830be2d31
-size 49598
+oid sha256:cbfadd457b4d943ffb46906a7daf03516e971fe49d2806cd32c84c5015178f03
+size 92819
--- a/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md
+++ b/docs/MO_DG/prepare_model/convert_model/IR_suitable_for_INT8_inference.md
@ -2,36 +2,36 @@

 ## Introduction

-Inference Engine CPU plugin can infer models in the 8-bit integer (INT8) precision. 
-For details, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
+Inference Engine CPU and GPU plugin can infer models in the low precision. 
+For details, refer to [Low Precision Inference on the CPU](../../../IE_DG/Int8Inference.md).

-Intermediate Representation (IR) should be specifically formed to be suitable for INT8 inference. 
-Such an IR is called an INT8 IR and you can generate it in two ways:
- [Quantize model with the Post-Training Optimization tool](@ref pot_README)
- Use the Model Optimizer for TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations)
+Intermediate Representation (IR) should be specifically formed to be suitable for low precision inference. 
+Such an IR is called a Low Precision IR and you can generate it in two ways:
+- [Quantize regular IR with the Post-Training Optimization tool](@ref pot_README)
+- Use the Model Optimizer for a model pretrained for Low Precision inference: TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) and ONNX\* quantized models.
+Both Tensorflow and ONNX quantized models could be prepared by [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf/blob/develop/README.md) 

-For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs with the `levels` attribute set to `255` or `256`. 
+For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs.
 See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details. 
-To see the list of supported INT8 layers, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).

 To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation:
 ![](../../img/expanded_int8_Convolution_weights.png)

-INT8 IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between an INT8 IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the INT8 IR. 
-Plugins with INT8 inference support recognize these sub-graphs and quantize them during the inference time. 
-Plugins without INT8 support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.   
+Low pecision IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between a Low Precision IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the Low Precision IR. 
+Plugins with Low Precision Inference support recognize these sub-graphs and quantize them during the inference time. 
+Plugins without Low Precision support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.   

 Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model. 
-If capable, a plugin accepts the recommendation and performs INT8 inference, otherwise the plugin ignores the recommendation and executes a model in the floating-point precision. 
+If capable, a plugin accepts the recommendation and performs Low Precision Inference, otherwise, the plugin ignores the recommendation and executes a model in the floating-point precision. 

-## Compressed INT8 Weights
+## Compressed Low Precision Weights

 Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation. 
 `Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics. 
-The resulting weights sub-graph stores weights in INT8 `Constant`, which gets unpacked back to floating point with the `Convert` operation. 
-Weights compression leaves `FakeQuantize` output arithmetically the same and weights storing takes four times less memory.
+The resulting weights sub-graph stores weights in Low Precision `Constant`, which gets unpacked back to floating point with the `Convert` operation. 
+Weights compression replaces `FakeQuantize` with optional `Subtract` and `Multiply` operation leaving output arithmetically the same and weights storing takes four times less memory.

 See the visualization of `Convolution` with the compressed weights:
 ![](../../img/compressed_int8_Convolution_weights.png)

-Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default. To generate an expanded INT8 IR, use `--disable_weights_compression`.
+Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default.