Low Precision IR documentation (#5791)
* Low Precision IR documentation * Apply suggestions from code review Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com> Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>
This commit is contained in:
parent
e848859e23
commit
74c8bd272b
@ -1,3 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:6c9ddc759bc419268f4c23089b91a9e3373114a1d36b01d6fe62a5e87b5c0ad4
|
||||
size 59827
|
||||
oid sha256:4b14b03ebb6a00b5f52a8404282f83d4ad214c8d04aea74738027a775c4ef545
|
||||
size 100581
|
||||
|
@ -1,3 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:59890c0c4a6d1c721dfaca22f0c1d0b305401f75dcd30418f858382830be2d31
|
||||
size 49598
|
||||
oid sha256:cbfadd457b4d943ffb46906a7daf03516e971fe49d2806cd32c84c5015178f03
|
||||
size 92819
|
||||
|
@ -2,36 +2,36 @@
|
||||
|
||||
## Introduction
|
||||
|
||||
Inference Engine CPU plugin can infer models in the 8-bit integer (INT8) precision.
|
||||
For details, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
|
||||
Inference Engine CPU and GPU plugin can infer models in the low precision.
|
||||
For details, refer to [Low Precision Inference on the CPU](../../../IE_DG/Int8Inference.md).
|
||||
|
||||
Intermediate Representation (IR) should be specifically formed to be suitable for INT8 inference.
|
||||
Such an IR is called an INT8 IR and you can generate it in two ways:
|
||||
- [Quantize model with the Post-Training Optimization tool](@ref pot_README)
|
||||
- Use the Model Optimizer for TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations)
|
||||
Intermediate Representation (IR) should be specifically formed to be suitable for low precision inference.
|
||||
Such an IR is called a Low Precision IR and you can generate it in two ways:
|
||||
- [Quantize regular IR with the Post-Training Optimization tool](@ref pot_README)
|
||||
- Use the Model Optimizer for a model pretrained for Low Precision inference: TensorFlow\* pre-TFLite models (`.pb` model file with `FakeQuantize*` operations) and ONNX\* quantized models.
|
||||
Both Tensorflow and ONNX quantized models could be prepared by [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf/blob/develop/README.md)
|
||||
|
||||
For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs with the `levels` attribute set to `255` or `256`.
|
||||
For an operation to be executed in INT8, it must have `FakeQuantize` operations as inputs.
|
||||
See the [specification of `FakeQuantize` operation](../../../ops/quantization/FakeQuantize_1.md) for details.
|
||||
To see the list of supported INT8 layers, refer to [INT8 inference on the CPU](../../../IE_DG/Int8Inference.md).
|
||||
|
||||
To execute the `Convolution` operation in INT8 on CPU, both data and weight inputs should have `FakeQuantize` as an input operation:
|
||||

|
||||
|
||||
INT8 IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between an INT8 IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the INT8 IR.
|
||||
Plugins with INT8 inference support recognize these sub-graphs and quantize them during the inference time.
|
||||
Plugins without INT8 support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.
|
||||
Low pecision IR is also suitable for FP32 and FP16 inference if a chosen plugin supports all operations of the IR, because the only difference between a Low Precision IR and FP16 or FP32 IR is the existence of `FakeQuantize` in the Low Precision IR.
|
||||
Plugins with Low Precision Inference support recognize these sub-graphs and quantize them during the inference time.
|
||||
Plugins without Low Precision support execute all operations, including `FakeQuantize`, as is in the FP32 or FP16 precision.
|
||||
|
||||
Accordingly, the presence of FakeQuantize operations in the IR is a recommendation for a plugin on how to quantize particular operations in the model.
|
||||
If capable, a plugin accepts the recommendation and performs INT8 inference, otherwise the plugin ignores the recommendation and executes a model in the floating-point precision.
|
||||
If capable, a plugin accepts the recommendation and performs Low Precision Inference, otherwise, the plugin ignores the recommendation and executes a model in the floating-point precision.
|
||||
|
||||
## Compressed INT8 Weights
|
||||
## Compressed Low Precision Weights
|
||||
|
||||
Weighted operations, like `Convolution`, `MatMul`, and others, store weights as floating-point `Constant` in the graph followed by the `FakeQuantize` operation.
|
||||
`Constant` followed by the `FakeQuantize` operation could be optimized memory-wise due to the `FakeQuantize` operation semantics.
|
||||
The resulting weights sub-graph stores weights in INT8 `Constant`, which gets unpacked back to floating point with the `Convert` operation.
|
||||
Weights compression leaves `FakeQuantize` output arithmetically the same and weights storing takes four times less memory.
|
||||
The resulting weights sub-graph stores weights in Low Precision `Constant`, which gets unpacked back to floating point with the `Convert` operation.
|
||||
Weights compression replaces `FakeQuantize` with optional `Subtract` and `Multiply` operation leaving output arithmetically the same and weights storing takes four times less memory.
|
||||
|
||||
See the visualization of `Convolution` with the compressed weights:
|
||||

|
||||
|
||||
Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default. To generate an expanded INT8 IR, use `--disable_weights_compression`.
|
||||
Both Model Optimizer and Post-Training Optimization tool generate a compressed IR by default.
|
||||
|
Loading…
Reference in New Issue
Block a user