* Brushed the general optimization page * Opt GUIDE, WIP * perf hints doc placeholder * WIP * WIP2 * WIP 3 * added streams and few other details * fixed titles, misprints etc * Perf hints * movin the runtime optimizations intro * fixed link * Apply suggestions from code review Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * some details on the FIL and other means when pure inference time is not the only factor * shuffled according to general->use-case->device-specifics flow, minor brushing * next iter * section on optimizing for tput and latency * couple of links to the features support matrix * Links, brushing, dedicated subsections for Latency/FIL/Tput * had to make the link less specific (otherwise docs compilations fails) * removing the Temp/Should be moved to the Opt Guide * shuffled the tput/latency/etc info into separated documents. also the following docs moved from the temp into specific feature, general product desc or corresponding plugins - openvino_docs_IE_DG_Model_caching_overview - openvino_docs_IE_DG_Int8Inference - openvino_docs_IE_DG_Bfloat16Inference - openvino_docs_OV_UG_NoDynamicShapes * fixed toc for ov_dynamic_shapes.md * referring the openvino_docs_IE_DG_Bfloat16Inference to avoid docs compilation errors * fixed main product TOC, removed ref from the second-level items * reviewers remarks * reverted the openvino_docs_OV_UG_NoDynamicShapes * reverting openvino_docs_IE_DG_Bfloat16Inference and openvino_docs_IE_DG_Int8Inference * "No dynamic shapes" to the "Dynamic shapes" as TOC * removed duplication * minor brushing * Caching to the next level in TOC * brushing * more on the perf counters ( for latency and dynamic cases) Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
4.1 KiB
Quantized networks compute and restrictions
One of the feature of Inference Engine is the support of quantized networks with different precisions: INT8, INT4, etc. However, it is up to the plugin to define what exact precisions are supported by the particular HW. All quantized networks which can be expressed in IR have a unified representation by means of FakeQuantize operation. For more details about low-precision model representation please refer to this [document](@ref openvino_docs_ie_plugin_dg_lp_representation).
Interpreting FakeQuantize at runtime
During the model load each plugin can interpret quantization rules expressed in FakeQuantize operations:
- Independently based on the definition of FakeQuantize operation.
- Using a special library of low-precision transformations (LPT) which applies common rules for generic operations, such as Convolution, Fully-Connected, Eltwise, etc., and translates "fake-quantized" models into the models with low-precision operations. For more information about low-precision flow please refer to the following document.
Here we provide only a high-level overview of the interpretation rules of FakeQuantize. At runtime each FakeQuantize can be split into two independent operations: Quantize and Dequantize. The former one is aimed to transform the input data into the target precision while the latter transforms the resulting values back to the original range and precision. In practice Dequantize operations can be propagated forward through the linear operations, such as Convolution or Fully-Connected, and in some cases fused with the following Quantize operation for the next layer into the so-called Requantize operation (see Fig. 1).
From the calculation standpoint, the FakeQuantize formula also is split into two parts accordingly:
output = round((x - input_low) / (input_high - input_low) * (levels-1)) / (levels-1) * (output_high - output_low) + output_low
The first part of this formula represents Quantize operation:
q = round((x - input_low) / (input_high - input_low) * (levels-1))
The second is responsible for the dequantization:
r = q / (levels-1) * (output_high - output_low) + output_low
From the scale/zero-point notation standpoint the latter formula can be written as follows:
r = (output_high - output_low) / (levels-1) * (q + output_low / (output_high - output_low) * (levels-1))
Thus we can define:
- Scale as
(output_high - output_low) / (levels-1) - Zero-point as
-output_low / (output_high - output_low) * (levels-1)
Note: During the quantization process the values input_low, input_high, output_low, output_high are selected so that to map a floating-point zero exactly to an integer value (zero-point) and vice versa.
Quantization specifics and restrictions
In general, OpenVINO can represent and execute quantized models from different sources. However, the Post-training Optimization Tool (POT) is considered the default way to get optimized models. Since the POT supports HW-aware quantization it means that specific rules can be implemented in it for the particular HW. However, it is reasonable to have compatibility with general-purpose HW such as CPU and GPU and support their quantization schemes. Below we define these rules as follows:
- Support of mixed-precision models where some layers can be kept in the floating-point precision.
- Per-channel quantization of weights of Convolutional and Fully-Connected layers.
- Per-channel quantization of activations for channel-wise and element-wise operations, e.g. Depthwise Convolution, Eltwise Add/Mul, ScaleShift.
- Symmetric and asymmetric quantization of weights and activations with the support of per-channel scales and zero-points.
- Non-unified quantization parameters for Eltwise and Concat operations.
- Non-quantized network output, i.e. there are no quantization parameters for it.
