Files

Sebastian Golebiewski 0429efb8b8 DOCS: NNCF documentation - port to master (#13183 )

* Updating NNCF documentation

* nncf-doc-update-ms

* Merge branch 'nncf-documentation-for-22.2' of https://github.com/sgolebiewski-intel/openvino into nncf-documentation-for-22.2

* Adding python files

* Changing ID of Range Supervision

* Minor fixes

Fixing formatting and renaming ID

* Proofreading

Minor corrections and removal of Neural Network Compression Framework article

Co-authored-by: msmykx <101244365+msmykx-intel@users.noreply.github.com>
Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>

2022-09-28 15:34:57 +04:00

4.0 KiB

Raw Blame History

Optimizing Models Post-training

@sphinxdirective

.. toctree:: :maxdepth: 1 :hidden:

Quantizing Model <pot_default_quantization_usage> Quantizing Model with Accuracy Control <pot_accuracyaware_usage> Quantization Best Practices <pot_docs_BestPractices> API Reference <pot_compression_api_README> Command-line Interface <pot_compression_cli_README> Examples <pot_examples_description> pot_docs_FrequentlyAskedQuestions

@endsphinxdirective

Post-training model optimization is the process of applying special methods without model retraining or fine-tuning, for example, post-training 8-bit quantization. Therefore, this process does not require a training dataset or a training pipeline in the source DL framework. To apply post-training methods in OpenVINO™, you need:

A floating-point precision model, FP32 or FP16, converted into the OpenVINO™ Intermediate Representation (IR) format that can be run on CPU.
A representative calibration dataset representing a use case scenario, for example, 300 samples.
In case of accuracy constraints, a validation dataset and accuracy metrics should be available.

For the needs of post-training optimization, OpenVINO™ provides a Post-training Optimization Tool (POT) which supports the uniform integer quantization method. This method allows moving from floating-point precision to integer precision (for example, 8-bit) for weights and activations during the inference time. It helps to reduce the model size, memory footprint and latency, as well as improve the computational efficiency, using integer arithmetic. During the quantization process the model undergoes the transformation process when additional operations, that contain quantization information, are inserted into the model. The actual transition to integer arithmetic happens at model inference.

The figure below shows the optimization workflow with POT:

POT is distributed as a part of OpenVINO™ [Development Tools](@ref openvino_docs_install_guides_install_dev_tools) package and also available on GitHub.

Quantizing models with POT

Depending on your needs and requirements, POT provides two main quantization methods that can be used:

[Default Quantization](@ref pot_default_quantization_usage) -- a recommended method that provides fast and accurate results in most cases. It requires only an unannotated dataset for quantization. For more details, see the [Default Quantization algorithm](@ref pot_compression_algorithms_quantization_default_README) documentation.
[Accuracy-aware Quantization](@ref pot_accuracyaware_usage) -- an advanced method that allows keeping accuracy at a predefined range, at the cost of performance improvement, when Default Quantization cannot guarantee it. This method requires an annotated representative dataset and may require more time for quantization. For more details, see the [Accuracy-aware Quantization algorithm](@ref accuracy_aware_README) documentation.

Different hardware platforms support different integer precisions and quantization parameters. For example, 8-bit is used by CPU, GPU, VPU, and 16-bit by GNA. POT abstracts this complexity by introducing a concept of the "target device" used to set quantization settings, specific to the device.

Note

: There is a special target_device: "ANY" which leads to portable quantized models compatible with CPU, GPU, and VPU devices. GNA-quantized models are compatible only with CPU.

For benchmarking results collected for the models optimized with the POT tool, refer to the [INT8 vs FP32 Comparison on Select Networks and Platforms](@ref openvino_docs_performance_int8_vs_fp32).

4.0 KiB Raw Blame History

Optimizing Models Post-training

Quantizing models with POT

Additional Resources

4.0 KiB

Raw Blame History