[DOC]: Added INT4 weight compression description (#20812)

* Added INT4 information into weight compression doc * Added GPTQ info. Fixed comments * Fixed list * Fixed issues. Updated Gen.AI doc * Applied comments * Added additional infor about GPTQ support * Fixed typos * Update docs/articles_en/openvino_workflow/gen_ai.md Co-authored-by: Nico Galoppo <nico.galoppo@intel.com> * Update docs/articles_en/openvino_workflow/gen_ai.md Co-authored-by: Nico Galoppo <nico.galoppo@intel.com> * Update docs/optimization_guide/nncf/code/weight_compression_openvino.py Co-authored-by: Nico Galoppo <nico.galoppo@intel.com> * Applied changes * Update docs/articles_en/openvino_workflow/gen_ai.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/articles_en/openvino_workflow/gen_ai.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/articles_en/openvino_workflow/gen_ai.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/articles_en/openvino_workflow/model_optimization_guide/weight_compression.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/articles_en/openvino_workflow/model_optimization_guide/weight_compression.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/articles_en/openvino_workflow/model_optimization_guide/weight_compression.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/articles_en/openvino_workflow/model_optimization_guide/weight_compression.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update docs/articles_en/openvino_workflow/model_optimization_guide/weight_compression.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Added table with results * One more comment --------- Co-authored-by: Nico Galoppo <nico.galoppo@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
2023-11-08 10:17:57 +04:00
parent c42a88a190
commit 0f260c2ccd
3 changed files with 131 additions and 3 deletions
--- a/docs/articles_en/openvino_workflow/gen_ai.md
+++ b/docs/articles_en/openvino_workflow/gen_ai.md
@@ -115,6 +115,28 @@ Optimum-Intel API also provides out-of-the-box model optimization through weight
 Weight compression is applied by default to models larger than one billion parameters and is also available for CLI interface as the ``--int8`` option.
 .. note::
   8-bit weight compression is enabled by default for models larger than 1 billion parameters.
 `NNCF <https://github.com/openvinotoolkit/nncf>`__ also provides 4-bit weight compression, which is supported by OpenVINO. It can be applied to Optimum objects as follows:
 .. code-block:: python
    from nncf import compress_weights, CompressWeightsMode
    model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False)
    model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8)
 The optimized model can be saved as usual with a call to ``save_pretrained()``. For more details on compression options, refer to the :doc:`weight compression guide <weight_compression>`.
 .. note::
   OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized 
   with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.
 Below are some examples of using Optimum-Intel for model conversion and inference:
 * `Stable Diffusion v2.1 using Optimum-Intel OpenVINO <https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/236-stable-diffusion-v2/236-stable-diffusion-v2-optimum-demo.ipynb>`__
--- a/docs/articles_en/openvino_workflow/model_optimization_guide/weight_compression.md
+++ b/docs/articles_en/openvino_workflow/model_optimization_guide/weight_compression.md
@@ -10,12 +10,14 @@ Weight compression aims to reduce the memory footprint of a model. It can also l
 - enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device; 
 - improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
-Currently, `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__ provides 8-bit weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
+Currently, `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__ provides weight quantization to 8 and 4-bit integer data types as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization is that activations remain floating-point in the case of weight compression, resulting in better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
 Compress Model Weights
 ######################
-The code snippet below shows how to compress the weights of the model represented in OpenVINO IR using NNCF:
+- **8-bit weight quantization** - this method is aimed at accurate optimization of the model, which usually leads to significant performance improvements for Transformer-based models. Models with 8-bit compressed weights are performant on the vast majority of supported CPU and GPU platforms.
 The code snippet below shows how to do 8-bit quantization of the model weights represented in OpenVINO IR using NNCF:
 .. tab-set::
@@ -28,6 +30,103 @@ The code snippet below shows how to compress the weights of the model represente
 Now, the model is ready for compilation and inference. It can be also saved into a compressed format, resulting in a smaller binary file.
 - **4-bit weight quantization** - this method stands for an INT4-INT8 mixed-precision weight quantization, where INT4 is considered as the primary precision and INT8 is the backup one. It usually results in a smaller model size and lower inference latency, although the accuracy degradation could be higher, depending on the model. The method has several parameters that can provide different performance-accuracy trade-offs after optimization:
  * ``mode`` - there are two modes to choose from: ``INT4_SYM`` - stands for INT4 symmetric weight quantization and results in faster inference and smaller model size, and ``INT4_ASYM`` - INT4 asymmetric weight quantization with variable zero-point for more accurate results.
  * ``group_size`` - controls the size of the group of weights that share the same quantization parameters. Smaller model size results in a more accurate optimized model but with a larger footprint and slower inference. The following group sizes are recommended: ``128``, ``64``, ``32`` (``128`` is default value)
  * ``ratio`` - controls the ratio between INT4 and INT8 compressed layers in the model. For example, 0.8 means that 80% of layers will be compressed to INT4, while the rest will be compressed to INT8 precision.
 The example below shows 4-bit weight quantization applied on top of OpenVINO IR:
 .. tab-set::
   .. tab-item:: OpenVINO
      :sync: openvino
      .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py
         :language: python
         :fragment: [compression_4bit]
 .. note::
   OpenVINO also supports 4-bit models from Hugging Face `Transformers <https://github.com/huggingface/transformers>`__ library optimized 
   with `GPTQ <https://github.com/PanQiWei/AutoGPTQ>`__. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.
 The table below shows examples of Text Generation models with different optimization settings:
 .. list-table::
   :widths: 40 55 25 25
   :header-rows: 1
   * - Model
     - Optimization
     - Perplexity
     - Model Size (Gb)
   * - databricks/dolly-v2-3b
     - FP32
     - 5.01
     - 10.3
   * - databricks/dolly-v2-3b
     - INT8
     - 5.07
     - 2.6
   * - databricks/dolly-v2-3b
     - INT4_ASYM,group_size=32,ratio=0.5
     - 5.28
     - 2.2
   * - facebook/opt-6.7b
     - FP32
     - 4.25
     - 24.8
   * - facebook/opt-6.7b
     - INT8
     - 4.27
     - 6.2
   * - facebook/opt-6.7b
     - INT4_ASYM,group_size=64,ratio=0.8
     - 4.32
     - 4.1
   * - meta-llama/Llama-2-7b-chat-hf
     - FP32
     - 3.28
     - 25.1
   * - meta-llama/Llama-2-7b-chat-hf
     - INT8
     - 3.29
     - 6.3
   * - meta-llama/Llama-2-7b-chat-hf
     - INT4_ASYM,group_size=128,ratio=0.8
     - 3.41
     - 4.0
   * - togethercomputer/RedPajama-INCITE-7B-Instruct
     - FP32
     - 4.15
     - 25.6
   * - togethercomputer/RedPajama-INCITE-7B-Instruct
     - INT8
     - 4.17
     - 6.4
   * - togethercomputer/RedPajama-INCITE-7B-Instruct
     - INT4_ASYM,group_size=128,ratio=1.0
     - 4.17
     - 3.6
   * - meta-llama/Llama-2-13b-chat-hf
     - FP32
     - 2.92
     - 48.5
   * - meta-llama/Llama-2-13b-chat-hf
     - INT8
     - 2.91
     - 12.1
   * - meta-llama/Llama-2-13b-chat-hf
     - INT4_SYM,group_size=64,ratio=0.8
     - 2.98
     - 8.0
 Additional Resources
 ####################
--- a/docs/optimization_guide/nncf/code/weight_compression_openvino.py
+++ b/docs/optimization_guide/nncf/code/weight_compression_openvino.py
@@ -4,3 +4,10 @@ from nncf import compress_weights
 ...
 model = compress_weights(model) # model is openvino.Model object
 #! [compression_8bit]
 #! [compression_4bit]
 from nncf import compress_weights, CompressWeightsMode
 ...
 model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8) # model is openvino.Model object
 #! [compression_4bit]