DOCS Update optimization docs with ov.save_model() instead of ov.serialize() (#20711)

* introduced ov.save_model(...) to the ptq code examples * replied to comments * fixed rendering --------- Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>
2023-11-16 16:34:33 +04:00
parent d1e1555f2f
commit 2a33f35c43
6 changed files with 23 additions and 7 deletions
--- a/docs/articles_en/openvino_workflow/model_optimization_guide/ptq_introduction/quantization_w_accuracy_control.md
+++ b/docs/articles_en/openvino_workflow/model_optimization_guide/ptq_introduction/quantization_w_accuracy_control.md
@@ -73,6 +73,17 @@ After that the model can be compiled and run with OpenVINO:
         :language: python
         :fragment: [inference]

+To save the model in the OpenVINO Intermediate Representation (IR), use ``ov.save_model()``. When dealing with an original model in FP32 precision, it's advisable to preserve FP32 precision in the most impactful model operations that were reverted from INT8 to FP32. To do this, consider using compress_to_fp16=False during the saving process. This recommendation is based on the default functionality of ``ov.save_model()``, which saves models in FP16, potentially impacting accuracy through this conversion.
+
+.. tab-set::
+
+   .. tab-item:: OpenVINO
+      :sync: openvino
+
+      .. doxygensnippet:: docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py
+         :language: python
+         :fragment: [save]
+
 ``nncf.quantize_with_accuracy_control()`` API supports all the parameters from :doc:`Basic 8-bit quantization <basic_quantization_flow>` API, to quantize a model with accuracy control and a custom configuration.

 If the accuracy or performance of the quantized model is not satisfactory, you can try :doc:`Training-time Optimization <tmo_introduction>` as the next step.
--- a/docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py
+++ b/docs/optimization_guide/nncf/ptq/code/ptq_aa_openvino.py
@@ -57,7 +57,12 @@ model_int8 = ov.compile_model(quantized_model)

 input_fp32 = ... # FP32 model input
 res = model_int8(input_fp32)
-
-# save the model
-ov.serialize(quantized_model, "quantized_model.xml")
 #! [inference]
+
+#! [save]
+# save the model with compress_to_fp16=False to avoid an accuracy drop from compression
+# of unquantized weights to FP16. This is necessary because
+# nncf.quantize_with_accuracy_control(...) keeps the most impactful operations within
+# the model in the original precision to achieve the specified model accuracy
+ov.save_model(quantized_model, "quantized_model.xml", compress_to_fp16=False)
+#! [save]
--- a/docs/optimization_guide/nncf/ptq/code/ptq_onnx.py
+++ b/docs/optimization_guide/nncf/ptq/code/ptq_onnx.py
@@ -36,5 +36,5 @@ input_fp32 = ... # FP32 model input
 res = model_int8(input_fp32)

 # save the model
-ov.serialize(ov_quantized_model, "quantized_model.xml")
+ov.save_model(ov_quantized_model, "quantized_model.xml")
 #! [inference]
--- a/docs/optimization_guide/nncf/ptq/code/ptq_openvino.py
+++ b/docs/optimization_guide/nncf/ptq/code/ptq_openvino.py
@@ -29,5 +29,5 @@ input_fp32 = ... # FP32 model input
 res = model_int8(input_fp32)

 # save the model
-ov.serialize(quantized_model, "quantized_model.xml")
+ov.save_model(quantized_model, "quantized_model.xml")
 #! [inference]
--- a/docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py
+++ b/docs/optimization_guide/nncf/ptq/code/ptq_tensorflow.py
@@ -35,5 +35,5 @@ input_fp32 = ... # FP32 model input
 res = model_int8(input_fp32)

 # save the model
-ov.serialize(ov_quantized_model, "quantized_model.xml")
+ov.save_model(ov_quantized_model, "quantized_model.xml")
 #! [inference]
--- a/docs/optimization_guide/nncf/ptq/code/ptq_torch.py
+++ b/docs/optimization_guide/nncf/ptq/code/ptq_torch.py
@@ -40,5 +40,5 @@ model_int8 = ov.compile_model(ov_quantized_model)
 res = model_int8(input_fp32)

 # save the model
-ov.serialize(ov_quantized_model, "quantized_model.xml")
+ov.save_model(ov_quantized_model, "quantized_model.xml")
 #! [inference]