[DOCS] model caching update to GPU (#16909)

Update GPU.md Update Model_caching_overview.md Co-authored-by: Eddy Kim <eddy.kim@intel.com>
2023-04-13 11:09:16 +02:00
parent 5d80bca16e
commit 7782d85b26
2 changed files with 91 additions and 64 deletions
--- a/docs/OV_Runtime_UG/Model_caching_overview.md
+++ b/docs/OV_Runtime_UG/Model_caching_overview.md
@@ -1,91 +1,110 @@
 # Model Caching Overview {#openvino_docs_OV_UG_Model_caching_overview}

@sphinxdirective
+ 
+As described in :doc:`Integrate OpenVINO™ with Your Application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`, 
+a common application flow consists of the following steps:

-As described in the :doc:`Integrate OpenVINO™ with Your Application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`, a common application flow consists of the following steps:
+1. | **Create a Core object**: 
+   |   First step to manage available devices and read model objects
+2. | **Read the Intermediate Representation**: 
+   |   Read an Intermediate Representation file into an object of the `ov::Model <classov_1_1Model.html#doxid-classov-1-1-model>`__
+3. | **Prepare inputs and outputs**: 
+   |   If needed, manipulate precision, memory layout, size or color format
+4. | **Set configuration**: 
+   |   Pass device-specific loading configurations to the device
+5. | **Compile and Load Network to device**: 
+   |   Use the `ov::Core::compile_model() <classov_1_1Core.html#doxid-classov-1-1-core-1a46555f0803e8c29524626be08e7f5c5a>`__ method with a specific device
+6. | **Set input data**: 
+   |   Specify input tensor
+7. | **Execute**: 
+   |   Carry out inference and process results

-1. **Create a Core object**: First step to manage available devices and read model objects
+Step 5 can potentially perform several time-consuming device-specific optimizations and network compilations. 
+To reduce the resulting delays at application startup, you can use Model Caching. It exports the compiled model 
+automatically and reuses it to significantly reduce the model compilation time.

-2. **Read the Intermediate Representation**: Read an Intermediate Representation file into an object of the `ov::Model <classov_1_1Model.html#doxid-classov-1-1-model>`__
+.. important:: 

-3. **Prepare inputs and outputs**: If needed, manipulate precision, memory layout, size or color format
+   The :doc:`Compile Tool <openvino_inference_engine_tools_compile_tool_README>` may serve the same purpose
+   for C++ applications, but is considered a legacy solution and you should use Model Caching instead.

-4. **Set configuration**: Pass device-specific loading configurations to the device
+   Not all devices support the network import/export feature. They will perform normally but will not
+   enable the compilation stage speed-up.

-5. **Compile and Load Network to device**: Use the `ov::Core::compile_model() <classov_1_1Core.html#doxid-classov-1-1-core-1a46555f0803e8c29524626be08e7f5c5a>`__ method with a specific device
-
-6. **Set input data**: Specify input tensor
-
-7. **Execute**: Carry out inference and process results
-
-Step 5 can potentially perform several time-consuming device-specific optimizations and network compilations,
-and such delays can lead to a bad user experience on application startup. To avoid this, some devices offer
-import/export network capability, and it is possible to either use the :doc:`Compile tool <openvino_inference_engine_tools_compile_tool_README>`
-or enable model caching to export compiled model automatically. Reusing cached model can significantly reduce compile model time.

 Set "cache_dir" config option to enable model caching
 +++++++++++++++++++++++++++++++++++++++++++++++++++++

-To enable model caching, the application must specify a folder to store cached blobs, which is done like this:
+To enable model caching, the application must specify a folder to store the cached blobs:
+
+.. tab-set::
+
+   .. tab-item:: C++
+      :sync: cpp
+
+         .. doxygensnippet:: docs/snippets/ov_caching.cpp
+            :language: cpp
+            :fragment: [ov:caching:part0]
+   
+   .. tab-item:: Python
+      :sync: py
+
+         .. doxygensnippet:: docs/snippets/ov_caching.py
+            :language: py
+            :fragment: [ov:caching:part0]


-.. tab:: C++
-
-      .. doxygensnippet:: docs/snippets/ov_caching.cpp
-         :language: cpp
-         :fragment: [ov:caching:part0]
-
-.. tab:: Python
-
-      .. doxygensnippet:: docs/snippets/ov_caching.py
-         :language: python
-         :fragment: [ov:caching:part0]
-
-
-With this code, if the device specified by ``device_name`` supports import/export model capability, a cached blob is automatically created inside the ``/path/to/cache/dir`` folder.
-If the device does not support import/export capability, cache is not created and no error is thrown.
-
-Depending on your device, total time for compiling model on application startup can be significantly reduced.
-Also note that the very first ``compile_model`` (when cache is not yet created) takes slightly longer time to "export" the compiled blob into a cache file:
+With this code, if the device specified by ``device_name`` supports import/export model capability, 
+a cached blob is automatically created inside the ``/path/to/cache/dir`` folder.
+If the device does not support the import/export capability, cache is not created and no error is thrown.

+Note that the first ``compile_model`` operation takes slightly longer, as the cache needs to be created - 
+the compiled blob is saved into a cache file:

 .. image:: _static/images/caching_enabled.svg


-Even faster: use compile_model(modelPath)
-+++++++++++++++++++++++++++++++++++++++++
+Make it even faster: use compile_model(modelPath)
+++++++++++++++++++++++++++++++++++++++++++++++++++

 In some cases, applications do not need to customize inputs and outputs every time. Such application always
-call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)`` and it can be further optimized.
+call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)``, which can be further optimized.
 For these cases, there is a more convenient API to compile the model in a single call, skipping the read step:

+.. tab-set::

-.. tab:: C++
+   .. tab-item:: C++
+      :sync: cpp

      .. doxygensnippet:: docs/snippets/ov_caching.cpp
         :language: cpp
         :fragment: [ov:caching:part1]

-.. tab:: Python
+   .. tab-item:: Python
+      :sync: py

      .. doxygensnippet:: docs/snippets/ov_caching.py
-         :language: python
+         :language: py
         :fragment: [ov:caching:part1]


-With model caching enabled, total load time is even smaller, if ``read_model`` is optimized as well.
+With model caching enabled, the total load time is even shorter, if ``read_model`` is optimized as well.

+.. tab-set::

-.. tab:: C++
+   .. tab-item:: C++
+      :sync: cpp

      .. doxygensnippet:: docs/snippets/ov_caching.cpp
         :language: cpp
         :fragment: [ov:caching:part2]

-.. tab:: Python
+   .. tab-item:: Python
+      :sync: py

      .. doxygensnippet:: docs/snippets/ov_caching.py
-         :language: python
+         :language: py
         :fragment: [ov:caching:part2]


@@ -94,25 +113,30 @@ With model caching enabled, total load time is even smaller, if ``read_model`` i
 Advanced Examples
 ++++++++++++++++++++

-Not every device supports network import/export capability. For those that don't, enabling caching has no effect.
+Not every device supports the network import/export capability. For those that don't, enabling caching has no effect.
 To check in advance if a particular device supports model caching, your application can use the following code:

+.. tab-set::

-.. tab:: C++
+   .. tab-item:: C++
+      :sync: cpp

      .. doxygensnippet:: docs/snippets/ov_caching.cpp
         :language: cpp
         :fragment: [ov:caching:part3]

-.. tab:: Python
+   .. tab-item:: Python
+      :sync: py

      .. doxygensnippet:: docs/snippets/ov_caching.py
-         :language: python
+         :language: py
         :fragment: [ov:caching:part3]


 .. note::

-   For GPU, model caching is currently implemented as a preview feature. Before it is fully supported, kernel caching can be used in the same manner: by setting the CACHE_DIR configuration key to a folder where the cache should be stored (see the :doc:`GPU plugin documentation <openvino_docs_OV_UG_supported_plugins_GPU>`). To activate the preview feature of model caching, set the OV_GPU_CACHE_MODEL environment variable to 1.
+   For GPU, model caching is currently supported fully for static models only. For dynamic models,
+   kernel caching is used and multiple ‘.cl_cache’ files are generated along with the ‘.blob’ file.
+   See the :doc:`GPU plugin documentation <openvino_docs_OV_UG_supported_plugins_GPU>`. 

@endsphinxdirective
--- a/docs/OV_Runtime_UG/supported_plugins/GPU.md
+++ b/docs/OV_Runtime_UG/supported_plugins/GPU.md
@@ -292,24 +292,27 @@ For more details, see the :doc:`preprocessing API<openvino_docs_OV_UG_Preprocess
 Model Caching
 +++++++++++++++++++++++++++++++++++++++

-Cache for the GPU plugin may be enabled via the common OpenVINO ``ov::cache_dir`` property. GPU plugin implementation supports only caching of compiled kernels, so all plugin-specific model transformations are executed on each ``ov::Core::compile_model()`` call regardless of the ``cache_dir`` option.
-Still, since kernel compilation is a bottleneck in the model loading process, a significant load time reduction can be achieved with the ``ov::cache_dir`` property enabled.
+Model Caching helps reduce application startup delays by exporting and reusing 
+the compiled model automatically. The cache for the GPU plugin may be enabled 
+via the common OpenVINO ``ov::cache_dir`` property. 

-.. note::
+This means that all plugin-specific model transformations are executed on each ``ov::Core::compile_model()`` 
+call, regardless of the ``ov::cache_dir`` option. Still, since kernel compilation is a bottleneck in the model 
+loading process, a significant load time reduction can be achieved.
+Currently, GPU plugin implementation fully supports static models only. For dynamic models,
+kernel caching is used instead and multiple ‘.cl_cache’ files are generated along with the ‘.blob’ file. 

-   Full model caching support is currently implemented as a preview feature. To activate it, set the OV_GPU_CACHE_MODEL environment variable to 1.
-
-For more details, see the :doc:`Model caching overview<openvino_docs_OV_UG_Model_caching_overview>`.
+For more details, see the :doc:`Model caching overview <openvino_docs_OV_UG_Model_caching_overview>`.

 Extensibility
 +++++++++++++++++++++++++++++++++++++++

-For information on this subject, see the :doc:`GPU Extensibility<openvino_docs_Extensibility_UG_GPU>`.
+For information on this subject, see the :doc:`GPU Extensibility <openvino_docs_Extensibility_UG_GPU>`.

 GPU Context and Memory Sharing via RemoteTensor API
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-For information on this subject, see the :doc:`RemoteTensor API of GPU Plugin<openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
+For information on this subject, see the :doc:`RemoteTensor API of GPU Plugin <openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.

 Supported Properties
 #######################################
@@ -373,18 +376,18 @@ GPU Performance Checklist: Summary
 Since OpenVINO relies on the OpenCL kernels for the GPU implementation, many general OpenCL tips apply:

 -	Prefer ``FP16`` inference precision over ``FP32``, as Model Optimizer can generate both variants, and the ``FP32`` is the default. To learn about optimization options, see :doc:`Optimization Guide<openvino_docs_model_optimization_guide>`.
- Try to group individual infer jobs by using :doc:`automatic batching<openvino_docs_OV_UG_Automatic_Batching>`.
-	Consider :doc:`caching<openvino_docs_OV_UG_Model_caching_overview>` to minimize model load time.
-	If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. :doc:`CPU configuration options<openvino_docs_OV_UG_supported_plugins_CPU>` can be used to limit the number of inference threads for the CPU plugin.
-	Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated ``queue_throttle`` property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or :doc:`throughput performance hints<openvino_docs_OV_UG_Performance_Hints>`.
- When operating media inputs, consider :doc:`remote tensors API of the GPU Plugin<openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
+- Try to group individual infer jobs by using :doc:`automatic batching <openvino_docs_OV_UG_Automatic_Batching>`.
+-	Consider :doc:`caching <openvino_docs_OV_UG_Model_caching_overview>` to minimize model load time.
+-	If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. :doc:`CPU configuration options <openvino_docs_OV_UG_supported_plugins_CPU>` can be used to limit the number of inference threads for the CPU plugin.
+-	Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated ``queue_throttle`` property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or :doc:`throughput performance hints <openvino_docs_OV_UG_Performance_Hints>`.
+- When operating media inputs, consider :doc:`remote tensors API of the GPU Plugin <openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.


 Additional Resources
 #######################################

-* :doc:`Supported Devices<openvino_docs_OV_UG_supported_plugins_Supported_Devices>`.
-* :doc:`Optimization guide<openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`.
+* :doc:`Supported Devices <openvino_docs_OV_UG_supported_plugins_Supported_Devices>`.
+* :doc:`Optimization guide <openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`.
 * `GPU plugin developers documentation <https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/README.md>`__