[DOCS] model caching update to GPU (#16909)

Update GPU.md
Update Model_caching_overview.md

Co-authored-by: Eddy Kim <eddy.kim@intel.com>
This commit is contained in:
Karol Blaszczak
2023-04-13 11:09:16 +02:00
committed by GitHub
parent 5d80bca16e
commit 7782d85b26
2 changed files with 91 additions and 64 deletions

View File

@@ -1,91 +1,110 @@
# Model Caching Overview {#openvino_docs_OV_UG_Model_caching_overview}
@sphinxdirective
As described in :doc:`Integrate OpenVINO™ with Your Application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`,
a common application flow consists of the following steps:
As described in the :doc:`Integrate OpenVINO™ with Your Application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`, a common application flow consists of the following steps:
1. | **Create a Core object**:
| First step to manage available devices and read model objects
2. | **Read the Intermediate Representation**:
| Read an Intermediate Representation file into an object of the `ov::Model <classov_1_1Model.html#doxid-classov-1-1-model>`__
3. | **Prepare inputs and outputs**:
| If needed, manipulate precision, memory layout, size or color format
4. | **Set configuration**:
| Pass device-specific loading configurations to the device
5. | **Compile and Load Network to device**:
| Use the `ov::Core::compile_model() <classov_1_1Core.html#doxid-classov-1-1-core-1a46555f0803e8c29524626be08e7f5c5a>`__ method with a specific device
6. | **Set input data**:
| Specify input tensor
7. | **Execute**:
| Carry out inference and process results
1. **Create a Core object**: First step to manage available devices and read model objects
Step 5 can potentially perform several time-consuming device-specific optimizations and network compilations.
To reduce the resulting delays at application startup, you can use Model Caching. It exports the compiled model
automatically and reuses it to significantly reduce the model compilation time.
2. **Read the Intermediate Representation**: Read an Intermediate Representation file into an object of the `ov::Model <classov_1_1Model.html#doxid-classov-1-1-model>`__
.. important::
3. **Prepare inputs and outputs**: If needed, manipulate precision, memory layout, size or color format
The :doc:`Compile Tool <openvino_inference_engine_tools_compile_tool_README>` may serve the same purpose
for C++ applications, but is considered a legacy solution and you should use Model Caching instead.
4. **Set configuration**: Pass device-specific loading configurations to the device
Not all devices support the network import/export feature. They will perform normally but will not
enable the compilation stage speed-up.
5. **Compile and Load Network to device**: Use the `ov::Core::compile_model() <classov_1_1Core.html#doxid-classov-1-1-core-1a46555f0803e8c29524626be08e7f5c5a>`__ method with a specific device
6. **Set input data**: Specify input tensor
7. **Execute**: Carry out inference and process results
Step 5 can potentially perform several time-consuming device-specific optimizations and network compilations,
and such delays can lead to a bad user experience on application startup. To avoid this, some devices offer
import/export network capability, and it is possible to either use the :doc:`Compile tool <openvino_inference_engine_tools_compile_tool_README>`
or enable model caching to export compiled model automatically. Reusing cached model can significantly reduce compile model time.
Set "cache_dir" config option to enable model caching
+++++++++++++++++++++++++++++++++++++++++++++++++++++
To enable model caching, the application must specify a folder to store cached blobs, which is done like this:
To enable model caching, the application must specify a folder to store the cached blobs:
.. tab-set::
.. tab-item:: C++
:sync: cpp
.. doxygensnippet:: docs/snippets/ov_caching.cpp
:language: cpp
:fragment: [ov:caching:part0]
.. tab-item:: Python
:sync: py
.. doxygensnippet:: docs/snippets/ov_caching.py
:language: py
:fragment: [ov:caching:part0]
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_caching.cpp
:language: cpp
:fragment: [ov:caching:part0]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_caching.py
:language: python
:fragment: [ov:caching:part0]
With this code, if the device specified by ``device_name`` supports import/export model capability, a cached blob is automatically created inside the ``/path/to/cache/dir`` folder.
If the device does not support import/export capability, cache is not created and no error is thrown.
Depending on your device, total time for compiling model on application startup can be significantly reduced.
Also note that the very first ``compile_model`` (when cache is not yet created) takes slightly longer time to "export" the compiled blob into a cache file:
With this code, if the device specified by ``device_name`` supports import/export model capability,
a cached blob is automatically created inside the ``/path/to/cache/dir`` folder.
If the device does not support the import/export capability, cache is not created and no error is thrown.
Note that the first ``compile_model`` operation takes slightly longer, as the cache needs to be created -
the compiled blob is saved into a cache file:
.. image:: _static/images/caching_enabled.svg
Even faster: use compile_model(modelPath)
+++++++++++++++++++++++++++++++++++++++++
Make it even faster: use compile_model(modelPath)
+++++++++++++++++++++++++++++++++++++++++++++++++++
In some cases, applications do not need to customize inputs and outputs every time. Such application always
call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)`` and it can be further optimized.
call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)``, which can be further optimized.
For these cases, there is a more convenient API to compile the model in a single call, skipping the read step:
.. tab-set::
.. tab:: C++
.. tab-item:: C++
:sync: cpp
.. doxygensnippet:: docs/snippets/ov_caching.cpp
:language: cpp
:fragment: [ov:caching:part1]
.. tab:: Python
.. tab-item:: Python
:sync: py
.. doxygensnippet:: docs/snippets/ov_caching.py
:language: python
:language: py
:fragment: [ov:caching:part1]
With model caching enabled, total load time is even smaller, if ``read_model`` is optimized as well.
With model caching enabled, the total load time is even shorter, if ``read_model`` is optimized as well.
.. tab-set::
.. tab:: C++
.. tab-item:: C++
:sync: cpp
.. doxygensnippet:: docs/snippets/ov_caching.cpp
:language: cpp
:fragment: [ov:caching:part2]
.. tab:: Python
.. tab-item:: Python
:sync: py
.. doxygensnippet:: docs/snippets/ov_caching.py
:language: python
:language: py
:fragment: [ov:caching:part2]
@@ -94,25 +113,30 @@ With model caching enabled, total load time is even smaller, if ``read_model`` i
Advanced Examples
++++++++++++++++++++
Not every device supports network import/export capability. For those that don't, enabling caching has no effect.
Not every device supports the network import/export capability. For those that don't, enabling caching has no effect.
To check in advance if a particular device supports model caching, your application can use the following code:
.. tab-set::
.. tab:: C++
.. tab-item:: C++
:sync: cpp
.. doxygensnippet:: docs/snippets/ov_caching.cpp
:language: cpp
:fragment: [ov:caching:part3]
.. tab:: Python
.. tab-item:: Python
:sync: py
.. doxygensnippet:: docs/snippets/ov_caching.py
:language: python
:language: py
:fragment: [ov:caching:part3]
.. note::
For GPU, model caching is currently implemented as a preview feature. Before it is fully supported, kernel caching can be used in the same manner: by setting the CACHE_DIR configuration key to a folder where the cache should be stored (see the :doc:`GPU plugin documentation <openvino_docs_OV_UG_supported_plugins_GPU>`). To activate the preview feature of model caching, set the OV_GPU_CACHE_MODEL environment variable to 1.
For GPU, model caching is currently supported fully for static models only. For dynamic models,
kernel caching is used and multiple .cl_cache files are generated along with the .blob file.
See the :doc:`GPU plugin documentation <openvino_docs_OV_UG_supported_plugins_GPU>`.
@endsphinxdirective

View File

@@ -292,24 +292,27 @@ For more details, see the :doc:`preprocessing API<openvino_docs_OV_UG_Preprocess
Model Caching
+++++++++++++++++++++++++++++++++++++++
Cache for the GPU plugin may be enabled via the common OpenVINO ``ov::cache_dir`` property. GPU plugin implementation supports only caching of compiled kernels, so all plugin-specific model transformations are executed on each ``ov::Core::compile_model()`` call regardless of the ``cache_dir`` option.
Still, since kernel compilation is a bottleneck in the model loading process, a significant load time reduction can be achieved with the ``ov::cache_dir`` property enabled.
Model Caching helps reduce application startup delays by exporting and reusing
the compiled model automatically. The cache for the GPU plugin may be enabled
via the common OpenVINO ``ov::cache_dir`` property.
.. note::
This means that all plugin-specific model transformations are executed on each ``ov::Core::compile_model()``
call, regardless of the ``ov::cache_dir`` option. Still, since kernel compilation is a bottleneck in the model
loading process, a significant load time reduction can be achieved.
Currently, GPU plugin implementation fully supports static models only. For dynamic models,
kernel caching is used instead and multiple .cl_cache files are generated along with the .blob file.
Full model caching support is currently implemented as a preview feature. To activate it, set the OV_GPU_CACHE_MODEL environment variable to 1.
For more details, see the :doc:`Model caching overview<openvino_docs_OV_UG_Model_caching_overview>`.
For more details, see the :doc:`Model caching overview <openvino_docs_OV_UG_Model_caching_overview>`.
Extensibility
+++++++++++++++++++++++++++++++++++++++
For information on this subject, see the :doc:`GPU Extensibility<openvino_docs_Extensibility_UG_GPU>`.
For information on this subject, see the :doc:`GPU Extensibility <openvino_docs_Extensibility_UG_GPU>`.
GPU Context and Memory Sharing via RemoteTensor API
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
For information on this subject, see the :doc:`RemoteTensor API of GPU Plugin<openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
For information on this subject, see the :doc:`RemoteTensor API of GPU Plugin <openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
Supported Properties
#######################################
@@ -373,18 +376,18 @@ GPU Performance Checklist: Summary
Since OpenVINO relies on the OpenCL kernels for the GPU implementation, many general OpenCL tips apply:
- Prefer ``FP16`` inference precision over ``FP32``, as Model Optimizer can generate both variants, and the ``FP32`` is the default. To learn about optimization options, see :doc:`Optimization Guide<openvino_docs_model_optimization_guide>`.
- Try to group individual infer jobs by using :doc:`automatic batching<openvino_docs_OV_UG_Automatic_Batching>`.
- Consider :doc:`caching<openvino_docs_OV_UG_Model_caching_overview>` to minimize model load time.
- If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. :doc:`CPU configuration options<openvino_docs_OV_UG_supported_plugins_CPU>` can be used to limit the number of inference threads for the CPU plugin.
- Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated ``queue_throttle`` property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or :doc:`throughput performance hints<openvino_docs_OV_UG_Performance_Hints>`.
- When operating media inputs, consider :doc:`remote tensors API of the GPU Plugin<openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
- Try to group individual infer jobs by using :doc:`automatic batching <openvino_docs_OV_UG_Automatic_Batching>`.
- Consider :doc:`caching <openvino_docs_OV_UG_Model_caching_overview>` to minimize model load time.
- If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. :doc:`CPU configuration options <openvino_docs_OV_UG_supported_plugins_CPU>` can be used to limit the number of inference threads for the CPU plugin.
- Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated ``queue_throttle`` property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or :doc:`throughput performance hints <openvino_docs_OV_UG_Performance_Hints>`.
- When operating media inputs, consider :doc:`remote tensors API of the GPU Plugin <openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
Additional Resources
#######################################
* :doc:`Supported Devices<openvino_docs_OV_UG_supported_plugins_Supported_Devices>`.
* :doc:`Optimization guide<openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`.
* :doc:`Supported Devices <openvino_docs_OV_UG_supported_plugins_Supported_Devices>`.
* :doc:`Optimization guide <openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`.
* `GPU plugin developers documentation <https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/README.md>`__