[DOCS] model caching update to GPU (#16909)
Update GPU.md Update Model_caching_overview.md Co-authored-by: Eddy Kim <eddy.kim@intel.com>
This commit is contained in:
@@ -1,91 +1,110 @@
|
||||
# Model Caching Overview {#openvino_docs_OV_UG_Model_caching_overview}
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
As described in :doc:`Integrate OpenVINO™ with Your Application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`,
|
||||
a common application flow consists of the following steps:
|
||||
|
||||
As described in the :doc:`Integrate OpenVINO™ with Your Application <openvino_docs_OV_UG_Integrate_OV_with_your_application>`, a common application flow consists of the following steps:
|
||||
1. | **Create a Core object**:
|
||||
| First step to manage available devices and read model objects
|
||||
2. | **Read the Intermediate Representation**:
|
||||
| Read an Intermediate Representation file into an object of the `ov::Model <classov_1_1Model.html#doxid-classov-1-1-model>`__
|
||||
3. | **Prepare inputs and outputs**:
|
||||
| If needed, manipulate precision, memory layout, size or color format
|
||||
4. | **Set configuration**:
|
||||
| Pass device-specific loading configurations to the device
|
||||
5. | **Compile and Load Network to device**:
|
||||
| Use the `ov::Core::compile_model() <classov_1_1Core.html#doxid-classov-1-1-core-1a46555f0803e8c29524626be08e7f5c5a>`__ method with a specific device
|
||||
6. | **Set input data**:
|
||||
| Specify input tensor
|
||||
7. | **Execute**:
|
||||
| Carry out inference and process results
|
||||
|
||||
1. **Create a Core object**: First step to manage available devices and read model objects
|
||||
Step 5 can potentially perform several time-consuming device-specific optimizations and network compilations.
|
||||
To reduce the resulting delays at application startup, you can use Model Caching. It exports the compiled model
|
||||
automatically and reuses it to significantly reduce the model compilation time.
|
||||
|
||||
2. **Read the Intermediate Representation**: Read an Intermediate Representation file into an object of the `ov::Model <classov_1_1Model.html#doxid-classov-1-1-model>`__
|
||||
.. important::
|
||||
|
||||
3. **Prepare inputs and outputs**: If needed, manipulate precision, memory layout, size or color format
|
||||
The :doc:`Compile Tool <openvino_inference_engine_tools_compile_tool_README>` may serve the same purpose
|
||||
for C++ applications, but is considered a legacy solution and you should use Model Caching instead.
|
||||
|
||||
4. **Set configuration**: Pass device-specific loading configurations to the device
|
||||
Not all devices support the network import/export feature. They will perform normally but will not
|
||||
enable the compilation stage speed-up.
|
||||
|
||||
5. **Compile and Load Network to device**: Use the `ov::Core::compile_model() <classov_1_1Core.html#doxid-classov-1-1-core-1a46555f0803e8c29524626be08e7f5c5a>`__ method with a specific device
|
||||
|
||||
6. **Set input data**: Specify input tensor
|
||||
|
||||
7. **Execute**: Carry out inference and process results
|
||||
|
||||
Step 5 can potentially perform several time-consuming device-specific optimizations and network compilations,
|
||||
and such delays can lead to a bad user experience on application startup. To avoid this, some devices offer
|
||||
import/export network capability, and it is possible to either use the :doc:`Compile tool <openvino_inference_engine_tools_compile_tool_README>`
|
||||
or enable model caching to export compiled model automatically. Reusing cached model can significantly reduce compile model time.
|
||||
|
||||
Set "cache_dir" config option to enable model caching
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
To enable model caching, the application must specify a folder to store cached blobs, which is done like this:
|
||||
To enable model caching, the application must specify a folder to store the cached blobs:
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: C++
|
||||
:sync: cpp
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.cpp
|
||||
:language: cpp
|
||||
:fragment: [ov:caching:part0]
|
||||
|
||||
.. tab-item:: Python
|
||||
:sync: py
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.py
|
||||
:language: py
|
||||
:fragment: [ov:caching:part0]
|
||||
|
||||
|
||||
.. tab:: C++
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.cpp
|
||||
:language: cpp
|
||||
:fragment: [ov:caching:part0]
|
||||
|
||||
.. tab:: Python
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.py
|
||||
:language: python
|
||||
:fragment: [ov:caching:part0]
|
||||
|
||||
|
||||
With this code, if the device specified by ``device_name`` supports import/export model capability, a cached blob is automatically created inside the ``/path/to/cache/dir`` folder.
|
||||
If the device does not support import/export capability, cache is not created and no error is thrown.
|
||||
|
||||
Depending on your device, total time for compiling model on application startup can be significantly reduced.
|
||||
Also note that the very first ``compile_model`` (when cache is not yet created) takes slightly longer time to "export" the compiled blob into a cache file:
|
||||
With this code, if the device specified by ``device_name`` supports import/export model capability,
|
||||
a cached blob is automatically created inside the ``/path/to/cache/dir`` folder.
|
||||
If the device does not support the import/export capability, cache is not created and no error is thrown.
|
||||
|
||||
Note that the first ``compile_model`` operation takes slightly longer, as the cache needs to be created -
|
||||
the compiled blob is saved into a cache file:
|
||||
|
||||
.. image:: _static/images/caching_enabled.svg
|
||||
|
||||
|
||||
Even faster: use compile_model(modelPath)
|
||||
+++++++++++++++++++++++++++++++++++++++++
|
||||
Make it even faster: use compile_model(modelPath)
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
In some cases, applications do not need to customize inputs and outputs every time. Such application always
|
||||
call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)`` and it can be further optimized.
|
||||
call ``model = core.read_model(...)``, then ``core.compile_model(model, ..)``, which can be further optimized.
|
||||
For these cases, there is a more convenient API to compile the model in a single call, skipping the read step:
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab:: C++
|
||||
.. tab-item:: C++
|
||||
:sync: cpp
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.cpp
|
||||
:language: cpp
|
||||
:fragment: [ov:caching:part1]
|
||||
|
||||
.. tab:: Python
|
||||
.. tab-item:: Python
|
||||
:sync: py
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.py
|
||||
:language: python
|
||||
:language: py
|
||||
:fragment: [ov:caching:part1]
|
||||
|
||||
|
||||
With model caching enabled, total load time is even smaller, if ``read_model`` is optimized as well.
|
||||
With model caching enabled, the total load time is even shorter, if ``read_model`` is optimized as well.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab:: C++
|
||||
.. tab-item:: C++
|
||||
:sync: cpp
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.cpp
|
||||
:language: cpp
|
||||
:fragment: [ov:caching:part2]
|
||||
|
||||
.. tab:: Python
|
||||
.. tab-item:: Python
|
||||
:sync: py
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.py
|
||||
:language: python
|
||||
:language: py
|
||||
:fragment: [ov:caching:part2]
|
||||
|
||||
|
||||
@@ -94,25 +113,30 @@ With model caching enabled, total load time is even smaller, if ``read_model`` i
|
||||
Advanced Examples
|
||||
++++++++++++++++++++
|
||||
|
||||
Not every device supports network import/export capability. For those that don't, enabling caching has no effect.
|
||||
Not every device supports the network import/export capability. For those that don't, enabling caching has no effect.
|
||||
To check in advance if a particular device supports model caching, your application can use the following code:
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab:: C++
|
||||
.. tab-item:: C++
|
||||
:sync: cpp
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.cpp
|
||||
:language: cpp
|
||||
:fragment: [ov:caching:part3]
|
||||
|
||||
.. tab:: Python
|
||||
.. tab-item:: Python
|
||||
:sync: py
|
||||
|
||||
.. doxygensnippet:: docs/snippets/ov_caching.py
|
||||
:language: python
|
||||
:language: py
|
||||
:fragment: [ov:caching:part3]
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
For GPU, model caching is currently implemented as a preview feature. Before it is fully supported, kernel caching can be used in the same manner: by setting the CACHE_DIR configuration key to a folder where the cache should be stored (see the :doc:`GPU plugin documentation <openvino_docs_OV_UG_supported_plugins_GPU>`). To activate the preview feature of model caching, set the OV_GPU_CACHE_MODEL environment variable to 1.
|
||||
For GPU, model caching is currently supported fully for static models only. For dynamic models,
|
||||
kernel caching is used and multiple ‘.cl_cache’ files are generated along with the ‘.blob’ file.
|
||||
See the :doc:`GPU plugin documentation <openvino_docs_OV_UG_supported_plugins_GPU>`.
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
@@ -292,24 +292,27 @@ For more details, see the :doc:`preprocessing API<openvino_docs_OV_UG_Preprocess
|
||||
Model Caching
|
||||
+++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Cache for the GPU plugin may be enabled via the common OpenVINO ``ov::cache_dir`` property. GPU plugin implementation supports only caching of compiled kernels, so all plugin-specific model transformations are executed on each ``ov::Core::compile_model()`` call regardless of the ``cache_dir`` option.
|
||||
Still, since kernel compilation is a bottleneck in the model loading process, a significant load time reduction can be achieved with the ``ov::cache_dir`` property enabled.
|
||||
Model Caching helps reduce application startup delays by exporting and reusing
|
||||
the compiled model automatically. The cache for the GPU plugin may be enabled
|
||||
via the common OpenVINO ``ov::cache_dir`` property.
|
||||
|
||||
.. note::
|
||||
This means that all plugin-specific model transformations are executed on each ``ov::Core::compile_model()``
|
||||
call, regardless of the ``ov::cache_dir`` option. Still, since kernel compilation is a bottleneck in the model
|
||||
loading process, a significant load time reduction can be achieved.
|
||||
Currently, GPU plugin implementation fully supports static models only. For dynamic models,
|
||||
kernel caching is used instead and multiple ‘.cl_cache’ files are generated along with the ‘.blob’ file.
|
||||
|
||||
Full model caching support is currently implemented as a preview feature. To activate it, set the OV_GPU_CACHE_MODEL environment variable to 1.
|
||||
|
||||
For more details, see the :doc:`Model caching overview<openvino_docs_OV_UG_Model_caching_overview>`.
|
||||
For more details, see the :doc:`Model caching overview <openvino_docs_OV_UG_Model_caching_overview>`.
|
||||
|
||||
Extensibility
|
||||
+++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
For information on this subject, see the :doc:`GPU Extensibility<openvino_docs_Extensibility_UG_GPU>`.
|
||||
For information on this subject, see the :doc:`GPU Extensibility <openvino_docs_Extensibility_UG_GPU>`.
|
||||
|
||||
GPU Context and Memory Sharing via RemoteTensor API
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
For information on this subject, see the :doc:`RemoteTensor API of GPU Plugin<openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
|
||||
For information on this subject, see the :doc:`RemoteTensor API of GPU Plugin <openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
|
||||
|
||||
Supported Properties
|
||||
#######################################
|
||||
@@ -373,18 +376,18 @@ GPU Performance Checklist: Summary
|
||||
Since OpenVINO relies on the OpenCL kernels for the GPU implementation, many general OpenCL tips apply:
|
||||
|
||||
- Prefer ``FP16`` inference precision over ``FP32``, as Model Optimizer can generate both variants, and the ``FP32`` is the default. To learn about optimization options, see :doc:`Optimization Guide<openvino_docs_model_optimization_guide>`.
|
||||
- Try to group individual infer jobs by using :doc:`automatic batching<openvino_docs_OV_UG_Automatic_Batching>`.
|
||||
- Consider :doc:`caching<openvino_docs_OV_UG_Model_caching_overview>` to minimize model load time.
|
||||
- If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. :doc:`CPU configuration options<openvino_docs_OV_UG_supported_plugins_CPU>` can be used to limit the number of inference threads for the CPU plugin.
|
||||
- Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated ``queue_throttle`` property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or :doc:`throughput performance hints<openvino_docs_OV_UG_Performance_Hints>`.
|
||||
- When operating media inputs, consider :doc:`remote tensors API of the GPU Plugin<openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
|
||||
- Try to group individual infer jobs by using :doc:`automatic batching <openvino_docs_OV_UG_Automatic_Batching>`.
|
||||
- Consider :doc:`caching <openvino_docs_OV_UG_Model_caching_overview>` to minimize model load time.
|
||||
- If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. :doc:`CPU configuration options <openvino_docs_OV_UG_supported_plugins_CPU>` can be used to limit the number of inference threads for the CPU plugin.
|
||||
- Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated ``queue_throttle`` property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or :doc:`throughput performance hints <openvino_docs_OV_UG_Performance_Hints>`.
|
||||
- When operating media inputs, consider :doc:`remote tensors API of the GPU Plugin <openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API>`.
|
||||
|
||||
|
||||
Additional Resources
|
||||
#######################################
|
||||
|
||||
* :doc:`Supported Devices<openvino_docs_OV_UG_supported_plugins_Supported_Devices>`.
|
||||
* :doc:`Optimization guide<openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`.
|
||||
* :doc:`Supported Devices <openvino_docs_OV_UG_supported_plugins_Supported_Devices>`.
|
||||
* :doc:`Optimization guide <openvino_docs_deployment_optimization_guide_dldt_optimization_guide>`.
|
||||
* `GPU plugin developers documentation <https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/README.md>`__
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user