Files

Ilya Lavrenov e3098ece7e DOCS: port changes from releases/2022/1 (#11040 )

* Added migration for deployment (#10800)

* Added migration for deployment

* Addressed comments

* more info after the What's new Sessions' questions (#10803)

* more info after the What's new Sessions' questions

* generalizing the optimal_batch_size vs explicit value message

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update docs/OV_Runtime_UG/automatic_batching.md

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Perf Hints docs and General Opt Guide refactoring (#10815)

* Brushed the general optimization page

* Opt GUIDE, WIP

* perf hints doc placeholder

* WIP

* WIP2

* WIP 3

* added streams and few other details

* fixed titles, misprints etc

* Perf hints

* movin the runtime optimizations intro

* fixed link

* Apply suggestions from code review

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* some details on the FIL and other means when pure inference time is not the only factor

* shuffled according to general->use-case->device-specifics flow, minor brushing

* next iter

* section on optimizing for tput and latency

* couple of links to the features support matrix

* Links, brushing, dedicated subsections for Latency/FIL/Tput

* had to make the link less specific (otherwise docs compilations fails)

* removing the Temp/Should be moved to the Opt Guide

* shuffled the tput/latency/etc info into separated documents. also the following docs moved from the temp into specific feature, general product desc or corresponding plugins

-   openvino_docs_IE_DG_Model_caching_overview
-   openvino_docs_IE_DG_Int8Inference
-   openvino_docs_IE_DG_Bfloat16Inference
-   openvino_docs_OV_UG_NoDynamicShapes

* fixed toc for ov_dynamic_shapes.md

* referring the openvino_docs_IE_DG_Bfloat16Inference to avoid docs compilation errors

* fixed main product TOC, removed ref from the second-level items

* reviewers remarks

* reverted the openvino_docs_OV_UG_NoDynamicShapes

* reverting openvino_docs_IE_DG_Bfloat16Inference and openvino_docs_IE_DG_Int8Inference

* "No dynamic shapes" to the "Dynamic shapes" as TOC

* removed duplication

* minor brushing

* Caching to the next level in TOC

* brushing

* more on the perf counters ( for latency and dynamic cases)

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Updated common IE pipeline infer-request section (#10844)

* Updated common IE pipeline infer-reqest section

* Update ov_infer_request.md

* Apply suggestions from code review

Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>

Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com>
Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>

* DOCS: Removed useless 4 spaces in snippets (#10870)

* Updated snippets

* Added link to encryption

* [DOCS] ARM CPU plugin docs (#10885)

* initial commit

ARM_CPU.md added
ARM CPU is added to the list of supported devices

* Update the list of supported properties

* Update Device_Plugins.md

* Update CODEOWNERS

* Removed quotes in limitations section

* NVIDIA and Android are added to the list of supported devices

* Added See Also section and reg sign to arm

* Added Preprocessing acceleration section

* Update the list of supported layers

* updated list of supported layers

* fix typos

* Added support disclaimer

* update trade and reg symbols

* fixed typos

* fix typos

* reg fix

* add reg symbol back

Co-authored-by: Vitaly Tuzov <vitaly.tuzov@intel.com>

* Try to fix visualization (#10896)

* Try to fix visualization

* New try

* Update Install&Deployment for migration guide to 22/1 (#10933)

* updates

* update

* Getting started improvements (#10948)

* Onnx updates (#10962)

* onnx changes

* onnx updates

* onnx updates

* fix broken anchors api reference (#10976)

* add ote repo (#10979)

* DOCS: Increase content width (#10995)

* fixes

* fix

* Fixed compilation

Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com>
Co-authored-by: Aleksandr Voron <aleksandr.voron@intel.com>
Co-authored-by: Vitaly Tuzov <vitaly.tuzov@intel.com>
Co-authored-by: Ilya Churaev <ilya.churaev@intel.com>
Co-authored-by: Yuan Xu <yuan1.xu@intel.com>
Co-authored-by: Victoria Yashina <victoria.yashina@intel.com>
Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com>

2022-03-18 17:48:45 +03:00

12 KiB

Raw Blame History

GPU device

@sphinxdirective

.. toctree:: :maxdepth: 1 :hidden:

openvino_docs_OV_UG_supported_plugins_GPU_RemoteTensor_API

@endsphinxdirective

The GPU plugin is OpenCL based plugin for inference of deep neural networks on Intel GPUs including integrated and discrete ones. For an in-depth description of GPU plugin, see

The GPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit.

See [GPU configuration page](@ref openvino_docs_install_guides_configurations_for_intel_gpu) for more details on how to configure machine to use GPU plugin.

Device Naming Convention

Devices are enumerated as "GPU.X" where X={0, 1, 2,...}. Only Intel® GPU devices are considered.
If the system has an integrated GPU, it always has id=0 ("GPU.0").
Other GPUs have undefined order that depends on the GPU driver.
"GPU" is an alias for "GPU.0"
If the system doesn't have an integrated GPU, then devices are enumerated starting from 0.
For GPUs with multi-tile architecture (multiple sub-devices in OpenCL terms) specific tile may be addresed as "GPU.X.Y" where X,Y={0, 1, 2,...}, X - id of the GPU device, Y - id of the tile within device X

For demonstration purposes, see the Hello Query Device C++ Sample that can print out the list of available devices with associated indices. Below is an example output (truncated to the device names only):

./hello_query_device
Available devices:
    Device: CPU
...
    Device: GPU.0
...
    Device: GPU.1
...
    Device: HDDL

Then device name can be passed to ov::Core::compile_model() method:

@sphinxdirective

.. tab:: Running on default device

.. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
    :language: cpp
    :fragment: [compile_model_default_gpu]

.. tab:: Running on specific GPU

.. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
    :language: cpp
    :fragment: [compile_model_gpu_with_id]

.. tab:: Running on specific tile

.. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
    :language: cpp
    :fragment: [compile_model_gpu_with_id_and_tile]

@endsphinxdirective

Supported inference data types

GPU plugin supports the following data types as inference precision of internal primitives:

Floating-point data types:
- f32
- f16
Quantized data types:
- u8
- i8
- u1

Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities. u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations. See [low-precision optimization guide](@ref pot_docs_LowPrecisionOptimizationGuide) for more details on how to get quantized model.

Floating-point precision of a GPU primitive is selected based on operation precision in IR except compressed f16 IR form which is executed in f16 precision.

Note

: Hardware acceleration for i8/u8 precision may be unavailable on some platforms. In that case model is executed in floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via ov::device::capabilities property.

Hello Query Device C++ Sample can be used to print out supported data types for all detected devices.

Supported features

Multi-device execution

If a machine has multiple GPUs (for example integrated GPU and discrete Intel GPU), then any supported model can be executed on all GPUs simultaneously. This can be achieved by specifying "MULTI:GPU.1,GPU.0" as a target device.

@snippet snippets/gpu/compile_model.cpp compile_model_multi

See Multi-device execution page for more details.

Automatic batching

GPU plugin is capable of reporting ov::max_batch_size and ov::optimal_batch_size metrics with respect to the current hardware platform and model, thus automatic batching is automatically enabled when ov::optimal_batch_size is > 1 and ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT) is set. Alternatively it can be enabled explicitly via the device notion, e.g. "BATCH:GPU".

@sphinxdirective

.. tab:: Batching via BATCH plugin

.. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
    :language: cpp
    :fragment: [compile_model_batch_plugin]

.. tab:: Batching via throughput hint

.. doxygensnippet:: docs/snippets/gpu/compile_model.cpp
    :language: cpp
    :fragment: [compile_model_auto_batch]

@endsphinxdirective

See Automatic batching page for more details.

Multi-stream execution

If either ov::num_streams(n_streams) with n_streams > 1 or ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT) property is set for GPU plugin, then multiple streams are created for the model. In case of GPU plugin each stream has its own host thread and associated OpenCL queue which means that incoming infer requests can be processed simultaneously.

Note

: Simultaneous scheduling of kernels to different queues doesn't mean that the kernels are actually executed in parallel on GPU device. The actual behavior depends on the hardware architecture, and in some cases the execution may be serialized inside the GPU driver.

When multiple inferences of the same model need to be executed in parallel, multi-stream feature is preferrable over multiple instances of the model or application, since implementation of streams in GPU plugin supports weights memory sharing across streams, thus memory consumption may be less comparing to the other approaches.

See [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide) for more details.

Dynamic shapes

GPU plugin supports dynamic shapes for batch dimension only (specified as 'N' in the layouts terms) with fixed upper bound. Any other dynamic dimensions are unsupported. Internally GPU plugin creates log2(N) (N - is an upper bound for batch dimension here) low-level execution graphs for batch sizes equal to powers of 2 to emulate dynamic behavior, so that incoming infer request with specific batch size is executed via minimal combination of internal networks. For example, batch size 33 may be executed via 2 internal networks with batch size 32 and 1.

Note

: Such approach requires much more memory and overall model compilation time is significantly bigger comparing to static batch scenario.

The code snippet below demonstrates how to use dynamic batch in simple scenarios:

@snippet snippets/gpu/dynamic_batch.cpp dynamic_batch

See dynamic shapes guide for more details.

Preprocessing acceleration

GPU plugin has the following additional preprocessing options:

ov::intel_gpu::memory_type::surface and ov::intel_gpu::memory_type::buffer values for ov::preprocess::InputTensorInfo::set_memory_type() preprocessing method. These values are intended to be used to provide a hint for the plugin on the type of input Tensors that will be set in runtime to generate proper kernels.

@snippet snippets/gpu/preprocessing.cpp init_preproc

With such preprocessing GPU plugin will expect ov::intel_gpu::ocl::ClImage2DTensor (or derived) to be passed for each NV12 plane via ov::InferRequest::set_tensor() or ov::InferRequest::set_tensors() methods.

Refer to RemoteTensor API for usage examples.

See preprocessing API guide for more details.

Models caching

Cache for GPU plugin may be enabled via common OpenVINO ov::cache_dir property. GPU plugin implementation supports only compiled kernels caching, thus all plugin specific model transformations are executed on each ov::Core::compile_model() call regardless cache_dir option, but since the kernels compilation is a bottleneck in the model loading process, significant load time reduction can be achieved with ov::cache_dir property enabled.

See Model caching overview page for more details.

Extensibility

See [GPU Extensibility](@ref openvino_docs_Extensibility_UG_GPU) page.

See RemoteTensor API of GPU Plugin.

Supported properties

The plugin supports the properties listed below.

Read-write properties

All parameters must be set before calling ov::Core::compile_model() in order to take effect or passed as additional argument to ov::Core::compile_model()

ov::cache_dir
ov::enable_profiling
ov::hint::model_priority
ov::hint::performance_mode
ov::hint::num_requests
ov::num_streams
ov::compilation_num_threads
ov::device::id
ov::intel_gpu::hint::host_task_priority
ov::intel_gpu::hint::queue_priority
ov::intel_gpu::hint::queue_throttle
ov::intel_gpu::enable_loop_unrolling

Read-only properties

ov::supported_properties
ov::available_devices
ov::range_for_async_infer_requests
ov::range_for_streams
ov::optimal_batch_size
ov::max_batch_size
ov::device::full_name
ov::device::type
ov::device::gops
ov::device::capabilities
ov::intel_gpu::device_total_mem_size
ov::intel_gpu::uarch_version
ov::intel_gpu::execution_units_count
ov::intel_gpu::memory_statistics

Limitations

In some cases GPU plugin may implicitly execute several primitives on CPU using internal implementations which may lead to increase of CPU utilization. Below is the list of such operations:

Proposal
NonMaxSuppression
DetectionOutput

The behavior depends on specific parameters of the operations and hardware configuration.

GPU Performance Checklist: Summary

Since the OpenVINO relies on the OpenCL™ kernels for the GPU implementation. Thus, many general OpenCL tips apply:

Prefer FP16 inference precision over FP32, as the Model Optimizer can generate both variants and the FP32 is default. Also, consider int8 inference
Try to group individual infer jobs by using automatic batching
Consider caching to minimize model load time
If your application is simultaneously using the inference on the CPU or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. You can use CPU configuration options to limit number of inference threads for the CPU plugin.
Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-looped polling for completion. If the CPU utilization is a concern, consider the dedicated referenced in this document. Notice that this option might increase the inference latency, so consider combining with multiple GPU streams or throughput performance hints.
When operating media inputs consider remote tensors API of the GPU Plugin.

12 KiB Raw Blame History

GPU device

Device Naming Convention

Supported inference data types

Supported features

Multi-device execution

Automatic batching

Multi-stream execution

Dynamic shapes

Preprocessing acceleration

Models caching

Extensibility

GPU context and memory sharing via RemoteTensor API

Supported properties

Read-write properties

Read-only properties

Limitations

GPU Performance Checklist: Summary

See Also

12 KiB

Raw Blame History