auto-batching- bare min of the info (#10190)

* auto-batching- bare min of the info

* renaming BATCH.MD to the automatic_batching.md, also aligned the link to the new naming convention

* more info and brushed

* added openvino_docs_OV_UG_Automatic_Batching to the main TOC

* Apply suggestions from code review

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* close on the comments, added the code examples

* Apply suggestions from code review

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>

* Update example

* Update format

* Update docs format

* added couple of more perf considerations

* more code examples

* Apply suggestions from code review

* Apply the rest from code review

* Update header

Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
This commit is contained in:
Maxim Shevtsov 2022-03-02 17:48:01 +03:00 committed by GitHub
parent 42d3893833
commit 180f15e84c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 192 additions and 1 deletions

View File

@ -0,0 +1,107 @@
# Automatic Batching {#openvino_docs_OV_UG_Automatic_Batching}
## (Automatic) Batching Execution
The Automatic-Batching is a preview of the new functionality in the OpenVINO™ toolkit. It performs on-the-fly automatic batching (i.e. grouping inference requests together) to improve device utilization, with no programming effort from the user.
Inputs gathering and outputs scattering from the individual inference requests required for the batch happen transparently, without affecting the application code.
The feature primarily targets existing code written for inferencing many requests (each instance with the batch size 1). To obtain corresponding performance improvements, the application must be *running many inference requests simultaneously*.
As explained below, the auto-batching functionality can be also used via a special *virtual* device.
Batching is a straightforward way of leveraging the GPU compute power and saving on communication overheads. The automatic batching is _implicitly_ triggered on the GPU when the `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for the compile_model or set_property calls.
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [compile_model]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [compile_model]
@endsphinxdirective
> **NOTE**: You can disable the Auto-Batching (for example, for the GPU device) from being triggered by the `ov::hint::PerformanceMode::THROUGHPUT`. To do that, pass the `ov::hint::allow_auto_batching` set to **false** in addition to the `ov::hint::performance_mode`:
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [compile_model_no_auto_batching]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [compile_model_no_auto_batching]
@endsphinxdirective
Alternatively, to enable the Auto-Batching in the legacy apps not akin to the notion of the performance hints, you may need to use the **explicit** device notion, such as 'BATCH:GPU'. In both cases (the *throughput* hint or explicit BATCH device), the optimal batch size selection happens automatically. The actual value depends on the model and device specifics, for example, on-device memory for the dGPUs.
This _automatic batch size selection_ assumes that the application queries the `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [query_optimal_num_requests]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: [query_optimal_num_requests]
@endsphinxdirective
If not enough inputs were collected, the `timeout` value makes the transparent execution fall back to the execution of individual requests. Configuration-wise, this is the AUTO_BATCH_TIMEOUT property.
The timeout, which adds itself to the execution time of the requests, heavily penalizes the performance. To avoid this, in cases when your parallel slack is bounded, give the OpenVINO an additional hint.
For example, the application processes only 4 video streams, so there is no need to use a batch larger than 4. The most future-proof way to communicate the limitations on the parallelism is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4. For the GPU this will limit the batch size, for the CPU - the number of inference streams, so each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
@sphinxdirective
.. tab:: C++
.. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
:language: cpp
:fragment: [hint_num_requests]
.. tab:: Python
.. doxygensnippet:: docs/snippets/ov_auto_batching.py
:language: python
:fragment: hint_num_requests]
@endsphinxdirective
For the *explicit* usage, you can limit the batch size using "BATCH:GPU(4)", where 4 is the number of requests running in parallel.
### Other Performance Considerations
To achieve the best performance with the Automatic Batching, the application should:
- Operate the number of inference requests that represents the multiple of the batch size. In the above example, for batch size 4, the application should operate 4, 8, 12, 16, etc. requests.
- Use the requests, grouped by the batch size, together. For example, the first 4 requests are inferred, while the second group of the requests is being populated.
The following are limitations of the current implementations:
- Although less critical for the throughput-oriented scenarios, the load-time with auto-batching increases by almost 2x.
- Certain networks are not reshape-able by the "batching" dimension (specified as 'N' in the layouts terms) or if the dimension is not zero-th, the auto-batching is not triggered.
- Performance improvements happen at the cost of the memory footprint growth, yet the auto-batching queries the available memory (especially for the dGPUs) and limits the selected batch size accordingly.
### Configuring the Automatic Batching
Following the OpenVINO convention for devices names, the *batching* device is named *BATCH*. The configuration options are as follows:
| Parameter name | Parameter description | Default | Examples |
| :--- | :--- | :--- |:-----------------------------------------------------------------------------|
| "AUTO_BATCH_DEVICE" | Device name to apply the automatic batching and optional batch size in brackets | N/A | BATCH:GPU which triggers the automatic batch size selection or explicit batch size BATCH:GPU(4) |
| "AUTO_BATCH_TIMEOUT" | timeout value, in ms | 1000 | you can reduce the timeout value (to avoid performance penalty when the data arrives too non-evenly) e.g. pass the "100", or in contrast make it large enough e.g. to accommodate inputs preparation (e.g. when it is serial process) |
### See Also
[Supported Devices](supported_plugins/Supported_Devices.md)

View File

@ -18,6 +18,7 @@
openvino_docs_IE_DG_supported_plugins_AUTO openvino_docs_IE_DG_supported_plugins_AUTO
openvino_docs_OV_UG_Running_on_multiple_devices openvino_docs_OV_UG_Running_on_multiple_devices
openvino_docs_OV_UG_Hetero_execution openvino_docs_OV_UG_Hetero_execution
openvino_docs_OV_UG_Automatic_Batching
openvino_docs_IE_DG_network_state_intro openvino_docs_IE_DG_network_state_intro
openvino_2_0_transition_guide openvino_2_0_transition_guide
openvino_docs_OV_Should_be_in_performance openvino_docs_OV_Should_be_in_performance

View File

@ -29,7 +29,8 @@ OpenVINO runtime also has several execution capabilities which work on top of ot
|------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------| |------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
|[Multi-Device execution](../multi_device.md) |Multi-Device enables simultaneous inference of the same model on several devices in parallel | |[Multi-Device execution](../multi_device.md) |Multi-Device enables simultaneous inference of the same model on several devices in parallel |
|[Auto-Device selection](../auto_device_selection.md) |Auto-Device selection enables selecting Intel&reg; device for inference automatically | |[Auto-Device selection](../auto_device_selection.md) |Auto-Device selection enables selecting Intel&reg; device for inference automatically |
|[Heterogeneous execution](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers)). | |[Heterogeneous execution](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers))|
|[Automatic Batching](../automatic_batching.md) | Auto-Batching plugin enables the batching (on top of the specified device) that is completely transparent to the application |
Devices similar to the ones we have used for benchmarking can be accessed using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/). Devices similar to the ones we have used for benchmarking can be accessed using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).

View File

@ -0,0 +1,41 @@
#include <openvino/runtime/core.hpp>
int main() {
ov::Core core;
auto model = core.read_model("sample.xml");
//! [compile_model]
{
auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
}
//! [compile_model]
//! [compile_model_no_auto_batching]
{
// disabling the automatic batching
// leaving intact other configurations options that the device selects for the 'throughput' hint
auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
ov::hint::allow_auto_batching(false)});
}
//! [compile_model_no_auto_batching]
//! [query_optimal_num_requests]
{
// when the batch size is automatically selected by the implementation
// it is important to query/create and run the sufficient #requests
auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);
}
//! [query_optimal_num_requests]
//! [hint_num_requests]
{
// limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests
// so that certain parameters (like selected batch size) are automatically accommodated accordingly
auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
ov::hint::num_requests(4)});
}
//! [hint_num_requests]
return 0;
}

View File

@ -0,0 +1,41 @@
#include <openvino/runtime/core.hpp>
int main() {
ov::Core core;
auto model = core.read_model("sample.xml");
//! [compile_model]
{
auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
}
//! [compile_model]
//! [compile_model_no_auto_batching]
{
// disabling the automatic batching
// leaving intact other configurations options that the device selects for the 'throughput' hint
auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
ov::hint::allow_auto_batching(false)});
}
//! [compile_model_no_auto_batching]
//! [query_optimal_num_requests]
{
// when the batch size is automatically selected by the implementation
// it is important to query/create and run the sufficient #requests
auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);
}
//! [query_optimal_num_requests]
//! [hint_num_requests]
{
// limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests
// so that certain parameters (like selected batch size) are automatically accommodated accordingly
auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
ov::hint::num_requests(4)});
}
//! [hint_num_requests]
return 0;
}