auto-batching- bare min of the info (#10190)

* auto-batching- bare min of the info * renaming BATCH.MD to the automatic_batching.md, also aligned the link to the new naming convention * more info and brushed * added openvino_docs_OV_UG_Automatic_Batching to the main TOC * Apply suggestions from code review Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * close on the comments, added the code examples * Apply suggestions from code review Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Update example * Update format * Update docs format * added couple of more perf considerations * more code examples * Apply suggestions from code review * Apply the rest from code review * Update header Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
2022-03-02 17:48:01 +03:00 · 2022-03-02 17:48:01 +03:00 · 180f15e84c
commit 180f15e84c
parent 42d3893833
5 changed files with 192 additions and 1 deletions
--- a/docs/OV_Runtime_UG/automatic_batching.md
+++ b/docs/OV_Runtime_UG/automatic_batching.md
@ -0,0 +1,107 @@
 # Automatic Batching {#openvino_docs_OV_UG_Automatic_Batching}
 ## (Automatic) Batching Execution
 The Automatic-Batching is a preview of the new functionality in the OpenVINO™ toolkit.  It performs on-the-fly automatic batching (i.e. grouping inference requests together) to improve device utilization, with no programming effort from the user.
 Inputs gathering and outputs scattering from the individual inference requests required for the batch happen transparently, without affecting the application code. 
 The feature primarily targets existing code written for inferencing many requests (each instance with the batch size 1). To obtain corresponding performance improvements, the application must be *running many inference requests simultaneously*. 
 As explained below, the auto-batching functionality can be also used via a special *virtual* device.       
 Batching is a straightforward way of leveraging the GPU compute power and saving on communication overheads. The automatic batching is  _implicitly_ triggered on the GPU when the `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for the compile_model or set_property calls.
@sphinxdirective
 .. tab:: C++
    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
       :language: cpp
       :fragment: [compile_model]
 .. tab:: Python
    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
       :language: python
       :fragment: [compile_model]
@endsphinxdirective
 > **NOTE**: You can disable the Auto-Batching (for example, for the GPU device) from being triggered by the `ov::hint::PerformanceMode::THROUGHPUT`. To do that, pass the `ov::hint::allow_auto_batching` set to **false** in addition to the `ov::hint::performance_mode`:
@sphinxdirective
 .. tab:: C++
    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
       :language: cpp
       :fragment: [compile_model_no_auto_batching]
 .. tab:: Python
    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
       :language: python
       :fragment: [compile_model_no_auto_batching]
@endsphinxdirective
 Alternatively, to enable the Auto-Batching in the legacy apps not akin to the notion of the performance hints, you may need to use the **explicit** device notion, such as 'BATCH:GPU'. In both cases (the *throughput* hint or explicit BATCH device), the optimal batch size selection happens automatically. The actual value depends on the model and device specifics, for example, on-device memory for the dGPUs.
 This _automatic batch size selection_ assumes that the application queries the `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
@sphinxdirective
 .. tab:: C++
    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
       :language: cpp
       :fragment: [query_optimal_num_requests]
 .. tab:: Python
    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
       :language: python
       :fragment: [query_optimal_num_requests]
@endsphinxdirective
 If not enough inputs were collected, the `timeout` value makes the transparent execution fall back to the execution of individual requests. Configuration-wise, this is the AUTO_BATCH_TIMEOUT property.
 The timeout, which adds itself to the execution time of the requests, heavily penalizes the performance. To avoid this, in cases when your parallel slack is bounded, give the OpenVINO an additional hint.
 For example, the application processes only 4 video streams, so there is no need to use a batch larger than 4. The most future-proof way to communicate the limitations on the parallelism is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4. For the GPU this will limit the batch size, for the CPU - the number of inference streams, so each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
@sphinxdirective
 .. tab:: C++
    .. doxygensnippet:: docs/snippets/ov_auto_batching.cpp
       :language: cpp
       :fragment: [hint_num_requests]
 .. tab:: Python
    .. doxygensnippet:: docs/snippets/ov_auto_batching.py
       :language: python
       :fragment: hint_num_requests]
@endsphinxdirective
 For the *explicit* usage, you can limit the batch size using  "BATCH:GPU(4)",  where 4 is the number of requests running in parallel.
 ### Other Performance Considerations
 To achieve the best performance with the Automatic Batching, the application should:
 - Operate the number of inference requests that represents the multiple of the batch size. In the above example, for batch size 4, the application should operate 4, 8, 12, 16, etc. requests.
 - Use the requests, grouped by the batch size, together. For example, the first 4 requests are inferred, while the second group of the requests is being populated.  
 The following are limitations of the current implementations:
 - Although less critical for the throughput-oriented scenarios, the load-time with auto-batching increases by almost 2x.
 - Certain networks are not reshape-able by the "batching" dimension (specified as 'N' in the layouts terms) or if the dimension is not zero-th, the auto-batching is not triggered. 
 - Performance improvements happen at the cost of the memory footprint growth, yet the auto-batching queries the available memory (especially for the dGPUs) and limits the selected batch size accordingly.
 ### Configuring the Automatic Batching
 Following the OpenVINO convention for devices names, the *batching* device is named *BATCH*. The configuration options are as follows:
 | Parameter name     | Parameter description      | Default            |             Examples                                                      |
 | :---               | :---                  | :---               |:-----------------------------------------------------------------------------|
 | "AUTO_BATCH_DEVICE" | Device name to apply the automatic batching and optional batch size in brackets | N/A | BATCH:GPU which triggers the automatic batch size selection or explicit batch size BATCH:GPU(4)     |
 | "AUTO_BATCH_TIMEOUT" | timeout value, in ms | 1000 |  you can reduce the timeout value (to avoid performance penalty when the data arrives too non-evenly) e.g. pass the "100", or in contrast make it large enough e.g. to accommodate inputs preparation (e.g. when it is serial process)     |
 ### See Also
 [Supported Devices](supported_plugins/Supported_Devices.md)
--- a/docs/OV_Runtime_UG/openvino_intro.md
+++ b/docs/OV_Runtime_UG/openvino_intro.md
@ -18,6 +18,7 @@
   openvino_docs_IE_DG_supported_plugins_AUTO
   openvino_docs_OV_UG_Running_on_multiple_devices
   openvino_docs_OV_UG_Hetero_execution
   openvino_docs_OV_UG_Automatic_Batching
   openvino_docs_IE_DG_network_state_intro
   openvino_2_0_transition_guide
   openvino_docs_OV_Should_be_in_performance
--- a/docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md
+++ b/docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md
@ -29,7 +29,8 @@ OpenVINO runtime also has several execution capabilities which work on top of ot
 |------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
 |[Multi-Device execution](../multi_device.md) |Multi-Device enables simultaneous inference of the same model on several devices in parallel    |
 |[Auto-Device selection](../auto_device_selection.md) |Auto-Device selection enables selecting Intel&reg; device for inference automatically |
-|[Heterogeneous execution](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers)).                                                           |
+|[Heterogeneous execution](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers))|
 |[Automatic Batching](../automatic_batching.md) | Auto-Batching plugin enables the batching (on top of the specified device)  that is completely transparent to the application |
 Devices similar to the ones we have used for benchmarking can be accessed using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).
--- a/docs/snippets/ov_auto_batching.cpp
+++ b/docs/snippets/ov_auto_batching.cpp
@ -0,0 +1,41 @@
 #include <openvino/runtime/core.hpp>
 int main() {
    ov::Core core;
    auto model = core.read_model("sample.xml");
 //! [compile_model]
 {
    auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
 }
 //! [compile_model]
 //! [compile_model_no_auto_batching]
 {
    // disabling the automatic batching
    // leaving intact other configurations options that the device selects for the 'throughput' hint 
    auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
                                                            ov::hint::allow_auto_batching(false)});
 }
 //! [compile_model_no_auto_batching]
 //! [query_optimal_num_requests]
 {
    // when the batch size is automatically selected by the implementation
    // it is important to query/create and run the sufficient #requests
    auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
    auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);
 }
 //! [query_optimal_num_requests]
 //! [hint_num_requests]
 {
    // limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests
    // so that certain parameters (like selected batch size) are automatically accommodated accordingly 
    auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
                                                            ov::hint::num_requests(4)});
 }
 //! [hint_num_requests]
    return 0;
 }
--- a/docs/snippets/ov_auto_batching.py
+++ b/docs/snippets/ov_auto_batching.py
@ -0,0 +1,41 @@
 #include <openvino/runtime/core.hpp>
 int main() {
    ov::Core core;
    auto model = core.read_model("sample.xml");
 //! [compile_model]
 {
    auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
 }
 //! [compile_model]
 //! [compile_model_no_auto_batching]
 {
    // disabling the automatic batching
    // leaving intact other configurations options that the device selects for the 'throughput' hint 
    auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
                                                            ov::hint::allow_auto_batching(false)});
 }
 //! [compile_model_no_auto_batching]
 //! [query_optimal_num_requests]
 {
    // when the batch size is automatically selected by the implementation
    // it is important to query/create and run the sufficient #requests
    auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
    auto num_requests = compiled_model.get_property(ov::optimal_number_of_infer_requests);
 }
 //! [query_optimal_num_requests]
 //! [hint_num_requests]
 {
    // limiting the available parallel slack for the 'throughput' hint via the ov::hint::num_requests
    // so that certain parameters (like selected batch size) are automatically accommodated accordingly 
    auto compiled_model = core.compile_model(model, "GPU", {ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT),
                                                            ov::hint::num_requests(4)});
 }
 //! [hint_num_requests]
    return 0;
 }