openvino/docs/OV_Runtime_UG/Python_API_inference.md

# OpenVINO™ Runtime Python API Advanced Inference {#openvino_docs_OV_UG_Python_API_inference}

@sphinxdirective

.. meta::
   :description: OpenVINO™ Runtime Python API enables you to share memory on inputs, hide
                 the latency with asynchronous calls and implement "postponed return".


.. warning::

   All mentioned methods are very dependent on a specific hardware and software set-up.
   Consider conducting your own experiments with various models and different input/output
   sizes. The methods presented here are not universal, they may or may not apply to the
   specific pipeline. Please consider all tradeoffs and avoid premature optimizations.


Direct Inference with ``CompiledModel``
#######################################

The ``CompiledModel`` class provides the ``__call__`` method that runs a single synchronous inference using the given model. In addition to a compact code, all future calls to ``CompiledModel.__call__`` will result in less overhead, as the object reuses the already created ``InferRequest``.


.. doxygensnippet:: docs/snippets/ov_python_inference.py
   :language: python
   :fragment: [direct_inference]


Shared Memory on Inputs
#######################

While using ``CompiledModel``, ``InferRequest`` and ``AsyncInferQueue``,
OpenVINO™ Runtime Python API provides an additional mode - "Shared Memory".
Specify the ``shared_memory`` flag to enable or disable this feature.
The "Shared Memory" mode may be beneficial when inputs are large and copying
data is considered an expensive operation. This feature creates shared ``Tensor``
instances with the "zero-copy" approach, reducing overhead of setting inputs
to minimum. Example usage:


.. doxygensnippet:: docs/snippets/ov_python_inference.py
   :language: python
   :fragment: [shared_memory_inference]


.. note::

   "Shared Memory" is enabled by default in ``CompiledModel.__call__``.
   For other methods, like ``InferRequest.infer`` or ``InferRequest.start_async``,
   it is required to set the flag to ``True`` manually.

.. warning::

   When data is being shared, all modifications may affect inputs of the inference!
   Use this feature with caution, especially in multi-threaded/parallel code,
   where data can be modified outside of the function's control flow.


Hiding Latency with Asynchronous Calls
######################################

Asynchronous calls allow to hide latency to optimize overall runtime of a codebase.
For example, ``InferRequest.start_async`` releases the GIL and provides non-blocking call.
It is beneficial to process other calls while waiting to finish compute-intensive inference.
Example usage:

.. doxygensnippet:: docs/snippets/ov_python_inference.py
   :language: python
   :fragment: [hiding_latency]


.. note::

   It is up to the user/developer to optimize the flow in a codebase to benefit from potential parallelization.


"Postponed Return" with Asynchronous Calls
##########################################

"Postponed Return" is a practice to omit overhead of ``OVDict``, which is always returned from
synchronous calls. "Postponed Return" could be applied when:

* only a part of output data is required. For example, only one specific output is significant
  in a given pipeline step and all outputs are large, thus, expensive to copy.
* data is not required "now". For example, it can be later extracted inside the pipeline as
  a part of latency hiding.
* data return is not required at all. For example, models are being chained with the pure ``Tensor`` interface.


.. doxygensnippet:: docs/snippets/ov_python_inference.py
   :language: python
   :fragment: [no_return_inference]

@endsphinxdirective