Files
openvino/docs/notebooks/109-latency-tricks-with-output.rst

618 lines
21 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Performance tricks in OpenVINO for latency mode
===============================================
The goal of this notebook is to provide a step-by-step tutorial for
improving performance for inferencing in a latency mode. Low latency is
especially desired in real-time applications when the results are needed
as soon as possible after the data appears. This notebook assumes
computer vision workflow and uses
`YOLOv5n <https://github.com/ultralytics/yolov5>`__ model. We will
simulate a camera application that provides frames one by one.
The performance tips applied in this notebook could be summarized in the
following figure. Some of the steps below can be applied to any device
at any stage, e.g., “shared_memory”; some can be used only to specific
devices, e.g., “inference num threads” to CPU. As the number of
potential configurations is vast, we recommend looking at the steps
below and then apply a trial-and-error approach. You can incorporate
many hints simultaneously, like more inference threads + shared memory.
It should give even better performance, but we recommend testing it
anyway.
**NOTE**: We especially recommend trying
``OpenVINO IR model + CPU + shared memory in latency mode`` or
``OpenVINO IR model + CPU + shared memory + more inference threads``.
The quantization and pre-post-processing API are not included here as
they change the precision (quantization) or processing graph
(prepostprocessor). You can find examples of how to apply them to
optimize performance on OpenVINO IR files in
`111-detection-quantization <../111-detection-quantization>`__ and
`118-optimize-preprocessing <../118-optimize-preprocessing>`__.
|image0|
**NOTE**: Many of the steps presented below will give you better
performance. However, some of them may not change anything if they
are strongly dependent on either the hardware or the model. Please
run this notebook on your computer with your model to learn which of
them makes sense in your case.
Prerequisites
-------------
.. |image0| image:: https://user-images.githubusercontent.com/4547501/229120774-01f4f972-424d-4280-8395-220dd432985a.png
.. code:: ipython3
!pip install -q seaborn ultralytics
.. code:: ipython3
import os
import sys
import time
from pathlib import Path
from typing import Any, List, Tuple
sys.path.append("../utils")
import notebook_utils as utils
Data
----
We will use the same image of the dog sitting on a bicycle for all
experiments below. The image is resized and preprocessed to fulfill the
requirements of this particular object detection model.
.. code:: ipython3
import numpy as np
import cv2
IMAGE_WIDTH = 640
IMAGE_HEIGHT = 480
# load image
image = utils.load_image("../data/image/coco_bike.jpg")
image = cv2.resize(image, dsize=(IMAGE_WIDTH, IMAGE_HEIGHT), interpolation=cv2.INTER_AREA)
# preprocess it for YOLOv5
input_image = image / 255.0
input_image = np.transpose(input_image, axes=(2, 0, 1))
input_image = np.expand_dims(input_image, axis=0)
# show the image
utils.show_array(image)
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_4_0.jpg
.. parsed-literal::
<DisplayHandle display_id=6f2409e74123f40cdda4989b2a2a5922>
Model
-----
We decided to go with
`YOLOv5n <https://github.com/ultralytics/yolov5>`__, one of the
state-of-the-art object detection models, easily available through the
PyTorch Hub and small enough to see the difference in performance.
.. code:: ipython3
import torch
from IPython.utils import io
# directory for all models
base_model_dir = Path("model")
model_name = "yolov5n"
model_path = base_model_dir / model_name
# load YOLOv5n from PyTorch Hub
pytorch_model = torch.hub.load("ultralytics/yolov5", "custom", path=model_path, device="cpu", skip_validation=True)
# don't print full model architecture
with io.capture_output():
pytorch_model.eval()
.. parsed-literal::
Using cache found in /opt/home/k8sworker/.cache/torch/hub/ultralytics_yolov5_master
YOLOv5 🚀 2023-4-21 Python-3.8.10 torch-1.13.1+cpu CPU
.. parsed-literal::
requirements: /opt/home/k8sworker/.cache/torch/hub/requirements.txt not found, check failed.
.. parsed-literal::
Downloading https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5n.pt to model/yolov5n.pt...
.. parsed-literal::
0%| | 0.00/3.87M [00:00<?, ?B/s]
.. parsed-literal::
Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
Adding AutoShape...
Hardware
--------
The code below lists the available hardware we will use in the
benchmarking process.
**NOTE**: The hardware you have is probably completely different from
ours. It means you can see completely different results.
.. code:: ipython3
import openvino.runtime as ov
# initialize OpenVINO
core = ov.Core()
# print available devices
for device in core.available_devices:
device_name = core.get_property(device, "FULL_DEVICE_NAME")
print(f"{device}: {device_name}")
.. parsed-literal::
CPU: Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz
Helper functions
----------------
Were defining a benchmark model function to use for all optimized
models below. It runs inference 1000 times, averages the latency time,
and prints two measures: seconds per image and frames per second (FPS).
.. code:: ipython3
INFER_NUMBER = 1000
def benchmark_model(model: Any, input_data: np.ndarray, benchmark_name: str, device_name: str = "CPU") -> float:
"""
Helper function for benchmarking the model. It measures the time and prints results.
"""
# measure the first inference separately - it may be slower as it contains also initialization
start = time.perf_counter()
model(input_data)
end = time.perf_counter()
first_infer_time = end - start
print(f"{benchmark_name} on {device_name}. First inference time: {first_infer_time :.4f} seconds")
# benchmarking
start = time.perf_counter()
for _ in range(INFER_NUMBER):
model(input_data)
end = time.perf_counter()
# elapsed time
infer_time = end - start
# print second per image and FPS
mean_infer_time = infer_time / INFER_NUMBER
mean_fps = INFER_NUMBER / infer_time
print(f"{benchmark_name} on {device_name}: {mean_infer_time :.4f} seconds per image ({mean_fps :.2f} FPS)")
return mean_infer_time
The following functions aim to post-process results and draw boxes on
the image.
.. code:: ipython3
# https://gist.github.com/AruniRC/7b3dadd004da04c80198557db5da4bda
classes = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant",
"stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra",
"giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite",
"baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork",
"knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
"cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard",
"cell phone", "microwave", "oven", "oaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
"hair drier", "toothbrush"
]
# Colors for the classes above (Rainbow Color Map).
colors = cv2.applyColorMap(
src=np.arange(0, 255, 255 / len(classes), dtype=np.float32).astype(np.uint8),
colormap=cv2.COLORMAP_RAINBOW,
).squeeze()
def postprocess(detections: np.ndarray) -> List[Tuple]:
"""
Postprocess the raw results from the model.
"""
# candidates - probability > 0.25
detections = detections[detections[..., 4] > 0.25]
boxes = []
labels = []
scores = []
for obj in detections:
xmin, ymin, ww, hh = obj[:4]
score = obj[4]
label = np.argmax(obj[5:])
# Create a box with pixels coordinates from the box with normalized coordinates [0,1].
boxes.append(
tuple(map(int, (xmin - ww // 2, ymin - hh // 2, ww, hh)))
)
labels.append(int(label))
scores.append(float(score))
# Apply non-maximum suppression to get rid of many overlapping entities.
# See https://paperswithcode.com/method/non-maximum-suppression
# This algorithm returns indices of objects to keep.
indices = cv2.dnn.NMSBoxes(
bboxes=boxes, scores=scores, score_threshold=0.25, nms_threshold=0.5
)
# If there are no boxes.
if len(indices) == 0:
return []
# Filter detected objects.
return [(labels[idx], scores[idx], boxes[idx]) for idx in indices.flatten()]
def draw_boxes(img: np.ndarray, boxes):
"""
Draw detected boxes on the image.
"""
for label, score, box in boxes:
# Choose color for the label.
color = tuple(map(int, colors[label]))
# Draw a box.
x2 = box[0] + box[2]
y2 = box[1] + box[3]
cv2.rectangle(img=img, pt1=box[:2], pt2=(x2, y2), color=color, thickness=2)
# Draw a label name inside the box.
cv2.putText(
img=img,
text=f"{classes[label]} {score:.2f}",
org=(box[0] + 10, box[1] + 20),
fontFace=cv2.FONT_HERSHEY_COMPLEX,
fontScale=img.shape[1] / 1200,
color=color,
thickness=1,
lineType=cv2.LINE_AA,
)
def show_result(results: np.ndarray):
"""
Postprocess the raw results, draw boxes and show the image.
"""
output_img = image.copy()
detections = postprocess(results)
draw_boxes(output_img, detections)
utils.show_array(output_img)
Optimizations
-------------
Below, we present the performance tricks for faster inference in the
latency mode. We release resources after every benchmarking to be sure
the same amount of resource is available for every experiment.
PyTorch model
~~~~~~~~~~~~~
First, were benchmarking the original PyTorch model without any
optimizations applied. We will treat it as our baseline.
.. code:: ipython3
import torch
with torch.no_grad():
result = pytorch_model(torch.as_tensor(input_image)).detach().numpy()[0]
show_result(result)
pytorch_infer_time = benchmark_model(pytorch_model, input_data=torch.as_tensor(input_image).float(), benchmark_name="PyTorch model")
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_14_0.jpg
.. parsed-literal::
PyTorch model on CPU. First inference time: 0.0288 seconds
PyTorch model on CPU: 0.0205 seconds per image (48.90 FPS)
ONNX model
~~~~~~~~~~
The first optimization is exporting the PyTorch model to ONNX and
running it in OpenVINO. Its possible, thanks to the ONNX frontend. It
means we dont necessarily have to convert the model to Intermediate
Representation (IR) to leverage the OpenVINO Runtime.
.. code:: ipython3
onnx_path = base_model_dir / Path(f"{model_name}_{IMAGE_WIDTH}_{IMAGE_HEIGHT}").with_suffix(".onnx")
# export PyTorch model to ONNX if it doesn't already exist
if not onnx_path.exists():
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(pytorch_model, dummy_input, onnx_path)
# load and compile in OpenVINO
onnx_model = core.read_model(onnx_path)
onnx_model = core.compile_model(onnx_model, device_name="CPU")
.. parsed-literal::
/opt/home/k8sworker/.cache/torch/hub/ultralytics_yolov5_master/models/common.py:514: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
y = self.model(im, augment=augment, visualize=visualize) if augment or visualize else self.model(im)
/opt/home/k8sworker/.cache/torch/hub/ultralytics_yolov5_master/models/yolo.py:64: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
.. code:: ipython3
result = onnx_model(input_image)[onnx_model.output(0)][0]
show_result(result)
onnx_infer_time = benchmark_model(model=onnx_model, input_data=input_image, benchmark_name="ONNX model")
del onnx_model # release resources
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_17_0.jpg
.. parsed-literal::
ONNX model on CPU. First inference time: 0.0165 seconds
ONNX model on CPU: 0.0091 seconds per image (109.62 FPS)
OpenVINO IR model
~~~~~~~~~~~~~~~~~
Lets convert the ONNX model to OpenVINO Intermediate Representation
(IR) FP16 and run it. Reducing the precision is one of the well-known
methods for faster inference provided the hardware that supports lower
precision, such as FP16 or even INT8. If the hardware doesnt support
lower precisions, the model will be inferred in FP32 automatically. We
could also use quantization (INT8), but we should experience a little
accuracy drop. Thats why we skip that step in this notebook.
.. code:: ipython3
from openvino.tools import mo
ov_model = mo.convert_model(onnx_path, compress_to_fp16=True)
# save the model on disk
ov.serialize(ov_model, xml_path=str(onnx_path.with_suffix(".xml")))
ov_cpu_model = core.compile_model(ov_model, device_name="CPU")
result = ov_cpu_model(input_image)[ov_cpu_model.output(0)][0]
show_result(result)
ov_cpu_infer_time = benchmark_model(model=ov_cpu_model, input_data=input_image, benchmark_name="OpenVINO model")
del ov_cpu_model # release resources
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_19_0.jpg
.. parsed-literal::
OpenVINO model on CPU. First inference time: 0.0152 seconds
OpenVINO model on CPU: 0.0092 seconds per image (109.13 FPS)
OpenVINO IR model on GPU
~~~~~~~~~~~~~~~~~~~~~~~~
Usually, a GPU device is faster than a CPU, so lets run the above model
on the GPU. Please note you need to have an Intel GPU and `install
drivers <https://github.com/openvinotoolkit/openvino_notebooks/wiki/Ubuntu#1-install-python-git-and-gpu-drivers-optional>`__
to be able to run this step. In addition, offloading to the GPU helps
reduce CPU load and memory consumption, allowing it to be left for
routine processes. If you cannot observe a faster inference on GPU, it
may be because the model is too light to benefit from massive parallel
execution.
.. code:: ipython3
ov_gpu_infer_time = 0.0
if "GPU" in core.available_devices:
ov_gpu_model = core.compile_model(ov_model, device_name="GPU")
result = ov_gpu_model(input_image)[ov_gpu_model.output(0)][0]
show_result(result)
ov_gpu_infer_time = benchmark_model(model=ov_gpu_model, input_data=input_image, benchmark_name="OpenVINO model", device_name="GPU")
del ov_gpu_model # release resources
OpenVINO IR model + more inference threads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is a possibility to add a config for any device (CPU in this
case). We will increase the number of threads to an equal number of our
cores. It should help us a lot. There are `more
options <https://docs.openvino.ai/latest/groupov_runtime_cpp_prop_api.html>`__
to be changed, so its worth playing with them to see what works best in
our case.
.. code:: ipython3
num_cores = os.cpu_count()
ov_cpu_config_model = core.compile_model(ov_model, device_name="CPU", config={"INFERENCE_NUM_THREADS": num_cores})
result = ov_cpu_config_model(input_image)[ov_cpu_config_model.output(0)][0]
show_result(result)
ov_cpu_config_infer_time = benchmark_model(model=ov_cpu_config_model, input_data=input_image, benchmark_name="OpenVINO model + more threads")
del ov_cpu_config_model # release resources
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_23_0.jpg
.. parsed-literal::
OpenVINO model + more threads on CPU. First inference time: 0.0143 seconds
OpenVINO model + more threads on CPU: 0.0094 seconds per image (106.13 FPS)
OpenVINO IR model in latency mode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenVINO offers a virtual device called
`AUTO <https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_AUTO.html>`__,
which can select the best device for us based on a performance hint.
There are three different hints: ``LATENCY``, ``THROUGHPUT``, and
``CUMULATIVE_THROUGHPUT``. As this notebook is focused on the latency
mode, we will use ``LATENCY``. The above hints can be used with other
devices as well.
.. code:: ipython3
ov_auto_model = core.compile_model(ov_model, device_name="AUTO", config={"PERFORMANCE_HINT": "LATENCY"})
result = ov_auto_model(input_image)[ov_auto_model.output(0)][0]
show_result(result)
ov_auto_infer_time = benchmark_model(model=ov_auto_model, input_data=input_image, benchmark_name="OpenVINO model", device_name="AUTO")
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_25_0.jpg
.. parsed-literal::
OpenVINO model on AUTO. First inference time: 0.0134 seconds
OpenVINO model on AUTO: 0.0095 seconds per image (105.46 FPS)
OpenVINO IR model in latency mode + shared memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenVINO is a C++ toolkit with Python wrappers (API). The default
behavior in the Python API is copying the input to the additional buffer
and then running processing in C++, which prevents many
multiprocessing-related issues. However, it also increases time cost. We
can create a tensor with enabled shared memory (keeping in mind we
cannot overwrite our input), save time for copying and improve the
performance!
.. code:: ipython3
# it must be assigned to a variable, not to be garbage collected
c_input_image = np.ascontiguousarray(input_image, dtype=np.float32)
input_tensor = ov.Tensor(c_input_image, shared_memory=True)
result = ov_auto_model(input_tensor)[ov_auto_model.output(0)][0]
show_result(result)
ov_auto_shared_infer_time = benchmark_model(model=ov_auto_model, input_data=input_tensor, benchmark_name="OpenVINO model + shared memory", device_name="AUTO")
del ov_auto_model # release resources
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_27_0.jpg
.. parsed-literal::
OpenVINO model + shared memory on AUTO. First inference time: 0.0140 seconds
OpenVINO model + shared memory on AUTO: 0.0055 seconds per image (181.99 FPS)
Other tricks
~~~~~~~~~~~~
There are other tricks for performance improvement, especially
quantization and prepostprocessing. To get even more from your model,
please visit
`111-detection-quantization <../111-detection-quantization>`__ and
`118-optimize-preprocessing <../118-optimize-preprocessing>`__.
Performance comparison
----------------------
The following graphical comparison is valid for the selected model and
hardware simultaneously. If you cannot see any improvement between some
steps, just skip them.
.. code:: ipython3
%matplotlib inline
.. code:: ipython3
from matplotlib import pyplot as plt
labels = ["PyTorch model", "ONNX model", "OpenVINO IR model", "OpenVINO IR model on GPU", "OpenVINO IR model + more inference threads",
"OpenVINO IR model in latency mode", "OpenVINO IR model in latency mode + shared memory"]
# make them milliseconds
times = list(map(lambda x: 1000 * x, [pytorch_infer_time, onnx_infer_time, ov_cpu_infer_time, ov_gpu_infer_time, ov_cpu_config_infer_time,
ov_auto_infer_time, ov_auto_shared_infer_time]))
bar_colors = colors[::10] / 255.0
fig, ax = plt.subplots(figsize=(16, 8))
ax.bar(labels, times, color=bar_colors)
ax.set_ylabel("Inference time [ms]")
ax.set_title("Performance difference")
plt.xticks(rotation='vertical')
plt.show()
.. image:: 109-latency-tricks-with-output_files/109-latency-tricks-with-output_30_0.png
Conclusions
-----------
We already showed the steps needed to improve the performance of an
object detection model. Even if you experience much better performance
after running this notebook, please note this may not be valid for every
hardware or every model. For the most accurate results, please use
``benchmark_app`` `command-line
tool <https://docs.openvino.ai/latest/openvino_inference_engine_samples_benchmark_app_README.html>`__.
Note that ``benchmark_app`` cannot measure the impact of some tricks
above, e.g., shared memory.