Proofreading-OV-Runtime (#11658)
* Update docs/OV_Runtime_UG/protecting_model_guide.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/protecting_model_guide.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/protecting_model_guide.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/protecting_model_guide.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/protecting_model_guide.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/protecting_model_guide.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/ARM_CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/optimization_guide/dldt_deployment_optimization_common.md Co-authored-by: Sebastian Golebiewski <sebastianx.golebiewski@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GPU_RemoteTensor_API.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/HDDL.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/HDDL.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/HDDL.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/MYRIAD.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/MYRIAD.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/MYRIAD.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/ov_dynamic_shapes.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/config_properties.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/config_properties.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/preprocessing_details.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/preprocessing_details.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/preprocessing_details.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/preprocessing_details.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/performance_hints.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/deployment/deployment-manager-tool.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Apply suggestions from code review Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/preprocessing_details.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/performance_hints.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/preprocessing_details.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/performance_hints.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Update docs/OV_Runtime_UG/deployment/deployment-manager-tool.md Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Apply suggestions from code review Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Apply suggestions from code review Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Update ref links * Update Getting_performance_numbers.md * Update deployment_intro.md * Update preprocessing_details.md * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update tools/pot/openvino/tools/pot/algorithms/quantization/default/README.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/automatic_batching.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/deployment/deployment-manager-tool.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update tools/pot/openvino/tools/pot/algorithms/quantization/default/README.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update automatic_batching.md * Update docs/OV_Runtime_UG/automatic_batching.md * Update docs/OV_Runtime_UG/ShapeInference.md * Update deployment-manager-tool.md * Update deployment-manager-tool.md * Update docs/OV_Runtime_UG/deployment/deployment-manager-tool.md * Update automatic_batching.md * Update automatic_batching.md * Update docs/OV_Runtime_UG/ShapeInference.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update integrate_with_your_application.md * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/integrate_with_your_application.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update integrate_with_your_application.md * Update docs/OV_Runtime_UG/integrate_with_your_application.md * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update model_representation.md * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update integrate_with_your_application.md * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update Additional_Optimizations.md Removing redundant information. * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update Additional_Optimizations.md * Apply suggestions from code review Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update Additional_Optimizations.md * Update docs/OV_Runtime_UG/model_representation.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/layout_overview.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update model_representation.md * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/docs/SaturationIssue.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/openvino/tools/pot/algorithms/quantization/accuracy_aware/README.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/docs/SaturationIssue.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/docs/SaturationIssue.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/GNA.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/docs/SaturationIssue.md * Update tools/pot/docs/SaturationIssue.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update README.md * Update README.md * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/docs/Introduction.md * Update tools/pot/docs/AccuracyAwareQuantizationUsage.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Removing one-liners Removing introductory sentences from 'Supported Features' sections. * Update docs/OV_Runtime_UG/openvino_intro.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/benchmarks/performance_benchmarks_ovms.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/docs/Introduction.md * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update tools/pot/docs/DefaultQuantizationUsage.md * Update tools/pot/docs/BestPractices.md * Update tools/pot/docs/BestPractices.md * Update tools/pot/docs/AccuracyAwareQuantizationUsage.md * Update docs/optimization_guide/model_optimization_guide.md * Update docs/optimization_guide/dldt_deployment_optimization_guide.md * Update docs/OV_Runtime_UG/supported_plugins/config_properties.md * Update docs/OV_Runtime_UG/supported_plugins/GNA.md * Update docs/OV_Runtime_UG/supported_plugins/CPU.md * Update docs/OV_Runtime_UG/preprocessing_usecase_save.md * Apply suggestions from code review Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Maciej Smyk <maciejx.smyk@intel.com> Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: msmykx <101244365+msmykx-intel@users.noreply.github.com> Co-authored-by: Piotr Milewski <piotr.milewski@intel.com>
This commit is contained in:
parent
6283ab2fde
commit
b1dcb276da
@ -2,11 +2,11 @@
|
||||
|
||||
Input data for inference can be different from the training dataset and requires additional preprocessing before inference.
|
||||
To accelerate the whole pipeline including preprocessing and inference, Model Optimizer provides special parameters such as `--mean_values`,
|
||||
`--scale_values`, `--reverse_input_channels`, and `--layout`. Based on these parameters, Model Optimizer generates IR with additionally
|
||||
|
||||
`--scale_values`, `--reverse_input_channels`, and `--layout`. Based on these parameters, Model Optimizer generates OpenVINO IR with additionally
|
||||
inserted sub-graphs to perform the defined preprocessing. This preprocessing block can perform mean-scale normalization of input data,
|
||||
reverting data along channel dimension, and changing the data layout.
|
||||
See the following sections for details on the parameters, or the [Overview of Preprocessing API](../../OV_Runtime_UG/preprocessing_overview.md) for the same functionality in OpenVINO Runtime.
|
||||
for more information.
|
||||
|
||||
## Specifying Layout
|
||||
|
||||
@ -58,10 +58,12 @@ for example, `[0, 1]` or `[-1, 1]`. Sometimes, the mean values (mean images) are
|
||||
|
||||
There are two cases of how the input data preprocessing is implemented.
|
||||
* The input preprocessing operations are a part of a model.
|
||||
In this case, the application does not perform a separate preprocessing step: everything is embedded into the model itself. Model Optimizer will generate the IR with required preprocessing operations, and no `mean` and `scale` parameters are required.
|
||||
|
||||
In this case, the application does not perform a separate preprocessing step: everything is embedded into the model itself. Model Optimizer will generate the OpenVINO IR format with required preprocessing operations, and no `mean` and `scale` parameters are required.
|
||||
* The input preprocessing operations are not a part of a model and the preprocessing is performed within the application which feeds the model with input data.
|
||||
|
||||
In this case, information about mean/scale values should be provided to the Model Optimizer to embed it to the generated IR.
|
||||
In this case, information about mean/scale values should be provided to Model Optimizer to embed it to the generated OpenVINO IR format.
|
||||
|
||||
Model Optimizer provides command-line parameters to specify the values: `--mean_values`, `--scale_values`, `--scale`.
|
||||
Using these parameters, Model Optimizer embeds the corresponding preprocessing block for mean-value normalization of the input data
|
||||
and optimizes this block so that the preprocessing takes negligible time for inference.
|
||||
@ -75,7 +77,8 @@ mo --input_model unet.pdmodel --mean_values [123,117,104] --scale 255
|
||||
## Reversing Input Channels <a name="when_to_reverse_input_channels"></a>
|
||||
Sometimes, input images for your application can be of the RGB (or BGR) format and the model is trained on images of the BGR (or RGB) format,
|
||||
which is in the opposite order of color channels. In this case, it is important to preprocess the input images by reverting the color channels before inference.
|
||||
To embed this preprocessing step into IR, Model Optimizer provides the `--reverse_input_channels` command-line parameter to shuffle the color channels.
|
||||
|
||||
To embed this preprocessing step into OpenVINO IR, Model Optimizer provides the `--reverse_input_channels` command-line parameter to shuffle the color channels.
|
||||
|
||||
The `--reverse_input_channels` parameter can be used to preprocess the model input in the following cases:
|
||||
* Only one dimension in the input shape has a size equal to 3.
|
||||
@ -84,7 +87,7 @@ The `--reverse_input_channels` parameter can be used to preprocess the model inp
|
||||
Using the `--reverse_input_channels` parameter, Model Optimizer embeds the corresponding preprocessing block for reverting
|
||||
the input data along channel dimension and optimizes this block so that the preprocessing takes only negligible time for inference.
|
||||
|
||||
For example, the following command launches the Model Optimizer for the TensorFlow AlexNet model and embeds the `reverse_input_channel` preprocessing block into IR:
|
||||
For example, the following command launches Model Optimizer for the TensorFlow AlexNet model and embeds the `reverse_input_channel` preprocessing block into OpenVINO IR:
|
||||
|
||||
```sh
|
||||
mo --input_model alexnet.pb --reverse_input_channels
|
||||
|
@ -1,16 +1,16 @@
|
||||
# OpenVINO™ Python API exclusives {#openvino_docs_OV_UG_Python_API_exclusives}
|
||||
# OpenVINO™ Python API Exclusives {#openvino_docs_OV_UG_Python_API_exclusives}
|
||||
|
||||
OpenVINO™ Runtime Python API is exposing additional features and helpers to elevate user experience. Main goal of Python API is to provide user-friendly and simple, still powerful, tool for Python users.
|
||||
OpenVINO™ Runtime Python API offers additional features and helpers to enhance user experience. The main goal of Python API is to provide user-friendly and simple yet powerful tool for Python users.
|
||||
|
||||
## Easier model compilation
|
||||
## Easier Model Compilation
|
||||
|
||||
`CompiledModel` can be easily created with the helper method. It hides `Core` creation and applies `AUTO` device by default.
|
||||
`CompiledModel` can be easily created with the helper method. It hides the creation of `Core` and applies `AUTO` inference mode by default.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py auto_compilation
|
||||
|
||||
## Model/CompiledModel inputs and outputs
|
||||
## Model/CompiledModel Inputs and Outputs
|
||||
|
||||
Besides functions aligned to C++ API, some of them have their Pythonic counterparts or extensions. For example, `Model` and `CompiledModel` inputs/outputs can be accessed via properties.
|
||||
Besides functions aligned to C++ API, some of them have their Python counterparts or extensions. For example, `Model` and `CompiledModel` inputs/outputs can be accessed via properties.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py properties_example
|
||||
|
||||
@ -18,21 +18,21 @@ Refer to Python API documentation on which helper functions or properties are av
|
||||
|
||||
## Working with Tensor
|
||||
|
||||
Python API allows passing data as tensors. `Tensor` object holds a copy of the data from the given array. `dtype` of numpy arrays is converted to OpenVINO™ types automatically.
|
||||
Python API allows passing data as tensors. The `Tensor` object holds a copy of the data from the given array. The `dtype` of *numpy* arrays is converted to OpenVINO™ types automatically.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py tensor_basics
|
||||
|
||||
### Shared memory mode
|
||||
### Shared Memory Mode
|
||||
|
||||
`Tensor` objects can share the memory with numpy arrays. By specifing `shared_memory` argument, a `Tensor` object does not perform copy of data and has access to the memory of the numpy array.
|
||||
`Tensor` objects can share the memory with *numpy* arrays. By specifying the `shared_memory` argument, the `Tensor` object does not copy data. Instead, it has access to the memory of the *numpy* array.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py tensor_shared_mode
|
||||
|
||||
## Running inference
|
||||
## Running Inference
|
||||
|
||||
Python API supports extra calling methods to synchronous and asynchronous modes for inference.
|
||||
|
||||
All infer methods allow users to pass data as popular numpy arrays, gathered in either Python dicts or lists.
|
||||
All infer methods allow users to pass data as popular *numpy* arrays, gathered in either Python dicts or lists.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py passing_numpy_array
|
||||
|
||||
@ -40,54 +40,54 @@ Results from inference can be obtained in various ways:
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py getting_results
|
||||
|
||||
### Synchronous mode - extended
|
||||
### Synchronous Mode - Extended
|
||||
|
||||
Python API provides different synchronous calls to infer model, which block the application execution. Additionally these calls return results of inference:
|
||||
Python API provides different synchronous calls to infer model, which block the application execution. Additionally, these calls return results of inference:
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py sync_infer
|
||||
|
||||
### AsyncInferQueue
|
||||
|
||||
Asynchronous mode pipelines can be supported with wrapper class called `AsyncInferQueue`. This class automatically spawns pool of `InferRequest` objects (also called "jobs") and provides synchronization mechanisms to control flow of the pipeline.
|
||||
Asynchronous mode pipelines can be supported with a wrapper class called `AsyncInferQueue`. This class automatically spawns the pool of `InferRequest` objects (also called "jobs") and provides synchronization mechanisms to control the flow of the pipeline.
|
||||
|
||||
Each job is distinguishable by unique `id`, which is in the range from 0 up to number of jobs specified in `AsyncInferQueue` constructor.
|
||||
Each job is distinguishable by a unique `id`, which is in the range from 0 up to the number of jobs specified in the `AsyncInferQueue` constructor.
|
||||
|
||||
Function call `start_async` is not required to be synchronized, it waits for any available job if queue is busy/overloaded. Every `AsyncInferQueue` code block should end with `wait_all` function. It provides "global" synchronization of all jobs in the pool and ensure that access to them is safe.
|
||||
The `start_async` function call is not required to be synchronized - it waits for any available job if the queue is busy/overloaded. Every `AsyncInferQueue` code block should end with the `wait_all` function which provides the "global" synchronization of all jobs in the pool and ensure that access to them is safe.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py asyncinferqueue
|
||||
|
||||
#### Acquire results from requests
|
||||
#### Acquiring Results from Requests
|
||||
|
||||
After the call to `wait_all`, jobs and their data can be safely accessed. Acquring of a specific job with `[id]` returns `InferRequest` object, which results in seamless retrieval of the output data.
|
||||
After the call to `wait_all`, jobs and their data can be safely accessed. Acquiring a specific job with `[id]` will return the `InferRequest` object, which will result in seamless retrieval of the output data.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py asyncinferqueue_access
|
||||
|
||||
#### Setting callbacks
|
||||
#### Setting Callbacks
|
||||
|
||||
Another feature of `AsyncInferQueue` is ability of setting callbacks. When callback is set, any job that ends inference, calls upon Python function. Callback function must have two arguments. First is the request that calls the callback, it provides `InferRequest` API. Second one being called "userdata", provides possibility of passing runtime values, which can be of any Python type and later used inside callback function.
|
||||
Another feature of `AsyncInferQueue` is the ability to set callbacks. When callback is set, any job that ends inference calls upon the Python function. The callback function must have two arguments: one is the request that calls the callback, which provides the `InferRequest` API; the other is called "userdata", which provides the possibility of passing runtime values. Those values can be of any Python type and later used within the callback function.
|
||||
|
||||
The callback of `AsyncInferQueue` is uniform for every job. When executed, GIL is acquired to ensure safety of data manipulation inside the function.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py asyncinferqueue_set_callback
|
||||
|
||||
### Working with u1, u4 and i4 element types
|
||||
### Working with u1, u4 and i4 Element Types
|
||||
|
||||
Since openvino supports low precision element types there are few ways how to handle them in python.
|
||||
To create an input tensor with such element types you may need to pack your data in new numpy array which byte size matches original input size:
|
||||
Since OpenVINO™ supports low precision element types, there are a few ways to handle them in Python.
|
||||
To create an input tensor with such element types, you may need to pack your data in the new *numpy* array, with which the byte size matches the original input size:
|
||||
@snippet docs/snippets/ov_python_exclusives.py packing_data
|
||||
|
||||
To extract low precision values from tensor into numpy array you can use next helper:
|
||||
To extract low precision values from a tensor into the *numpy* array, you can use the following helper:
|
||||
@snippet docs/snippets/ov_python_exclusives.py unpacking
|
||||
|
||||
### Releasing the GIL
|
||||
### Release of GIL
|
||||
|
||||
Some functions in Python API release the Global Lock Interpreter (GIL) while running work-intensive code. It can help you to achieve more parallelism in your application using Python threads. For more information about GIL please refer to the Python documentation.
|
||||
Some functions in Python API release the Global Lock Interpreter (GIL) while running work-intensive code. This can help you achieve more parallelism in your application, using Python threads. For more information about GIL, refer to the Python documentation.
|
||||
|
||||
@snippet docs/snippets/ov_python_exclusives.py releasing_gil
|
||||
|
||||
> **NOTE**: While GIL is released functions can still modify and/or operate on Python objects in C++, thus there is no reference counting. User is responsible for thread safety if sharing of these objects with other thread occurs. It can affects your code only if multiple threads are spawned in Python.:
|
||||
> **NOTE**: While GIL is released, functions can still modify and/or operate on Python objects in C++. Hence, there is no reference counting. You should pay attention to thread safety in case sharing of these objects with another thread occurs. It might affect code only if multiple threads are spawned in Python.
|
||||
|
||||
#### List of functions that release the GIL
|
||||
#### List of Functions that Release the GIL
|
||||
- openvino.runtime.AsyncInferQueue.start_async
|
||||
- openvino.runtime.AsyncInferQueue.is_ready
|
||||
- openvino.runtime.AsyncInferQueue.wait_all
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Changing input shapes {#openvino_docs_OV_UG_ShapeInference}
|
||||
# Changing Input Shapes {#openvino_docs_OV_UG_ShapeInference}
|
||||
|
||||
|
||||
## Introduction (C++)
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -9,28 +9,28 @@
|
||||
@endsphinxdirective
|
||||
|
||||
OpenVINO™ provides capabilities to change model input shape during the runtime.
|
||||
It may be useful in case you would like to feed model an input that has different size than model input shape.
|
||||
In case you need to do this only once [prepare a model with updated shapes via Model Optimizer](@ref when_to_specify_input_shapes) for all the other cases follow instructions further.
|
||||
It may be useful when you want to feed model an input that has different size than model input shape.
|
||||
If you need to do this only once, prepare a model with updated shapes via Model Optimizer. See [Specifying --input_shape Command-line Parameter](@ref when_to_specify_input_shapes) for more information. For all the other cases, follow the instructions below.
|
||||
|
||||
### Set a new input shape with reshape method
|
||||
### Setting a New Input Shape with Reshape Method
|
||||
|
||||
The `ov::Model::reshape` method updates input shapes and propagates them down to the outputs of the model through all intermediate layers.
|
||||
Example: Changing the batch size and spatial dimensions of input of a model with an image input:
|
||||
For example, changing the batch size and spatial dimensions of input of a model with an image input:
|
||||
|
||||

|
||||
|
||||
Please see the code to achieve that:
|
||||
Consider the code below to achieve that:
|
||||
|
||||
@snippet snippets/ShapeInference.cpp picture_snippet
|
||||
|
||||
### Set a new batch size with set_batch method
|
||||
### Setting a New Batch Size with set_batch Method
|
||||
|
||||
Meaning of the model batch may vary depending on the model design.
|
||||
The meaning of the model batch may vary depending on the model design.
|
||||
In order to change the batch dimension of the model, [set the ov::Layout](@ref declare_model_s_layout) and call the `ov::set_batch` method.
|
||||
|
||||
@snippet snippets/ShapeInference.cpp set_batch
|
||||
|
||||
`ov::set_batch` method is a high level API of `ov::Model::reshape` functionality, so all information about `ov::Model::reshape` method implications are applicable for `ov::set_batch` too, including the troubleshooting section.
|
||||
The `ov::set_batch` method is a high level API of the `ov::Model::reshape` functionality, so all information about the `ov::Model::reshape` method implications are applicable for `ov::set_batch` too, including the troubleshooting section.
|
||||
|
||||
Once the input shape of `ov::Model` is set, call the `ov::Core::compile_model` method to get an `ov::CompiledModel` object for inference with updated shapes.
|
||||
|
||||
@ -39,17 +39,17 @@ There are other approaches to change model input shapes during the stage of [IR
|
||||
### Dynamic Shape Notice
|
||||
|
||||
Shape-changing functionality could be used to turn dynamic model input into a static one and vice versa.
|
||||
It is recommended to always set static shapes in case if the shape of data is not going to change from one inference to another.
|
||||
Setting static shapes avoids possible functional limitations, memory and run time overheads for dynamic shapes that vary depending on hardware plugin and model used.
|
||||
To learn more about dynamic shapes in OpenVINO please see a [dedicated article](../OV_Runtime_UG/ov_dynamic_shapes.md).
|
||||
It is recommended to always set static shapes when the shape of data is not going to change from one inference to another.
|
||||
Setting static shapes can avoid possible functional limitations, memory, and runtime overheads for dynamic shapes which may vary depending on hardware plugin and model used.
|
||||
To learn more about dynamic shapes in OpenVINO, see the [Dynamic Shapes](../OV_Runtime_UG/ov_dynamic_shapes.md) page.
|
||||
|
||||
### Usage of Reshape Method <a name="usage_of_reshape_method"></a>
|
||||
### Usage of the Reshape Method <a name="usage_of_reshape_method"></a>
|
||||
|
||||
The primary method of the feature is `ov::Model::reshape`. It is overloaded to better serve two main use cases:
|
||||
|
||||
1) To change input shape of model with single input you may pass new shape into the method. Please see the example of adjusting spatial dimensions to the input image:
|
||||
1) To change the input shape of the model with a single input, you may pass a new shape to the method. See the example of adjusting spatial dimensions to the input image below:
|
||||
|
||||
@snippet snippets/ShapeInference.cpp spatial_reshape
|
||||
@snippet snippets/ShapeInference.cpp spatial_reshape
|
||||
|
||||
To do the opposite - resize input image to the input shapes of the model, use the [pre-processing API](../OV_Runtime_UG/preprocessing_overview.md).
|
||||
|
||||
@ -80,9 +80,9 @@ To do the opposite - resize input image to the input shapes of the model, use th
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Please find usage scenarios of `reshape` feature in our [samples](Samples_Overview.md) starting with [Hello Reshape Sample](../../samples/cpp/hello_reshape_ssd/README.md).
|
||||
The usage scenarios of the `reshape` feature can be found in [OpenVINO Samples](Samples_Overview.md), starting with the [Hello Reshape Sample](../../samples/cpp/hello_reshape_ssd/README.md).
|
||||
|
||||
Practically, some models are not ready to be reshaped. In this case, a new input shape cannot be set with the Model Optimizer or the `ov::Model::reshape` method.
|
||||
In practice, some models are not ready to be reshaped. In such cases, a new input shape cannot be set with Model Optimizer or the `ov::Model::reshape` method.
|
||||
|
||||
@anchor troubleshooting_reshape_errors
|
||||
### Troubleshooting Reshape Errors
|
||||
@ -92,8 +92,8 @@ Shape collision during shape propagation may be a sign that a new shape does not
|
||||
Changing the model input shape may result in intermediate operations shape collision.
|
||||
|
||||
Examples of such operations:
|
||||
* [Reshape](../ops/shape/Reshape_1.md) operation with a hard-coded output shape value
|
||||
* [MatMul](../ops/matrix/MatMul_1.md) operation with the `Const` second input cannot be resized by spatial dimensions due to operation semantics
|
||||
* The [Reshape](../ops/shape/Reshape_1.md) operation with a hard-coded output shape value.
|
||||
* The [MatMul](../ops/matrix/MatMul_1.md) operation with the `Const` second input and this input cannot be resized by spatial dimensions due to operation semantics.
|
||||
|
||||
Model structure and logic should not change significantly after model reshaping.
|
||||
- The Global Pooling operation is commonly used to reduce output feature map of classification models output.
|
||||
@ -101,7 +101,7 @@ Having the input of the shape [N, C, H, W], Global Pooling returns the output of
|
||||
Model architects usually express Global Pooling with the help of the `Pooling` operation with the fixed kernel size [H, W].
|
||||
During spatial reshape, having the input of the shape [N, C, H1, W1], Pooling with the fixed kernel size [H, W] returns the output of the shape [N, C, H2, W2], where H2 and W2 are commonly not equal to `1`.
|
||||
It breaks the classification model structure.
|
||||
For example, [publicly available Inception family models from TensorFlow*](https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models) have this issue.
|
||||
For example, the publicly available [Inception family models from TensorFlow](https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models) have this issue.
|
||||
|
||||
- Changing the model input shape may significantly affect its accuracy.
|
||||
For example, Object Detection models from TensorFlow have resizing restrictions by design.
|
||||
@ -112,22 +112,22 @@ For details, refer to the [Tensorflow Object Detection API models resizing techn
|
||||
### How To Fix Non-Reshape-able Model
|
||||
|
||||
Some operators which prevent normal shape propagation can be fixed. To do so you can:
|
||||
* see if the issue can be fixed via changing the values of some operators input.
|
||||
E.g. most common problem of non-reshape-able models is a `Reshape` operator with hardcoded output shape.
|
||||
* see if the issue can be fixed via changing the values of some operators' input.
|
||||
For example, the most common problem of non-reshape-able models is a `Reshape` operator with hard-coded output shape.
|
||||
You can cut-off hard-coded 2nd input of `Reshape` and fill it in with relaxed values.
|
||||
For the following example on the picture Model Optimizer CLI should be:
|
||||
For the following example on the picture, the Model Optimizer CLI should be:
|
||||
```sh
|
||||
mo --input_model path/to/model --input data[8,3,224,224],1:reshaped[2]->[0 -1]`
|
||||
```
|
||||
With `1:reshaped[2]` we request to cut 2nd input (counting from zero, so `1:` means 2nd inputs) of operation named `reshaped` and replace it with a `Parameter` with shape `[2]`.
|
||||
With `->[0 -1]` we replace this new `Parameter` by a `Constant` operator which has value `[0, -1]`.
|
||||
Since `Reshape` operator has `0` and `-1` as a specific values (see the meaning in [the specification](../ops/shape/Reshape_1.md)) it allows to propagate shapes freely without losing the intended meaning of `Reshape`.
|
||||
With `1:reshaped[2]`, it's requested to cut the 2nd input (counting from zero, so `1:` means the 2nd input) of the operation named `reshaped` and replace it with a `Parameter` with shape `[2]`.
|
||||
With `->[0 -1]`, this new `Parameter` is replaced by a `Constant` operator which has the `[0, -1]` value.
|
||||
Since the `Reshape` operator has `0` and `-1` as specific values (see the meaning in [this specification](../ops/shape/Reshape_1.md)), it allows propagating shapes freely without losing the intended meaning of `Reshape`.
|
||||
|
||||

|
||||
|
||||
* transform model during Model Optimizer conversion on the back phase. See [Model Optimizer extension article](../MO_DG/prepare_model/customize_model_optimizer/Customize_Model_Optimizer.md)
|
||||
* transform OpenVINO Model during the runtime. See [OpenVINO Runtime Transformations article](../Extensibility_UG/ov_transformations.md)
|
||||
* modify the original model with the help of original framework
|
||||
* transform the model during Model Optimizer conversion on the back phase. For more information, see the [Model Optimizer extension](../MO_DG/prepare_model/customize_model_optimizer/Customize_Model_Optimizer.md).
|
||||
* transform OpenVINO Model during the runtime. For more information, see [OpenVINO Runtime Transformations](../Extensibility_UG/ov_transformations.md).
|
||||
* modify the original model with the help of the original framework.
|
||||
|
||||
### Extensibility
|
||||
OpenVINO provides a special mechanism that allows adding support of shape inference for custom operations. This mechanism is described in the [Extensibility documentation](../Extensibility_UG/Intro.md)
|
||||
@ -141,17 +141,17 @@ OpenVINO provides a special mechanism that allows adding support of shape infere
|
||||
@endsphinxdirective
|
||||
|
||||
OpenVINO™ provides capabilities to change model input shape during the runtime.
|
||||
It may be useful in case you would like to feed model an input that has different size than model input shape.
|
||||
In case you need to do this only once [prepare a model with updated shapes via Model Optimizer](@ref when_to_specify_input_shapes) for all the other cases follow instructions further.
|
||||
It may be useful when you want to feed model an input that has different size than model input shape.
|
||||
If you need to do this only once, prepare a model with updated shapes via Model Optimizer. See [specifying input shapes](@ref when_to_specify_input_shapes) for more information. For all the other cases, follow the instructions below.
|
||||
|
||||
### Set a new input shape with reshape method
|
||||
### Setting a New Input Shape with Reshape Method
|
||||
|
||||
The [Model.reshape](api/ie_python_api/_autosummary/openvino.runtime.Model.html#openvino.runtime.Model.reshape) method updates input shapes and propagates them down to the outputs of the model through all intermediate layers.
|
||||
Example: Changing the batch size and spatial dimensions of input of a model with an image input:
|
||||
|
||||

|
||||
|
||||
Please see the code to achieve that:
|
||||
Consider the code below to achieve that:
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -161,9 +161,9 @@ Please see the code to achieve that:
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
### Set a new batch size with set_batch method
|
||||
### Setting a New Batch Size with the set_batch Method
|
||||
|
||||
Meaning of the model batch may vary depending on the model design.
|
||||
The meaning of the model batch may vary depending on the model design.
|
||||
In order to change the batch dimension of the model, [set the layout](@ref declare_model_s_layout) for inputs and call the [set_batch](api/ie_python_api/_autosummary/openvino.runtime.set_batch.html) method.
|
||||
|
||||
@sphinxdirective
|
||||
@ -183,15 +183,15 @@ There are other approaches to change model input shapes during the stage of [IR
|
||||
### Dynamic Shape Notice
|
||||
|
||||
Shape-changing functionality could be used to turn dynamic model input into a static one and vice versa.
|
||||
It is recommended to always set static shapes in case if the shape of data is not going to change from one inference to another.
|
||||
Setting static shapes avoids possible functional limitations, memory and run time overheads for dynamic shapes that vary depending on hardware plugin and model used.
|
||||
To learn more about dynamic shapes in OpenVINO please see a [dedicated article](../OV_Runtime_UG/ov_dynamic_shapes.md).
|
||||
It is recommended to always set static shapes when the shape of data is not going to change from one inference to another.
|
||||
Setting static shapes can avoid possible functional limitations, memory, and runtime overheads for dynamic shapes which may vary depending on hardware plugin and used model.
|
||||
To learn more about dynamic shapes in OpenVINO, see the [Dynamic Shapes](../OV_Runtime_UG/ov_dynamic_shapes.md) article.
|
||||
|
||||
### Usage of Reshape Method <a name="usage_of_reshape_method"></a>
|
||||
### Usage of the Reshape Method <a name="usage_of_reshape_method"></a>
|
||||
|
||||
The primary method of the feature is [Model.reshape](api/ie_python_api/_autosummary/openvino.runtime.Model.html#openvino.runtime.Model.reshape). It is overloaded to better serve two main use cases:
|
||||
|
||||
1) To change input shape of model with single input you may pass new shape into the method. Please see the example of adjusting spatial dimensions to the input image:
|
||||
1) To change the input shape of a model with a single input, you may pass a new shape to the method. See the example of adjusting spatial dimensions to the input image:
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -204,12 +204,12 @@ The primary method of the feature is [Model.reshape](api/ie_python_api/_autosumm
|
||||
To do the opposite - resize input image to the input shapes of the model, use the [pre-processing API](../OV_Runtime_UG/preprocessing_overview.md).
|
||||
|
||||
2) Otherwise, you can express reshape plan via dictionary mapping input and its new shape:
|
||||
Dictionary keys could be
|
||||
* `str` specifies input by its name
|
||||
* `int` specifies input by its index
|
||||
* `openvino.runtime.Output` specifies input by passing actual input object
|
||||
Dictionary keys could be:
|
||||
* The `str` key specifies input by its name.
|
||||
* The `int` key specifies input by its index.
|
||||
* The `openvino.runtime.Output` key specifies input by passing the actual input object.
|
||||
|
||||
Dictionary values (representing new shapes) could be
|
||||
Dictionary values (representing new shapes) could be:
|
||||
* `list`
|
||||
* `tuple`
|
||||
* `PartialShape`
|
||||
@ -236,9 +236,9 @@ Dictionary values (representing new shapes) could be
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Please find usage scenarios of `reshape` feature in our [samples](Samples_Overview.md), starting with [Hello Reshape Sample](../../samples/python/hello_reshape_ssd/README.md).
|
||||
The usage scenarios of the `reshape` feature can be found in [OpenVINO Samples](Samples_Overview.md), starting with the [Hello Reshape Sample](../../samples/python/hello_reshape_ssd/README.md).
|
||||
|
||||
Practically, some models are not ready to be reshaped. In this case, a new input shape cannot be set with the Model Optimizer or the `Model.reshape` method.
|
||||
In practice, some models are not ready to be reshaped. In such cases, a new input shape cannot be set with Model Optimizer or the `Model.reshape` method.
|
||||
|
||||
### Troubleshooting Reshape Errors
|
||||
|
||||
@ -256,7 +256,7 @@ Having the input of the shape [N, C, H, W], Global Pooling returns the output of
|
||||
Model architects usually express Global Pooling with the help of the `Pooling` operation with the fixed kernel size [H, W].
|
||||
During spatial reshape, having the input of the shape [N, C, H1, W1], Pooling with the fixed kernel size [H, W] returns the output of the shape [N, C, H2, W2], where H2 and W2 are commonly not equal to `1`.
|
||||
It breaks the classification model structure.
|
||||
For example, [publicly available Inception family models from TensorFlow*](https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models) have this issue.
|
||||
For example, the publicly available [Inception family models from TensorFlow](https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models) have this issue.
|
||||
|
||||
- Changing the model input shape may significantly affect its accuracy.
|
||||
For example, Object Detection models from TensorFlow have resizing restrictions by design.
|
||||
@ -267,21 +267,21 @@ For details, refer to the [Tensorflow Object Detection API models resizing techn
|
||||
|
||||
Some operators which prevent normal shape propagation can be fixed. To do so you can:
|
||||
* see if the issue can be fixed via changing the values of some operators input.
|
||||
E.g. most common problem of non-reshape-able models is a `Reshape` operator with hardcoded output shape.
|
||||
For example, the most common problem of non-reshape-able models is a `Reshape` operator with hard-coded output shape.
|
||||
You can cut-off hard-coded 2nd input of `Reshape` and fill it in with relaxed values.
|
||||
For the following example on the picture Model Optimizer CLI should be:
|
||||
```sh
|
||||
mo --input_model path/to/model --input data[8,3,224,224],1:reshaped[2]->[0 -1]`
|
||||
```
|
||||
With `1:reshaped[2]` we request to cut 2nd input (counting from zero, so `1:` means 2nd inputs) of operation named `reshaped` and replace it with a `Parameter` with shape `[2]`.
|
||||
With `->[0 -1]` we replace this new `Parameter` by a `Constant` operator which has value `[0, -1]`.
|
||||
Since `Reshape` operator has `0` and `-1` as a specific values (see the meaning in [the specification](../ops/shape/Reshape_1.md)) it allows to propagate shapes freely without losing the intended meaning of `Reshape`.
|
||||
With `1:reshaped[2]`, it's requested to cut the 2nd input (counting from zero, so `1:` means the 2nd input) of the operation named `reshaped` and replace it with a `Parameter` with shape `[2]`.
|
||||
With `->[0 -1]`, this new `Parameter` is replaced by a `Constant` operator which has value `[0, -1]`.
|
||||
Since the `Reshape` operator has `0` and `-1` as specific values (see the meaning in [this specification](../ops/shape/Reshape_1.md)), it allows propagating shapes freely without losing the intended meaning of `Reshape`.
|
||||
|
||||

|
||||
|
||||
* transform model during Model Optimizer conversion on the back phase. See [Model Optimizer extension article](../MO_DG/prepare_model/customize_model_optimizer/Customize_Model_Optimizer.md)
|
||||
* transform OpenVINO Model during the runtime. See [OpenVINO Runtime Transformations article](../Extensibility_UG/ov_transformations.md)
|
||||
* modify the original model with the help of original framework
|
||||
* transform the model during Model Optimizer conversion on the back phase. See [Model Optimizer extension](../MO_DG/prepare_model/customize_model_optimizer/Customize_Model_Optimizer.md).
|
||||
* transform OpenVINO Model during the runtime. See [OpenVINO Runtime Transformations](../Extensibility_UG/ov_transformations.md).
|
||||
* modify the original model with the help of the original framework.
|
||||
|
||||
### Extensibility
|
||||
OpenVINO provides a special mechanism that allows adding support of shape inference for custom operations. This mechanism is described in the [Extensibility documentation](../Extensibility_UG/Intro.md)
|
||||
|
@ -1,14 +1,16 @@
|
||||
# Automatic Batching {#openvino_docs_OV_UG_Automatic_Batching}
|
||||
|
||||
## (Automatic) Batching Execution
|
||||
The Automatic Batching Execution mode (or Auto-batching for short) performs automatic batching on-the-fly to improve device utilization by grouping inference requests together, with no programming effort from the user.
|
||||
With Automatic Batching, gathering the input and scattering the output from the individual inference requests required for the batch happen transparently, without affecting the application code.
|
||||
|
||||
The Automatic-Batching is a preview of the new functionality in the OpenVINO™ toolkit. It performs on-the-fly automatic batching (i.e. grouping inference requests together) to improve device utilization, with no programming effort from the user.
|
||||
Inputs gathering and outputs scattering from the individual inference requests required for the batch happen transparently, without affecting the application code.
|
||||
This article provides a preview of the new Automatic Batching function, including how it works, its configurations, and testing performance.
|
||||
|
||||
The feature primarily targets existing code written for inferencing many requests (each instance with the batch size 1). To obtain corresponding performance improvements, the application must be *running many inference requests simultaneously*.
|
||||
As explained below, the auto-batching functionality can be also used via a special *virtual* device.
|
||||
## Enabling/Disabling Automatic Batching
|
||||
|
||||
Batching is a straightforward way of leveraging the GPU compute power and saving on communication overheads. The automatic batching is _implicitly_ triggered on the GPU when the `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for the compile_model or set_property calls.
|
||||
Auto-batching primarily targets the existing code written for inferencing many requests, each instance with the batch size 1. To obtain corresponding performance improvements, the application **must be running many inference requests simultaneously**.
|
||||
Auto-batching can also be used via a particular *virtual* device.
|
||||
|
||||
Batching is a straightforward way of leveraging the compute power of GPU and saving on communication overheads. Automatic Batching is "implicitly" triggered on the GPU when `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for the `compile_model` or `set_property` calls.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -27,8 +29,11 @@ Batching is a straightforward way of leveraging the GPU compute power and saving
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
> **NOTE**: You can disable the Auto-Batching (for example, for the GPU device) from being triggered by the `ov::hint::PerformanceMode::THROUGHPUT`. To do that, pass the `ov::hint::allow_auto_batching` set to **false** in addition to the `ov::hint::performance_mode`:
|
||||
To enable Auto-batching in the legacy apps not akin to the notion of performance hints, you need to use the **explicit** device notion, such as `BATCH:GPU`.
|
||||
|
||||
### Disabling Automatic Batching
|
||||
|
||||
Auto-Batching can be disabled (for example, for the GPU device) to prevent being triggered by `ov::hint::PerformanceMode::THROUGHPUT`. To do that, set `ov::hint::allow_auto_batching` to **false** in addition to the `ov::hint::performance_mode`, as shown below:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -47,10 +52,20 @@ Batching is a straightforward way of leveraging the GPU compute power and saving
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
Alternatively, to enable the Auto-Batching in the legacy apps not akin to the notion of the performance hints, you may need to use the **explicit** device notion, such as 'BATCH:GPU'. In both cases (the *throughput* hint or explicit BATCH device), the optimal batch size selection happens automatically (the implementation queries the `ov::optimal_batch_size` property from the device, passing the model's graph as the parameter). The actual value depends on the model and device specifics, for example, on-device memory for the dGPUs.
|
||||
Auto-Batching support is not limited to the GPUs, but if a device does not support the `ov::optimal_batch_size` yet, it can work with the auto-batching only when specifying an explicit batch size, for example, "BATCH:<device>(16)".
|
||||
## Configuring Automatic Batching
|
||||
Following the OpenVINO naming convention, the *batching* device is assigned the label of *BATCH*. The configuration options are as follows:
|
||||
|
||||
This _automatic batch size selection_ assumes that the application queries the `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
|
||||
| Parameter name | Parameter description | Examples |
|
||||
| :--- | :--- |:-----------------------------------------------------------------------------|
|
||||
| `AUTO_BATCH_DEVICE` | The name of the device to apply Automatic batching, with the optional batch size value in brackets. | `BATCH:GPU` triggers the automatic batch size selection. `BATCH:GPU(4)` directly specifies the batch size. |
|
||||
| `AUTO_BATCH_TIMEOUT` | The timeout value, in ms. (1000 by default) | You can reduce the timeout value to avoid performance penalty when the data arrives too unevenly). For example, set it to "100", or the contrary, i.e., make it large enough to accommodate input preparation (e.g. when it is a serial process). |
|
||||
|
||||
## Automatic Batch Size Selection
|
||||
|
||||
In both the THROUGHPUT hint and the explicit BATCH device cases, the optimal batch size is selected automatically, as the implementation queries the `ov::optimal_batch_size` property from the device and passes the model graph as the parameter. The actual value depends on the model and device specifics, for example, the on-device memory for dGPUs.
|
||||
The support for Auto-batching is not limited to GPU. However, if a device does not support `ov::optimal_batch_size` yet, to work with Auto-batching, an explicit batch size must be specified, e.g., `BATCH:<device>(16)`.
|
||||
|
||||
This "automatic batch size selection" works on the presumption that the application queries `ov::optimal_number_of_infer_requests` to create the requests of the returned number and run them simultaneously:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -68,10 +83,13 @@ This _automatic batch size selection_ assumes that the application queries the `
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
If not enough inputs were collected, the `timeout` value makes the transparent execution fall back to the execution of individual requests. Configuration-wise, this is the AUTO_BATCH_TIMEOUT property.
|
||||
The timeout, which adds itself to the execution time of the requests, heavily penalizes the performance. To avoid this, in cases when your parallel slack is bounded, give the OpenVINO an additional hint.
|
||||
|
||||
For example, the application processes only 4 video streams, so there is no need to use a batch larger than 4. The most future-proof way to communicate the limitations on the parallelism is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4. For the GPU this will limit the batch size, for the CPU - the number of inference streams, so each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
|
||||
### Optimizing Performance by Limiting Batch Size
|
||||
|
||||
If not enough inputs were collected, the `timeout` value makes the transparent execution fall back to the execution of individual requests. This value can be configured via the `AUTO_BATCH_TIMEOUT` property.
|
||||
The timeout, which adds itself to the execution time of the requests, heavily penalizes the performance. To avoid this, when your parallel slack is bounded, provide OpenVINO with an additional hint.
|
||||
|
||||
For example, when the application processes only 4 video streams, there is no need to use a batch larger than 4. The most future-proof way to communicate the limitations on the parallelism is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4. This will limit the batch size for the GPU and the number of inference streams for the CPU, hence each device uses `ov::hint::num_requests` while converting the hint to the actual device configuration options:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -90,44 +108,47 @@ For example, the application processes only 4 video streams, so there is no need
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
For the *explicit* usage, you can limit the batch size using "BATCH:GPU(4)", where 4 is the number of requests running in parallel.
|
||||
For the *explicit* usage, you can limit the batch size by using `BATCH:GPU(4)`, where 4 is the number of requests running in parallel.
|
||||
|
||||
### Other Performance Considerations
|
||||
## Other Performance Considerations
|
||||
|
||||
To achieve the best performance with the Automatic Batching, the application should:
|
||||
- Operate the number of inference requests that represents the multiple of the batch size. In the above example, for batch size 4, the application should operate 4, 8, 12, 16, etc. requests.
|
||||
- Use the requests, grouped by the batch size, together. For example, the first 4 requests are inferred, while the second group of the requests is being populated. Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches.
|
||||
- Balance the 'timeout' value vs the batch size. For example, in many cases having a smaller timeout value/batch size may yield better performance than large batch size, but with the timeout value that is not large enough to accommodate the full number of the required requests.
|
||||
- When the Automatic Batching is enabled, the 'timeout' property of the `ov::CompiledModel` can be changed any time, even after model loading/compilation. For example, setting the value to 0 effectively disables the auto-batching, as requests' collection would be omitted.
|
||||
- Carefully apply the auto-batching to the pipelines. For example for the conventional video-sources->detection->classification flow, it is the most benefical to do auto-batching over the inputs to the detection stage. Whereas the resulting number of detections is usually fluent, which makes the auto-batching less applicable for the classification stage.
|
||||
To achieve the best performance with Automatic Batching, the application should:
|
||||
- Operate inference requests of the number that represents the multiple of the batch size. In the example above, for batch size 4, the application should operate 4, 8, 12, 16, etc. requests.
|
||||
- Use the requests that are grouped by the batch size together. For example, the first 4 requests are inferred, while the second group of the requests is being populated. Essentially, Automatic Batching shifts the asynchronicity from the individual requests to the groups of requests that constitute the batches.
|
||||
- Balance the `timeout` value vs. the batch size. For example, in many cases, having a smaller `timeout` value/batch size may yield better performance than having a larger batch size with a `timeout` value that is not large enough to accommodate the full number of the required requests.
|
||||
- When Automatic Batching is enabled, the `timeout` property of `ov::CompiledModel` can be changed anytime, even after the loading/compilation of the model. For example, setting the value to 0 disables Auto-batching effectively, as the collection of requests would be omitted.
|
||||
- Carefully apply Auto-batching to the pipelines. For example, in the conventional "video-sources -> detection -> classification" flow, it is most beneficial to do Auto-batching over the inputs to the detection stage. The resulting number of detections is usually fluent, which makes Auto-batching less applicable for the classification stage.
|
||||
|
||||
The following are limitations of the current implementations:
|
||||
- Although less critical for the throughput-oriented scenarios, the load-time with auto-batching increases by almost 2x.
|
||||
- Certain networks are not safely reshape-able by the "batching" dimension (specified as 'N' in the layouts terms). Also, if the batching dimension is not zero-th, the auto-batching is not triggered _implicitly_ by the throughput hint.
|
||||
- The _explicit_ notion, for example, "BATCH:GPU", uses the relaxed dimensions tracking, often making the auto-batching possible. For example, this trick unlocks most **detection networks**.
|
||||
- - When *forcing* the auto-batching via the explicit device notion, make sure to validate the results for correctness.
|
||||
- Performance improvements happen at the cost of the memory footprint growth, yet the auto-batching queries the available memory (especially for the dGPUs) and limits the selected batch size accordingly.
|
||||
- Although it is less critical for the throughput-oriented scenarios, the load time with Auto-batching increases by almost double.
|
||||
- Certain networks are not safely reshapable by the "batching" dimension (specified as `N` in the layout terms). Besides, if the batching dimension is not zeroth, Auto-batching will not be triggered "implicitly" by the throughput hint.
|
||||
- The "explicit" notion, for example, `BATCH:GPU`, using the relaxed dimensions tracking, often makes Auto-batching possible. For example, this method unlocks most **detection networks**.
|
||||
- When *forcing* Auto-batching via the "explicit" device notion, make sure that you validate the results for correctness.
|
||||
- Performance improvements happen at the cost of the growth of memory footprint. However, Auto-batching queries the available memory (especially for dGPU) and limits the selected batch size accordingly.
|
||||
|
||||
|
||||
### Configuring the Automatic Batching
|
||||
Following the OpenVINO convention for devices names, the *batching* device is named *BATCH*. The configuration options are as follows:
|
||||
|
||||
| Parameter name | Parameter description | Default | Examples |
|
||||
| :--- | :--- | :--- |:-----------------------------------------------------------------------------|
|
||||
| "AUTO_BATCH_DEVICE" | Device name to apply the automatic batching and optional batch size in brackets | N/A | "BATCH:GPU" which triggers the automatic batch size selection. Another example is the device name (to apply the batching) with directly specified batch size "BATCH:GPU(4)" |
|
||||
| "AUTO_BATCH_TIMEOUT" | timeout value, in ms | 1000 | you can reduce the timeout value (to avoid performance penalty when the data arrives too non-evenly) e.g. pass the "100", or in contrast make it large enough e.g. to accommodate inputs preparation (e.g. when it is serial process) |
|
||||
|
||||
### Testing Automatic Batching Performance with the Benchmark_App
|
||||
The `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the performance of the Automatic Batching:
|
||||
- The most straighforward way is performance hints:
|
||||
## Testing Performance with Benchmark_app
|
||||
The `benchmark_app` sample, that has both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the performance of Automatic Batching:
|
||||
- The most straightforward way is using the performance hints:
|
||||
- benchmark_app **-hint tput** -d GPU -m 'path to your favorite model'
|
||||
- Overriding the strict rules of implicit reshaping by the batch dimension via the explicit device notion:
|
||||
- You can also use the "explicit" device notion to override the strict rules of the implicit reshaping by the batch dimension:
|
||||
- benchmark_app **-hint none -d BATCH:GPU** -m 'path to your favorite model'
|
||||
- Finally, overriding the automatically-deduced batch size as well:
|
||||
- or override the automatically deduced batch size as well:
|
||||
- $benchmark_app -hint none -d **BATCH:GPU(16)** -m 'path to your favorite model'
|
||||
- notice that some shell versions (e.g. `bash`) may require adding quotes around complex device names, i.e. -d "BATCH:GPU(16)"
|
||||
- This example also applies to CPU or any other device that generally supports batch execution.
|
||||
- Keep in mind that some shell versions (e.g. `bash`) may require adding quotes around complex device names, i.e. `-d "BATCH:GPU(16)"` in this example.
|
||||
|
||||
The last example is also applicable to the CPU or any other device that generally supports the batched execution.
|
||||
Note that Benchmark_app performs a warm-up run of a *single* request. As Auto-Batching requires significantly more requests to execute in batch, this warm-up run hits the default timeout value (1000 ms), as reported in the following example:
|
||||
|
||||
### See Also
|
||||
```
|
||||
[ INFO ] First inference took 1000.18ms
|
||||
```
|
||||
This value also exposed as the final execution statistics on the `benchmark_app` exit:
|
||||
```
|
||||
[ INFO ] Latency:
|
||||
[ INFO ] Max: 1000.18 ms
|
||||
```
|
||||
This is NOT the actual latency of the batched execution, so you are recommended to refer to other metrics in the same log, for example, "Median" or "Average" execution.
|
||||
|
||||
### Additional Resources
|
||||
[Supported Devices](supported_plugins/Supported_Devices.md)
|
||||
|
@ -1,31 +1,29 @@
|
||||
# Deployment Manager {#openvino_docs_install_guides_deployment_manager_tool}
|
||||
# Deploying Your Application with Deployment Manager {#openvino_docs_install_guides_deployment_manager_tool}
|
||||
|
||||
The Deployment Manager is a Python* command-line tool that creates a deployment package by assembling the model, IR files, your application, and associated dependencies into a runtime package for your target device. This tool is delivered within the Intel® Distribution of OpenVINO™ toolkit for Linux*, Windows* and macOS* release packages and is available after installation in the `<INSTALL_DIR>/tools/deployment_manager` directory.
|
||||
The OpenVINO™ Deployment Manager is a Python command-line tool that creates a deployment package by assembling the model, OpenVINO IR files, your application, and associated dependencies into a runtime package for your target device. This tool is delivered within the Intel® Distribution of OpenVINO™ toolkit for Linux, Windows and macOS release packages. It is available in the `<INSTALL_DIR>/tools/deployment_manager` directory after installation.
|
||||
|
||||
This article provides instructions on how to create a package with Deployment Manager and then deploy the package to your target systems.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
* Intel® Distribution of OpenVINO™ toolkit
|
||||
To use the Deployment Manager tool, the following requirements need to be met:
|
||||
* Intel® Distribution of OpenVINO™ toolkit is installed. See the [Installation Guide](../../install_guides/installing-openvino-overview.md) for instructions on different operating systems.
|
||||
* To run inference on a target device other than CPU, device drivers must be pre-installed:
|
||||
* **For Linux**, see the following sections in the [installation instructions for Linux](../../install_guides/installing-openvino-linux.md):
|
||||
* Steps for [Intel® Processor Graphics (GPU)](../../install_guides/configurations-for-intel-gpu.md) section
|
||||
* Steps for [Intel® Neural Compute Stick 2 section](../../install_guides/configurations-for-ncs2.md)
|
||||
* Steps for [Intel® Vision Accelerator Design with Intel® Movidius™ VPUs](../../install_guides/configurations-for-ivad-vpu.md)
|
||||
* Steps for [Intel® Gaussian & Neural Accelerator (GNA)](../../install_guides/configurations-for-intel-gna.md)
|
||||
* **For Windows**, see the following sections in the [installation instructions for Windows](../../install_guides/installing-openvino-windows.md):
|
||||
* Steps for [Intel® Processor Graphics (GPU)](../../install_guides/configurations-for-intel-gpu.md)
|
||||
* Steps for the [Intel® Vision Accelerator Design with Intel® Movidius™ VPUs](../../install_guides/configurations-for-ivad-vpu.md)
|
||||
* **For macOS**, see the following section in the [installation instructions for macOS](../../install_guides/installing-openvino-macos.md):
|
||||
* Steps for [Intel® Neural Compute Stick 2 section](../../install_guides/configurations-for-ncs2.md)
|
||||
* **For GPU**, see [Configurations for Intel® Processor Graphics (GPU)](../../install_guides/configurations-for-intel-gpu.md).
|
||||
* **For NCS2**, see [Intel® Neural Compute Stick 2 section](../../install_guides/configurations-for-ncs2.md)
|
||||
* **For VPU**, see [Configurations for Intel® Vision Accelerator Design with Intel® Movidius™ VPUs](../../install_guides/configurations-for-ivad-vpu.md).
|
||||
* **For GNA**, see [Intel® Gaussian & Neural Accelerator (GNA)](../../install_guides/configurations-for-intel-gna.md)
|
||||
|
||||
> **IMPORTANT**: The operating system on the target system must be the same as the development system on which you are creating the package. For example, if the target system is Ubuntu 18.04, the deployment package must be created from the OpenVINO™ toolkit installed on Ubuntu 18.04.
|
||||
|
||||
> **IMPORTANT**: The target operating system must be the same as the development system on which you are creating the package. For example, if the target system is Ubuntu 18.04, the deployment package must be created from the OpenVINO™ toolkit installed on Ubuntu 18.04.
|
||||
|
||||
> **TIP**: If your application requires additional dependencies, including the Microsoft Visual C++ Redistributable, use the ['--user_data' option](https://docs.openvino.ai/latest/openvino_docs_install_guides_deployment_manager_tool.html#run-standard-cli-mode) to add them to the deployment archive. Install these dependencies on the target host before running inference.
|
||||
|
||||
## Create Deployment Package Using Deployment Manager
|
||||
## Creating Deployment Package Using Deployment Manager
|
||||
|
||||
There are two ways to create a deployment package that includes inference-related components of the OpenVINO™ toolkit: you can run the Deployment Manager tool in either interactive or standard CLI mode.
|
||||
To create a deployment package that includes inference-related components of OpenVINO™ toolkit, you can run the Deployment Manager tool in either interactive or standard CLI mode .
|
||||
|
||||
### Run Interactive Mode
|
||||
### Running Deployment Manager in Interactive Mode
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -35,9 +33,9 @@ There are two ways to create a deployment package that includes inference-relate
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Interactive mode provides a user-friendly command-line interface that will guide you through the process with text prompts.
|
||||
The interactive mode provides a user-friendly command-line interface that guides through the process with text prompts.
|
||||
|
||||
To launch the Deployment Manager in interactive mode, open a new terminal window, go to the Deployment Manager tool directory and run the tool script without parameters:
|
||||
To launch the Deployment Manager in interactive mode, open a new terminal window, go to the Deployment Manager tool directory, and run the tool script without parameters:
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -69,23 +67,23 @@ The target device selection dialog is displayed:
|
||||
|
||||

|
||||
|
||||
Use the options provided on the screen to complete selection of the target devices and press **Enter** to proceed to the package generation dialog. if you want to interrupt the generation process and exit the program, type **q** and press **Enter**.
|
||||
Use the options provided on the screen to complete the selection of the target devices, and press **Enter** to proceed to the package generation dialog. To interrupt the generation process and exit the program, type **q** and press **Enter**.
|
||||
|
||||
Once you accept the selection, the package generation dialog is displayed:
|
||||
Once the selection is accepted, the package generation dialog will appear:
|
||||
|
||||

|
||||
|
||||
The target devices you have selected at the previous step appear on the screen. To go back and change the selection, type **b** and press **Enter**. Use the options provided to configure the generation process, or use the default settings.
|
||||
The target devices selected in the previous step appear on the screen. To go back and change the selection, type **b** and press **Enter**. Use the default settings, or use the following options to configure the generation process:
|
||||
|
||||
* `o. Change output directory` (optional): Path to the output directory. By default, it's set to your home directory.
|
||||
* `o. Change output directory` (optional): the path to the output directory. By default, it is set to your home directory.
|
||||
|
||||
* `u. Provide (or change) path to folder with user data` (optional): Path to a directory with user data (IRs, models, datasets, etc.) files and subdirectories required for inference, which will be added to the deployment archive. By default, it's set to `None`, which means you will separately copy the user data to the target system.
|
||||
* `u. Provide (or change) path to folder with user data` (optional): the path to a directory with user data (OpenVINO IR, model, dataset, etc.) files and subdirectories required for inference, which will be added to the deployment archive. By default, it is set to `None`, which means that copying the user data to the target system need to be done separately.
|
||||
|
||||
* `t. Change archive name` (optional): Deployment archive name without extension. By default, it is set to `openvino_deployment_package`.
|
||||
* `t. Change archive name` (optional): the deployment archive name without extension. By default, it is set to `openvino_deployment_package`.
|
||||
|
||||
Once all the parameters are set, type **g** and press **Enter** to generate the package for the selected target devices. To interrupt the generation process and exit the program, type **q** and press **Enter**.
|
||||
After all the parameters are set, type **g** and press **Enter** to generate the package for the selected target devices. To interrupt the generation process and exit the program, type **q** and press **Enter**.
|
||||
|
||||
The script successfully completes and the deployment package is generated in the specified output directory.
|
||||
Once the script has successfully completed, the deployment package is generated in the specified output directory.
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -95,7 +93,7 @@ The script successfully completes and the deployment package is generated in the
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
### Run Standard CLI Mode
|
||||
### Running Deployment Manager in Standard CLI Mode
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -105,9 +103,9 @@ The script successfully completes and the deployment package is generated in the
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Alternatively, you can run the Deployment Manager tool in the standard CLI mode. In this mode, you specify the target devices and other parameters as command-line arguments of the Deployment Manager Python script. This mode facilitates integrating the tool in an automation pipeline.
|
||||
You can also run the Deployment Manager tool in the standard CLI mode. In this mode, specify the target devices and other parameters as command-line arguments of the Deployment Manager Python script. This mode facilitates integrating the tool in an automation pipeline.
|
||||
|
||||
To launch the Deployment Manager tool in the standard mode, open a new terminal window, go to the Deployment Manager tool directory and run the tool command with the following syntax:
|
||||
To launch the Deployment Manager tool in the standard mode: open a new terminal window, go to the Deployment Manager tool directory, and run the tool command with the following syntax:
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -136,15 +134,16 @@ To launch the Deployment Manager tool in the standard mode, open a new terminal
|
||||
|
||||
The following options are available:
|
||||
|
||||
* `<--targets>` (required): List of target devices to run inference. To specify more than one target, separate them with spaces. For example: `--targets cpu gpu vpu`. You can get a list of currently available targets by running the program with the `-h` option.
|
||||
* `<--targets>` (required): the list of target devices to run inference. To specify more than one target, separate them with spaces, for example, `--targets cpu gpu vpu`.
|
||||
To get a list of currently available targets, run the program with the `-h` option.
|
||||
|
||||
* `[--output_dir]` (optional): Path to the output directory. By default, it is set to your home directory.
|
||||
* `[--output_dir]` (optional): the path to the output directory. By default, it is set to your home directory.
|
||||
|
||||
* `[--archive_name]` (optional): Deployment archive name without extension. By default, it is set to `openvino_deployment_package`.
|
||||
* `[--archive_name]` (optional): a deployment archive name without extension. By default, it is set to `openvino_deployment_package`.
|
||||
|
||||
* `[--user_data]` (optional): Path to a directory with user data (IRs, models, datasets, etc.) files and subdirectories required for inference, which will be added to the deployment archive. By default, it's set to `None`, which means you will separately copy the user data to the target system.
|
||||
* `[--user_data]` (optional): the path to a directory with user data (OpenVINO IR, model, dataset, etc.) files and subdirectories required for inference, which will be added to the deployment archive. By default, it is set to `None`, which means copying the user data to the target system need to be performed separately.
|
||||
|
||||
The script successfully completes, and the deployment package is generated in the output directory specified.
|
||||
Once the script has successfully completed, the deployment package is generated in the output directory specified.
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -154,15 +153,15 @@ The script successfully completes, and the deployment package is generated in th
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Deploy Package on Target Systems
|
||||
## Deploying Package on Target Systems
|
||||
|
||||
After the Deployment Manager has successfully completed, you can find the generated `.tar.gz` (for Linux or macOS) or `.zip` (for Windows) package in the output directory you specified.
|
||||
Once the Deployment Manager has successfully completed, the `.tar.gz` (on Linux or macOS) or `.zip` (on Windows) package is generated in the specified output directory.
|
||||
|
||||
To deploy the OpenVINO Runtime components from the development machine to the target system, perform the following steps:
|
||||
|
||||
1. Copy the generated archive to the target system using your preferred method.
|
||||
1. Copy the generated archive to the target system by using your preferred method.
|
||||
|
||||
2. Unpack the archive into the destination directory on the target system (if your archive name is different from the default shown below, replace the `openvino_deployment_package` with the name you use).
|
||||
2. Extract the archive to the destination directory on the target system. If the name of your archive is different from the default one shown below, replace `openvino_deployment_package` with your specified name.
|
||||
@sphinxdirective
|
||||
|
||||
.. tab:: Linux
|
||||
@ -185,21 +184,22 @@ To deploy the OpenVINO Runtime components from the development machine to the ta
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
The package is unpacked to the destination directory and the following files and subdirectories are created:
|
||||
|
||||
* `setupvars.sh` — Copy of `setupvars.sh`
|
||||
* `runtime` — Contains the OpenVINO runtime binary files.
|
||||
* `install_dependencies` — Snapshot of the `install_dependencies` directory from the OpenVINO installation directory.
|
||||
* `<user_data>` — The directory with the user data (IRs, datasets, etc.) you specified while configuring the package.
|
||||
Now, the package is extracted to the destination directory. The following files and subdirectories are created:
|
||||
|
||||
For Linux, to run inference on a target Intel® GPU, Intel® Movidius™ VPU, or Intel® Vision Accelerator Design with Intel® Movidius™ VPUs, you need to install additional dependencies by running the `install_openvino_dependencies.sh` script on the target machine:
|
||||
* `setupvars.sh` — a copy of `setupvars.sh`.
|
||||
* `runtime` — contains the OpenVINO runtime binary files.
|
||||
* `install_dependencies` — a snapshot of the `install_dependencies` directory from the OpenVINO installation directory.
|
||||
* `<user_data>` — the directory with the user data (OpenVINO IR, model, dataset, etc.) specified while configuring the package.
|
||||
|
||||
3. On a target Linux system, to run inference on a target Intel® GPU, Intel® Movidius™ VPU, or Intel® Vision Accelerator Design with Intel® Movidius™ VPUs, install additional dependencies by running the `install_openvino_dependencies.sh` script:
|
||||
|
||||
```sh
|
||||
cd <destination_dir>/openvino/install_dependencies
|
||||
sudo -E ./install_openvino_dependencies.sh
|
||||
```
|
||||
|
||||
Set up the environment variables:
|
||||
4. Set up the environment variables:
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -226,4 +226,4 @@ Set up the environment variables:
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
You have now finished the deployment of the OpenVINO Runtime components to the target system.
|
||||
Now, you have finished the deployment of the OpenVINO Runtime components to the target system.
|
||||
|
@ -1,4 +1,4 @@
|
||||
# Deploy with OpenVINO {#openvino_deployment_guide}
|
||||
# Deploying Your Applications with OpenVINO™ {#openvino_deployment_guide}
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -11,58 +11,44 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Once the [OpenVINO application development](../integrate_with_your_application.md) has been finished, usually application developers need to deploy their applications to end users. There are several ways how to achieve that:
|
||||
Once the [OpenVINO™ application development](../integrate_with_your_application.md) has been finished, application developers usually need to deploy their applications to end users. There are several ways to achieve that:
|
||||
|
||||
- Set a dependency on existing prebuilt packages (so called _centralized distribution_):
|
||||
- Using Debian / RPM packages, a recommended way for a family of Linux operation systems
|
||||
- Using pip package manager on PyPi, default approach for Python-based applications
|
||||
- Using Docker images. If the application should be deployed as a Docker image, developer can use a pre-built runtime OpenVINO Docker image as a base image in the Dockerfile for the application container image. You can find more info about available OpenVINO Docker images in the Install Guides for [Linux](../../install_guides/installing-openvino-docker-linux.md) and [Windows](../../install_guides/installing-openvino-docker-windows.md).
|
||||
Also, if you need to customize OpenVINO Docker image, you can use [Docker CI Framework](https://github.com/openvinotoolkit/docker_ci) to generate a Dockerfile and built it.
|
||||
- Grab a necessary functionality of OpenVINO together with your application (so-called _local distribution_):
|
||||
- Using [OpenVINO Deployment manager](deployment-manager-tool.md) providing a convinient way create a distribution package
|
||||
- Using advanced [Local distribution](local-distribution.md) approach
|
||||
- Using [static version of OpenVINO Runtime linked into the final app](https://github.com/openvinotoolkit/openvino/wiki/StaticLibraries)
|
||||
- Set a dependency on the existing prebuilt packages, also called "centralized distribution":
|
||||
- using Debian / RPM packages - a recommended way for Linux operating systems;
|
||||
- using PIP package manager on PyPI - the default approach for Python-based applications;
|
||||
- using Docker images - if the application should be deployed as a Docker image, use a pre-built OpenVINO™ Runtime Docker image as a base image in the Dockerfile for the application container image. For more information about OpenVINO Docker images, refer to [Installing OpenVINO on Linux from Docker](../../install_guides/installing-openvino-docker-linux.md) and [Installing OpenVINO on Windows from Docker](../../install_guides/installing-openvino-docker-windows.md).
|
||||
Furthermore, to customize your OpenVINO Docker image, use the [Docker CI Framework](https://github.com/openvinotoolkit/docker_ci) to generate a Dockerfile and built the image.
|
||||
- Grab a necessary functionality of OpenVINO together with your application, also called "local distribution":
|
||||
- using [OpenVINO Deployment Manager](deployment-manager-tool.md) - providing a convenient way for creating a distribution package;
|
||||
- using the advanced [local distribution](local-distribution.md) approach;
|
||||
- using [a static version of OpenVINO Runtime linked to the final app](https://github.com/openvinotoolkit/openvino/wiki/StaticLibraries).
|
||||
|
||||
The table below shows which distribution type can be used depending on target operation system:
|
||||
The table below shows which distribution type can be used for what target operating system:
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<div class="collapsible-section" data-title="Click to expand/collapse">
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
| Distribution type | Operation systems |
|
||||
| Distribution type | Operating systems |
|
||||
|------- ---------- | ----------------- |
|
||||
| Debian packages | Ubuntu 18.04 long-term support (LTS), 64-bit; Ubuntu 20.04 long-term support (LTS), 64-bit |
|
||||
| RMP packages | Red Hat Enterprise Linux 8, 64-bit |
|
||||
| Docker images | Ubuntu 18.04 long-term support (LTS), 64-bit; Ubuntu 20.04 long-term support (LTS), 64-bit; Red Hat Enterprise Linux 8, 64-bit; Windows Server Core base LTSC 2019, 64-bit; Windows 10, version 20H2, 64-bit |
|
||||
| PyPi (pip package manager) | See [https://pypi.org/project/openvino/](https://pypi.org/project/openvino/) |
|
||||
| [OpenVINO Deployment Manager](deployment-manager-tool.md) | All operation systems |
|
||||
| [Local distribution](local-distribution.md) | All operation systems |
|
||||
| [Build OpenVINO statically and link into the final app](https://github.com/openvinotoolkit/openvino/wiki/StaticLibraries) | All operation systems |
|
||||
| PyPI (PIP package manager) | See [https://pypi.org/project/openvino/](https://pypi.org/project/openvino/) |
|
||||
| [OpenVINO Deployment Manager](deployment-manager-tool.md) | All operating systems |
|
||||
| [Local distribution](local-distribution.md) | All operating systems |
|
||||
| [Build OpenVINO statically and link to the final app](https://github.com/openvinotoolkit/openvino/wiki/StaticLibraries) | All operating systems |
|
||||
|
||||
@sphinxdirective
|
||||
## Granularity of Major Distribution Types
|
||||
|
||||
.. raw:: html
|
||||
|
||||
</div>
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Depending on the distribution type, the granularity of OpenVINO packages may vary: PyPi distribution [OpenVINO has a single package 'openvino'](https://pypi.org/project/openvino/) containing all the runtime libraries and plugins, while more configurable ways like [Local distribution](local-distribution.md) provide higher granularity, so it is important to now some details about the set of libraries which are part of OpenVINO Runtime package:
|
||||
The granularity of OpenVINO packages may vary for different distribution types. For example, the PyPI distribution of OpenVINO has a [single 'openvino' package](https://pypi.org/project/openvino/) that contains all the runtime libraries and plugins, while a [local distribution](local-distribution.md) is a more configurable type providing higher granularity. Below are important details of the set of libraries included in the OpenVINO Runtime package:
|
||||
|
||||
![deployment_simplified]
|
||||
|
||||
- The main library `openvino` is used by C++ user's applications to link against with. The library provides all OpenVINO Runtime public API for both OpenVINO API 2.0 and Inference Engine, nGraph APIs. For C language applications `openvino_c` is additionally required for distribution.
|
||||
- The _optional_ plugin libraries like `openvino_intel_cpu_plugin` (matching `openvino_.+_plugin` pattern) are used to provide inference capabilities on specific devices or additional capabitilies like [Hetero execution](../hetero_execution.md) or [Multi-Device execution](../multi_device.md).
|
||||
- The _optional_ plugin libraries like `openvino_ir_frontnend` (matching `openvino_.+_frontend`) are used to provide capabilities to read models of different file formats like OpenVINO IR, ONNX or Paddle.
|
||||
- The main library `openvino` is used by users' C++ applications to link against with. The library provides all OpenVINO Runtime public APIs, including both API 2.0 and the previous Inference Engine and nGraph APIs. For C language applications, `openvino_c` is additionally required for distribution.
|
||||
- The "optional" plugin libraries like `openvino_intel_cpu_plugin` (matching the `openvino_.+_plugin` pattern) are used to provide inference capabilities on specific devices or additional capabilities like [Hetero Execution](../hetero_execution.md) and [Multi-Device Execution](../multi_device.md).
|
||||
- The "optional" plugin libraries like `openvino_ir_frontend` (matching `openvino_.+_frontend`) are used to provide capabilities to read models of different file formats such as OpenVINO IR, ONNX, and PaddlePaddle.
|
||||
|
||||
The _optional_ means that if the application does not use the capability enabled by the plugin, the plugin's library or package with the plugin is not needed in the final distribution.
|
||||
Here the term "optional" means that if the application does not use the capability enabled by the plugin, the plugin library or a package with the plugin is not needed in the final distribution.
|
||||
|
||||
The information above covers granularity aspects of majority distribution types, more detailed information is only needed and provided in [Local Distribution](local-distribution.md).
|
||||
Building a local distribution will require more detailed information, and you will find it in the dedicated [Libraries for Local Distribution](local-distribution.md) article.
|
||||
|
||||
> **NOTE**: Depending on target OpenVINO devices, you also have to use [Configurations for GPU](../../install_guides/configurations-for-intel-gpu.md), [Configurations for GNA](../../install_guides/configurations-for-intel-gna.md), [Configurations for NCS2](../../install_guides/configurations-for-ncs2.md) or [Configurations for VPU](../../install_guides/configurations-for-ivad-vpu.md) for proper configuration of deployed machines.
|
||||
> **NOTE**: Depending on your target OpenVINO devices, the following configurations might be needed for deployed machines: [Configurations for GPU](../../install_guides/configurations-for-intel-gpu.md), [Configurations for GNA](../../install_guides/configurations-for-intel-gna.md), [Configurations for NCS2](../../install_guides/configurations-for-ncs2.md), [Configurations for VPU](../../install_guides/configurations-for-ivad-vpu.md).
|
||||
|
||||
[deployment_simplified]: ../../img/deployment_simplified.png
|
||||
|
@ -1,37 +1,38 @@
|
||||
# Local distribution {#openvino_docs_deploy_local_distribution}
|
||||
# Libraries for Local Distribution {#openvino_docs_deploy_local_distribution}
|
||||
|
||||
The local distribution implies that each C or C++ application / installer will have its own copies of OpenVINO Runtime binaries. However, OpenVINO has a scalable plugin-based architecture which implies that some components can be loaded in runtime only if they are really needed. So, it is important to understand which minimal set of libraries is really needed to deploy the application and this guide helps to achieve this goal.
|
||||
With a local distribution, each C or C++ application/installer will have its own copies of OpenVINO Runtime binaries. However, OpenVINO has a scalable plugin-based architecture, which means that some components can be loaded in runtime only when they are really needed. Therefore, it is important to understand which minimal set of libraries is really needed to deploy the application. This guide helps you to achieve that goal.
|
||||
|
||||
> **NOTE**: The steps below are operation system independent and refer to a library file name without any prefixes (like `lib` on Unix systems) or suffixes (like `.dll` on Windows OS). Do not put `.lib` files on Windows OS to the distribution, because such files are needed only on a linker stage.
|
||||
|
||||
Local dsitribution is also appropriate for OpenVINO binaries built from sources using [Build instructions](https://github.com/openvinotoolkit/openvino/wiki#how-to-build), but the guide below supposes OpenVINO Runtime is built dynamically. For case of [Static OpenVINO Runtime](https://github.com/openvinotoolkit/openvino/wiki/StaticLibraries) select the required OpenVINO capabilities on CMake configuration stage using [CMake Options for Custom Compilation](https://github.com/openvinotoolkit/openvino/wiki/CMakeOptionsForCustomCompilation), the build and link the OpenVINO components into the final application.
|
||||
|
||||
### C++ or C language
|
||||
> **NOTE**: The steps below are operating system independent and refer to a library file name without any prefixes (like `lib` on Unix systems) or suffixes (like `.dll` on Windows OS). Do not put `.lib` files on Windows OS to the distribution, because such files are needed only on a linker stage.
|
||||
|
||||
Independently on language used to write the application, `openvino` must always be put to the final distribution since is a core library which orshectrates with all the inference and frontend plugins.
|
||||
If your application is written with C language, then you need to put `openvino_c` additionally.
|
||||
## Library Requirements for C++ and C Languages
|
||||
|
||||
The `plugins.xml` file with information about inference devices must also be taken as support file for `openvino`.
|
||||
Independent on the language used to write the application, the `openvino` library must always be put to the final distribution, since it's a core library which orchestrates with all the inference and frontend plugins. In Intel® Distribution of OpenVINO™ toolkit, `openvino` depends on the TBB libraries which are used by OpenVINO Runtime to optimally saturate the devices with computations, so it must be put to the distribution package.
|
||||
|
||||
> **NOTE**: in Intel Distribution of OpenVINO, `openvino` depends on TBB libraries which are used by OpenVINO Runtime to optimally saturate the devices with computations, so it must be put to the distribution package
|
||||
If your application is written with C language, you need to put the `openvino_c` library additionally.
|
||||
|
||||
### Pluggable components
|
||||
The `plugins.xml` file with information about inference devices must also be taken as a support file for `openvino`.
|
||||
|
||||
The picture below demonstrates dependnecies between the OpenVINO Runtime core and pluggable libraries:
|
||||
|
||||
## Libraries for Pluggable Components
|
||||
|
||||
The picture below presents dependencies between the OpenVINO Runtime core and pluggable libraries:
|
||||
|
||||
![deployment_full]
|
||||
|
||||
#### Compute devices
|
||||
### Libraries for Compute Devices
|
||||
|
||||
For each inference device, OpenVINO Runtime has its own plugin library:
|
||||
- `openvino_intel_cpu_plugin` for [Intel CPU devices](../supported_plugins/CPU.md)
|
||||
- `openvino_intel_gpu_plugin` for [Intel GPU devices](../supported_plugins/GPU.md)
|
||||
- `openvino_intel_gna_plugin` for [Intel GNA devices](../supported_plugins/GNA.md)
|
||||
- `openvino_intel_myriad_plugin` for [Intel MYRIAD devices](../supported_plugins/MYRIAD.md)
|
||||
- `openvino_intel_hddl_plugin` for [Intel HDDL device](../supported_plugins/HDDL.md)
|
||||
- `openvino_arm_cpu_plugin` for [ARM CPU devices](../supported_plugins/ARM_CPU.md)
|
||||
- `openvino_intel_cpu_plugin` for [Intel® CPU devices](../supported_plugins/CPU.md).
|
||||
- `openvino_intel_gpu_plugin` for [Intel® GPU devices](../supported_plugins/GPU.md).
|
||||
- `openvino_intel_gna_plugin` for [Intel® GNA devices](../supported_plugins/GNA.md).
|
||||
- `openvino_intel_myriad_plugin` for [Intel® MYRIAD devices](../supported_plugins/MYRIAD.md).
|
||||
- `openvino_intel_hddl_plugin` for [Intel® HDDL device](../supported_plugins/HDDL.md).
|
||||
- `openvino_arm_cpu_plugin` for [ARM CPU devices](../supported_plugins/ARM_CPU.md).
|
||||
|
||||
Depending on what devices is used in the app, put the appropriate libraries to the distribution package.
|
||||
Depending on what devices are used in the app, the appropriate libraries need to be put to the distribution package.
|
||||
|
||||
As it is shown on the picture above, some plugin libraries may have OS-specific dependencies which are either backend libraries or additional supports files with firmware, etc. Refer to the table below for details:
|
||||
|
||||
@ -105,58 +106,59 @@ As it is shown on the picture above, some plugin libraries may have OS-specific
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
#### Execution capabilities
|
||||
### Libraries for Execution Modes
|
||||
|
||||
`HETERO`, `MULTI`, `BATCH`, `AUTO` execution capabilities can also be used explicitly or implicitly by the application. Use the following recommendation scheme to decide whether to put the appropriate libraries to the distribution package:
|
||||
- If [AUTO](../auto_device_selection.md) is used explicitly in the application or `ov::Core::compile_model` is used without specifying a device, put the `openvino_auto_plugin` to the distribution
|
||||
> **NOTE**: Auto device selection relies on [inference device plugins](../supported_plugins/Device_Plugins.md), so if are not sure what inference devices are available on target machine, put all inference plugin libraries to the distribution. If the `ov::device::priorities` is used for `AUTO` to specify a limited device list, grab the corresponding device plugins only.
|
||||
The `HETERO`, `MULTI`, `BATCH` and `AUTO` execution modes can also be used explicitly or implicitly by the application. Use the following recommendation scheme to decide whether to put the appropriate libraries to the distribution package:
|
||||
- If [AUTO](../auto_device_selection.md) is used explicitly in the application or `ov::Core::compile_model` is used without specifying a device, put `openvino_auto_plugin` to the distribution.
|
||||
|
||||
- If [MULTI](../multi_device.md) is used explicitly, put the `openvino_auto_plugin` to the distribution
|
||||
- If [HETERO](../hetero_execution.md) is either used explicitly or `ov::hint::performance_mode` is used with GPU, put the `openvino_hetero_plugin` to the distribution
|
||||
- If [BATCH](../automatic_batching.md) is either used explicitly or `ov::hint::performance_mode` is used with GPU, put the `openvino_batch_plugin` to the distribution
|
||||
> **NOTE**: Automatic Device Selection relies on [inference device plugins](../supported_plugins/Device_Plugins.md). If you are not sure about what inference devices are available on target system, put all the inference plugin libraries to the distribution. If `ov::device::priorities` is used for `AUTO` to specify a limited device list, grab the corresponding device plugins only.
|
||||
|
||||
#### Reading models
|
||||
- If [MULTI](../multi_device.md) is used explicitly, put `openvino_auto_plugin` to the distribution.
|
||||
- If [HETERO](../hetero_execution.md) is either used explicitly or `ov::hint::performance_mode` is used with GPU, put `openvino_hetero_plugin` to the distribution.
|
||||
- If [BATCH](../automatic_batching.md) is either used explicitly or `ov::hint::performance_mode` is used with GPU, put `openvino_batch_plugin` to the distribution.
|
||||
|
||||
### Frontend Libraries for Reading Models
|
||||
|
||||
OpenVINO Runtime uses frontend libraries dynamically to read models in different formats:
|
||||
- To read OpenVINO IR `openvino_ir_frontend` is used
|
||||
- To read ONNX file format `openvino_onnx_frontend` is used
|
||||
- To read Paddle file format `openvino_paddle_frontend` is used
|
||||
- `openvino_ir_frontend` is used to read OpenVINO IR.
|
||||
- `openvino_onnx_frontend` is used to read ONNX file format.
|
||||
- `openvino_paddle_frontend` is used to read Paddle file format.
|
||||
|
||||
Depending on what types of model file format are used in the application in `ov::Core::read_model`, peek up the appropriate libraries.
|
||||
Depending on the model format types that are used in the application in `ov::Core::read_model`, pick up the appropriate libraries.
|
||||
|
||||
> **NOTE**: The recommended way to optimize the size of final distribution package is to [convert models using Model Optimizer](../../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md) to OpenVINO IR, in this case you don't have to keep ONNX, Paddle and other frontend libraries in the distribution package.
|
||||
> **NOTE**: To optimize the size of final distribution package, you are recommended to convert models to OpenVINO IR by using [Model Optimizer](../../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md). This way you don't have to keep ONNX, PaddlePaddle, and other frontend libraries in the distribution package.
|
||||
|
||||
#### (Legacy) Preprocessing via G-API
|
||||
### (Legacy) Preprocessing via G-API
|
||||
|
||||
> **NOTE**: [G-API](../../gapi/gapi_intro.md) preprocessing is a legacy functionality, use [preprocessing capabilities from OpenVINO 2.0](../preprocessing_overview.md) which do not require any additional libraries.
|
||||
|
||||
If the application uses `InferenceEngine::PreProcessInfo::setColorFormat` or `InferenceEngine::PreProcessInfo::setResizeAlgorithm` methods, OpenVINO Runtime dynamically loads `openvino_gapi_preproc` plugin to perform preprocessing via G-API.
|
||||
|
||||
### Examples
|
||||
## Examples
|
||||
|
||||
#### CPU + IR in C-written application
|
||||
**CPU + OpenVINO IR in C application**
|
||||
|
||||
C-written application performs inference on CPU and reads models stored as OpenVINO IR:
|
||||
- `openvino_c` library is a main dependency of the application. It links against this library
|
||||
- `openvino` is used as a private dependency for `openvino` and also used in the deployment
|
||||
- `openvino_intel_cpu_plugin` is used for inference
|
||||
- `openvino_ir_frontend` is used to read source model
|
||||
In this example, the application is written in C language, performs inference on CPU, and reads models stored as the OpenVINO IR format. The following libraries are used:
|
||||
- The `openvino_c` library is a main dependency of the application. It links against this library.
|
||||
- The `openvino` library is used as a private dependency for `openvino_c` and is also used in the deployment.
|
||||
- `openvino_intel_cpu_plugin` is used for inference.
|
||||
- `openvino_ir_frontend` is used to read source models.
|
||||
|
||||
#### MULTI execution on GPU and MYRIAD in tput mode
|
||||
**MULTI execution on GPU and MYRIAD in `tput` mode**
|
||||
|
||||
C++ written application performs inference [simultaneously on GPU and MYRIAD devices](../multi_device.md) with `ov::hint::PerformanceMode::THROUGHPUT` property, reads models stored in ONNX file format:
|
||||
- `openvino` library is a main dependency of the application. It links against this library
|
||||
- `openvino_intel_gpu_plugin` and `openvino_intel_myriad_plugin` are used for inference
|
||||
- `openvino_auto_plugin` is used for `MULTI` multi-device execution
|
||||
- `openvino_auto_batch_plugin` can be also put to the distribution to improve saturation of [Intel GPU](../supported_plugins/GPU.md) device. If there is no such plugin, [Automatic batching](../automatic_batching.md) is turned off.
|
||||
- `openvino_onnx_frontend` is used to read source model
|
||||
In this example, the application is written in C++, performs inference [simultaneously on GPU and MYRIAD devices](../multi_device.md) with the `ov::hint::PerformanceMode::THROUGHPUT` property set, and reads models stored in the ONNX format. The following libraries are used:
|
||||
- The `openvino` library is a main dependency of the application. It links against this library.
|
||||
- `openvino_intel_gpu_plugin` and `openvino_intel_myriad_plugin` are used for inference.
|
||||
- `openvino_auto_plugin` is used for Multi-Device Execution.
|
||||
- `openvino_auto_batch_plugin` can be also put to the distribution to improve the saturation of [Intel® GPU](../supported_plugins/GPU.md) device. If there is no such plugin, [Automatic Batching](../automatic_batching.md) is turned off.
|
||||
- `openvino_onnx_frontend` is used to read source models.
|
||||
|
||||
#### Auto device selection between HDDL and CPU
|
||||
**Auto-Device Selection between HDDL and CPU**
|
||||
|
||||
C++ written application performs inference with [automatic device selection](../auto_device_selection.md) with device list limited to HDDL and CPU, model is [created using C++ code](../model_representation.md):
|
||||
- `openvino` library is a main dependency of the application. It links against this library
|
||||
- `openvino_auto_plugin` is used to enable automatic device selection feature
|
||||
- `openvino_intel_hddl_plugin` and `openvino_intel_cpu_plugin` are used for inference, `AUTO` selects between CPU and HDDL devices according to their physical existance on deployed machine.
|
||||
In this example, the application is written in C++, performs inference with the [Automatic Device Selection](../auto_device_selection.md) mode, limiting device list to HDDL and CPU, and reads models [created using C++ code](../model_representation.md). The following libraries are used:
|
||||
- The `openvino` library is a main dependency of the application. It links against this library.
|
||||
- `openvino_auto_plugin` is used to enable Automatic Device Selection.
|
||||
- `openvino_intel_hddl_plugin` and `openvino_intel_cpu_plugin` are used for inference. AUTO selects between CPU and HDDL devices according to their physical existence on the deployed machine.
|
||||
- No frontend library is needed because `ov::Model` is created in code.
|
||||
|
||||
[deployment_full]: ../../img/deployment_full.png
|
||||
|
@ -1,24 +1,23 @@
|
||||
# Layout API overview {#openvino_docs_OV_UG_Layout_Overview}
|
||||
|
||||
## Introduction
|
||||
|
||||
In few words, with layout `NCHW` it is easier to understand what model's shape `{8, 3, 224, 224}` means. Without layout it is just a 4-dimensional tensor.
|
||||
# Layout API Overview {#openvino_docs_OV_UG_Layout_Overview}
|
||||
|
||||
|
||||
Concept of layout helps you (and your application) to understand what does each particular dimension of input/output tensor mean. For example, if your input has shape `{1, 3, 720, 1280}` and layout "NCHW" - it is clear that `N(batch) = 1`, `C(channels) = 3`, `H(height) = 720` and `W(width) = 1280`. Without layout information `{1, 3, 720, 1280}` doesn't give any idea to your application what these number mean and how to resize input image to fit model's expectations.
|
||||
|
||||
|
||||
Reasons when you may want to care about input/output layout:
|
||||
- Perform model modification:
|
||||
- Apply [preprocessing](./preprocessing_overview.md) steps, like subtract means, divide by scales, resize image, convert RGB<->BGR
|
||||
- Set/get batch for a model
|
||||
- Same operations, used during model conversion phase, see [Model Optimizer Embedding Preprocessing Computation](../MO_DG/prepare_model/Additional_Optimizations.md)
|
||||
- Improve readability of a model's input and output
|
||||
The concept of layout helps you (and your application) to understand what each particular dimension of input/output tensor means. For example, if your input has the `{1, 3, 720, 1280}` shape and the `NCHW` layout, it is clear that `N(batch) = 1`, `C(channels) = 3`, `H(height) = 720`, and `W(width) = 1280`. Without the layout information, the `{1, 3, 720, 1280}` tuple does not give any idea to your application on what these numbers mean and how to resize the input image to fit the expectations of the model.
|
||||
|
||||
## Layout syntax
|
||||
With the `NCHW` layout, it is easier to understand what the `{8, 3, 224, 224}` model shape means. Without the layout, it is just a 4-dimensional tensor.
|
||||
|
||||
### Short
|
||||
The easiest way is to fully specify each dimension with one alphabetical letter
|
||||
Below is a list of cases where input/output layout is important:
|
||||
- Performing model modification:
|
||||
- Applying the [preprocessing](./preprocessing_overview.md) steps, such as subtracting means, dividing by scales, resizing an image, and converting `RGB`<->`BGR`.
|
||||
- Setting/getting a batch for a model.
|
||||
- Doing the same operations as used during the model conversion phase. For more information, refer to the [Model Optimizer Embedding Preprocessing Computation](../MO_DG/prepare_model/Additional_Optimizations.md) guide.
|
||||
- Improving the readability of a model input and output.
|
||||
|
||||
## Syntax of Layout
|
||||
|
||||
### Short Syntax
|
||||
The easiest way is to fully specify each dimension with one alphabet letter.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -36,10 +35,10 @@ The easiest way is to fully specify each dimension with one alphabetical letter
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
This assigns 'N' to first dimension, 'C' to second, 'H' to 3rd and 'W' to 4th
|
||||
This assigns `N` to the first dimension, `C` to the second, `H` to the third, and `W` to the fourth.
|
||||
|
||||
### Advanced
|
||||
Advanced syntax allows assigning a word to a dimension. To do this, wrap layout with square brackets `[]` and specify each name separated by comma `,`
|
||||
### Advanced Syntax
|
||||
The advanced syntax allows assigning a word to a dimension. To do this, wrap a layout with square brackets `[]` and specify each name separated by a comma `,`.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -58,8 +57,8 @@ Advanced syntax allows assigning a word to a dimension. To do this, wrap layout
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
### Partially defined layout
|
||||
If some dimension is not important, it's name can be set to `?`
|
||||
### Partially Defined Layout
|
||||
If a certain dimension is not important, its name can be set to `?`.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -78,8 +77,8 @@ If some dimension is not important, it's name can be set to `?`
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
### Dynamic layout
|
||||
If number of dimensions is not important, ellipsis `...` can be used to specify variadic number of dimensions.
|
||||
### Dynamic Layout
|
||||
If several dimensions are not important, an ellipsis `...` can be used to specify those dimensions.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -97,16 +96,16 @@ If number of dimensions is not important, ellipsis `...` can be used to specify
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
### Predefined names
|
||||
### Predefined Names
|
||||
|
||||
Layout has pre-defined some widely used in computer vision dimension names:
|
||||
- N/Batch - batch size
|
||||
- C/Channels - channels dimension
|
||||
- D/Depth - depth
|
||||
- H/Height - height
|
||||
- W/Width - width
|
||||
A layout has some pre-defined dimension names, widely used in computer vision:
|
||||
- `N`/`Batch` - batch size
|
||||
- `C`/`Channels` - channels
|
||||
- `D`/`Depth` - depth
|
||||
- `H`/`Height` - height
|
||||
- `W`/`Width` - width
|
||||
|
||||
These names are used in [PreProcessing API](./preprocessing_overview.md) and there is a set of helper functions to get appropriate dimension index from layout
|
||||
These names are used in [PreProcessing API](./preprocessing_overview.md). There is a set of helper functions to get appropriate dimension index from a layout.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -126,11 +125,11 @@ These names are used in [PreProcessing API](./preprocessing_overview.md) and the
|
||||
|
||||
### Equality
|
||||
|
||||
Layout names are case-insensitive, which means that ```Layout("NCHW") == Layout("nChW") == Layout("[N,c,H,w]")```
|
||||
Layout names are case-insensitive, which means that `Layout("NCHW")` = `Layout("nChW") = `Layout("[N,c,H,w]")`.
|
||||
|
||||
### Dump layout
|
||||
### Dump Layout
|
||||
|
||||
Layout can be converted to string in advanced syntax format. Can be useful for debugging and serialization purposes
|
||||
A layout can be converted to a string in the advanced syntax format. It can be useful for debugging and serialization purposes.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -150,4 +149,4 @@ Layout can be converted to string in advanced syntax format. Can be useful for d
|
||||
|
||||
## See also
|
||||
|
||||
* <code>ov::Layout</code> C++ class documentation
|
||||
* API Reference: <code>ov::Layout</code> C++ class
|
||||
|
@ -1,46 +1,50 @@
|
||||
# Model Representation in OpenVINO™ Runtime {#openvino_docs_OV_UG_Model_Representation}
|
||||
|
||||
In OpenVINO™ Runtime a model is represented by the `ov::Model` class.
|
||||
In OpenVINO™ Runtime, a model is represented by the `ov::Model` class.
|
||||
|
||||
The `ov::Model` object stores shared pointers to `ov::op::v0::Parameter`, `ov::op::v0::Result` and `ov::op::Sink` operations that are inputs, outputs and sinks of the graph.
|
||||
Sinks of the graph have no consumers and are not included in the results vector. All other operations hold each other via shared pointers: child operation holds its parent (hard link). If an operation has no consumers and it's not the `Result` or `Sink` operation
|
||||
(shared pointer counter is zero), then it will be destructed and won't be accessible anymore.
|
||||
The `ov::Model` object stores shared pointers to `ov::op::v0::Parameter`, `ov::op::v0::Result`, and `ov::op::Sink` operations, which are inputs, outputs, and sinks of the graph.
|
||||
Sinks of the graph have no consumers and are not included in the results vector. All other operations hold each other via shared pointers, in which a child operation holds its parent via a hard link. If an operation has no consumers and is neither the `Result` nor the `Sink` operation
|
||||
whose shared pointer counter is zero, the operation will be destructed and not be accessible anymore.
|
||||
|
||||
Each operation in `ov::Model` has the `std::shared_ptr<ov::Node>` type.
|
||||
|
||||
## How OpenVINO Runtime Works with Models
|
||||
|
||||
OpenVINO™ Runtime enables you to use different approaches to work with model inputs/outputs:
|
||||
- The `ov::Model::inputs()`/`ov::Model::outputs()` methods are used to get vectors of all input/output ports.
|
||||
- For a model that has only one input or output, you can use the `ov::Model::input()` or `ov::Model::output()` methods without any arguments to get input or output port respectively.
|
||||
- The `ov::Model::input()` and `ov::Model::output()` methods can be used with the index of inputs or outputs from the framework model to get specific ports by index.
|
||||
- You can use the tensor name of input or output from the original framework model together with the `ov::Model::input()` or `ov::Model::output()` methods to get specific ports. It means that you do not need to have any additional mapping of names from framework to OpenVINO as it was before. OpenVINO™ Runtime allows the usage of native framework tensor names, for example:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@sphinxtab{C++}
|
||||
|
||||
@snippet docs/snippets/ov_model_snippets.cpp all_inputs_ouputs
|
||||
|
||||
@endsphinxtab
|
||||
|
||||
@sphinxtab{Python}
|
||||
|
||||
@snippet docs/snippets/ov_model_snippets.py all_inputs_ouputs
|
||||
|
||||
@endsphinxtab
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
For details on how to build a model in OpenVINO™ Runtime, see the [Build a Model in OpenVINO™ Runtime](@ref ov_ug_build_model) section.
|
||||
|
||||
OpenVINO™ Runtime allows to use different approaches to work with model inputs/outputs:
|
||||
- `ov::Model::inputs()`/`ov::Model::outputs()` methods allow to get vector of all input/output ports.
|
||||
- For a model which has only one input or output you can use methods `ov::Model::input()` or `ov::Model::output()` without arguments to get input or output port respectively.
|
||||
- Methods `ov::Model::input()` and `ov::Model::output()` can be used with index of input or output from the framework model to get specific port by index.
|
||||
- You can use tensor name of input or output from the original framework model together with methods `ov::Model::input()` or `ov::Model::output()` to get specific port. It means that you don't need to have any additional mapping of names from framework to OpenVINO, as it was before, OpenVINO™ Runtime allows using of native framework tensor names.
|
||||
OpenVINO™ Runtime model representation uses special classes to work with model data types and shapes. The `ov::element::Type` is used for data types. See the section below for representation of shapes.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@sphinxtab{C++}
|
||||
|
||||
@snippet docs/snippets/ov_model_snippets.cpp all_inputs_ouputs
|
||||
|
||||
@endsphinxtab
|
||||
|
||||
@sphinxtab{Python}
|
||||
|
||||
@snippet docs/snippets/ov_model_snippets.py all_inputs_ouputs
|
||||
|
||||
@endsphinxtab
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
OpenVINO™ Runtime model representation uses special classes to work with model data types and shapes. For data types the `ov::element::Type` is used.
|
||||
|
||||
## Shapes Representation
|
||||
## Representation of Shapes
|
||||
|
||||
OpenVINO™ Runtime provides two types for shape representation:
|
||||
|
||||
* `ov::Shape` - Represents static (fully defined) shapes.
|
||||
|
||||
* `ov::PartialShape` - Represents dynamic shapes. That means that the rank or some of dimensions are dynamic (dimension defines an interval or undefined). `ov::PartialShape` can be converted to `ov::Shape` using the `get_shape()` method if all dimensions are static; otherwise the conversion raises an exception.
|
||||
* `ov::PartialShape` - Represents dynamic shapes. This means that the rank or some of dimensions are dynamic (dimension defines an interval or undefined).
|
||||
|
||||
`ov::PartialShape` can be converted to `ov::Shape` by using the `get_shape()` method if all dimensions are static; otherwise, the conversion will throw an exception. For example:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -58,21 +62,22 @@ OpenVINO™ Runtime provides two types for shape representation:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
But in most cases before getting static shape using `get_shape()` method, you need to check that shape is static.
|
||||
However, in most cases, before getting static shape using the `get_shape()` method, you need to check if that shape is static.
|
||||
|
||||
## Operations
|
||||
## Representation of Operations
|
||||
|
||||
The `ov::Op` class represents any abstract operation in the model representation. Use this class to create [custom operations](../Extensibility_UG/add_openvino_ops.md).
|
||||
|
||||
## Operation Sets
|
||||
## Representation of Operation Sets
|
||||
|
||||
Operation set (opset) is a collection of operations that can be used to construct a model. The `ov::OpSet` class provides a functionality to work with operation sets.
|
||||
An operation set (opset) is a collection of operations that can be used to construct a model. The `ov::OpSet` class provides the functionality to work with operation sets.
|
||||
For each operation set, OpenVINO™ Runtime provides a separate namespace, for example `opset8`.
|
||||
Each OpenVINO™ Release introduces new operations and add these operations to a new operation set. New operation sets help to introduce a new version of operations that change behavior of previous operations. Using operation sets allows you to avoid changes in your application if new operations have been introduced.
|
||||
For a complete list of operation sets supported in OpenVINO™ toolkit, see [Available Operations Sets](../ops/opset.md).
|
||||
To add support of custom operations, see the [Add Custom OpenVINO Operations](../Extensibility_UG/Intro.md) document.
|
||||
|
||||
## Build a Model in OpenVINO™ Runtime {#ov_ug_build_model}
|
||||
Each OpenVINO™ Release introduces new operations and adds them to new operation sets, within which the new operations would change the behavior of previous operations. Using operation sets helps you avoid changing your application when new operations are introduced.
|
||||
For a complete list of operation sets supported in OpenVINO™ toolkit, see the [Available Operations Sets](../ops/opset.md).
|
||||
To add the support for custom operations, see [OpenVINO Extensibility Mechanism](../Extensibility_UG/Intro.md).
|
||||
|
||||
## Building a Model in OpenVINO™ Runtime {#ov_ug_build_model}
|
||||
|
||||
You can create a model from source. This section illustrates how to construct a model composed of operations from an available operation set.
|
||||
|
||||
@ -132,37 +137,45 @@ The following code creates a model with several outputs:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
## Model debug capabilities
|
||||
## Model Debugging Capabilities
|
||||
|
||||
OpenVINO™ provides several debug capabilities:
|
||||
- To receive additional messages about applied model modifications, rebuild the OpenVINO™ Runtime library with the `-DENABLE_OPENVINO_DEBUG=ON` option.
|
||||
- Model can be visualized to image from the xDot format:
|
||||
|
||||
@sphinxtabset
|
||||
@sphinxtabset
|
||||
|
||||
@sphinxtab{C++}
|
||||
@sphinxtab{C++}
|
||||
|
||||
@snippet docs/snippets/ov_model_snippets.cpp ov:visualize
|
||||
@snippet docs/snippets/ov_model_snippets.cpp ov:visualize
|
||||
|
||||
@endsphinxtab
|
||||
@endsphinxtab
|
||||
|
||||
@sphinxtab{Python}
|
||||
@sphinxtab{Python}
|
||||
|
||||
@snippet docs/snippets/ov_model_snippets.py ov:visualize
|
||||
@snippet docs/snippets/ov_model_snippets.py ov:visualize
|
||||
|
||||
@endsphinxtab
|
||||
@endsphinxtab
|
||||
|
||||
@endsphinxtabset
|
||||
@endsphinxtabset
|
||||
|
||||
`ov::pass::VisualizeTree` can be parametrized via environment variables:
|
||||
|
||||
`ov::pass::VisualizeTree` can be parametrized via environment variables:
|
||||
|
||||
OV_VISUALIZE_TREE_OUTPUT_SHAPES=1 - visualize shapes
|
||||
|
||||
OV_VISUALIZE_TREE_OUTPUT_TYPES=1 - visualize types
|
||||
|
||||
OV_VISUALIZE_TREE_MIN_MAX_DENORMAL=1 - pretty denormal values
|
||||
|
||||
OV_VISUALIZE_TREE_RUNTIME_INFO=1 - print runtime information
|
||||
|
||||
OV_VISUALIZE_TREE_IO=1 - print I/O ports
|
||||
|
||||
OV_VISUALIZE_TREE_MEMBERS_NAME=1 - print member names
|
||||
|
||||
|
||||
|
||||
- Also model can be serialized to IR:
|
||||
|
||||
@sphinxtabset
|
||||
|
@ -1,44 +1,41 @@
|
||||
Stateful models {#openvino_docs_OV_UG_network_state_intro}
|
||||
==============================
|
||||
# Stateful models {#openvino_docs_OV_UG_network_state_intro}
|
||||
|
||||
This section describes how to work with stateful networks in OpenVINO toolkit, specifically:
|
||||
* How stateful networks are represented in IR and nGraph
|
||||
* How operations with state can be done
|
||||
|
||||
The section additionally provides small examples of stateful network and code to infer it.
|
||||
This article describes how to work with stateful networks in OpenVINO™ toolkit. More specifically, it illustrates how stateful networks are represented in IR and nGraph
|
||||
and how operations with a state can be done. The article additionally provides some examples of stateful networks and code to infer them.
|
||||
|
||||
## What is a Stateful Network?
|
||||
|
||||
Several use cases require processing of data sequences. When length of a sequence is known and small enough,
|
||||
we can process it with RNN like networks that contain a cycle inside. But in some cases, like online speech recognition of time series
|
||||
forecasting, length of data sequence is unknown. Then data can be divided in small portions and processed step-by-step. But dependency
|
||||
between data portions should be addressed. For that, networks save some data between inferences - state. When one dependent sequence is over,
|
||||
state should be reset to initial value and new sequence can be started.
|
||||
it can be processed with RNN like networks that contain a cycle inside. However, in some cases, like online speech recognition of time series
|
||||
forecasting, length of data sequence is unknown. Then, data can be divided in small portions and processed step-by-step. The dependency
|
||||
between data portions should be addressed. For that, networks save some data between inferences - a state. When one dependent sequence is over,
|
||||
a state should be reset to initial value and a new sequence can be started.
|
||||
|
||||
Several frameworks have special API for states in networks. For example, Keras has special option for RNNs `stateful` that turns on saving state
|
||||
between inferences. Kaldi contains special specifier `Offset` to define time offset in a network.
|
||||
Several frameworks have special APIs for states in networks. For example, Keras has special option for RNNs, i.e. `stateful` that turns on saving a state
|
||||
between inferences. Kaldi contains special `Offset` specifier to define time offset in a network.
|
||||
|
||||
OpenVINO also contains special API to simplify work with networks with states. State is automatically saved between inferences,
|
||||
and there is a way to reset state when needed. You can also read state or set it to some new value between inferences.
|
||||
OpenVINO also contains a special API to simplify work with networks with states. A state is automatically saved between inferences,
|
||||
and there is a way to reset a state when needed. A state can also be read or set to some new value between inferences.
|
||||
|
||||
## OpenVINO State Representation
|
||||
|
||||
OpenVINO contains a special abstraction `Variable` to represent a state in a network. There are two operations to work with the state:
|
||||
* `Assign` to save value in state
|
||||
* `ReadValue` to read value saved on previous iteration
|
||||
OpenVINO contains the `Variable`, a special abstraction to represent a state in a network. There are two operations to work with a state:
|
||||
* `Assign` - to save a value in a state.
|
||||
* `ReadValue` - to read a value saved on previous iteration.
|
||||
|
||||
You can find more details on these operations in [ReadValue specification](../ops/infrastructure/ReadValue_3.md) and
|
||||
[Assign specification](../ops/infrastructure/Assign_3.md).
|
||||
For more details on these operations, refer to the [ReadValue specification](../ops/infrastructure/ReadValue_3.md) and
|
||||
[Assign specification](../ops/infrastructure/Assign_3.md) articles.
|
||||
|
||||
## Examples of Representation of a Network with States
|
||||
## Examples of Networks with States
|
||||
|
||||
To get a model with states ready for inference, you can convert a model from another framework to IR with Model Optimizer or create an nGraph function (details can be found in [Build OpenVINO Model section](../OV_Runtime_UG/model_representation.md)). Let's represent the following graph in both forms:
|
||||
To get a model with states ready for inference, convert a model from another framework to IR with Model Optimizer or create an nGraph function. (For more information,
|
||||
refer to the [Build OpenVINO Model section](../OV_Runtime_UG/model_representation.md)). Below is the graph in both forms:
|
||||
|
||||
![state_network_example]
|
||||
|
||||
### Example of IR with State
|
||||
|
||||
The `bin` file for this graph should contain float 0 in binary form. Content of `xml` is the following.
|
||||
The `bin` file for this graph should contain `float 0` in binary form. The content of the `xml` file is as follows.
|
||||
|
||||
```xml
|
||||
<?xml version="1.0" ?>
|
||||
@ -175,80 +172,81 @@ The `bin` file for this graph should contain float 0 in binary form. Content of
|
||||
auto f = make_shared<Function>(ResultVector({res}), ParameterVector({arg}), SinkVector({assign}));
|
||||
```
|
||||
|
||||
In this example, `SinkVector` is used to create `ngraph::Function`. For network with states, except inputs and outputs, `Assign` nodes should also point to `Function`
|
||||
to avoid deleting it during graph transformations. You can do it with the constructor, as shown in the example, or with the special method `add_sinks(const SinkVector& sinks)`. Also you can delete
|
||||
sink from `ngraph::Function` after deleting the node from graph with the `delete_sink()` method.
|
||||
In this example, the `SinkVector` is used to create the `ngraph::Function`. For a network with states, except inputs and outputs, the `Assign` nodes should also point to the `Function` to avoid deleting it during graph transformations. Use the constructor to do it, as shown in the example, or with the special `add_sinks(const SinkVector& sinks)` method. After deleting the node from the graph with the `delete_sink()` method, a sink can be deleted from `ngraph::Function`.
|
||||
|
||||
## OpenVINO state API
|
||||
## OpenVINO State API
|
||||
|
||||
Inference Engine has the `InferRequest::QueryState` method to get the list of states from a network and `IVariableState` interface to operate with states. Below you can find brief description of methods and the workable example of how to use this interface.
|
||||
Inference Engine has the `InferRequest::QueryState` method to get the list of states from a network and `IVariableState` interface to operate with states. Below is a brief description of methods and the example of how to use this interface.
|
||||
|
||||
* `std::string GetName() const`
|
||||
returns name(variable_id) of according Variable
|
||||
* `void Reset()`
|
||||
reset state to default value
|
||||
* `void SetState(Blob::Ptr newState)`
|
||||
set new value for state
|
||||
* `Blob::CPtr GetState() const`
|
||||
returns current value of state
|
||||
* `std::string GetName() const` -
|
||||
returns the name (variable_id) of a corresponding Variable.
|
||||
* `void Reset()` -
|
||||
resets a state to a default value.
|
||||
* `void SetState(Blob::Ptr newState)` -
|
||||
sets a new value for a state.
|
||||
* `Blob::CPtr GetState() const` -
|
||||
returns current value of state.
|
||||
|
||||
## Example of Stateful Network Inference
|
||||
|
||||
Let's take an IR from the previous section example. The example below demonstrates inference of two independent sequences of data. State should be reset between these sequences.
|
||||
Based on the IR from the previous section, the example below demonstrates inference of two independent sequences of data. A state should be reset between these sequences.
|
||||
|
||||
One infer request and one thread will be used in this example. Using several threads is possible if you have several independent sequences. Then each sequence can be processed in its own infer request. Inference of one sequence in several infer requests is not recommended. In one infer request state will be saved automatically between inferences, but
|
||||
if the first step is done in one infer request and the second in another, state should be set in new infer request manually (using `IVariableState::SetState` method).
|
||||
One infer request and one thread will be used in this example. Using several threads is possible if there are several independent sequences. Then, each sequence can be processed in its own infer request. Inference of one sequence in several infer requests is not recommended. In one infer request, a state will be saved automatically between inferences, but if the first step is done in one infer request and the second in another, a state should be set in a new infer request manually (using the `IVariableState::SetState` method).
|
||||
|
||||
@snippet openvino/docs/snippets/InferenceEngine_network_with_state_infer.cpp part1
|
||||
|
||||
You can find more powerful examples demonstrating how to work with networks with states in speech sample and demo.
|
||||
Decsriptions can be found in [Samples Overview](./Samples_Overview.md)
|
||||
More elaborate examples demonstrating how to work with networks with states can be found in a speech sample and a demo.
|
||||
Refer to the [Samples Overview](./Samples_Overview.md).
|
||||
|
||||
[state_network_example]: ./img/state_network_example.png
|
||||
|
||||
|
||||
## LowLatency Transformations
|
||||
|
||||
If the original framework does not have a special API for working with states, after importing the model, OpenVINO representation will not contain Assign/ReadValue layers. For example, if the original ONNX model contains RNN operations, IR will contain TensorIterator operations and the values will be obtained only after execution of the whole TensorIterator primitive. Intermediate values from each iteration will not be available. To enable you to work with these intermediate values of each iteration and receive them with a low latency after each infer request, special LowLatency and LowLatency2 transformations were introduced.
|
||||
If the original framework does not have a special API for working with states, after importing the model, OpenVINO representation will not contain `Assign`/`ReadValue` layers. For example, if the original ONNX model contains RNN operations, IR will contain `TensorIterator` operations and the values will be obtained only after execution of the whole `TensorIterator` primitive. Intermediate values from each iteration will not be available. Working with these intermediate values of each iteration is enabled by special LowLatency and LowLatency2 transformations, which also help receive these values with a low latency after each infer request.
|
||||
|
||||
### How to get TensorIterator/Loop operations from different frameworks via ModelOptimizer.
|
||||
### How to Get TensorIterator/Loop operations from Different Frameworks via Model Optimizer.
|
||||
|
||||
**ONNX and frameworks supported via ONNX format:** *LSTM, RNN, GRU* original layers are converted to the TensorIterator operation. TensorIterator body contains LSTM/RNN/GRU Cell. Peepholes, InputForget modifications are not supported, sequence_lengths optional input is supported.
|
||||
*ONNX Loop* layer is converted to the OpenVINO Loop operation.
|
||||
**ONNX and frameworks supported via ONNX format:** `LSTM`, `RNN`, and `GRU` original layers are converted to the `TensorIterator` operation. The `TensorIterator` body contains `LSTM`/`RNN`/`GRU Cell`. The `Peepholes` and `InputForget` modifications are not supported, while the `sequence_lengths` optional input is.
|
||||
`ONNX Loop` layer is converted to the OpenVINO Loop operation.
|
||||
|
||||
**Apache MXNet:** *LSTM, RNN, GRU* original layers are converted to TensorIterator operation, TensorIterator body contains LSTM/RNN/GRU Cell operations.
|
||||
**Apache MXNet:** `LSTM`, `RNN`, `GRU` original layers are converted to `TensorIterator` operation. The `TensorIterator` body contains `LSTM`/`RNN`/`GRU Cell` operations.
|
||||
|
||||
**TensorFlow:** *BlockLSTM* is converted to TensorIterator operation, TensorIterator body contains LSTM Cell operation, Peepholes, InputForget modifications are not supported.
|
||||
*While* layer is converted to TensorIterator, TensorIterator body can contain any supported operations, but dynamic cases, when count of iterations cannot be calculated in shape inference (ModelOptimizer conversion) time, are not supported.
|
||||
**TensorFlow:** The `BlockLSTM` is converted to `TensorIterator` operation. The `TensorIterator` body contains `LSTM Cell` operation, whereas `Peepholes` and `InputForget` modifications are not supported.
|
||||
The `While` layer is converted to `TensorIterator`. The `TensorIterator` body can contain any supported operations. However, when count of iterations cannot be calculated in shape inference (Model Optimizer conversion) time, the dynamic cases are not supported.
|
||||
|
||||
**TensorFlow2:** *While* layer is converted to Loop operation. Loop body can contain any supported operations.
|
||||
**TensorFlow2:** The `While` layer is converted to `Loop` operation. The `Loop` body can contain any supported operations.
|
||||
|
||||
**Kaldi:** Kaldi models already contain Assign/ReadValue (Memory) operations after model conversion. TensorIterator/Loop operations are not generated.
|
||||
**Kaldi:** Kaldi models already contain `Assign`/`ReadValue` (Memory) operations after model conversion. The `TensorIterator`/`Loop` operations are not generated.
|
||||
|
||||
## LowLatencу2
|
||||
## The LowLatencу2 Transformation
|
||||
|
||||
LowLatency2 transformation changes the structure of the network containing [TensorIterator](../ops/infrastructure/TensorIterator_1.md) and [Loop](../ops/infrastructure/Loop_5.md) by adding the ability to work with the state, inserting the Assign/ReadValue layers as it is shown in the picture below.
|
||||
The LowLatency2 transformation changes the structure of the network containing [TensorIterator](../ops/infrastructure/TensorIterator_1.md) and [Loop](../ops/infrastructure/Loop_5.md) by adding the ability to work with the state, inserting the `Assign`/`ReadValue` layers as it is shown in the picture below.
|
||||
|
||||
### The differences between LowLatency and LowLatency2**:
|
||||
### The Differences between the LowLatency and the LowLatency2**:
|
||||
|
||||
* Unrolling of `TensorIterator`/`Loop` operations became a part of the LowLatency2, not a separate transformation. After invoking the transformation, the network can be serialized and inferred without re-invoking the transformation.
|
||||
* Support for `TensorIterator` and `Loop` operations with multiple iterations inside. The `TensorIterator`/`Loop` will not be unrolled in this case.
|
||||
* The "Parameters connected directly to ReadValues" limitation is resolved. To apply the previous version of the transformation in this case, additional manual manipulations were required. Now, the case is processed automatically.
|
||||
|
||||
#### Example of Applying the LowLatency2 Transformation:
|
||||
|
||||
<a name="example-of-applying-lowlatency2-transformation"></a>
|
||||
|
||||
* Unrolling of TensorIterator/Loop operations became a part of LowLatency2, not a separate transformation. After invoking the transformation, the network can be serialized and inferred without re-invoking the transformation.
|
||||
* Added support for TensorIterator and Loop operations with multiple iterations inside. TensorIterator/Loop will not be unrolled in this case.
|
||||
* Resolved the ‘Parameters connected directly to ReadValues’ limitation. To apply the previous version of the transformation in this case, additional manual manipulations were required, now the case is processed automatically.
|
||||
#### Example of applying LowLatency2 transformation:
|
||||

|
||||
|
||||
After applying the transformation, ReadValue operations can receive other operations as an input, as shown in the picture above. These inputs should set the initial value for initialization of ReadValue operations. However, such initialization is not supported in the current State API implementation. Input values are ignored and the initial values for the ReadValue operations are set to zeros unless otherwise specified by the user via [State API](#openvino-state-api).
|
||||
After applying the transformation, the `ReadValue` operations can receive other operations as an input, as shown in the picture above. These inputs should set the initial value for initialization of the `ReadValue` operations. However, such initialization is not supported in the current State API implementation. Input values are ignored and the initial values for the `ReadValue` operations are set to 0 unless otherwise specified by the user via [State API](#openvino-state-api).
|
||||
|
||||
### Steps to apply LowLatency2 Transformation
|
||||
### Steps to Apply the LowLatency2 Transformation
|
||||
|
||||
1. Get CNNNetwork. Either way is acceptable:
|
||||
|
||||
* [from IR or ONNX model](./integrate_with_your_application.md)
|
||||
* [from ov::Model](../OV_Runtime_UG/model_representation.md)
|
||||
|
||||
2. Change the number of iterations inside TensorIterator/Loop nodes in the network using the [Reshape](ShapeInference.md) feature.
|
||||
2. Change the number of iterations inside `TensorIterator`/`Loop` nodes in the network, using the [Reshape](ShapeInference.md) feature.
|
||||
|
||||
For example, the *sequence_lengths* dimension of input of the network > 1, it means the TensorIterator layer has number_of_iterations > 1. You can reshape the inputs of the network to set *sequence_dimension* to exactly 1.
|
||||
For example, when the `sequence_lengths` dimension of input of the network > 1, the `TensorIterator` layer has `number_iterations` > 1. You can reshape the inputs of the network to set `sequence_dimension` to 1.
|
||||
|
||||
```cpp
|
||||
|
||||
@ -259,9 +257,9 @@ cnnNetwork.reshape({"X" : {1, 1, 16});
|
||||
// Network after reshape: Parameter (name: X, shape: [1 (sequence_lengths), 1, 16]) -> TensorIterator (num_iteration = 1, axis = 0) -> ...
|
||||
|
||||
```
|
||||
**Unrolling**: If the LowLatency2 transformation is applied to a network containing TensorIterator/Loop nodes with exactly one iteration inside, these nodes are unrolled; otherwise, the nodes remain as they are. Please see [the picture](#example-of-applying-lowlatency2-transformation) for more details.
|
||||
**Unrolling**: If the LowLatency2 transformation is applied to a network containing `TensorIterator`/`Loop` nodes with exactly one iteration inside, these nodes are unrolled. Otherwise, the nodes remain as they are. For more details, see [the picture](#example-of-applying-lowlatency2-transformation) above.
|
||||
|
||||
3. Apply LowLatency2 transformation
|
||||
3. Apply the LowLatency2 transformation.
|
||||
```cpp
|
||||
#include "ie_transformations.hpp"
|
||||
|
||||
@ -271,7 +269,7 @@ InferenceEngine::lowLatency2(cnnNetwork); // 2nd argument 'use_const_initializer
|
||||
```
|
||||
**Use_const_initializer argument**
|
||||
|
||||
By default, the LowLatency2 transformation inserts a constant subgraph of the same shape as the previous input node, and with zero values as the initializing value for ReadValue nodes, please see the picture below. We can disable insertion of this subgraph by passing the `false` value for the `use_const_initializer` argument.
|
||||
By default, the LowLatency2 transformation inserts a constant subgraph of the same shape as the previous input node, and with 0 values as the initializing value for `ReadValue` nodes. (See the picture below.) Insertion of this subgraph can be disabled by passing the `false` value for the `use_const_initializer` argument.
|
||||
|
||||
```cpp
|
||||
InferenceEngine::lowLatency2(cnnNetwork, false);
|
||||
@ -279,7 +277,8 @@ InferenceEngine::lowLatency2(cnnNetwork, false);
|
||||
|
||||

|
||||
|
||||
**State naming rule:** a name of a state is a concatenation of names: original TensorIterator operation, Parameter of the body, and additional suffix "variable_" + id (0-base indexing, new indexing for each TensorIterator). You can use these rules to predict what the name of the inserted State will be after the transformation is applied. For example:
|
||||
**State naming rule:** A name of a state is a concatenation of names: original `TensorIterator` operation, parameter of the body, and additional suffix `variable_` + `id` (0-base indexing, new indexing for each `TensorIterator`). Use these rules to predict the name of the inserted state after the transformation is applied. For example:
|
||||
|
||||
```cpp
|
||||
// Precondition in ngraph::function.
|
||||
// Created TensorIterator and Parameter in body of TensorIterator with names
|
||||
@ -305,17 +304,17 @@ InferenceEngine::lowLatency2(cnnNetwork, false);
|
||||
}
|
||||
```
|
||||
|
||||
4. Use state API. See sections [OpenVINO state API](#openvino-state-api), [Example of stateful network inference](#example-of-stateful-network-inference).
|
||||
4. Use state API. See the [OpenVINO state API](#openvino-state-api) and the [Example of stateful network inference](#example-of-stateful-network-inference) sections.
|
||||
|
||||
### Known Limitations
|
||||
1. Unable to execute [Reshape](ShapeInference.md) to change the number iterations of TensorIterator/Loop layers to apply the transformation correctly due to hardcoded values of shapes somewhere in the network.
|
||||
1. Unable to execute the [Reshape](ShapeInference.md) feature to change the number iterations of `TensorIterator`/`Loop` layers to apply the transformation correctly.
|
||||
|
||||
The only way you can change the number iterations of TensorIterator/Loop layer is to use the Reshape feature, but networks can be non-reshapable, the most common reason is that the value of shapes is hardcoded in a constant somewhere in the network.
|
||||
The only way to change the number iterations of `TensorIterator`/`Loop` layer is to use the `Reshape` feature. However, networks can be non-reshapable. The most common reason is that the value of shapes is hardcoded in a constant somewhere in the network.
|
||||
|
||||

|
||||
|
||||
**Current solution:** Trim non-reshapable layers via [ModelOptimizer CLI](../MO_DG/prepare_model/convert_model/Converting_Model.md) `--input`, `--output`. For example, the parameter and the problematic constant in the picture above can be trimmed using the following command line option:
|
||||
`--input Reshape_layer_name`. The problematic constant can be also replaced using ngraph, as shown in the example below.
|
||||
**Current solution:** Trim non-reshapable layers via [ModelOptimizer CLI](../MO_DG/prepare_model/convert_model/Converting_Model.md): the `--input` and `--output` parameters. For example, the parameter and the problematic constant in the picture above can be trimmed using the `--input Reshape_layer_name` command-line option.
|
||||
The problematic constant can also be replaced using ngraph, as shown in the example below.
|
||||
|
||||
```cpp
|
||||
// nGraph example. How to replace a Constant with hardcoded values of shapes in the network with another one with the new values.
|
||||
@ -335,25 +334,27 @@ InferenceEngine::lowLatency2(cnnNetwork, false);
|
||||
}
|
||||
}
|
||||
```
|
||||
## [DEPRECATED] LowLatency
|
||||
## [DEPRECATED] The LowLatency Transformation
|
||||
|
||||
LowLatency transformation changes the structure of the network containing [TensorIterator](../ops/infrastructure/TensorIterator_1.md) and [Loop](../ops/infrastructure/Loop_5.md) by adding the ability to work with the state, inserting the Assign/ReadValue layers as it is shown in the picture below.
|
||||
The LowLatency transformation changes the structure of the network containing [TensorIterator](../ops/infrastructure/TensorIterator_1.md) and [Loop](../ops/infrastructure/Loop_5.md) operations by adding the ability to work with the state, inserting the `Assign`/`ReadValue` layers, as shown in the picture below.
|
||||
|
||||

|
||||
|
||||
After applying the transformation, ReadValue operations can receive other operations as an input, as shown in the picture above. These inputs should set the initial value for initialization of ReadValue operations. However, such initialization is not supported in the current State API implementation. Input values are ignored and the initial values for the ReadValue operations are set to zeros unless otherwise specified by the user via [State API](#openvino-state-api).
|
||||
After applying the transformation, `ReadValue` operations can receive other operations as an input, as shown in the picture above. These inputs should set the initial value for initialization of `ReadValue` operations. However, such initialization is not supported in the current State API implementation. Input values are ignored and the initial values for the `ReadValue` operations are set to 0 unless otherwise specified by the user via [State API](#openvino-state-api).
|
||||
|
||||
### Steps to apply LowLatency Transformation
|
||||
### Steps to Apply LowLatency Transformation
|
||||
|
||||
1. Get CNNNetwork. Either way is acceptable:
|
||||
|
||||
* [from IR or ONNX model](./integrate_with_your_application.md)
|
||||
* [from ov::Model](../OV_Runtime_UG/model_representation.md)
|
||||
|
||||
2. [Reshape](ShapeInference.md) the CNNNetwork network if necessary. **Necessary case:** where the sequence_lengths dimension of input > 1, it means TensorIterator layer will have number_iterations > 1. We should reshape the inputs of the network to set sequence_dimension to exactly 1.
|
||||
2. [Reshape](ShapeInference.md) the CNNNetwork network if necessary. An example of such a **necessary case** is when the `sequence_lengths` dimension of input > 1,
|
||||
and it means that `TensorIterator` layer will have `number_iterations` > 1. The inputs of the network should be reshaped to set `sequence_dimension` to exactly 1.
|
||||
|
||||
Usually, the following exception, which occurs after applying a transform when trying to infer the network in a plugin, indicates the need to apply reshape feature: `C++ exception with description "Function is incorrect. Assign and ReadValue operations must be used in pairs in the network."`
|
||||
This means that there are several pairs of Assign/ReadValue operations with the same variable_id in the network, operations were inserted into each iteration of the TensorIterator.
|
||||
Usually, the following exception, which occurs after applying a transform when trying to infer the network in a plugin, indicates the need to apply the reshape feature:
|
||||
`C++ exception with description "Function is incorrect. The Assign and ReadValue operations must be used in pairs in the network."`
|
||||
This means that there are several pairs of `Assign`/`ReadValue` operations with the same `variable_id` in the network and operations were inserted into each iteration of the `TensorIterator`.
|
||||
|
||||
```cpp
|
||||
|
||||
@ -365,7 +366,7 @@ cnnNetwork.reshape({"X" : {1, 1, 16});
|
||||
|
||||
```
|
||||
|
||||
3. Apply LowLatency transformation
|
||||
3. Apply the LowLatency transformation.
|
||||
```cpp
|
||||
#include "ie_transformations.hpp"
|
||||
|
||||
@ -373,7 +374,8 @@ cnnNetwork.reshape({"X" : {1, 1, 16});
|
||||
|
||||
InferenceEngine::LowLatency(cnnNetwork);
|
||||
```
|
||||
**State naming rule:** a name of a state is a concatenation of names: original TensorIterator operation, Parameter of the body, and additional suffix "variable_" + id (0-base indexing, new indexing for each TensorIterator). You can use these rules to predict what the name of the inserted State will be after the transformation is applied. For example:
|
||||
**State naming rule:** a name of a state is a concatenation of names: original `TensorIterator` operation, parameter of the body, and additional suffix `variable_` + `id` (0-base indexing, new indexing for each `TensorIterator`). Use these rules to predict the name of the inserted state after the transformation is applied. For example:
|
||||
|
||||
```cpp
|
||||
// Precondition in ngraph::function.
|
||||
// Created TensorIterator and Parameter in body of TensorIterator with names
|
||||
@ -398,19 +400,19 @@ InferenceEngine::LowLatency(cnnNetwork);
|
||||
}
|
||||
}
|
||||
```
|
||||
4. Use state API. See sections [OpenVINO state API](#openvino-state-api), [Example of stateful network inference](#example-of-stateful-network-inference).
|
||||
4. Use state API. See the [OpenVINO state API](#openvino-state-api) and the [Example of stateful network inference](#example-of-stateful-network-inference) sections.
|
||||
|
||||
|
||||
### Known Limitations for LowLatency [DEPRECATED]
|
||||
1. Parameters connected directly to ReadValues (States) after the transformation is applied are not allowed.
|
||||
### Known Limitations for the LowLatency [DEPRECATED]
|
||||
1. Parameters connected directly to `ReadValues` (states) after the transformation is applied are not allowed.
|
||||
|
||||
Unnecessary parameters may remain on the graph after applying the transformation. The automatic handling of this case inside the transformation is not possible now. Such Parameters should be removed manually from `ngraph::Function` or replaced with a Constant.
|
||||
Unnecessary parameters may remain on the graph after applying the transformation. The automatic handling of this case inside the transformation is currently not possible. Such parameters should be removed manually from `ngraph::Function` or replaced with a constant.
|
||||
|
||||

|
||||
|
||||
**Current solutions:**
|
||||
* Replace Parameter with Constant (freeze) with the value [0, 0, 0 … 0] via [ModelOptimizer CLI](../MO_DG/prepare_model/convert_model/Converting_Model.md) `--input` or `--freeze_placeholder_with_value`.
|
||||
* Use ngraph API to replace Parameter with Constant.
|
||||
* Replace a parameter with a constant (freeze) with the `[0, 0, 0 … 0]` value via [ModelOptimizer CLI](../MO_DG/prepare_model/convert_model/Converting_Model.md): the `--input` or `--freeze_placeholder_with_value` parameters.
|
||||
* Use nGraph API to replace a parameter with a constant, as shown in the example below:
|
||||
|
||||
```cpp
|
||||
// nGraph example. How to replace Parameter with Constant.
|
||||
@ -428,30 +430,31 @@ InferenceEngine::LowLatency(cnnNetwork);
|
||||
}
|
||||
```
|
||||
|
||||
2. Unable to execute reshape precondition to apply the transformation correctly due to hardcoded values of shapes somewhere in the network.
|
||||
2. Unable to execute reshape precondition to apply the transformation correctly.
|
||||
|
||||
Networks can be non-reshapable, the most common reason is that the value of shapes is hardcoded in the Constant somewhere in the network.
|
||||
Networks can be non-reshapable. The most common reason is that the value of shapes is hardcoded in the constant somewhere in the network.
|
||||
|
||||

|
||||
|
||||
**Current solution:** trim non-reshapable layers via [ModelOptimizer CLI](../MO_DG/prepare_model/convert_model/Converting_Model.md) `--input`, `--output`. For example, we can trim the Parameter and the problematic Constant in the picture above, using the following command line option:
|
||||
`--input Reshape_layer_name`. We can also replace the problematic Constant using ngraph, as shown in the example below.
|
||||
**Current solutions:**
|
||||
* Trim non-reshapable layers via [ModelOptimizer CLI](../MO_DG/prepare_model/convert_model/Converting_Model.md): the `--input` and `--output` parameters. For example, the parameter and the problematic constant (as shown in the picture above) can be trimmed using the `--input Reshape_layer_name` command-line option.
|
||||
* Use nGraph API to replace the problematic constant, as shown in the example below:
|
||||
|
||||
```cpp
|
||||
// nGraph example. How to replace a Constant with hardcoded values of shapes in the network with another one with the new values.
|
||||
// Assume we know which Constant (const_with_hardcoded_shape) prevents the reshape from being applied.
|
||||
// Then we can find this Constant by name on the network and replace it with a new one with the correct shape.
|
||||
auto func = cnnNetwork.getFunction();
|
||||
// Creating the new Constant with a correct shape.
|
||||
// For the example shown in the picture above, the new values of the Constant should be 1, 1, 10 instead of 1, 49, 10
|
||||
auto new_const = std::make_shared<ngraph::opset6::Constant>( /*type, shape, value_with_correct_shape*/ );
|
||||
for (const auto& node : func->get_ops()) {
|
||||
// Trying to find the problematic Constant by name.
|
||||
if (node->get_friendly_name() == "name_of_non_reshapable_const") {
|
||||
auto const_with_hardcoded_shape = std::dynamic_pointer_cast<ngraph::opset6::Constant>(node);
|
||||
// Replacing the problematic Constant with a new one. Do this for all the problematic Constants in the network, then
|
||||
// you can apply the reshape feature.
|
||||
ngraph::replace_node(const_with_hardcoded_shape, new_const);
|
||||
```cpp
|
||||
// nGraph example. How to replace a Constant with hardcoded values of shapes in the network with another one with the new values.
|
||||
// Assume we know which Constant (const_with_hardcoded_shape) prevents the reshape from being applied.
|
||||
// Then we can find this Constant by name on the network and replace it with a new one with the correct shape.
|
||||
auto func = cnnNetwork.getFunction();
|
||||
// Creating the new Constant with a correct shape.
|
||||
// For the example shown in the picture above, the new values of the Constant should be 1, 1, 10 instead of 1, 49, 10
|
||||
auto new_const = std::make_shared<ngraph::opset6::Constant>( /*type, shape, value_with_correct_shape*/ );
|
||||
for (const auto& node : func->get_ops()) {
|
||||
// Trying to find the problematic Constant by name.
|
||||
if (node->get_friendly_name() == "name_of_non_reshapable_const") {
|
||||
auto const_with_hardcoded_shape = std::dynamic_pointer_cast<ngraph::opset6::Constant>(node);
|
||||
// Replacing the problematic Constant with a new one. Do this for all the problematic Constants in the network, then
|
||||
// you can apply the reshape feature.
|
||||
ngraph::replace_node(const_with_hardcoded_shape, new_const);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
```
|
||||
|
@ -1,4 +1,4 @@
|
||||
# Performing inference with OpenVINO Runtime {#openvino_docs_OV_UG_OV_Runtime_User_Guide}
|
||||
# Performing Inference with OpenVINO Runtime {#openvino_docs_OV_UG_OV_Runtime_User_Guide}
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -23,10 +23,9 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Introduction
|
||||
OpenVINO Runtime is a set of C++ libraries with C and Python bindings providing a common API to deliver inference solutions on the platform of your choice. Use the OpenVINO Runtime API to read an Intermediate Representation (IR), ONNX, or PaddlePaddle model and execute it on preferred devices.
|
||||
|
||||
OpenVINO Runtime uses a plugin architecture. Its plugins are software components that contain complete implementation for inference on a particular Intel® hardware device: CPU, GPU, VPU, etc. Each plugin implements the unified API and provides additional hardware-specific APIs, for configuring devices, or API interoperability between OpenVINO Runtime and underlying plugin backend.
|
||||
OpenVINO Runtime uses a plugin architecture. Its plugins are software components that contain complete implementation for inference on a particular Intel® hardware device: CPU, GPU, VPU, etc. Each plugin implements the unified API and provides additional hardware-specific APIs for configuring devices or API interoperability between OpenVINO Runtime and underlying plugin backend.
|
||||
|
||||
The scheme below illustrates the typical workflow for deploying a trained deep learning model:
|
||||
|
||||
|
@ -10,47 +10,47 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
As it was demonstrated in the [Changing Input Shapes](ShapeInference.md) article, there are models that support changing of input shapes before model compilation in `Core::compile_model`.
|
||||
Reshaping models provides an ability to customize the model input shape for exactly that size that is required in the end application.
|
||||
As it was demonstrated in the [Changing Input Shapes](ShapeInference.md) article, there are models that support changing input shapes before model compilation in `Core::compile_model`.
|
||||
Reshaping models provides an ability to customize the model input shape for the exact size required in the end application.
|
||||
This article explains how the ability of model to reshape can further be leveraged in more dynamic scenarios.
|
||||
|
||||
## When to Apply Dynamic Shapes
|
||||
## Applying Dynamic Shapes
|
||||
|
||||
Conventional "static" model reshaping works well when it can be done once per many model inference calls with the same shape.
|
||||
However, this approach doesn't perform efficiently if the input tensor shape is changed on every inference call: calling `reshape()` and `compile_model()` each time when a new size comes is extremely time-consuming.
|
||||
A popular example would be an inference of natural language processing models (like BERT) with arbitrarily-sized input sequences that come from the user.
|
||||
In this case, the sequence length cannot be predicted and may change every time you need to call inference.
|
||||
Below, such dimensions that can be frequently changed are called *dynamic dimensions*.
|
||||
When real shape of input is not known at `compile_model` time, that's the case when dynamic shapes should be considered.
|
||||
However, this approach does not perform efficiently if the input tensor shape is changed on every inference call. Calling the `reshape()` and `compile_model()` methods each time a new size comes is extremely time-consuming.
|
||||
A popular example would be inference of natural language processing models (like BERT) with arbitrarily-sized user input sequences.
|
||||
In this case, the sequence length cannot be predicted and may change every time inference is called.
|
||||
Dimensions that can be frequently changed are called *dynamic dimensions*.
|
||||
Dynamic shapes should be considered, when a real shape of input is not known at the time of the `compile_model()` method call.
|
||||
|
||||
Here are several examples of dimensions that can be naturally dynamic:
|
||||
Below are several examples of dimensions that can be naturally dynamic:
|
||||
- Sequence length dimension for various sequence processing models, like BERT
|
||||
- Spatial dimensions in segmentation and style transfer models
|
||||
- Batch dimension
|
||||
- Arbitrary number of detections in object detection models output
|
||||
|
||||
There are various tricks to address input dynamic dimensions through combining multiple pre-reshaped models and input data padding.
|
||||
The tricks are sensitive to model internals, do not always give optimal performance and cumbersome.
|
||||
Short overview of the methods you can find [here](ov_without_dynamic_shapes.md).
|
||||
Apply those methods only if native dynamic shape API described in the following sections doesn't work for you or doesn't give desired performance.
|
||||
There are various methods to address input dynamic dimensions through combining multiple pre-reshaped models and input data padding.
|
||||
The methods are sensitive to model internals, do not always give optimal performance and are cumbersome.
|
||||
For a short overview of the methods, refer to the [When Dynamic Shapes API is Not Applicable](ov_without_dynamic_shapes.md) page.
|
||||
Apply those methods only if native dynamic shape API described in the following sections does not work or does not perform as expected.
|
||||
|
||||
The decision about using dynamic shapes should be based on proper benchmarking of real application with real data.
|
||||
That's because unlike statically shaped models, inference of dynamically shaped ones takes different inference time depending on input data shape or input tensor content.
|
||||
Also using the dynamic shapes can bring more overheads in memory and running time per each inference call depending on hardware plugin and model used.
|
||||
The decision about using dynamic shapes should be based on proper benchmarking of a real application with real data.
|
||||
Unlike statically shaped models, dynamically shaped ones require different inference time, depending on input data shape or input tensor content.
|
||||
Furthermore, using the dynamic shapes can bring more overheads in memory and running time of each inference call depending on hardware plugin and model used.
|
||||
|
||||
## Dynamic Shapes without Tricks
|
||||
## Handling Dynamic Shapes Natively
|
||||
|
||||
This section describes how to handle dynamically shaped models natively with OpenVINO Runtime API version 2022.1 and higher.
|
||||
There are three main parts in the flow that differ from static shapes:
|
||||
- configure the model
|
||||
- prepare data for inference
|
||||
- read resulting data after inference
|
||||
- Configure the model.
|
||||
- Prepare data for inference.
|
||||
- Read resulting data after inference.
|
||||
|
||||
### Configure the Model
|
||||
### Configuring the Model
|
||||
|
||||
To avoid the tricks mentioned in the previous section there is a way to directly specify one or multiple dimensions in the model inputs to be dynamic.
|
||||
To avoid the methods mentioned in the previous section, there is a way to specify one or multiple dimensions to be dynamic, directly in the model inputs.
|
||||
This is achieved with the same reshape method that is used for alternating static shape of inputs.
|
||||
Dynamic dimensions are specified as `-1` or `ov::Dimension()` instead of a positive number used for static dimensions:
|
||||
Dynamic dimensions are specified as `-1` or the `ov::Dimension()` instead of a positive number used for static dimensions:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -72,23 +72,23 @@ However, there are no limitations on the number of inputs and outputs to apply d
|
||||
|
||||
### Undefined Dimensions "Out Of the Box"
|
||||
|
||||
Dynamic dimensions may appear in the input model without calling reshape.
|
||||
Dynamic dimensions may appear in the input model without calling the `reshape` method.
|
||||
Many DL frameworks support undefined dimensions.
|
||||
If such a model is converted with Model Optimizer or read directly by Core::read_model, undefined dimensions are preserved.
|
||||
Such dimensions automatically treated as dynamic ones.
|
||||
So you don't need to call reshape if undefined dimensions are already configured in the original model or in the IR file.
|
||||
If such a model is converted with Model Optimizer or read directly by the `Core::read_model`, undefined dimensions are preserved.
|
||||
Such dimensions are automatically treated as dynamic ones.
|
||||
Therefore, there is no need to call the `reshape` method, if undefined dimensions are already configured in the original or the IR model.
|
||||
|
||||
If the input model has undefined dimensions that you are not going to change during the inference, it is recommended to set them to static values, using the same `reshape` method of the model.
|
||||
From the API perspective any combination of dynamic and static dimensions can be configured.
|
||||
If the input model has undefined dimensions that will not change during inference. It is recommended to set them to static values, using the same `reshape` method of the model.
|
||||
From the API perspective, any combination of dynamic and static dimensions can be configured.
|
||||
|
||||
Model Optimizer provides identical capability to reshape the model during the conversion, including specifying dynamic dimensions.
|
||||
Use this capability to save time on calling `reshape` method in the end application.
|
||||
To get information about setting input shapes using Model Optimizer, refer to [Setting Input Shapes](../MO_DG/prepare_model/convert_model/Converting_Model.md)
|
||||
To get information about setting input shapes using Model Optimizer, refer to [Setting Input Shapes](../MO_DG/prepare_model/convert_model/Converting_Model.md).
|
||||
|
||||
### Dimension Bounds
|
||||
|
||||
Besides marking a dimension just dynamic, you can also specify lower and/or upper bounds that define a range of allowed values for the dimension.
|
||||
Bounds are coded as arguments for `ov::Dimension`:
|
||||
Apart from a dynamic dimension, the lower and/or upper bounds can also be specified. They define a range of allowed values for the dimension.
|
||||
The bounds are coded as arguments for the `ov::Dimension`:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -104,23 +104,23 @@ Bounds are coded as arguments for `ov::Dimension`:
|
||||
@endsphinxtab
|
||||
@endsphinxtabset
|
||||
|
||||
Information about bounds gives opportunity for the inference plugin to apply additional optimizations.
|
||||
Using dynamic shapes assumes the plugins apply more loose optimization technique during model compilation
|
||||
Information about bounds gives an opportunity for the inference plugin to apply additional optimizations.
|
||||
Using dynamic shapes assumes the plugins apply more flexible optimization approach during model compilation.
|
||||
It may require more time/memory for model compilation and inference.
|
||||
So providing any additional information like bounds can be beneficial.
|
||||
For the same reason it is not recommended to leave dimensions as undefined without the real need.
|
||||
Therefore, providing any additional information, like bounds, can be beneficial.
|
||||
For the same reason, it is not recommended to leave dimensions as undefined, without the real need.
|
||||
|
||||
When specifying bounds, the lower bound is not so important as upper bound, because knowing of upper bound allows inference devices to more precisely allocate memory for intermediate tensors for inference and use lesser number of tuned kernels for different sizes.
|
||||
Precisely speaking benefits of specifying lower or upper bound is device dependent.
|
||||
Depending on the plugin specifying upper bounds can be required. For information about dynamic shapes support on different devices, see the [Features Support Matrix](@ref features_support_matrix).
|
||||
When specifying bounds, the lower bound is not as important as the upper one. The upper bound allows inference devices to allocate memory for intermediate tensors more precisely. It also allows using a fewer number of tuned kernels for different sizes.
|
||||
More precisely, benefits of specifying the lower or upper bound is device dependent.
|
||||
Depending on the plugin, specifying the upper bounds can be required. For information about dynamic shapes support on different devices, refer to the [Features Support Matrix](@ref features_support_matrix).
|
||||
|
||||
If users known lower and upper bounds for dimension it is recommended to specify them even when plugin can execute model without the bounds.
|
||||
If the lower and upper bounds for a dimension are known, it is recommended to specify them, even if a plugin can execute a model without the bounds.
|
||||
|
||||
### Setting Input Tensors
|
||||
|
||||
Preparing model with the reshape method was the first step.
|
||||
Preparing a model with the `reshape` method is the first step.
|
||||
The second step is passing a tensor with an appropriate shape to infer request.
|
||||
This is similar to [regular steps](integrate_with_your_application.md), but now we can pass tensors with different shapes for the same executable model and even for the same inference request:
|
||||
This is similar to the [regular steps](integrate_with_your_application.md). However, tensors can now be passed with different shapes for the same executable model and even for the same inference request:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -136,14 +136,14 @@ This is similar to [regular steps](integrate_with_your_application.md), but now
|
||||
@endsphinxtab
|
||||
@endsphinxtabset
|
||||
|
||||
In the example above `set_input_tensor` is used to specify input tensors.
|
||||
The real dimensions of the tensor is always static, because it is a concrete tensor and it doesn't have any dimension variations in contrast to model inputs.
|
||||
In the example above, the `set_input_tensor` is used to specify input tensors.
|
||||
The real dimension of the tensor is always static, because it is a particular tensor and it does not have any dimension variations in contrast to model inputs.
|
||||
|
||||
Similar to static shapes, `get_input_tensor` can be used instead of `set_input_tensor`.
|
||||
In contrast to static input shapes, when using `get_input_tensor` for dynamic inputs, `set_shape` method for the returned tensor should be called to define the shape and allocate memory.
|
||||
Without doing that, the tensor returned by `get_input_tensor` is an empty tensor, it's shape is not initialized and memory is not allocated, because infer request doesn't have information about real shape you are going to feed.
|
||||
Setting shape for input tensor is required when the corresponding input has at least one dynamic dimension regardless of bounds information.
|
||||
The following example makes the same sequence of two infer request as the previous example but using `get_input_tensor` instead of `set_input_tensor`:
|
||||
In contrast to static input shapes, when using `get_input_tensor` for dynamic inputs, the `set_shape` method for the returned tensor should be called to define the shape and allocate memory.
|
||||
Without doing so, the tensor returned by `get_input_tensor` is an empty tensor. The shape of the tensor is not initialized and memory is not allocated, because infer request does not have information about the real shape that will be provided.
|
||||
Setting shape for an input tensor is required when the corresponding input has at least one dynamic dimension, regardless of the bounds.
|
||||
Contrary to previous example, the following one shows the same sequence of two infer requests, using `get_input_tensor` instead of `set_input_tensor`:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -162,12 +162,12 @@ The following example makes the same sequence of two infer request as the previo
|
||||
|
||||
### Dynamic Shapes in Outputs
|
||||
|
||||
Examples above handle correctly case when dynamic dimensions in output may be implied by propagating of dynamic dimension from the inputs.
|
||||
For example, batch dimension in input shape is usually propagated through the whole model and appears in the output shape.
|
||||
The same is true for other dimensions, like sequence length for NLP models or spatial dimensions for segmentation models, that are propagated through the entire network.
|
||||
Examples above are valid approaches when dynamic dimensions in output may be implied by propagation of dynamic dimension from the inputs.
|
||||
For example, batch dimension in an input shape is usually propagated through the whole model and appears in the output shape.
|
||||
It also applies to other dimensions, like sequence length for NLP models or spatial dimensions for segmentation models, that are propagated through the entire network.
|
||||
|
||||
Whether or not output has dynamic dimensions can be examined by querying output partial shape after model read or reshape.
|
||||
The same is applicable for inputs. For example:
|
||||
Whether the output has dynamic dimensions or not can be verified by querying the output partial shape after the model is read or reshaped.
|
||||
The same applies to inputs. For example:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -184,9 +184,9 @@ The same is applicable for inputs. For example:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
Appearing `?` or ranges like `1..10` means there are dynamic dimensions in corresponding inputs or outputs.
|
||||
When there are dynamic dimensions in corresponding inputs or outputs, the `?` or ranges like `1..10` appear.
|
||||
|
||||
Or more programmatically:
|
||||
It can also be verified in a more programmatic way:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -204,7 +204,7 @@ Or more programmatically:
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
If at least one dynamic dimension exists in output of the model, shape of the corresponding output tensor will be set as the result of inference call.
|
||||
Before the first inference, memory for such a tensor is not allocated and has shape `[0]`.
|
||||
If user call `set_output_tensor` with pre-allocated tensor, the inference will call `set_shape` internally, and the initial shape is replaced by the really calculated shape.
|
||||
So setting shape for output tensors in this case is useful only if you want to pre-allocate enough memory for output tensor, because `Tensor`'s `set_shape` method will re-allocate memory only if new shape requires more storage.
|
||||
If at least one dynamic dimension exists in an output of a model, a shape of the corresponding output tensor will be set as the result of inference call.
|
||||
Before the first inference, memory for such a tensor is not allocated and has the `[0]` shape.
|
||||
If the `set_output_tensor` method is called with a pre-allocated tensor, the inference will call the `set_shape` internally, and the initial shape is replaced by the calculated shape.
|
||||
Therefore, setting a shape for output tensors in this case is useful only when pre-allocating enough memory for output tensor. Normally, the `set_shape` method of a `Tensor` re-allocates memory only if a new shape requires more storage.
|
||||
|
@ -1,12 +1,12 @@
|
||||
# OpenVINO™ Inference Request {#openvino_docs_OV_UG_Infer_request}
|
||||
|
||||
OpenVINO™ Runtime uses Infer Request mechanism which allows to run models on different devices in asynchronous or synchronous manners.
|
||||
`ov::InferRequest` class is used for this purpose inside the OpenVINO™ Runtime.
|
||||
This class allows to set and get data for model inputs, outputs and run inference for the model.
|
||||
OpenVINO™ Runtime uses Infer Request mechanism which allows running models on different devices in asynchronous or synchronous manners.
|
||||
The `ov::InferRequest` class is used for this purpose inside the OpenVINO™ Runtime.
|
||||
This class allows you to set and get data for model inputs, outputs and run inference for the model.
|
||||
|
||||
## Creating Infer Request
|
||||
|
||||
`ov::InferRequest` can be created from the `ov::CompiledModel`:
|
||||
The `ov::InferRequest` can be created from the `ov::CompiledModel`:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -24,13 +24,13 @@ This class allows to set and get data for model inputs, outputs and run inferenc
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
## Run inference
|
||||
## Run Inference
|
||||
|
||||
`ov::InferRequest` supports synchronous and asynchronous modes for inference.
|
||||
The `ov::InferRequest` supports synchronous and asynchronous modes for inference.
|
||||
|
||||
### Synchronous mode
|
||||
### Synchronous Mode
|
||||
|
||||
You can use `ov::InferRequest::infer`, which blocks the application execution, to infer model in the synchronous mode:
|
||||
You can use `ov::InferRequest::infer`, which blocks the application execution, to infer a model in the synchronous mode:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -48,9 +48,9 @@ You can use `ov::InferRequest::infer`, which blocks the application execution, t
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
### Asynchronous mode
|
||||
### Asynchronous Mode
|
||||
|
||||
Asynchronous mode can improve application's overall frame-rate, because rather than wait for inference to complete, the app can keep working on the host, while the accelerator is busy. You can use `ov::InferRequest::start_async` to infer model in the asynchronous mode:
|
||||
The asynchronous mode can improve application's overall frame-rate, by making it work on the host while the accelerator is busy, instead of waiting for inference to complete. To infer a model in the asynchronous mode, use `ov::InferRequest::start_async`:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -105,7 +105,7 @@ Asynchronous mode supports two ways the application waits for inference results:
|
||||
|
||||
Both methods are thread-safe.
|
||||
|
||||
When you are running several inference requests in parallel, a device can process them simultaneously, with no garauntees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait` (unless your code needs to wait for the _all_ requests). For multi-request scenarios, consider using the `ov::InferRequest::set_callback` method to set a callback which is called upon completion of the request:
|
||||
When you are running several inference requests in parallel, a device can process them simultaneously, with no guarantees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait` (unless your code needs to wait for the _all_ requests). For multi-request scenarios, consider using the `ov::InferRequest::set_callback` method to set a callback which is called upon completion of the request:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -125,7 +125,7 @@ When you are running several inference requests in parallel, a device can proces
|
||||
|
||||
|
||||
> **NOTE**: Use weak reference of infer_request (`ov::InferRequest*`, `ov::InferRequest&`, `std::weal_ptr<ov::InferRequest>`, etc.) in the callback. It is necessary to avoid cyclic references.
|
||||
For more details, check [Classification Sample Async](../../samples/cpp/classification_sample_async/README.md).
|
||||
For more details, see the [Classification Async Sample](../../samples/cpp/classification_sample_async/README.md).
|
||||
|
||||
You can use the `ov::InferRequest::cancel` method if you want to abort execution of the current inference request:
|
||||
|
||||
@ -145,12 +145,12 @@ You can use the `ov::InferRequest::cancel` method if you want to abort execution
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
@anchor in_out_tensors
|
||||
## Working with Input and Output tensors
|
||||
|
||||
`ov::InferRequest` allows to get input/output tensors by tensor name, index, port and without any arguments in case if model has only one input or output.
|
||||
`ov::InferRequest` allows you to get input/output tensors by tensor name, index, port, and without any arguments, if a model has only one input or output.
|
||||
|
||||
* `ov::InferRequest::get_input_tensor`, `ov::InferRequest::set_input_tensor`, `ov::InferRequest::get_output_tensor`, `ov::InferRequest::set_output_tensor` methods without arguments can be used to get or set input/output tensor for model with only one input/output:
|
||||
* `ov::InferRequest::get_input_tensor`, `ov::InferRequest::set_input_tensor`, `ov::InferRequest::get_output_tensor`, `ov::InferRequest::set_output_tensor` methods without arguments can be used to get or set input/output tensor for a model with only one input/output:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -222,12 +222,14 @@ You can use the `ov::InferRequest::cancel` method if you want to abort execution
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
## Examples of InferRequest usages
|
||||
## Examples of Infer Request Usages
|
||||
|
||||
### Cascade of models
|
||||
Presented below are examples of what the Infer Request can be used for.
|
||||
|
||||
`ov::InferRequest` can be used to organize cascade of models. You need to have infer requests for each model.
|
||||
In this case you can get output tensor from the first request using `ov::InferRequest::get_tensor` and set it as input for the second request using `ov::InferRequest::set_tensor`. But be careful, shared tensors across compiled models can be rewritten by the first model if the first infer request is run once again, while the second model has not started yet.
|
||||
### Cascade of Models
|
||||
|
||||
`ov::InferRequest` can be used to organize a cascade of models. Infer Requests are required for each model.
|
||||
In this case, you can get the output tensor from the first request, using `ov::InferRequest::get_tensor` and set it as input for the second request, using `ov::InferRequest::set_tensor`. Keep in mind that tensors shared across compiled models can be rewritten by the first model if the first infer request is run once again, while the second model has not started yet.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -245,9 +247,9 @@ In this case you can get output tensor from the first request using `ov::InferRe
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
### Using of ROI tensors
|
||||
### Using of ROI Tensors
|
||||
|
||||
It is possible to re-use shared input by several models. You do not need to allocate separate input tensor for a model if it processes a ROI object located inside of already allocated input of a previous model. For instance, when the first model detects objects in a video frame (stored as input tensor) and the second model accepts detected bounding boxes (ROI inside of the frame) as input. In this case, it is allowed to re-use pre-allocated input tensor (used by the first model) by the second model and just crop ROI without allocation of new memory using `ov::Tensor` with passing of `ov::Tensor` and `ov::Coordinate` as parameters.
|
||||
It is possible to re-use shared input in several models. You do not need to allocate a separate input tensor for a model if it processes a ROI object located inside of an already allocated input of a previous model. For instance, when the first model detects objects in a video frame (stored as an input tensor) and the second model accepts detected bounding boxes (ROI inside of the frame) as input. In this case, it is allowed to re-use a pre-allocated input tensor (used by the first model) by the second model and just crop ROI without allocation of new memory, using `ov::Tensor` with passing `ov::Tensor` and `ov::Coordinate` as parameters.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -265,9 +267,9 @@ It is possible to re-use shared input by several models. You do not need to allo
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
### Using of remote tensors
|
||||
### Using Remote Tensors
|
||||
|
||||
You can create a remote tensor to work with remote device memory. `ov::RemoteContext` allows to create remote tensor.
|
||||
By using `ov::RemoteContext` you can create a remote tensor to work with remote device memory.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
|
@ -1,44 +1,39 @@
|
||||
# When Dynamic Shapes API is Not Applicable {#openvino_docs_OV_UG_NoDynamicShapes}
|
||||
|
||||
Several approaches to emulate dynamic shapes are considered in this chapter
|
||||
Apply these methods only if [native dynamic shape API](ov_dynamic_shapes.md) doesn't work for you or doesn't give desired performance.
|
||||
Several approaches to emulate dynamic shapes are considered in this article.
|
||||
Apply the following methods only if the [native dynamic shape API](ov_dynamic_shapes.md) does not work or does not perform as expected.
|
||||
|
||||
## Padding
|
||||
|
||||
The model can be designed in a way that supports partially filled tensors.
|
||||
For the BERT model you can use a special input to the model to mask unused elements out.
|
||||
So, the model can be reshaped for some predefined big sequence length once and compiled once, and then the input tensors are used only partially with mask specifying valid tokens.
|
||||
For the BERT model, use a special input to the model to mask out unused elements.
|
||||
Therefore, the model can be reshaped for some predefined big sequence length once and compiled once. Then, the input tensors are used only partially with a mask specifying valid tokens.
|
||||
This approach is called *padding*.
|
||||
|
||||
However, padding is not applicable to every model and every use case.
|
||||
You should be aware of model internals to apply padding. Otherwise, if the model is not designed to handle dummy element gracefully in padding area,
|
||||
then the results of inference may be totally scrambled,
|
||||
or accuracy is significantly affected.
|
||||
Model can even crash during inference.
|
||||
Be aware of the internals of the model before you apply padding. Otherwise, if the model is not designed to handle dummy elements gracefully in a padding area, the results of inference may be entirely scrambled, or accuracy significantly affected.
|
||||
The model can even crash during inference.
|
||||
|
||||
Besides the bad developer experience,
|
||||
the main disadvantage of padding is a bad performance due to spending time for processing dummy elements in the padding area,
|
||||
even if the model is properly designed to be used with padding.
|
||||
It turns out that usually such models are designed in a way where calculations in the padded area still happen not affecting the end result.
|
||||
The main disadvantage of padding, apart from impacting developer experience, is poor performance. Even if the model is properly designed for padding, it is often designed in such a way that the time-consuming processing of dummy elements in the padded area still occurs, not affecting the end result but decreasing inference speed.
|
||||
|
||||
## Multiple Precompiled Models
|
||||
## Multiple Pre-compiled Models
|
||||
|
||||
Another approach to handle arbitrary sized inputs is to pre-compile several models reshaped for different input shapes.
|
||||
This method works well if the number of different shapes is small enough to afford increased time for multiple reshapes and compilations
|
||||
as well as increased amount of consumed memory.
|
||||
As this method cannot be scaled well it is used in combination with the padding:
|
||||
model with the most suitable input shape among pre-reshaped models is chosen.
|
||||
It gives smaller pad area in comparison to a single model.
|
||||
As this method cannot be scaled well, it is used in combination with padding.
|
||||
Hence, the model with the most suitable input shape among pre-reshaped models is chosen.
|
||||
It gives a smaller padding area in comparison to a single model.
|
||||
|
||||
## Dimension Partitioning
|
||||
|
||||
Another practical but still a complicated approach is when the input tensor can be divided into multiple chunks along the dynamic dimension.
|
||||
For example, if we have a batch of independent inputs as a single tensor.
|
||||
If arbitrary division along batch dimension is possible - and for batch dimension it should be possible by the dimension purpose -
|
||||
you can run multiple inferences using the approach with several pre-compiled models choosing sized to have the minimal number of inferences
|
||||
Another practical but still complicated approach is to divide the input tensor into multiple chunks along the dynamic dimension.
|
||||
For example, if there is a batch of independent inputs as a single tensor.
|
||||
If arbitrary division along batch dimension is possible, and it should be possible by the dimension purpose,
|
||||
run multiple inferences. Use the approach with several pre-compiled models, choosing sized inputs to have the minimum number of inferences,
|
||||
having a particular batch size in the input tensor.
|
||||
|
||||
For example, if there are models pre-compiled for batch sizes 1, 2, 4 and 8,
|
||||
the input tensor with batch 5 can be processed with two inference calls with batch size 1 and 4.
|
||||
(Here it's assumed the batch processing is required for performance reasons, otherwise you can just loop over images in a batch,
|
||||
For example, if there are models pre-compiled for batch sizes `1`, `2`, `4` and `8`,
|
||||
the input tensor with batch `5` can be processed with two inference calls with batch size `1` and `4`.
|
||||
(At this point, it is assumed that the batch processing is required for performance reasons. In other cases, just loop over images in a batch
|
||||
and process image by image with a single compiled model.)
|
||||
|
@ -1,32 +1,31 @@
|
||||
# High-level Performance Hints {#openvino_docs_OV_UG_Performance_Hints}
|
||||
|
||||
Each of OpenVINO's [supported devices](supported_plugins/Device_Plugins.md) offers low-level performance settings. Tweaking this detailed configuration requires deep architecture understanding.
|
||||
Also, while the performance may be optimal for the specific combination of the device and the inferred model, the resulting configuration is not necessarily optimal for another device or model.
|
||||
The OpenVINO performance hints are the new way to configure the performance with _portability_ in mind. As the hints are supported by every OpenVINO device, this is a future-proof solution that is fully compatible with the [automatic device selection](./auto_device_selection.md).
|
||||
Even though all [supported devices](supported_plugins/Device_Plugins.md) in OpenVINO™ offer low-level performance settings, utilizing them is not recommended outside of very few cases.
|
||||
The preferred way to configure performance in OpenVINO Runtime is using performance hints. This is a future-proof solution fully compatible with the [automatic device selection inference mode](./auto_device_selection.md) and designed with *portability* in mind.
|
||||
|
||||
The hints also "reverse" the direction of the configuration in the right fashion: rather than map the application needs to the low-level performance settings, and keep an associated application logic to configure each possible device separately, the idea is to express a target scenario with a single config key and let the *device* to configure itself in response.
|
||||
The hints also set the direction of the configuration in the right order. Instead of mapping the application needs to the low-level performance settings, and keeping an associated application logic to configure each possible device separately, the hints express a target scenario with a single config key and let the *device* configure itself in response.
|
||||
|
||||
Previously, a certain level of automatic configuration was coming from the _default_ values of the parameters. For example, the number of CPU streams was deduced from the number of CPU cores, when `ov::streams::AUTO` (`CPU_THROUGHPUT_AUTO` in the pre-OpenVINO 2.0 parlance) is set. However, the resulting number of streams didn't account for actual compute requirements of the model to be inferred.
|
||||
Previously, a certain level of automatic configuration was the result of the *default* values of the parameters. For example, the number of CPU streams was deduced from the number of CPU cores, when `ov::streams::AUTO` (`CPU_THROUGHPUT_AUTO` in the pre-API 2.0 terminology) was set. However, the resulting number of streams did not account for actual compute requirements of the model to be inferred.
|
||||
The hints, in contrast, respect the actual model, so the parameters for optimal throughput are calculated for each model individually (based on its compute versus memory bandwidth requirements and capabilities of the device).
|
||||
|
||||
## Performance Hints: Latency and Throughput
|
||||
As discussed in the [Optimization Guide](../optimization_guide/dldt_optimization_guide.md) there are a few different metrics associated with inference speed.
|
||||
Throughput and latency are some of the most widely used metrics that measure the overall performance of an application.
|
||||
|
||||
This is why, to ease the configuration of the device, OpenVINO offers two dedicated hints, namely `ov::hint::PerformanceMode::THROUGHPUT` and `ov::hint::PerformanceMode::LATENCY`.
|
||||
A special `ov::hint::PerformanceMode::UNDEFINED` acts the same as specifying no hint.
|
||||
Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely `ov::hint::PerformanceMode::THROUGHPUT` and `ov::hint::PerformanceMode::LATENCY`.
|
||||
A special `ov::hint::PerformanceMode::UNDEFINED` hint acts the same as specifying no hint.
|
||||
|
||||
Please also see the last section in this document on conducting performance measurements with the benchmark_app`.
|
||||
For more information on conducting performance measurements with the `benchmark_app`, refer to the last section in this document.
|
||||
|
||||
Note that a typical model may take significantly more time to load with `ov::hint::PerformanceMode::THROUGHPUT` and consume much more memory, compared with `ov::hint::PerformanceMode::LATENCY`.
|
||||
Keep in mind that a typical model may take significantly more time to load with the `ov::hint::PerformanceMode::THROUGHPUT` and consume much more memory, compared to the `ov::hint::PerformanceMode::LATENCY`.
|
||||
|
||||
## Performance Hints: How It Works?
|
||||
## Performance Hints: How It Works
|
||||
Internally, every device "translates" the value of the hint to the actual performance settings.
|
||||
For example the `ov::hint::PerformanceMode::THROUGHPUT` selects number of CPU or GPU streams.
|
||||
For the GPU, additionally the optimal batch size is selected and the [automatic batching](../OV_Runtime_UG/automatic_batching.md) is applied whenever possible (and also if the device supports that [refer to the devices/features support matrix](./supported_plugins/Device_Plugins.md)).
|
||||
For example, the `ov::hint::PerformanceMode::THROUGHPUT` selects the number of CPU or GPU streams.
|
||||
Additionally, the optimal batch size is selected for the GPU and the [automatic batching](../OV_Runtime_UG/automatic_batching.md) is applied whenever possible. To check whether the device supports it, refer to the [devices/features support matrix](./supported_plugins/Device_Plugins.md) article.
|
||||
|
||||
The resulting (device-specific) settings can be queried back from the instance of the `ov:Compiled_Model`.
|
||||
Notice that the `benchmark_app`, outputs the actual settings for the THROUGHPUT hint, please the bottom of the output example:
|
||||
Be aware that the `benchmark_app` outputs the actual settings for the `THROUGHPUT` hint. See the example of the output below:
|
||||
|
||||
```
|
||||
$benchmark_app -hint tput -d CPU -m 'path to your favorite model'
|
||||
@ -41,7 +40,7 @@ Notice that the `benchmark_app`, outputs the actual settings for the THROUGHPUT
|
||||
```
|
||||
|
||||
## Using the Performance Hints: Basic API
|
||||
In the example code-snippet below the `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for the compile_model:
|
||||
In the example code snippet below, `ov::hint::PerformanceMode::THROUGHPUT` is specified for the `ov::hint::performance_mode` property for `compile_model`:
|
||||
@sphinxdirective
|
||||
|
||||
.. tab:: C++
|
||||
@ -59,8 +58,8 @@ In the example code-snippet below the `ov::hint::PerformanceMode::THROUGHPUT` i
|
||||
@endsphinxdirective
|
||||
|
||||
## Additional (Optional) Hints from the App
|
||||
Let's take an example of an application that processes 4 video streams. The most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4.
|
||||
As discussed previosly, for the GPU this will limit the batch size, for the CPU - the number of inference streams, so each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
|
||||
For an application that processes 4 video streams, the most future-proof way to communicate the limitation of the parallel slack is to equip the performance hint with the optional `ov::hint::num_requests` configuration key set to 4.
|
||||
As mentioned earlier, this will limit the batch size for the GPU and the number of inference streams for the CPU. Thus, each device uses the `ov::hint::num_requests` while converting the hint to the actual device configuration options:
|
||||
@sphinxdirective
|
||||
|
||||
.. tab:: C++
|
||||
@ -78,7 +77,7 @@ As discussed previosly, for the GPU this will limit the batch size, for the CPU
|
||||
@endsphinxdirective
|
||||
|
||||
## Optimal Number of Inference Requests
|
||||
Using the hints assumes that the application queries the `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
|
||||
The hints are used on the presumption that the application queries `ov::optimal_number_of_infer_requests` to create and run the returned number of requests simultaneously:
|
||||
@sphinxdirective
|
||||
|
||||
.. tab:: C++
|
||||
@ -95,18 +94,18 @@ Using the hints assumes that the application queries the `ov::optimal_number_of_
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
While an application is free to create more requests if needed (for example to support asynchronous inputs population) **it is very important to at least run the `ov::optimal_number_of_infer_requests` of the inference requests in parallel**, for efficiency (device utilization) reasons.
|
||||
While an application is free to create more requests if needed (for example to support asynchronous inputs population) **it is very important to at least run the `ov::optimal_number_of_infer_requests` of the inference requests in parallel**. It is recommended for efficiency, or device utilization, reasons.
|
||||
|
||||
Also, notice that `ov::hint::PerformanceMode::LATENCY` does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes the machine features.
|
||||
To make your application fully scalable, prefer to query the `ov::optimal_number_of_infer_requests` directly.
|
||||
Keep in mind that `ov::hint::PerformanceMode::LATENCY` does not necessarily imply using single inference request. For example, multi-socket CPUs can deliver as many requests at the same minimal latency as the number of NUMA nodes in the system.
|
||||
To make your application fully scalable, make sure to query the `ov::optimal_number_of_infer_requests` directly.
|
||||
|
||||
## Prefer Async API
|
||||
The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and use of the `ov::InferRequest::wait()` (or callbacks). Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md).
|
||||
Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios).
|
||||
The API of the inference requests offers Sync and Async execution. The `ov::InferRequest::infer()` is inherently synchronous and simple to operate (as it serializes the execution flow in the current application thread). The Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()` (or callbacks). For more information, refer to the [API examples](../OV_Runtime_UG/ov_infer_request.md).
|
||||
Although the Synchronous API can be somewhat easier to start with, it is recommended to use the Asynchronous (callbacks-based) API in the production code. It is the most general and scalable way to implement the flow control for any possible number of requests (and thus both latency and throughput scenarios).
|
||||
|
||||
## Combining the Hints and Individual Low-Level Settings
|
||||
While sacrificing the portability at a some extent, it is possible to combine the hints with individual device-specific settings.
|
||||
For example, you can let the device prepare a configuration `ov::hint::PerformanceMode::THROUGHPUT` while overriding any specific value:
|
||||
While sacrificing the portability to some extent, it is possible to combine the hints with individual device-specific settings.
|
||||
For example, use `ov::hint::PerformanceMode::THROUGHPUT` to prepare a general configuration and override any of its specific values:
|
||||
@sphinxdirective
|
||||
|
||||
.. tab:: C++
|
||||
@ -123,8 +122,9 @@ For example, you can let the device prepare a configuration `ov::hint::Performan
|
||||
|
||||
|
||||
@endsphinxdirective
|
||||
## Testing the Performance of The Hints with the Benchmark_App
|
||||
The `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the performance of the performance hints for a particular device:
|
||||
|
||||
## Testing Performance of the Hints with the Benchmark_App
|
||||
The `benchmark_app`, that exists in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md) versions, is the best way to evaluate the functionality of the performance hints for a particular device:
|
||||
- benchmark_app **-hint tput** -d 'device' -m 'path to your model'
|
||||
- benchmark_app **-hint latency** -d 'device' -m 'path to your model'
|
||||
- Disabling the hints to emulate the pre-hints era (highly recommended before trying the individual low-level settings, such as the number of streams as below, threads, etc):
|
||||
|
@ -1,10 +1,14 @@
|
||||
# Preprocessing API - details {#openvino_docs_OV_UG_Preprocessing_Details}
|
||||
|
||||
## Preprocessing capabilities
|
||||
The purpose of this article is to present details on preprocessing API, such as its capabilities and post-processing.
|
||||
|
||||
### Addressing particular input/output
|
||||
## Pre-processing Capabilities
|
||||
|
||||
If your model has only one input, then simple <code>ov::preprocess::PrePostProcessor::input()</code> will get a reference to preprocessing builder for this input (tensor, steps, model):
|
||||
Below is a full list of pre-processing API capabilities:
|
||||
|
||||
### Addressing Particular Input/Output
|
||||
|
||||
If the model has only one input, then simple `ov::preprocess::PrePostProcessor::input()` will get a reference to pre-processing builder for this input (a tensor, the steps, a model):
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -23,7 +27,7 @@ If your model has only one input, then simple <code>ov::preprocess::PrePostProce
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
In general, when model has multiple inputs/outputs, each one can be addressed by tensor name
|
||||
In general, when a model has multiple inputs/outputs, each one can be addressed by a tensor name.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -42,7 +46,7 @@ In general, when model has multiple inputs/outputs, each one can be addressed by
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
Or by it's index
|
||||
Or by it's index.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -61,17 +65,17 @@ Or by it's index
|
||||
@endsphinxtabset
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::InputTensorInfo</code>
|
||||
* <code>ov::preprocess::OutputTensorInfo</code>
|
||||
* <code>ov::preprocess::PrePostProcessor</code>
|
||||
* `ov::preprocess::InputTensorInfo`
|
||||
* `ov::preprocess::OutputTensorInfo`
|
||||
* `ov::preprocess::PrePostProcessor`
|
||||
|
||||
|
||||
### Supported preprocessing operations
|
||||
### Supported Pre-processing Operations
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::PreProcessSteps</code>
|
||||
* `ov::preprocess::PreProcessSteps`
|
||||
|
||||
#### Mean/Scale normalization
|
||||
#### Mean/Scale Normalization
|
||||
|
||||
Typical data normalization includes 2 operations for each data item: subtract mean value and divide to standard deviation. This can be done with the following code:
|
||||
|
||||
@ -110,15 +114,15 @@ In Computer Vision area normalization is usually done separately for R, G, B val
|
||||
@endsphinxtabset
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::PreProcessSteps::mean()</code>
|
||||
* <code>ov::preprocess::PreProcessSteps::scale()</code>
|
||||
* `ov::preprocess::PreProcessSteps::mean()`
|
||||
* `ov::preprocess::PreProcessSteps::scale()`
|
||||
|
||||
|
||||
#### Convert precision
|
||||
#### Converting Precision
|
||||
|
||||
In Computer Vision, image is represented by array of unsigned 8-but integer values (for each color), but model accepts floating point tensors
|
||||
In Computer Vision, the image is represented by an array of unsigned 8-bit integer values (for each color), but the model accepts floating point tensors.
|
||||
|
||||
To integrate precision conversion into execution graph as a preprocessing step, just do:
|
||||
To integrate precision conversion into an execution graph as a pre-processing step:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -137,15 +141,15 @@ To integrate precision conversion into execution graph as a preprocessing step,
|
||||
@endsphinxtabset
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::InputTensorInfo::set_element_type()</code>
|
||||
* <code>ov::preprocess::PreProcessSteps::convert_element_type()</code>
|
||||
* `ov::preprocess::InputTensorInfo::set_element_type()`
|
||||
* `ov::preprocess::PreProcessSteps::convert_element_type()`
|
||||
|
||||
|
||||
#### Convert layout (transpose)
|
||||
#### Converting layout (transposing)
|
||||
|
||||
Transposing of matrices/tensors is a typical operation in Deep Learning - you may have a BMP image 640x480 which is an array of `{480, 640, 3}` elements, but Deep Learning model can require input with shape `{1, 3, 480, 640}`
|
||||
Transposing of matrices/tensors is a typical operation in Deep Learning - you may have a BMP image 640x480, which is an array of `{480, 640, 3}` elements, but Deep Learning model can require input with shape `{1, 3, 480, 640}`.
|
||||
|
||||
Using [layout](./layout_overview.md) of user's tensor and layout of original model conversion can be done implicitly
|
||||
Conversion can be done implicitly, using the [layout](./layout_overview.md) of a user's tensor and the layout of an original model.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -164,7 +168,7 @@ Using [layout](./layout_overview.md) of user's tensor and layout of original mod
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
Or if you prefer manual transpose of axes without usage of [layout](./layout_overview.md) in your code, just do:
|
||||
For a manual transpose of axes without the use of a [layout](./layout_overview.md) in the code:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -182,19 +186,19 @@ Or if you prefer manual transpose of axes without usage of [layout](./layout_ove
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
It performs the same transpose, but we believe that approach using source and destination layout can be easier to read and understand
|
||||
It performs the same transpose. However, the approach where source and destination layout are used can be easier to read and understand.
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::PreProcessSteps::convert_layout()</code>
|
||||
* <code>ov::preprocess::InputTensorInfo::set_layout()</code>
|
||||
* <code>ov::preprocess::InputModelInfo::set_layout()</code>
|
||||
* <code>ov::Layout</code>
|
||||
* `ov::preprocess::PreProcessSteps::convert_layout()`
|
||||
* `ov::preprocess::InputTensorInfo::set_layout()`
|
||||
* `ov::preprocess::InputModelInfo::set_layout()`
|
||||
* `ov::Layout`
|
||||
|
||||
#### Resize image
|
||||
#### Resizing Image
|
||||
|
||||
Resizing of image is a typical preprocessing step for computer vision tasks. With preprocessing API this step can also be integrated into execution graph and performed on target device.
|
||||
Resizing an image is a typical pre-processing step for computer vision tasks. With pre-processing API, this step can also be integrated into an execution graph and performed on a target device.
|
||||
|
||||
To resize the input image, it is needed to define `H` and `W` dimensions of [layout](./layout_overview.md)
|
||||
To resize the input image, it is needed to define `H` and `W` dimensions of the [layout](./layout_overview.md)
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -212,7 +216,7 @@ To resize the input image, it is needed to define `H` and `W` dimensions of [lay
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
Or in case if original model has known spatial dimensions (widht+height), target width/height can be omitted
|
||||
When original model has known spatial dimensions (`width`+`height`), target `width`/`height` can be omitted.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -231,13 +235,13 @@ Or in case if original model has known spatial dimensions (widht+height), target
|
||||
@endsphinxtabset
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::PreProcessSteps::resize()</code>
|
||||
* <code>ov::preprocess::ResizeAlgorithm</code>
|
||||
* `ov::preprocess::PreProcessSteps::resize()`
|
||||
* `ov::preprocess::ResizeAlgorithm`
|
||||
|
||||
|
||||
#### Color conversion
|
||||
#### Color Conversion
|
||||
|
||||
Typical use case is to reverse color channels from RGB to BGR and wise versa. To do this, specify source color format in `tensor` section and perform `convert_color` preprocessing operation. In example below, user has `BGR` image and needs to convert it to `RGB` as required for model's input
|
||||
Typical use case is to reverse color channels from `RGB` to `BGR` and vice versa. To do this, specify source color format in `tensor` section and perform `convert_color` pre-processing operation. In the example below, a `BGR` image needs to be converted to `RGB` as required for the model input.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -255,9 +259,9 @@ Typical use case is to reverse color channels from RGB to BGR and wise versa. To
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
#### Color conversion - NV12/I420
|
||||
Preprocessing also support YUV-family source color formats, i.e. NV12 and I420.
|
||||
In advanced cases such YUV images can be splitted into separate planes, e.g. for NV12 images Y-component may come from one source and UV-component comes from another source. Concatenating such components in user's application manually is not a perfect solution from performance and device utilization perspectives, so there is a way to use Preprocessing API. For such cases there is `NV12_TWO_PLANES` and `I420_THREE_PLANES` source color formats, which will split original `input` to 2 or 3 inputs
|
||||
#### Color Conversion - NV12/I420
|
||||
Pre-processing also supports YUV-family source color formats, i.e. NV12 and I420.
|
||||
In advanced cases, such YUV images can be split into separate planes, e.g., for NV12 images Y-component may come from one source and UV-component from another one. Concatenating such components in user's application manually is not a perfect solution from performance and device utilization perspectives. However, there is a way to use Pre-processing API. For such cases there are `NV12_TWO_PLANES` and `I420_THREE_PLANES` source color formats, which will split the original `input` into 2 or 3 inputs.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -275,20 +279,20 @@ In advanced cases such YUV images can be splitted into separate planes, e.g. for
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
In this example, original `input` is being split to `input/y` and `input/uv` inputs. You can fill `input/y` from one source, and `input/uv` from another source. Color conversion to `RGB` will be performed using these sources, it is more optimal as there will be no additional copies of NV12 buffers.
|
||||
In this example, the original `input` is split to `input/y` and `input/uv` inputs. You can fill `input/y` from one source, and `input/uv` from another source. Color conversion to `RGB` will be performed, using these sources. It is more efficient as there will be no additional copies of NV12 buffers.
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::ColorFormat</code>
|
||||
* <code>ov::preprocess::PreProcessSteps::convert_color</code>
|
||||
* `ov::preprocess::ColorFormat`
|
||||
* `ov::preprocess::PreProcessSteps::convert_color`
|
||||
|
||||
|
||||
### Custom operations
|
||||
### Custom Operations
|
||||
|
||||
Preprocessing API also allows adding custom preprocessing steps into execution graph. Custom step is a function which accepts current 'input' node and returns new node after adding preprocessing step
|
||||
Pre-processing API also allows adding `custom` preprocessing steps into an execution graph. The `custom` function accepts the current `input` node, applies the defined preprocessing operations, and returns a new node.
|
||||
|
||||
> **Note:** Custom preprocessing function shall only insert node(s) after input, it will be done during model compilation. This function will NOT be called during execution phase. This may look not trivial and require some knowledge of [OpenVINO™ operations](../ops/opset.md)
|
||||
> **Note:** Custom pre-processing function should only insert node(s) after the input. It is done during model compilation. This function will NOT be called during the execution phase. This may appear to be complicated and require knowledge of [OpenVINO™ operations](../ops/opset.md).
|
||||
|
||||
If there is a need to insert some additional operations to execution graph right after input, like some specific crops and/or resizes - Preprocessing API can be a good choice to implement this
|
||||
If there is a need to insert additional operations to the execution graph right after the input, like some specific crops and/or resizes - Pre-processing API can be a good choice to implement this.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -307,23 +311,23 @@ If there is a need to insert some additional operations to execution graph right
|
||||
@endsphinxtabset
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::PreProcessSteps::custom()</code>
|
||||
* `ov::preprocess::PreProcessSteps::custom()`
|
||||
* [Available Operations Sets](../ops/opset.md)
|
||||
|
||||
## Postprocessing
|
||||
## Post-processing
|
||||
|
||||
Postprocessing steps can be added to model outputs. As for preprocessing, these steps will be also integrated into graph and executed on selected device.
|
||||
Post-processing steps can be added to model outputs. As for pre-processing, these steps will be also integrated into a graph and executed on a selected device.
|
||||
|
||||
Preprocessing uses flow **User tensor** -> **Steps** -> **Model input**
|
||||
Pre-processing uses the following flow: **User tensor** -> **Steps** -> **Model input**.
|
||||
|
||||
Postprocessing is wise versa: **Model output** -> **Steps** -> **User tensor**
|
||||
Post-processing uses the reverse: **Model output** -> **Steps** -> **User tensor**.
|
||||
|
||||
Comparing to preprocessing, there is not so much operations needed to do in post-processing stage, so right now only following postprocessing operations are supported:
|
||||
- Convert [layout](./layout_overview.md)
|
||||
- Convert element type
|
||||
- Custom operations
|
||||
Compared to pre-processing, there are not as many operations needed for the post-processing stage. Currently, only the following post-processing operations are supported:
|
||||
- Convert a [layout](./layout_overview.md).
|
||||
- Convert an element type.
|
||||
- Customize operations.
|
||||
|
||||
Usage of these operations is similar to Preprocessing. Some example is shown below:
|
||||
Usage of these operations is similar to pre-processing. See the following example:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -342,6 +346,6 @@ Usage of these operations is similar to Preprocessing. Some example is shown bel
|
||||
@endsphinxtabset
|
||||
|
||||
C++ references:
|
||||
* <code>ov::preprocess::PostProcessSteps</code>
|
||||
* <code>ov::preprocess::OutputModelInfo</code>
|
||||
* <code>ov::preprocess::OutputTensorInfo</code>
|
||||
* `ov::preprocess::PostProcessSteps`
|
||||
* `ov::preprocess::OutputModelInfo`
|
||||
* `ov::preprocess::OutputTensorInfo`
|
||||
|
@ -14,36 +14,36 @@
|
||||
|
||||
## Introduction
|
||||
|
||||
When your input data don't perfectly fit to Neural Network model input tensor - this means that additional operations/steps are needed to transform your data to format expected by model. These operations are known as "preprocessing".
|
||||
When input data does not fit the model input tensor perfectly, additional operations/steps are needed to transform the data to the format expected by the model. These operations are known as "preprocessing".
|
||||
|
||||
### Example
|
||||
Consider the following standard example: deep learning model expects input with shape `{1, 3, 224, 224}`, `FP32` precision, `RGB` color channels order, and requires data normalization (subtract mean and divide by scale factor). But you have just a `640x480` `BGR` image (data is `{480, 640, 3}`). This means that we need some operations which will:
|
||||
- Convert U8 buffer to FP32
|
||||
- Transform to `planar` format: from `{1, 480, 640, 3}` to `{1, 3, 480, 640}`
|
||||
- Resize image from 640x480 to 224x224
|
||||
- Make `BGR->RGB` conversion as model expects `RGB`
|
||||
- For each pixel, subtract mean values and divide by scale factor
|
||||
Consider the following standard example: deep learning model expects input with the `{1, 3, 224, 224}` shape, `FP32` precision, `RGB` color channels order, and it requires data normalization (subtract mean and divide by scale factor). However, there is just a `640x480` `BGR` image (data is `{480, 640, 3}`). This means that the following operations must be performed:
|
||||
- Convert `U8` buffer to `FP32`.
|
||||
- Transform to `planar` format: from `{1, 480, 640, 3}` to `{1, 3, 480, 640}`.
|
||||
- Resize image from 640x480 to 224x224.
|
||||
- Make `BGR->RGB` conversion as model expects `RGB`.
|
||||
- For each pixel, subtract mean values and divide by scale factor.
|
||||
|
||||
|
||||

|
||||
|
||||
|
||||
Even though all these steps can be relatively easy implemented manually in application's code before actual inference, it is possible to do it with Preprocessing API. Reasons to use this API are:
|
||||
- Preprocessing API is easy to use
|
||||
Even though it is relatively easy to implement all these steps in the application code manually, before actual inference, it is also possible with the use of Preprocessing API. Advantages of using the API are:
|
||||
- Preprocessing API is easy to use.
|
||||
- Preprocessing steps will be integrated into execution graph and will be performed on selected device (CPU/GPU/VPU/etc.) rather than always being executed on CPU. This will improve selected device utilization which is always good.
|
||||
|
||||
## Preprocessing API
|
||||
|
||||
Intuitively, Preprocessing API consists of the following parts:
|
||||
1. **Tensor:** Declare user's data format, like shape, [layout](./layout_overview.md), precision, color format of actual user's data
|
||||
2. **Steps:** Describe sequence of preprocessing steps which need to be applied to user's data
|
||||
3. **Model:** Specify Model's data format. Usually, precision and shape are already known for model, only additional information, like [layout](./layout_overview.md) can be specified
|
||||
Intuitively, preprocessing API consists of the following parts:
|
||||
1. **Tensor** - declares user data format, like shape, [layout](./layout_overview.md), precision, color format from actual user's data.
|
||||
2. **Steps** - describes sequence of preprocessing steps which need to be applied to user data.
|
||||
3. **Model** - specifies model data format. Usually, precision and shape are already known for model, only additional information, like [layout](./layout_overview.md) can be specified.
|
||||
|
||||
> **Note:** All model's graph modification shall be performed after model is read from disk and **before** it is being loaded on actual device.
|
||||
> **NOTE**: Graph modifications of a model shall be performed after the model is read from a drive and **before** it is loaded on the actual device.
|
||||
|
||||
### PrePostProcessor object
|
||||
### PrePostProcessor Object
|
||||
|
||||
`ov::preprocess::PrePostProcessor` class allows specifying preprocessing and postprocessing steps for model read from disk.
|
||||
The `ov::preprocess::PrePostProcessor` class allows specifying preprocessing and postprocessing steps for a model read from disk.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -61,9 +61,9 @@ Intuitively, Preprocessing API consists of the following parts:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
### Declare user's data format
|
||||
### Declare User's Data Format
|
||||
|
||||
To address particular input of model/preprocessor, use `ov::preprocess::PrePostProcessor::input(input_name)` method
|
||||
To address particular input of a model/preprocessor, use the `ov::preprocess::PrePostProcessor::input(input_name)` method.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -81,16 +81,16 @@ To address particular input of model/preprocessor, use `ov::preprocess::PrePostP
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
Here we've specified all information about user's input:
|
||||
- Precision is U8 (unsigned 8-bit integer)
|
||||
- Data represents tensor with {1,480,640,3} shape
|
||||
- [Layout](./layout_overview.md) is "NHWC". It means that 'height=480, width=640, channels=3'
|
||||
- Color format is `BGR`
|
||||
Below is all the specified input information:
|
||||
- Precision is `U8` (unsigned 8-bit integer).
|
||||
- Data represents tensor with the `{1,480,640,3}` shape.
|
||||
- [Layout](./layout_overview.md) is "NHWC". It means: `height=480`, `width=640`, `channels=3`'.
|
||||
- Color format is `BGR`.
|
||||
|
||||
@anchor declare_model_s_layout
|
||||
### Declare model's layout
|
||||
### Declaring Model Layout
|
||||
|
||||
Model's input already has information about precision and shape. Preprocessing API is not intended to modify this. The only thing that may be specified is input's data [layout](./layout_overview.md)
|
||||
Model input already has information about precision and shape. Preprocessing API is not intended to modify this. The only thing that may be specified is input data [layout](./layout_overview.md)
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -109,11 +109,11 @@ Model's input already has information about precision and shape. Preprocessing A
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
Now, if model's input has `{1,3,224,224}` shape, preprocessing will be able to identify that model's `height=224`, `width=224`, `channels=3`. Height/width information is necessary for 'resize', and `channels` is needed for mean/scale normalization
|
||||
Now, if the model input has `{1,3,224,224}` shape, preprocessing will be able to identify the `height=224`, `width=224`, and `channels=3` of that model. The `height`/`width` information is necessary for `resize`, and `channels` is needed for mean/scale normalization.
|
||||
|
||||
### Preprocessing steps
|
||||
### Preprocessing Steps
|
||||
|
||||
Now we can define sequence of preprocessing steps:
|
||||
Now, the sequence of preprocessing steps can be defined:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -131,17 +131,18 @@ Now we can define sequence of preprocessing steps:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
Here:
|
||||
- Convert U8 to FP32 precision
|
||||
- Convert current color format (BGR) to RGB
|
||||
- Resize to model's height/width. **Note** that if model accepts dynamic size, e.g. {?, 3, ?, ?}, `resize` will not know how to resize the picture, so in this case you should specify target height/width on this step. See also <code>ov::preprocess::PreProcessSteps::resize()</code>
|
||||
- Subtract mean from each channel. On this step, color format is RGB already, so `100.5` will be subtracted from each Red component, and `101.5` will be subtracted from `Blue` one.
|
||||
- Divide each pixel data to appropriate scale value. In this example each `Red` component will be divided by 50, `Green` by 51, `Blue` by 52 respectively
|
||||
- **Note:** last `convert_layout` step is commented out as it is not necessary to specify last layout conversion. PrePostProcessor will do such conversion automatically
|
||||
Perform the following:
|
||||
|
||||
### Integrate steps into model
|
||||
1. Convert `U8` to `FP32` precision.
|
||||
2. Convert current color format from `BGR` to `RGB`.
|
||||
3. Resize to `height`/`width` of a model. Be aware that if a model accepts dynamic size e.g., `{?, 3, ?, ?}`, `resize` will not know how to resize the picture. Therefore, in this case, target `height`/`width` should be specified. For more details, see also the `ov::preprocess::PreProcessSteps::resize()`.
|
||||
4. Subtract mean from each channel. In this step, color format is already `RGB`, so `100.5` will be subtracted from each `Red` component, and `101.5` will be subtracted from each `Blue` one.
|
||||
5. Divide each pixel data to appropriate scale value. In this example, each `Red` component will be divided by 50, `Green` by 51, and `Blue` by 52 respectively.
|
||||
6. Keep in mind that the last `convert_layout` step is commented out as it is not necessary to specify the last layout conversion. The `PrePostProcessor` will do such conversion automatically.
|
||||
|
||||
We've finished with preprocessing steps declaration, now it is time to build it. For debugging purposes it is possible to print `PrePostProcessor` configuration on screen:
|
||||
### Integrating Steps into a Model
|
||||
|
||||
Once the preprocessing steps have been finished the model can be finally built. It is possible to display `PrePostProcessor` configuration for debugging purposes:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -160,11 +161,11 @@ We've finished with preprocessing steps declaration, now it is time to build it.
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
After this, `model` will accept U8 input with `{1, 480, 640, 3}` shape, with `BGR` channels order. All conversion steps will be integrated into execution graph. Now you can load model on device and pass your image to model as is, without any data manipulation on application's side
|
||||
The `model` will accept `U8` input with the shape of `{1, 480, 640, 3}` and the `BGR` channel order. All conversion steps will be integrated into the execution graph. Now, model can be loaded on the device and the image can be passed to the model without any data manipulation in the application.
|
||||
|
||||
|
||||
## See Also
|
||||
## Additional Resources
|
||||
|
||||
* [Preprocessing Details](./preprocessing_details.md)
|
||||
* [Layout API overview](./layout_overview.md)
|
||||
* [Preprocessing Details](@ref openvino_docs_OV_UG_Preprocessing_Details)
|
||||
* [Layout API overview](@ref openvino_docs_OV_UG_Layout_Overview)
|
||||
* <code>ov::preprocess::PrePostProcessor</code> C++ class documentation
|
||||
|
@ -1,21 +1,20 @@
|
||||
# Use Case - Integrate and Save Preprocessing Steps Into IR {#openvino_docs_OV_UG_Preprocess_Usecase_save}
|
||||
|
||||
## Introduction
|
||||
|
||||
In previous sections we've covered how to add [preprocessing steps](./preprocessing_details.md) and got the overview of [Layout](./layout_overview.md) API.
|
||||
Previous sections covered the topic of the [preprocessing steps](@ref openvino_docs_OV_UG_Preprocessing_Details) and the overview of [Layout](@ref openvino_docs_OV_UG_Layout_Overview) API.
|
||||
|
||||
For many applications it is also important to minimize model's read/load time, so performing integration of preprocessing steps every time on application startup after `ov::runtime::Core::read_model` may look not convenient. In such cases, after adding of Pre- and Post-processing steps it can be useful to store new execution model to Intermediate Representation (IR, .xml format).
|
||||
For many applications, it is also important to minimize read/load time of a model. Therefore, performing integration of preprocessing steps every time on application startup, after `ov::runtime::Core::read_model`, may seem inconvenient. In such cases, once pre and postprocessing steps have been added, it can be useful to store new execution model to OpenVINO Intermediate Representation (OpenVINO IR, `.xml` format).
|
||||
|
||||
Most part of existing preprocessing steps can also be performed via command line options using Model Optimizer tool. Refer to [Model Optimizer - Optimize Preprocessing Computation](../MO_DG/prepare_model/Additional_Optimizations.md) for details os such command line options.
|
||||
Most available preprocessing steps can also be performed via command-line options, using Model Optimizer. For details on such command-line options, refer to the [Optimizing Preprocessing Computation](../MO_DG/prepare_model/Additional_Optimizations.md).
|
||||
|
||||
## Code example - saving model with preprocessing to IR
|
||||
## Code example - Saving Model with Preprocessing to OpenVINO IR
|
||||
|
||||
In case if you have some preprocessing steps which can't be integrated into execution graph using Model Optimizer command line options (e.g. `YUV->RGB` color space conversion, Resize, etc.) it is possible to write simple code which:
|
||||
- Reads original model (IR, ONNX, Paddle)
|
||||
- Adds preprocessing/postprocessing steps
|
||||
- Saves resulting model as IR (.xml/.bin)
|
||||
When some preprocessing steps cannot be integrated into the execution graph using Model Optimizer command-line options (for example, `YUV`->`RGB` color space conversion, `Resize`, etc.), it is possible to write a simple code which:
|
||||
- Reads the original model (OpenVINO IR, ONNX, PaddlePaddle).
|
||||
- Adds the preprocessing/postprocessing steps.
|
||||
- Saves resulting model as IR (`.xml` and `.bin`).
|
||||
|
||||
Let's consider the example, there is an original `ONNX` model which takes one `float32` input with shape `{1, 3, 224, 224}` with `RGB` channels order, with mean/scale values applied. User's application can provide `BGR` image buffer with not fixed size. Additionally, we'll also imagine that our application provides input images as batches, each batch contains 2 images. Here is how model conversion code may look like in your model preparation script
|
||||
Consider the example, where an original ONNX model takes one `float32` input with the `{1, 3, 224, 224}` shape, the `RGB` channel order, and mean/scale values applied. In contrast, the application provides `BGR` image buffer with a non-fixed size and input images as batches of two. Below is the model conversion code that can be applied in the model preparation script for such a case.
|
||||
|
||||
- Includes / Imports
|
||||
|
||||
@ -35,7 +34,7 @@ Let's consider the example, there is an original `ONNX` model which takes one `f
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
- Preprocessing & Saving to IR code
|
||||
- Preprocessing & Saving to the OpenVINO IR code.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -54,9 +53,9 @@ Let's consider the example, there is an original `ONNX` model which takes one `f
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
## Application code - load model to target device
|
||||
## Application Code - Load Model to Target Device
|
||||
|
||||
After this, your application's code can load saved file and don't perform preprocessing anymore. In this example we'll also enable [model caching](./Model_caching_overview.md) to minimize load time when cached model is available
|
||||
After this, the application code can load a saved file and stop preprocessing. In this case, enable [model caching](./Model_caching_overview.md) to minimize load time when the cached model is available.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -75,11 +74,11 @@ After this, your application's code can load saved file and don't perform prepro
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
## See Also
|
||||
* [Preprocessing Details](./preprocessing_details.md)
|
||||
* [Layout API overview](./layout_overview.md)
|
||||
## Additional Resources
|
||||
* [Preprocessing Details](@ref openvino_docs_OV_UG_Preprocessing_Details)
|
||||
* [Layout API overview](@ref openvino_docs_OV_UG_Layout_Overview)
|
||||
* [Model Optimizer - Optimize Preprocessing Computation](../MO_DG/prepare_model/Additional_Optimizations.md)
|
||||
* [Model Caching Overview](./Model_caching_overview.md)
|
||||
* <code>ov::preprocess::PrePostProcessor</code> C++ class documentation
|
||||
* <code>ov::pass::Serialize</code> - pass to serialize model to XML/BIN
|
||||
* <code>ov::set_batch</code> - update batch dimension for a given model
|
||||
* The `ov::preprocess::PrePostProcessor` C++ class documentation
|
||||
* The `ov::pass::Serialize` - pass to serialize model to XML/BIN
|
||||
* The `ov::set_batch` - update batch dimension for a given model
|
@ -1,27 +1,27 @@
|
||||
# Using Encrypted Models with OpenVINO™ {#openvino_docs_OV_UG_protecting_model_guide}
|
||||
# Using Encrypted Models with OpenVINO {#openvino_docs_OV_UG_protecting_model_guide}
|
||||
|
||||
Deploying deep-learning capabilities to edge devices can present security
|
||||
challenges, for example, ensuring inference integrity or providing copyright
|
||||
challenges like ensuring inference integrity, or providing copyright
|
||||
protection of your deep-learning models.
|
||||
|
||||
One possible solution is to use cryptography to protect models as they are
|
||||
deployed and stored on edge devices. Model encryption, decryption and
|
||||
authentication are not provided by OpenVINO™ but can be implemented with
|
||||
third-party tools, like OpenSSL\*. While implementing encryption, ensure that
|
||||
you use the latest versions of tools and follow cryptography best practices.
|
||||
authentication are not provided by OpenVINO but can be implemented with
|
||||
third-party tools (i.e., OpenSSL). While implementing encryption, ensure that
|
||||
the latest versions of tools are used and follow cryptography best practices.
|
||||
|
||||
This guide demonstrates how to use OpenVINO securely with protected models.
|
||||
This guide presents how to use OpenVINO securely with protected models.
|
||||
|
||||
## Secure Model Deployment
|
||||
|
||||
After a model is optimized by the OpenVINO Model Optimizer, it's deployed
|
||||
to target devices in the Intermediate Representation (IR) format. An optimized
|
||||
model is stored on an edge device and executed by the OpenVINO Runtime.
|
||||
(ONNX, PDPD models can also be read natively by the OpenVINO Runtime.)
|
||||
to target devices in the OpenVINO Intermediate Representation (OpenVINO IR) format. An optimized
|
||||
model is stored on edge device and is executed by the OpenVINO Runtime.
|
||||
ONNX and PDPD models can be read natively by OpenVINO Runtime as well.
|
||||
|
||||
To protect deep-learning models, you can encrypt an optimized model before
|
||||
deploying it to the edge device. The edge device should keep the stored model
|
||||
protected at all times and have the model decrypted **in runtime only** for use
|
||||
Encrypting and optimizing model before deploying it to the edge device can be
|
||||
used to protect deep-learning models. The edge device should keep the stored model
|
||||
protected all the time and have the model decrypted **in runtime only** for use
|
||||
by the OpenVINO Runtime.
|
||||
|
||||

|
||||
@ -35,12 +35,12 @@ For more information, see the `ov::Core` Class Reference Documentation.
|
||||
|
||||
@snippet snippets/protecting_model_guide.cpp part0
|
||||
|
||||
Hardware-based protection such as Intel® Software Guard Extensions
|
||||
(Intel® SGX) can be utilized to protect decryption operation secrets and
|
||||
bind them to a device. For more information, go to [Intel® Software Guard
|
||||
Hardware-based protection such as Intel Software Guard Extensions
|
||||
(Intel SGX) can be used to protect decryption operation secrets and
|
||||
bind them to a device. For more information, see the [Intel Software Guard
|
||||
Extensions](https://software.intel.com/en-us/sgx).
|
||||
|
||||
Use `ov::Core::read_model` to set model representations and
|
||||
Use the `ov::Core::read_model` to set model representations and
|
||||
weights respectively.
|
||||
|
||||
Currently there is no way to read external weights from memory for ONNX models.
|
||||
@ -51,9 +51,9 @@ should be called with `weights` passed as an empty `ov::Tensor`.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- Intel® Distribution of OpenVINO™ toolkit home page: [https://software.intel.com/en-us/openvino-toolkit](https://software.intel.com/en-us/openvino-toolkit)
|
||||
- Model Optimizer Developer Guide: [Model Optimizer Developer Guide](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md)
|
||||
- [OpenVINO™ runTime User Guide](openvino_intro.md)
|
||||
- Intel® Distribution of OpenVINO™ toolkit [home page](https://software.intel.com/en-us/openvino-toolkit).
|
||||
- Model Optimizer [Developer Guide](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md).
|
||||
- [OpenVINO™ Runtime User Guide](openvino_intro.md).
|
||||
- For more information on Sample Applications, see the [OpenVINO Samples Overview](Samples_Overview.md)
|
||||
- For information on a set of pre-trained models, see the [Overview of OpenVINO™ Toolkit Pre-Trained Models](@ref omz_models_group_intel)
|
||||
- For IoT Libraries and Code Samples see the [Intel® IoT Developer Kit](https://github.com/intel-iot-devkit).
|
||||
- For information on a set of pre-trained models, see the [Overview of OpenVINO™ Toolkit Pre-Trained Models](@ref omz_models_group_intel).
|
||||
- For IoT Libraries and Code Samples, see the [Intel® IoT Developer Kit](https://github.com/intel-iot-devkit).
|
||||
|
@ -1,61 +1,58 @@
|
||||
# Arm® CPU device {#openvino_docs_OV_UG_supported_plugins_ARM_CPU}
|
||||
# Arm® CPU Device {#openvino_docs_OV_UG_supported_plugins_ARM_CPU}
|
||||
|
||||
|
||||
## Introducing the Arm® CPU Plugin
|
||||
The Arm® CPU plugin is developed in order to enable deep neural networks inference on Arm® CPU, using [Compute Library](https://github.com/ARM-software/ComputeLibrary) as a backend.
|
||||
|
||||
> **NOTE**: Note that this is a community-level add-on to OpenVINO™. Intel® welcomes community participation in the OpenVINO™ ecosystem and technical questions on community forums as well as code contributions are welcome. However, this component has not undergone full release validation or qualification from Intel®, and no official support is offered.
|
||||
> **NOTE**: This is a community-level add-on to OpenVINO™. Intel® welcomes community participation in the OpenVINO™ ecosystem, technical questions and code contributions on community forums. However, this component has not undergone full release validation or qualification from Intel®, hence no official support is offered.
|
||||
|
||||
The Arm® CPU plugin is not a part of the Intel® Distribution of OpenVINO™ toolkit and is not distributed in pre-built form. To use the plugin, it should be built from source code. Plugin build procedure is described on page [How to build Arm® CPU plugin](https://github.com/openvinotoolkit/openvino_contrib/wiki/How-to-build-ARM-CPU-plugin).
|
||||
The Arm® CPU plugin is not a part of the Intel® Distribution of OpenVINO™ toolkit and is not distributed in the pre-built form. The plugin should be built from the source code for use. Plugin build procedure is described in [How to build Arm® CPU plugin](https://github.com/openvinotoolkit/openvino_contrib/wiki/How-to-build-ARM-CPU-plugin) guide.
|
||||
|
||||
The set of supported layers is defined on [Operation set specification](https://github.com/openvinotoolkit/openvino_contrib/wiki/ARM-plugin-operation-set-specification).
|
||||
The set of supported layers is defined on the [Op-set specification page](https://github.com/openvinotoolkit/openvino_contrib/wiki/ARM-plugin-operation-set-specification).
|
||||
|
||||
|
||||
## Supported inference data types
|
||||
## Supported Inference Data Types
|
||||
The Arm® CPU plugin supports the following data types as inference precision of internal primitives:
|
||||
|
||||
- Floating-point data types:
|
||||
- f32
|
||||
- f16
|
||||
- Quantized data types:
|
||||
- i8
|
||||
|
||||
|
||||
> **NOTE**: i8 support is experimental.
|
||||
- i8 (support is experimental)
|
||||
|
||||
[Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out supported data types for all detected devices.
|
||||
|
||||
## Supported features
|
||||
## Supported Features
|
||||
|
||||
### Preprocessing acceleration
|
||||
**Preprocessing Acceleration**
|
||||
The Arm® CPU plugin supports the following accelerated preprocessing operations:
|
||||
- Precision conversion:
|
||||
- u8 -> u16, s16, s32
|
||||
- u16 -> u8, u32
|
||||
- s16 -> u8, s32
|
||||
- f16 -> f32
|
||||
- Transposion of tensors with dims < 5
|
||||
- Transposition of tensors with dims < 5
|
||||
- Interpolation of 4D tensors with no padding (`pads_begin` and `pads_end` equal 0).
|
||||
|
||||
The Arm® CPU plugin supports the following preprocessing operations, however they are not accelerated:
|
||||
- Precision conversion that are not mentioned above
|
||||
- Precision conversion that is not mentioned above
|
||||
- Color conversion:
|
||||
- NV12 to RGB
|
||||
- NV12 to BGR
|
||||
- i420 to RGB
|
||||
- i420 to BGR
|
||||
|
||||
See [preprocessing API guide](../preprocessing_overview.md) for more details.
|
||||
For more details, see the [preprocessing API guide](../preprocessing_overview.md).
|
||||
|
||||
## Supported properties
|
||||
## Supported Properties
|
||||
The plugin supports the properties listed below.
|
||||
|
||||
### Read-write properties
|
||||
All parameters must be set before calling `ov::Core::compile_model()` in order to take effect or passed as additional argument to `ov::Core::compile_model()`
|
||||
**Read-write Properties**
|
||||
In order to take effect, all parameters must be set before calling `ov::Core::compile_model()` or passed as additional argument to `ov::Core::compile_model()`
|
||||
|
||||
- ov::enable_profiling
|
||||
|
||||
### Read-only properties
|
||||
**Read-only Properties**
|
||||
- ov::supported_properties
|
||||
- ov::available_devices
|
||||
- ov::range_for_async_infer_requests
|
||||
@ -65,27 +62,27 @@ All parameters must be set before calling `ov::Core::compile_model()` in order t
|
||||
|
||||
|
||||
## Known Layers Limitation
|
||||
* `AvgPool` layer is supported via arm_compute library for 4D input tensor and via reference implementation for another cases.
|
||||
* `BatchToSpace` layer is supported 4D tensors only and constant nodes: `block_shape` with `N` = 1 and `C`= 1, `crops_begin` with zero values and `crops_end` with zero values.
|
||||
* `ConvertLike` layer is supported configuration like `Convert`.
|
||||
* `DepthToSpace` layer is supported 4D tensors only and for `BLOCKS_FIRST` of `mode` attribute.
|
||||
* `AvgPool` layer is supported via arm_compute library for 4D input tensor and via reference implementation for other cases.
|
||||
* `BatchToSpace` layer is supported for 4D tensors only and constant nodes: `block_shape` with `N` = 1 and `C`= 1, `crops_begin` with zero values and `crops_end` with zero values.
|
||||
* `ConvertLike` layer is supported for configuration like `Convert`.
|
||||
* `DepthToSpace` layer is supported for 4D tensors only and for `BLOCKS_FIRST` of `mode` attribute.
|
||||
* `Equal` does not support `broadcast` for inputs.
|
||||
* `Gather` layer is supported constant scalar or 1D indices axes only. Layer is supported as via arm_compute library for non negative indices and via reference implementation otherwise.
|
||||
* `Gather` layer is supported for constant scalar or 1D indices axes only. Layer is supported via arm_compute library for non negative indices and via reference implementation otherwise.
|
||||
* `Less` does not support `broadcast` for inputs.
|
||||
* `LessEqual` does not support `broadcast` for inputs.
|
||||
* `LRN` layer is supported `axes = {1}` or `axes = {2, 3}` only.
|
||||
* `MaxPool-1` layer is supported via arm_compute library for 4D input tensor and via reference implementation for another cases.
|
||||
* `LRN` layer is supported for `axes = {1}` or `axes = {2, 3}` only.
|
||||
* `MaxPool-1` layer is supported via arm_compute library for 4D input tensor and via reference implementation for other cases.
|
||||
* `Mod` layer is supported for f32 only.
|
||||
* `MVN` layer is supported via arm_compute library for 2D inputs and `false` value of `normalize_variance` and `false` value of `across_channels`, for another cases layer is implemented via runtime reference.
|
||||
* `Normalize` layer is supported via arm_compute library with `MAX` value of `eps_mode` and `axes = {2 | 3}`, and for `ADD` value of `eps_mode` layer uses `DecomposeNormalizeL2Add`, for another cases layer is implemented via runtime reference.
|
||||
* `MVN` layer is supported via arm_compute library for 2D inputs and `false` value of `normalize_variance` and `false` value of `across_channels`, for other cases layer is implemented via runtime reference.
|
||||
* `Normalize` layer is supported via arm_compute library with `MAX` value of `eps_mode` and `axes = {2 | 3}`, and for `ADD` value of `eps_mode` layer uses `DecomposeNormalizeL2Add`. For other cases layer is implemented via runtime reference.
|
||||
* `NotEqual` does not support `broadcast` for inputs.
|
||||
* `Pad` layer works with `pad_mode = {REFLECT | CONSTANT | SYMMETRIC}` parameters only.
|
||||
* `Round` layer is supported via arm_compute library with `RoundMode::HALF_AWAY_FROM_ZERO` value of `mode`, for another cases layer is implemented via runtime reference.
|
||||
* `SpaceToBatch` layer is supported 4D tensors only and constant nodes: `shapes`, `pads_begin` or `pads_end` with zero paddings for batch or channels and one values `shapes` for batch and channels.
|
||||
* `SpaceToDepth` layer is supported 4D tensors only and for `BLOCKS_FIRST` of `mode` attribute.
|
||||
* `StridedSlice` layer is supported via arm_compute library for tensors with dims < 5 and zero values of `ellipsis_mask` or zero values of `new_axis_mask` and `shrink_axis_mask`, for another cases layer is implemented via runtime reference.
|
||||
* `FakeQuantize` layer is supported via arm_compute library in Low Precision evaluation mode for suitable models and via runtime reference otherwise.
|
||||
* `Round` layer is supported via arm_compute library with `RoundMode::HALF_AWAY_FROM_ZERO` value of `mode`, for other cases layer is implemented via runtime reference.
|
||||
* `SpaceToBatch` layer is supported for 4D tensors only and constant nodes: `shapes`, `pads_begin` or `pads_end` with zero paddings for batch or channels and one values `shapes` for batch and channels.
|
||||
* `SpaceToDepth` layer is supported for 4D tensors only and for `BLOCKS_FIRST` of `mode` attribute.
|
||||
* `StridedSlice` layer is supported via arm_compute library for tensors with dims < 5 and zero values of `ellipsis_mask` or zero values of `new_axis_mask` and `shrink_axis_mask`. For other cases, layer is implemented via runtime reference.
|
||||
* `FakeQuantize` layer is supported via arm_compute library, in Low Precision evaluation mode for suitable models, and via runtime reference otherwise.
|
||||
|
||||
## See Also
|
||||
* [How to run YOLOv4 model inference using OpenVINO™ and OpenCV on Arm®](https://opencv.org/how-to-run-yolov4-using-openvino-and-opencv-on-arm/)
|
||||
* [Face recognition on Android™ using OpenVINO™ toolkit with Arm® plugin](https://opencv.org/face-recognition-on-android-using-openvino-toolkit-with-arm-plugin/)
|
||||
## Additional Resources
|
||||
* [How to run YOLOv4 model inference using OpenVINO™ and OpenCV on Arm®](https://opencv.org/how-to-run-yolov4-using-openvino-and-opencv-on-arm/).
|
||||
* [Face recognition on Android™ using OpenVINO™ toolkit with Arm® plugin](https://opencv.org/face-recognition-on-android-using-openvino-toolkit-with-arm-plugin/).
|
||||
|
@ -1,17 +1,18 @@
|
||||
# CPU device {#openvino_docs_OV_UG_supported_plugins_CPU}
|
||||
# CPU Device {#openvino_docs_OV_UG_supported_plugins_CPU}
|
||||
|
||||
The CPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit and is developed to achieve high performance inference of neural networks on Intel® x86-64 CPUs.
|
||||
For an in-depth description of the plugin, see:
|
||||
The CPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit. It is developed to achieve high performance inference of neural networks on Intel® x86-64 CPUs.
|
||||
For an in-depth description of CPU plugin, see:
|
||||
|
||||
- [CPU plugin developers documentation](https://github.com/openvinotoolkit/openvino/wiki/CPUPluginDevelopersDocs)
|
||||
- [CPU plugin developers documentation](https://github.com/openvinotoolkit/openvino/wiki/CPUPluginDevelopersDocs).
|
||||
|
||||
- [OpenVINO Runtime CPU plugin source files](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_cpu/)
|
||||
- [OpenVINO Runtime CPU plugin source files](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_cpu/).
|
||||
|
||||
|
||||
## Device name
|
||||
The CPU device plugin uses the label of `"CPU"` and is the only device of this kind, even if multiple sockets are present on the platform.
|
||||
|
||||
## Device Name
|
||||
The `CPU` device name is used for the CPU plugin. Even though there can be more than one physical socket on a platform, only one device of this kind is listed by OpenVINO.
|
||||
On multi-socket platforms, load balancing and memory usage distribution between NUMA nodes are handled automatically.
|
||||
In order to use CPU for inference the device name should be passed to the `ov::Core::compile_model()` method:
|
||||
In order to use CPU for inference, the device name should be passed to the `ov::Core::compile_model()` method:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -25,8 +26,8 @@ In order to use CPU for inference the device name should be passed to the `ov::C
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
## Supported inference data types
|
||||
The CPU device plugin supports the following data types as inference precision of internal primitives:
|
||||
## Supported Inference Data Types
|
||||
CPU plugin supports the following data types as inference precision of internal primitives:
|
||||
|
||||
- Floating-point data types:
|
||||
- f32
|
||||
@ -38,30 +39,30 @@ The CPU device plugin supports the following data types as inference precision o
|
||||
- i8
|
||||
- u1
|
||||
|
||||
[Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out the supported data types for all detected devices.
|
||||
[Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out supported data types for all detected devices.
|
||||
|
||||
### Quantized data type specifics
|
||||
### Quantized Data Types Specifics
|
||||
|
||||
Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities.
|
||||
u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations.
|
||||
The `u1/u8/i8` data types are used for quantized operations only, i.e., those are not selected automatically for non-quantized operations.
|
||||
|
||||
See the [low-precision optimization guide](@ref openvino_docs_model_optimization_guide) for more details on how to get a quantized model.
|
||||
|
||||
> **NOTE**: Platforms that do not support Intel® AVX512-VNNI have a known "saturation issue" which in some cases leads to reduced computational accuracy for u8/i8 precision calculations.
|
||||
> See the [saturation (overflow) issue section](@ref pot_saturation_issue) to get more information on how to detect such issues and find possible workarounds.
|
||||
> **NOTE**: Platforms that do not support Intel® AVX512-VNNI have a known "saturation issue" that may lead to reduced computational accuracy for `u8/i8` precision calculations.
|
||||
> See the [saturation (overflow) issue section](@ref pot_saturation_issue) to get more information on how to detect such issues and possible workarounds.
|
||||
|
||||
### Floating point data type specifics
|
||||
### Floating Point Data Types Specifics
|
||||
|
||||
The default floating-point precision of a CPU primitive is f32. To support f16 IRs, the plugin internally converts all the f16 values to f32 and all the calculations are performed using the native f32 precision.
|
||||
On platforms that natively support bfloat16 calculations (have AVX512_BF16 extension), the bf16 type is automatically used instead of f32 to achieve better performance, thus no special steps are required to run a model with bf16 precision.
|
||||
See the [BFLOAT16 – Hardware Numerics Definition white paper](https://software.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf) for more details about bfloat16.
|
||||
The default floating-point precision of a CPU primitive is `f32`. To support the `f16` OpenVINO IR the plugin internally converts all the `f16` values to `f32` and all the calculations are performed using the native precision of `f32`.
|
||||
On platforms that natively support `bfloat16` calculations (have the `AVX512_BF16` extension), the `bf16` type is automatically used instead of `f32` to achieve better performance. Thus, no special steps are required to run a `bf16` model.
|
||||
For more details about the `bfloat16` format, see the [BFLOAT16 – Hardware Numerics Definition white paper](https://software.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf).
|
||||
|
||||
Using bf16 provides the following performance benefits:
|
||||
Using the `bf16` precision provides the following performance benefits:
|
||||
|
||||
- Faster multiplication of two bfloat16 numbers because of shorter mantissa of bfloat16 data.
|
||||
- Reduced memory consumption since bfloat16 data is half the size of 32-bit float.
|
||||
- Faster multiplication of two `bfloat16` numbers because of shorter mantissa of the `bfloat16` data.
|
||||
- Reduced memory consumption since `bfloat16` data half the size of 32-bit float.
|
||||
|
||||
To check if the CPU device can support the bfloat16 data type use the [query device properties interface](./config_properties.md) to query ov::device::capabilities property, which should contain `BF16` in the list of CPU capabilities:
|
||||
To check if the CPU device can support the `bfloat16` data type, use the [query device properties interface](./config_properties.md) to query `ov::device::capabilities` property, which should contain `BF16` in the list of CPU capabilities:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -75,11 +76,11 @@ To check if the CPU device can support the bfloat16 data type use the [query dev
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
If the model has been converted to bf16, ov::hint::inference_precision is set to ov::element::bf16 and can be checked via ov::CompiledModel::get_property call. The code below demonstrates how to get the element type:
|
||||
If the model has been converted to `bf16`, the `ov::hint::inference_precision` is set to `ov::element::bf16` and can be checked via the `ov::CompiledModel::get_property` call. The code below demonstrates how to get the element type:
|
||||
|
||||
@snippet snippets/cpu/Bfloat16Inference1.cpp part1
|
||||
|
||||
To infer the model in f32 instead of bf16 on targets with native bf16 support, set the ov::hint::inference_precision to ov::element::f32.
|
||||
To infer the model in `f32` precision instead of `bf16` on targets with native `bf16` support, set the `ov::hint::inference_precision` to `ov::element::f32`.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -93,18 +94,18 @@ To infer the model in f32 instead of bf16 on targets with native bf16 support, s
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
Bfloat16 software simulation mode is available on CPUs with Intel® AVX-512 instruction set which does not support the native `avx512_bf16` instruction. This mode is used for development purposes and it does not guarantee good performance.
|
||||
To enable the simulation, you have to explicitly set ov::hint::inference_precision to ov::element::bf16.
|
||||
The `Bfloat16` software simulation mode is available on CPUs with Intel® AVX-512 instruction set that do not support the native `avx512_bf16` instruction. This mode is used for development purposes and it does not guarantee good performance.
|
||||
To enable the simulation, the `ov::hint::inference_precision` has to be explicitly set to `ov::element::bf16`.
|
||||
|
||||
> **NOTE**: An exception is thrown if ov::hint::inference_precision is set to ov::element::bf16 on a CPU without native bfloat16 support or bfloat16 simulation mode.
|
||||
> **NOTE**: If ov::hint::inference_precision is set to ov::element::bf16 on a CPU without native bfloat16 support or bfloat16 simulation mode, an exception is thrown.
|
||||
|
||||
> **NOTE**: Due to the reduced mantissa size of the bfloat16 data type, the resulting bf16 inference accuracy may differ from the f32 inference, especially for models that were not trained using the bfloat16 data type. If the bf16 inference accuracy is not acceptable, it is recommended to switch to the f32 precision.
|
||||
> **NOTE**: Due to the reduced mantissa size of the `bfloat16` data type, the resulting `bf16` inference accuracy may differ from the `f32` inference, especially for models that were not trained using the `bfloat16` data type. If the `bf16` inference accuracy is not acceptable, it is recommended to switch to the `f32` precision.
|
||||
|
||||
## Supported features
|
||||
## Supported Features
|
||||
|
||||
### Multi-device execution
|
||||
If a machine has OpenVINO-supported devices other than the CPU (for example an integrated GPU), then any supported model can be executed on CPU and all the other devices simultaneously.
|
||||
This can be achieved by specifying `"MULTI:CPU,GPU.0"` as a target device in case of simultaneous usage of CPU and GPU.
|
||||
### Multi-device Execution
|
||||
If a system includes OpenVINO-supported devices other than the CPU (e.g. an integrated GPU), then any supported model can be executed on all the devices simultaneously.
|
||||
This can be achieved by specifying `MULTI:CPU,GPU.0` as a target device in case of simultaneous usage of CPU and GPU.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -118,27 +119,28 @@ This can be achieved by specifying `"MULTI:CPU,GPU.0"` as a target device in cas
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
See [Multi-device execution page](../multi_device.md) for more details.
|
||||
For more details, see the [Multi-device execution](../multi_device.md) article.
|
||||
|
||||
### Multi-stream execution
|
||||
If either `ov::num_streams(n_streams)` with `n_streams > 1` or the `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` property is set for the CPU plugin, multiple streams are created for the model. In the case of the CPU plugin, each stream has its own host thread, which means that incoming infer requests can be processed simultaneously.
|
||||
### Multi-stream Execution
|
||||
If either `ov::num_streams(n_streams)` with `n_streams > 1` or `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` property is set for CPU plugin,
|
||||
then multiple streams are created for the model. In case of CPU plugin, each stream has its own host thread, which means that incoming infer requests can be processed simultaneously.
|
||||
Each stream is pinned to its own group of physical cores with respect to NUMA nodes physical memory usage to minimize overhead on data transfer between NUMA nodes.
|
||||
|
||||
See [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide) for more details.
|
||||
For more details, see the [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide).
|
||||
|
||||
> **NOTE**: When it comes to latency, keep in mind that running only one stream on a multi-socket platform may introduce additional overheads on data transfer between NUMA nodes.
|
||||
> In that case it is better to use ov::hint::PerformanceMode::LATENCY performance hint (please see [performance hints overview](@ref openvino_docs_OV_UG_Performance_Hints) for details).
|
||||
> **NOTE**: When it comes to latency, be aware that running only one stream on multi-socket platform may introduce additional overheads on data transfer between NUMA nodes.
|
||||
> In that case it is better to use the `ov::hint::PerformanceMode::LATENCY` performance hint. For more details see the [performance hints](@ref openvino_docs_OV_UG_Performance_Hints) overview.
|
||||
|
||||
### Dynamic shapes
|
||||
The CPU device plugin provides full functional support for models with dynamic shapes in terms of the opset coverage.
|
||||
### Dynamic Shapes
|
||||
CPU provides full functional support for models with dynamic shapes in terms of the opset coverage.
|
||||
|
||||
> **NOTE**: CPU does not support tensors with a dynamically changing rank. If you try to infer a model with such tensors, an exception will be thrown.
|
||||
> **NOTE**: The CPU plugin does not support tensors with dynamically changing rank. In case of an attempt to infer a model with such tensors, an exception will be thrown.
|
||||
|
||||
Dynamic shapes support introduces additional overhead on memory management and may limit internal runtime optimizations.
|
||||
The more degrees of freedom are used, the more difficult it is to achieve the best performance.
|
||||
The most flexible configuration and the most convenient approach is the fully undefined shape, where no constraints to the shape dimensions are applied.
|
||||
But reducing the level of uncertainty brings gains in performance.
|
||||
You can reduce memory consumption through memory reuse and achieve better cache locality, leading to better inference performance, if you explicitly set dynamic shapes with defined upper bounds.
|
||||
The most flexible configuration, and the most convenient approach, is the fully undefined shape, which means that no constraints to the shape dimensions are applied.
|
||||
However, reducing the level of uncertainty results in performance gains.
|
||||
You can reduce memory consumption through memory reuse, achieving better cache locality and increasing inference performance. To do so, set dynamic shapes explicitly, with defined upper bounds.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -153,7 +155,7 @@ You can reduce memory consumption through memory reuse and achieve better cache
|
||||
@endsphinxtabset
|
||||
|
||||
> **NOTE**: Using fully undefined shapes may result in significantly higher memory consumption compared to inferring the same model with static shapes.
|
||||
> If the level of memory consumption is unacceptable but dynamic shapes are still required, you can reshape the model using shapes with defined upper bounds to reduce memory footprint.
|
||||
> If memory consumption is unacceptable but dynamic shapes are still required, the model can be reshaped using shapes with defined upper bounds to reduce memory footprint.
|
||||
|
||||
Some runtime optimizations work better if the model shapes are known in advance.
|
||||
Therefore, if the input data shape is not changed between inference calls, it is recommended to use a model with static shapes or reshape the existing model with the static input shape to get the best performance.
|
||||
@ -170,12 +172,12 @@ Therefore, if the input data shape is not changed between inference calls, it is
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
See [dynamic shapes guide](../ov_dynamic_shapes.md) for more details.
|
||||
For more details, see the [dynamic shapes guide](../ov_dynamic_shapes.md).
|
||||
|
||||
### Preprocessing acceleration
|
||||
### Preprocessing Acceleration
|
||||
CPU plugin supports a full set of the preprocessing operations, providing high performance implementations for them.
|
||||
|
||||
See [preprocessing API guide](../preprocessing_overview.md) for more details.
|
||||
For more details, see [preprocessing API guide](../preprocessing_overview.md).
|
||||
|
||||
@sphinxdirective
|
||||
.. dropdown:: The CPU plugin support for handling tensor precision conversion is limited to the following ov::element types:
|
||||
@ -195,51 +197,52 @@ See [preprocessing API guide](../preprocessing_overview.md) for more details.
|
||||
* boolean
|
||||
@endsphinxdirective
|
||||
|
||||
### Model caching
|
||||
The CPU device plugin supports Import/Export network capability. If model caching is enabled via the common OpenVINO™ `ov::cache_dir` property, the plugin will automatically create a cached blob inside the specified directory during model compilation.
|
||||
### Models Caching
|
||||
CPU supports Import/Export network capability. If model caching is enabled via the common OpenVINO™ `ov::cache_dir` property, the plugin automatically creates a cached blob inside the specified directory during model compilation.
|
||||
This cached blob contains partial representation of the network, having performed common runtime optimizations and low precision transformations.
|
||||
At the next attempt to compile the model, the cached representation will be loaded to the plugin instead of the initial IR, so the aforementioned steps will be skipped.
|
||||
These operations take a significant amount of time during model compilation, so caching their results makes subsequent compilations of the model much faster, thus reducing first inference latency (FIL).
|
||||
The next time the model is compiled, the cached representation will be loaded to the plugin instead of the initial OpenVINO IR, so the aforementioned transformation steps will be skipped.
|
||||
These transformations take a significant amount of time during model compilation, so caching this representation reduces time spent for subsequent compilations of the model,
|
||||
thereby reducing first inference latency (FIL).
|
||||
|
||||
See [model caching overview](@ref openvino_docs_OV_UG_Model_caching_overview) for more details.
|
||||
For more details, see the [model caching](@ref openvino_docs_OV_UG_Model_caching_overview) overview.
|
||||
|
||||
### Extensibility
|
||||
The CPU device plugin supports fallback on `ov::Op` reference implementation if it lacks own implementation of such operation.
|
||||
This means that [OpenVINO™ Extensibility Mechanism](@ref openvino_docs_Extensibility_UG_Intro) can be used for the plugin extension as well.
|
||||
To enable fallback on a custom operation implementation, override the `ov::Op::evaluate` method in the derived operation class (see [custom OpenVINO™ operations](@ref openvino_docs_Extensibility_UG_add_openvino_ops) for details).
|
||||
CPU plugin supports fallback on `ov::Op` reference implementation if the plugin do not have its own implementation for such operation.
|
||||
That means that [OpenVINO™ Extensibility Mechanism](@ref openvino_docs_Extensibility_UG_Intro) can be used for the plugin extension as well.
|
||||
Enabling fallback on a custom operation implementation is possible by overriding the `ov::Op::evaluate` method in the derived operation class (see [custom OpenVINO™ operations](@ref openvino_docs_Extensibility_UG_add_openvino_ops) for details).
|
||||
|
||||
> **NOTE**: At the moment, custom operations with internal dynamism (when the output tensor shape can only be determined as a result of performing the operation) are not supported by the plugin.
|
||||
|
||||
### Stateful models
|
||||
The CPU device plugin supports stateful models without any limitations.
|
||||
### Stateful Models
|
||||
The CPU plugin supports stateful models without any limitations.
|
||||
|
||||
See [stateful models guide](@ref openvino_docs_OV_UG_network_state_intro) for details.
|
||||
For details, see [stateful models guide](@ref openvino_docs_OV_UG_network_state_intro).
|
||||
|
||||
## Supported properties
|
||||
## Supported Properties
|
||||
The plugin supports the following properties:
|
||||
|
||||
### Read-write properties
|
||||
### Read-write Properties
|
||||
All parameters must be set before calling `ov::Core::compile_model()` in order to take effect or passed as additional argument to `ov::Core::compile_model()`
|
||||
|
||||
- ov::enable_profiling
|
||||
- ov::hint::inference_precision
|
||||
- ov::hint::performance_mode
|
||||
- ov::hint::num_request
|
||||
- ov::num_streams
|
||||
- ov::affinity
|
||||
- ov::inference_num_threads
|
||||
- `ov::enable_profiling`
|
||||
- `ov::hint::inference_precision`
|
||||
- `ov::hint::performance_mode`
|
||||
- `ov::hint::num_request`
|
||||
- `ov::num_streams`
|
||||
- `ov::affinity`
|
||||
- `ov::inference_num_threads`
|
||||
|
||||
|
||||
### Read-only properties
|
||||
- ov::cache_dir
|
||||
- ov::supported_properties
|
||||
- ov::available_devices
|
||||
- ov::range_for_async_infer_requests
|
||||
- ov::range_for_streams
|
||||
- ov::device::full_name
|
||||
- ov::device::capabilities
|
||||
- `ov::cache_dir`
|
||||
- `ov::supported_properties`
|
||||
- `ov::available_devices`
|
||||
- `ov::range_for_async_infer_requests`
|
||||
- `ov::range_for_streams`
|
||||
- `ov::device::full_name`
|
||||
- `ov::device::capabilities`
|
||||
|
||||
## External dependencies
|
||||
## External Dependencies
|
||||
For some performance-critical DL operations, the CPU plugin uses optimized implementations from the oneAPI Deep Neural Network Library ([oneDNN](https://github.com/oneapi-src/oneDNN)).
|
||||
|
||||
@sphinxdirective
|
||||
|
@ -22,19 +22,19 @@ The OpenVINO Runtime provides capabilities to infer deep learning models on the
|
||||
|[CPU](CPU.md) |Intel® Xeon®, Intel® Core™ and Intel® Atom® processors with Intel® Streaming SIMD Extensions (Intel® SSE4.2), Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Vector Neural Network Instructions (Intel® AVX512-VNNI) and bfloat16 extension for AVX-512 (Intel® AVX-512_BF16 Extension)|
|
||||
|[GPU](GPU.md) |Intel® Graphics, including Intel® HD Graphics, Intel® UHD Graphics, Intel® Iris® Graphics, Intel® Xe Graphics, Intel® Xe MAX Graphics |
|
||||
|[VPUs](VPU.md) |Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X, Intel® Vision Accelerator Design with Intel® Movidius™ VPUs |
|
||||
|[GNA](GNA.md) |[Intel® Speech Enabling Developer Kit](https://www.intel.com/content/www/us/en/support/articles/000026156/boards-and-kits/smart-home.html); [Amazon Alexa\* Premium Far-Field Developer Kit](https://developer.amazon.com/en-US/alexa/alexa-voice-service/dev-kits/amazon-premium-voice); [Intel® Pentium® Silver Processors N5xxx, J5xxx and Intel® Celeron® Processors N4xxx, J4xxx (formerly codenamed Gemini Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/83915/gemini-lake.html): [Intel® Pentium® Silver J5005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128984/intel-pentium-silver-j5005-processor-4m-cache-up-to-2-80-ghz.html), [Intel® Pentium® Silver N5000 Processor](https://ark.intel.com/content/www/us/en/ark/products/128990/intel-pentium-silver-n5000-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128992/intel-celeron-j4005-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4105 Processor](https://ark.intel.com/content/www/us/en/ark/products/128989/intel-celeron-j4105-processor-4m-cache-up-to-2-50-ghz.html), [Intel® Celeron® J4125 Processor](https://ark.intel.com/content/www/us/en/ark/products/197305/intel-celeron-processor-j4125-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® Processor N4100](https://ark.intel.com/content/www/us/en/ark/products/128983/intel-celeron-processor-n4100-4m-cache-up-to-2-40-ghz.html), [Intel® Celeron® Processor N4000](https://ark.intel.com/content/www/us/en/ark/products/128988/intel-celeron-processor-n4000-4m-cache-up-to-2-60-ghz.html); [Intel® Pentium® Processors N6xxx, J6xxx, Intel® Celeron® Processors N6xxx, J6xxx and Intel Atom® x6xxxxx (formerly codenamed Elkhart Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/128825/products-formerly-elkhart-lake.html); [Intel® Core™ Processors (formerly codenamed Cannon Lake)](https://ark.intel.com/content/www/us/en/ark/products/136863/intel-core-i3-8121u-processor-4m-cache-up-to-3-20-ghz.html); [10th Generation Intel® Core™ Processors (formerly codenamed Ice Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/74979/ice-lake.html): [Intel® Core™ i7-1065G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i71065g7-processor-8m-cache-up-to-3-90-ghz.html), [Intel® Core™ i7-1060G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197120/intel-core-i71060g7-processor-8m-cache-up-to-3-80-ghz.html), [Intel® Core™ i5-1035G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/196591/intel-core-i51035g4-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196592/intel-core-i51035g7-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196603/intel-core-i51035g1-processor-6m-cache-up-to-3-60-ghz.html), [Intel® Core™ i5-1030G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197119/intel-core-i51030g7-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i5-1030G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197121/intel-core-i51030g4-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i3-1005G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196588/intel-core-i31005g1-processor-4m-cache-up-to-3-40-ghz.html), [Intel® Core™ i3-1000G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/197122/intel-core-i31000g1-processor-4m-cache-up-to-3-20-ghz.html), [Intel® Core™ i3-1000G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197123/intel-core-i31000g4-processor-4m-cache-up-to-3-20-ghz.html); [11th Generation Intel® Core™ Processors (formerly codenamed Tiger Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/88759/tiger-lake.html); [12th Generation Intel® Core™ Processors (formerly codenamed Alder Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/147470/products-formerly-alder-lake.html)|
|
||||
|[GNA](GNA.md) |[Intel® Speech Enabling Developer Kit](https://www.intel.com/content/www/us/en/support/articles/000026156/boards-and-kits/smart-home.html); [Amazon Alexa Premium Far-Field Developer Kit](https://developer.amazon.com/en-US/alexa/alexa-voice-service/dev-kits/amazon-premium-voice); [Intel® Pentium® Silver Processors N5xxx, J5xxx and Intel® Celeron® Processors N4xxx, J4xxx (formerly codenamed Gemini Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/83915/gemini-lake.html): [Intel® Pentium® Silver J5005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128984/intel-pentium-silver-j5005-processor-4m-cache-up-to-2-80-ghz.html), [Intel® Pentium® Silver N5000 Processor](https://ark.intel.com/content/www/us/en/ark/products/128990/intel-pentium-silver-n5000-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4005 Processor](https://ark.intel.com/content/www/us/en/ark/products/128992/intel-celeron-j4005-processor-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® J4105 Processor](https://ark.intel.com/content/www/us/en/ark/products/128989/intel-celeron-j4105-processor-4m-cache-up-to-2-50-ghz.html), [Intel® Celeron® J4125 Processor](https://ark.intel.com/content/www/us/en/ark/products/197305/intel-celeron-processor-j4125-4m-cache-up-to-2-70-ghz.html), [Intel® Celeron® Processor N4100](https://ark.intel.com/content/www/us/en/ark/products/128983/intel-celeron-processor-n4100-4m-cache-up-to-2-40-ghz.html), [Intel® Celeron® Processor N4000](https://ark.intel.com/content/www/us/en/ark/products/128988/intel-celeron-processor-n4000-4m-cache-up-to-2-60-ghz.html); [Intel® Pentium® Processors N6xxx, J6xxx, Intel® Celeron® Processors N6xxx, J6xxx and Intel Atom® x6xxxxx (formerly codenamed Elkhart Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/128825/products-formerly-elkhart-lake.html); [Intel® Core™ Processors (formerly codenamed Cannon Lake)](https://ark.intel.com/content/www/us/en/ark/products/136863/intel-core-i3-8121u-processor-4m-cache-up-to-3-20-ghz.html); [10th Generation Intel® Core™ Processors (formerly codenamed Ice Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/74979/ice-lake.html): [Intel® Core™ i7-1065G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i71065g7-processor-8m-cache-up-to-3-90-ghz.html), [Intel® Core™ i7-1060G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197120/intel-core-i71060g7-processor-8m-cache-up-to-3-80-ghz.html), [Intel® Core™ i5-1035G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/196591/intel-core-i51035g4-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/196592/intel-core-i51035g7-processor-6m-cache-up-to-3-70-ghz.html), [Intel® Core™ i5-1035G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196603/intel-core-i51035g1-processor-6m-cache-up-to-3-60-ghz.html), [Intel® Core™ i5-1030G7 Processor](https://ark.intel.com/content/www/us/en/ark/products/197119/intel-core-i51030g7-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i5-1030G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197121/intel-core-i51030g4-processor-6m-cache-up-to-3-50-ghz.html), [Intel® Core™ i3-1005G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/196588/intel-core-i31005g1-processor-4m-cache-up-to-3-40-ghz.html), [Intel® Core™ i3-1000G1 Processor](https://ark.intel.com/content/www/us/en/ark/products/197122/intel-core-i31000g1-processor-4m-cache-up-to-3-20-ghz.html), [Intel® Core™ i3-1000G4 Processor](https://ark.intel.com/content/www/us/en/ark/products/197123/intel-core-i31000g4-processor-4m-cache-up-to-3-20-ghz.html); [11th Generation Intel® Core™ Processors (formerly codenamed Tiger Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/88759/tiger-lake.html); [12th Generation Intel® Core™ Processors (formerly codenamed Alder Lake)](https://ark.intel.com/content/www/us/en/ark/products/codename/147470/products-formerly-alder-lake.html)|
|
||||
|[Arm® CPU](ARM_CPU.md) |Raspberry Pi™ 4 Model B, Apple® Mac mini with M1 chip, NVIDIA® Jetson Nano™, Android™ devices |
|
||||
|
||||
OpenVINO Runtime also offers several execution modes which work on top of other devices:
|
||||
OpenVINO Runtime also has several execution capabilities which work on top of other devices:
|
||||
|
||||
| Capability | Description |
|
||||
|------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
|[Multi-Device execution](../multi_device.md) |Multi-Device enables simultaneous inference of the same model on several devices in parallel |
|
||||
|[Auto-Device selection](../auto_device_selection.md) |Auto-Device selection enables selecting Intel® device for inference automatically |
|
||||
|[Heterogeneous execution](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers))|
|
||||
|[Automatic Batching](../automatic_batching.md) | the Auto-Batching plugin enables batching (on top of the specified device) that is completely transparent to the application |
|
||||
|[Multi-Device execution](../multi_device.md) |Multi-Device enables simultaneous inference of the same model on several devices in parallel. |
|
||||
|[Auto-Device selection](../auto_device_selection.md) |Auto-Device selection enables selecting Intel device for inference automatically. |
|
||||
|[Heterogeneous execution](../hetero_execution.md) |Heterogeneous execution enables automatic inference splitting between several devices (for example if a device doesn't [support certain operation](#supported-layers)).|
|
||||
|[Automatic Batching](../automatic_batching.md) | Auto-Batching plugin enables the batching (on top of the specified device) that is completely transparent to the application. |
|
||||
|
||||
Devices similar to the ones we use for benchmarking can be accessed using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).
|
||||
Devices similar to the ones used for benchmarking can be accessed, using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).
|
||||
|
||||
@anchor features_support_matrix
|
||||
## Feature Support Matrix
|
||||
@ -55,8 +55,6 @@ The table below demonstrates support of key features by OpenVINO device plugins.
|
||||
|
||||
For more details on plugin-specific feature limitations, see the corresponding plugin pages.
|
||||
|
||||
|
||||
|
||||
## Enumerating Available Devices
|
||||
The OpenVINO Runtime API features dedicated methods of enumerating devices and their capabilities. See the [Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md). This is an example output from the sample (truncated to device names only):
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
# GNA device {#openvino_docs_OV_UG_supported_plugins_GNA}
|
||||
# GNA Device {#openvino_docs_OV_UG_supported_plugins_GNA}
|
||||
|
||||
The Intel® Gaussian & Neural Accelerator (GNA) is a low-power neural coprocessor for continuous inference at the edge.
|
||||
|
||||
@ -9,17 +9,17 @@ to save power and free CPU resources.
|
||||
|
||||
The GNA plugin provides a way to run inference on Intel® GNA, as well as in the software execution mode on CPU.
|
||||
|
||||
For more details on how to configure a machine to use GNA plugin, see [GNA configuration page](@ref openvino_docs_install_guides_configurations_for_intel_gna).
|
||||
For more details on how to configure a machine to use GNA plugin, see the [GNA configuration page](@ref openvino_docs_install_guides_configurations_for_intel_gna).
|
||||
|
||||
## Intel® GNA Generational Differences
|
||||
|
||||
The first (1.0) and second (2.0) versions of Intel® GNA found in 10th and 11th generation Intel® Core™ Processors may be considered to be functionally equivalent. Intel® GNA 2.0 provided performance improvement with respect to Intel® GNA 1.0. Starting with 12th Generation Intel® Core™ Processors (formerly codenamed Alder Lake), support for Intel® GNA 3.0 features is being added.
|
||||
The first (1.0) and second (2.0) versions of Intel® GNA found in 10th and 11th generation Intel® Core™ Processors may be considered functionally equivalent. Intel® GNA 2.0 provided performance improvement with respect to Intel® GNA 1.0. Starting with 12th Generation Intel® Core™ Processors (formerly codenamed Alder Lake), support for Intel® GNA 3.0 features is being added.
|
||||
|
||||
In the rest of this documentation, "GNA 2.0" refers to Intel® GNA hardware delivered on 10th and 11th generation Intel® Core™ processors, and the term "GNA 3.0" refers to GNA hardware delivered on 12th generation Intel® Core™ processors.
|
||||
In this documentation, "GNA 2.0" refers to Intel® GNA hardware delivered on 10th and 11th generation Intel® Core™ processors, and the term "GNA 3.0" refers to GNA hardware delivered on 12th generation Intel® Core™ processors.
|
||||
|
||||
### Intel® GNA Forward and Backward Compatibility
|
||||
|
||||
When you run a model using the GNA plugin, it is compiled internally for the specific hardware target. It is possible to export compiled model using <a href="#import-export">Import/Export</a> functionality to use it later, but in the general case, there is no guarantee that a model compiled and exported for GNA 2.0 runs on GNA 3.0, or vice versa.
|
||||
When a model is run, using the GNA plugin, it is compiled internally for the specific hardware target. It is possible to export a compiled model, using <a href="#import-export">Import/Export</a> functionality to use it later. In general, there is no guarantee that a model compiled and exported for GNA 2.0 runs on GNA 3.0 or vice versa.
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -31,37 +31,36 @@ When you run a model using the GNA plugin, it is compiled internally for the spe
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
> **NOTE**: In most cases, networks compiled for GNA 2.0 runs as expected on GNA 3.0, although the performance may be worse compared to the case when a network is compiled specifically for the latter. The exception is networks with convolutions with the number of filters greater than 8192 (see the <a href="#models-and-operations-limitations">Models and Operations Limitations</a> section).
|
||||
> **NOTE**: In most cases, a network compiled for GNA 2.0 runs as expected on GNA 3.0. However, the performance may be worse compared to when a network is compiled specifically for the latter. The exception is a network with convolutions with the number of filters greater than 8192 (see the <a href="#models-and-operations-limitations">Models and Operations Limitations</a> section).
|
||||
|
||||
For optimal work with POT quantized models which includes 2D convolutions on GNA 3.0 hardware, the <a href="#support-for-2d-convolutions-using-pot">following requirements</a> should be satisfied.
|
||||
For optimal work with POT quantized models, which include 2D convolutions on GNA 3.0 hardware, the <a href="#support-for-2d-convolutions-using-pot">following requirements</a> should be satisfied.
|
||||
|
||||
Choose a compile target depending on the priority: cross-platform execution, performance, memory, or power optimization..
|
||||
Choose a compile target with priority on: cross-platform execution, performance, memory, or power optimization.
|
||||
|
||||
Use the following properties to check interoperability in your application: `ov::intel_gna::execution_target` and `ov::intel_gna::compile_target`
|
||||
Use the following properties to check interoperability in your application: `ov::intel_gna::execution_target` and `ov::intel_gna::compile_target`.
|
||||
|
||||
[Speech C++ Sample](@ref openvino_inference_engine_samples_speech_sample_README) can be used for experiments (see `-exec_target` and `-compile_target` command line options).
|
||||
[Speech C++ Sample](@ref openvino_inference_engine_samples_speech_sample_README) can be used for experiments (see the `-exec_target` and `-compile_target` command line options).
|
||||
|
||||
## Software emulation mode
|
||||
## Software Emulation Mode
|
||||
|
||||
On platforms without GNA hardware support plugin chooses software emulation mode by default. It means, model runs even if you do not have GNA HW within your platform.
|
||||
GNA plugin enables you to switch the execution between software emulation mode and hardware execution mode after the model is loaded.
|
||||
For details, see description of the `ov::intel_gna::execution_mode` property.
|
||||
Software emulation mode is used by default on platforms without GNA hardware support. Therefore, model runs even if there is no GNA HW within your platform.
|
||||
GNA plugin enables switching the execution between software emulation mode and hardware execution mode once the model has been loaded.
|
||||
For details, see a description of the `ov::intel_gna::execution_mode` property.
|
||||
|
||||
## Recovery from Interruption by High-Priority Windows Audio Processes\*
|
||||
## Recovery from Interruption by High-Priority Windows Audio Processes
|
||||
|
||||
GNA is designed for real-time workloads such as noise reduction.
|
||||
For such workloads, processing should be time constrained, otherwise extra delays may cause undesired effects such as
|
||||
*audio glitches*. To make sure that processing can satisfy real-time requirements, the GNA driver provides a Quality of Service
|
||||
(QoS) mechanism, which interrupts requests that might cause high-priority Windows audio processes to miss
|
||||
the schedule, thereby causing long running GNA tasks to terminate early.
|
||||
GNA is designed for real-time workloads i.e., noise reduction.
|
||||
For such workloads, processing should be time constrained. Otherwise, extra delays may cause undesired effects such as
|
||||
*audio glitches*. The GNA driver provides a Quality of Service (QoS) mechanism to ensure that processing can satisfy real-time requirements.
|
||||
The mechanism interrupts requests that might cause high-priority Windows audio processes to miss
|
||||
the schedule. As a result, long running GNA tasks terminate early.
|
||||
|
||||
To prepare the applications correctly, use Automatic QoS Feature described below.
|
||||
|
||||
### Automatic QoS Feature on Windows*
|
||||
### Automatic QoS Feature on Windows
|
||||
|
||||
Starting with 2021.4.1 release of OpenVINO and 03.00.00.1363 version of Windows* GNA driver, a new execution mode `ov::intel_gna::ExecutionMode::HW_WITH_SW_FBACK` is introduced
|
||||
to assure that workloads satisfy real-time execution. In this mode, the GNA driver automatically falls back on CPU for a particular infer request
|
||||
if the HW queue is not empty, so there is no need for explicitly switching between GNA and CPU.
|
||||
Starting with the 2021.4.1 release of OpenVINO™ and the 03.00.00.1363 version of Windows GNA driver, a new execution mode of `ov::intel_gna::ExecutionMode::HW_WITH_SW_FBACK` has been available to ensure that workloads satisfy real-time execution. In this mode, the GNA driver automatically falls back on CPU for a particular infer request
|
||||
if the HW queue is not empty. Therefore, there is no need for explicitly switching between GNA and CPU.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -83,42 +82,41 @@ if the HW queue is not empty, so there is no need for explicitly switching betwe
|
||||
|
||||
> **NOTE**: Due to the "first come - first served" nature of GNA driver and the QoS feature, this mode may lead to increased CPU consumption
|
||||
if there are several clients using GNA simultaneously.
|
||||
Even a lightweight competing infer request which has not been cleared at the time when the user's GNA client process makes its request,
|
||||
can cause the user's request to be executed on CPU, thereby unnecessarily increasing CPU utilization and power.
|
||||
Even a lightweight competing infer request, not cleared at the time when the user's GNA client process makes its request,
|
||||
can cause the user's request to be executed on CPU, unnecessarily increasing CPU utilization and power.
|
||||
|
||||
## Supported inference data types
|
||||
## Supported Inference Data Types
|
||||
|
||||
Intel® GNA essentially operates in the low-precision mode which represents a mix of 8-bit (`i8`), 16-bit (`i16`), and 32-bit (`i32`) integer computations.
|
||||
|
||||
GNA plugin users are encouraged to use the [Post-Training Optimization Tool](@ref pot_introduction) to get a model with quantization hints based on statistics for the provided dataset.
|
||||
|
||||
Unlike other plugins supporting low-precision execution, the GNA plugin can calculate quantization factors at the model loading time, so you can run a model without calibration. However, this mode may not provide satisfactory accuracy because the internal quantization algorithm is based on heuristics which may or may not be efficient, depending on the model and dynamic range of input data and this mode is going to be deprecated soon.
|
||||
Unlike other plugins supporting low-precision execution, the GNA plugin can calculate quantization factors at the model loading time. Therefore, a model can be run without calibration. However, this mode may not provide satisfactory accuracy because the internal quantization algorithm is based on heuristics, the efficiency of which depends on the model and dynamic range of input data. This mode is going to be deprecated soon.
|
||||
|
||||
GNA plugin supports the following data types as inference precision of internal primitives
|
||||
* Quantized data types:
|
||||
- i16
|
||||
- i8
|
||||
GNA plugin supports the `i16` and `i8` quantized data types as inference precision of internal primitives.
|
||||
|
||||
[Hello Query Device C++ Sample](@ref openvino_inference_engine_samples_hello_query_device_README) can be used to print out supported data types for all detected devices.
|
||||
|
||||
[POT API Usage sample for GNA](@ref pot_example_speech_README) demonstrates how a model can be quantized for GNA using POT API in 2 modes:
|
||||
[POT API Usage sample for GNA](@ref pot_example_speech_README) demonstrates how a model can be quantized for GNA, using POT API in two modes:
|
||||
* Accuracy (i16 weights)
|
||||
* Performance (i8 weights)
|
||||
|
||||
For POT quantized model `ov::hint::inference_precision` property has no effect except cases described in <a href="#support-for-2d-convolutions-using-pot">Support for 2D Convolutions using POT</a>.
|
||||
For POT quantized model, the `ov::hint::inference_precision` property has no effect except cases described in <a href="#support-for-2d-convolutions-using-pot">Support for 2D Convolutions using POT</a>.
|
||||
|
||||
## Supported features
|
||||
## Supported Features
|
||||
|
||||
### Models caching
|
||||
Cache for GNA plugin may be enabled via common OpenVINO `ov::cache_dir` property due to import/export functionality support (see below).
|
||||
The plugin supports the features listed below:
|
||||
|
||||
See [Model caching overview page](@ref openvino_docs_OV_UG_Model_caching_overview) for more details.
|
||||
### Models Caching
|
||||
Due to import/export functionality support (see below), cache for GNA plugin may be enabled via common `ov::cache_dir` property of OpenVINO™.
|
||||
|
||||
For more details, see the [Model caching overview](@ref openvino_docs_OV_UG_Model_caching_overview).
|
||||
|
||||
### Import/Export
|
||||
|
||||
The GNA plugin supports import/export capability which helps to significantly decrease first inference time. The model compile target is the same as the execution target by default. The default value for the execution target corresponds to available hardware, or latest hardware version supported by the plugin (i.e., GNA 3.0) if there is no GNA HW in the system.
|
||||
The GNA plugin supports import/export capability, which helps decrease first inference time significantly. The model compile target is the same as the execution target by default. If there is no GNA HW in the system, the default value for the execution target corresponds to available hardware or latest hardware version, supported by the plugin (i.e., GNA 3.0).
|
||||
|
||||
If you are willing to export a model for a specific version of GNA HW, please use the `ov::intel_gna::compile_target` property and then export the model:
|
||||
To export a model for a specific version of GNA HW, use the `ov::intel_gna::compile_target` property and then export the model:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -154,18 +152,17 @@ Import model:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
[Compile Tool](@ref openvino_inference_engine_tools_compile_tool_README) or [Speech C++ Sample](@ref openvino_inference_engine_samples_speech_sample_README) can be used to compile model.
|
||||
To compile a model, use either [compile Tool](@ref openvino_inference_engine_tools_compile_tool_README) or [Speech C++ Sample](@ref openvino_inference_engine_samples_speech_sample_README).
|
||||
|
||||
### Stateful models
|
||||
GNA plugin natively supports stateful models.
|
||||
### Stateful Models
|
||||
GNA plugin natively supports stateful models. For more details on such models, refer to the [Stateful models] (@ref openvino_docs_OV_UG_network_state_intro).
|
||||
|
||||
Please refer to [Stateful models] (@ref openvino_docs_OV_UG_network_state_intro) for more details about such models.
|
||||
|
||||
> **NOTE**: Typically, GNA is used in streaming scenarios, when minimizing the latency is important. Taking into account that POT does not support the `TensorIterator` operation, the recommendation is to use the `--transform` option of the Model Optimizer to apply `LowLatency2` transformation when converting an original model.
|
||||
> **NOTE**: The GNA is typically used in streaming scenarios when minimizing latency is important. Taking into account that POT does not support the `TensorIterator` operation, the recommendation is to use the `--transform` option of the Model Optimizer to apply `LowLatency2` transformation when converting an original model.
|
||||
|
||||
### Profiling
|
||||
The GNA plugin allows to turn on profiling using the `ov::enable_profiling` property.
|
||||
With the following methods, you can collect profiling information that provides various performance data about execution on GNA:
|
||||
The GNA plugin allows turning on profiling, using the `ov::enable_profiling` property.
|
||||
With the following methods, you can collect profiling information with various performance data about execution on GNA:
|
||||
|
||||
@sphinxdirective
|
||||
.. tab:: C++
|
||||
@ -178,7 +175,7 @@ With the following methods, you can collect profiling information that provides
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
The current GNA implementation calculates counters for the whole utterance scoring and does not provide per-layer information. The API enables you to retrieve counter units in cycles, you can convert cycles to seconds as follows:
|
||||
The current GNA implementation calculates counters for the whole utterance scoring and does not provide per-layer information. The API enables you to retrieve counter units in cycles. You can convert cycles to seconds as follows:
|
||||
|
||||
```
|
||||
seconds = cycles / frequency
|
||||
@ -197,17 +194,15 @@ Refer to the table below to learn about the frequency of Intel® GNA inside a pa
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Performance counters provided for the time being:
|
||||
Inference request performance counters provided for the time being:
|
||||
|
||||
* Inference request performance results
|
||||
* Number of total cycles spent on scoring in hardware including compute and memory stall cycles
|
||||
* Number of stall cycles spent in hardware
|
||||
* The number of total cycles spent on scoring in hardware, including compute and memory stall cycles
|
||||
* The number of stall cycles spent in hardware
|
||||
|
||||
## Supported properties
|
||||
The plugin supports the properties listed below.
|
||||
## Supported Properties
|
||||
|
||||
### Read-write properties
|
||||
The following parameters must be set before model compilation in order to take effect or passed as additional argument to `ov::Core::compile_model()`:
|
||||
### Read-write Properties
|
||||
In order to take effect, the following parameters must be set before model compilation or passed as additional arguments to `ov::Core::compile_model()`:
|
||||
|
||||
- ov::cache_dir
|
||||
- ov::enable_profiling
|
||||
@ -225,7 +220,7 @@ These parameters can be changed after model compilation `ov::CompiledModel::set_
|
||||
- ov::intel_gna::execution_mode
|
||||
- ov::log::level
|
||||
|
||||
### Read-only properties
|
||||
### Read-only Properties
|
||||
- ov::available_devices
|
||||
- ov::device::capabilities
|
||||
- ov::device::full_name
|
||||
@ -236,19 +231,20 @@ These parameters can be changed after model compilation `ov::CompiledModel::set_
|
||||
|
||||
## Limitations
|
||||
|
||||
### Models and Operations Limitations
|
||||
|
||||
Because of specifics of hardware architecture, Intel® GNA supports a limited set of operations, their kinds and combinations.
|
||||
For example, you should not expect the GNA Plugin to be able to run computer vision models, except those specifically adapted for the GNA Plugin, because the plugin does not fully support 2D convolutions.
|
||||
### Model and Operation Limitations
|
||||
|
||||
Due to the specification of hardware architecture, Intel® GNA supports a limited set of operations (including their kinds and combinations).
|
||||
For example, GNA Plugin should not be expected to run computer vision models because the plugin does not fully support 2D convolutions. The exception are the models specifically adapted for the GNA Plugin.
|
||||
|
||||
Limitations include:
|
||||
|
||||
- Only 1D convolutions are natively supported on the HW prior to GNA 3.0; 2D convolutions have specific limitations (see the table below).
|
||||
- Prior to GNA 3.0, only 1D convolutions are natively supported on the HW; 2D convolutions have specific limitations (see the table below).
|
||||
- The number of output channels for convolutions must be a multiple of 4.
|
||||
- The maximum number of filters is 65532 for GNA 2.0 and 8192 for GNA 3.0.
|
||||
- Transpose layer support is limited to the cases where no data reordering is needed or when reordering is happening for two dimensions, at least one of which is not greater than 8.
|
||||
- *Transpose* layer support is limited to the cases where no data reordering is needed or when reordering is happening for two dimensions, at least one of which is not greater than 8.
|
||||
- Splits and concatenations are supported for continuous portions of memory (e.g., split of 1,2,3,4 to 1,1,3,4 and 1,1,3,4 or concats of 1,2,3,4 and 1,2,3,5 to 2,2,3,4).
|
||||
- For Multiply, Add and Subtract layers, auto broadcasting is only supported for constant inputs.
|
||||
- For *Multiply*, *Add* and *Subtract* layers, auto broadcasting is only supported for constant inputs.
|
||||
|
||||
#### Support for 2D Convolutions
|
||||
|
||||
@ -256,11 +252,11 @@ The Intel® GNA 1.0 and 2.0 hardware natively supports only 1D convolutions. How
|
||||
|
||||
Initially, a limited subset of Intel® GNA 3.0 features are added to the previous feature set including the following:
|
||||
|
||||
* **2D VALID Convolution With Small 2D Kernels:** Two-dimensional convolutions with the following kernel dimensions [H,W] are supported: [1,1], [2,2], [3,3], [2,1], [3,1], [4,1], [5,1], [6,1], [7,1], [1,2], or [1,3]. Input tensor dimensions are limited to [1,8,16,16] <= [N,C,H,W] <= [1,120,384,240]. Up to 384 channels C may be used with a subset of kernel sizes (see table below). Up to 256 kernels (output channels) are supported. Pooling is limited to pool shapes of [1,1], [2,2], or [3,3]. Not all combinations of kernel shape and input tensor shape are supported (see the tables below for exact limitations).
|
||||
* **2D VALID Convolution With Small 2D Kernels:** Two-dimensional convolutions with the following kernel dimensions [`H`,`W`] are supported: [1,1], [2,2], [3,3], [2,1], [3,1], [4,1], [5,1], [6,1], [7,1], [1,2], or [1,3]. Input tensor dimensions are limited to [1,8,16,16] <= [`N`,`C`,`H`,`W`] <= [1,120,384,240]. Up to 384 `C` channels may be used with a subset of kernel sizes (see the table below). Up to 256 kernels (output channels) are supported. Pooling is limited to pool shapes of [1,1], [2,2], or [3,3]. Not all combinations of kernel shape and input tensor shape are supported (see the tables below for exact limitations).
|
||||
|
||||
The tables below show that the exact limitation on the input tensor width W depends on the number of input channels C (indicated as Ci below) and the kernel shape. There is much more freedom to choose the input tensor height and number of output channels.
|
||||
The tables below show that the exact limitation on the input tensor width W depends on the number of input channels *C* (indicated as *Ci* below) and the kernel shape. There is much more freedom to choose the input tensor height and number of output channels.
|
||||
|
||||
The following tables provide a more explicit representation of the Intel(R) GNA 3.0 2D convolution operations initially supported. The limits depend strongly on number of input tensor channels (Ci) and the input tensor width (W). Other factors are kernel height (KH), kernel width (KW), pool height (PH), pool width (PW), horizontal pool step (SH), and vertical pool step (PW). For example, the first table shows that for a 3x3 kernel with max pooling, only square pools are supported, and W is limited to 87 when there are 64 input channels.
|
||||
The following tables provide a more explicit representation of the Intel(R) GNA 3.0 2D convolution operations initially supported. The limits depend strongly on number of input tensor channels (*Ci*) and the input tensor width (*W*). Other factors are kernel height (*KH*), kernel width (*KW*), pool height (*PH*), pool width (*PW*), horizontal pool step (*SH*), and vertical pool step (*PW*). For example, the first table shows that for a 3x3 kernel with max pooling, only square pools are supported, and *W* is limited to 87 when there are 64 input channels.
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -275,16 +271,16 @@ The following tables provide a more explicit representation of the Intel(R) GNA
|
||||
#### Support for 2D Convolutions using POT
|
||||
|
||||
For POT to successfully work with the models including GNA3.0 2D convolutions, the following requirements must be met:
|
||||
* All convolution parameters are natively supported by HW (see tables above)
|
||||
* All convolution parameters are natively supported by HW (see tables above).
|
||||
* The runtime precision is explicitly set by the `ov::hint::inference_precision` property as `i8` for the models produced by the `performance mode` of POT, and as `i16` for the models produced by the `accuracy mode` of POT.
|
||||
|
||||
### Batch Size Limitation
|
||||
|
||||
Intel® GNA plugin supports the processing of context-windowed speech frames in batches of 1-8 frames.
|
||||
|
||||
Please refer to [Layout API overview](@ref openvino_docs_OV_UG_Layout_Overview) to determine batch dimension.
|
||||
Refer to the [Layout API overview](@ref openvino_docs_OV_UG_Layout_Overview) to determine batch dimension.
|
||||
|
||||
To set layout of model inputs in runtime use [Optimize Preprocessing](@ref openvino_docs_OV_UG_Preprocessing_Overview) guide:
|
||||
To set layout of model inputs in runtime, use the [Optimize Preprocessing](@ref openvino_docs_OV_UG_Preprocessing_Overview) guide:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
# GPU device {#openvino_docs_OV_UG_supported_plugins_GPU}
|
||||
# GPU Device {#openvino_docs_OV_UG_supported_plugins_GPU}
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -16,15 +16,15 @@ For an in-depth description of the GPU plugin, see:
|
||||
- [OpenVINO Runtime GPU plugin source files](https://github.com/openvinotoolkit/openvino/tree/master/src/plugins/intel_gpu/)
|
||||
- [Accelerate Deep Learning Inference with Intel® Processor Graphics](https://software.intel.com/en-us/articles/accelerating-deep-learning-inference-with-intel-processor-graphics).
|
||||
|
||||
It is a part of the Intel® Distribution of OpenVINO™ toolkit. For more information on how to configure a system to use it, see [GPU configuration page](@ref openvino_docs_install_guides_configurations_for_intel_gpu).
|
||||
The GPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit. For more information on how to configure a system to use it, see the [GPU configuration](@ref openvino_docs_install_guides_configurations_for_intel_gpu).
|
||||
|
||||
## Device Naming Convention
|
||||
* Devices are enumerated as `"GPU.X"` where `X={0, 1, 2,...}`. Only Intel® GPU devices are considered.
|
||||
* If the system has an integrated GPU, its 'id' is always '0' (`"GPU.0"`).
|
||||
* Other GPUs' order is not predefined and depends on the GPU driver.
|
||||
* `"GPU"` is an alias for `"GPU.0"`
|
||||
* If the system doesn't have an integrated GPU, devices are enumerated starting from 0.
|
||||
* For GPUs with multi-tile architecture (multiple sub-devices in OpenCL terms) a specific tile may be addressed as `"GPU.X.Y"` where `X,Y={0, 1, 2,...}`, `X` - id of the GPU device, `Y` - id of the tile within device `X`
|
||||
* Devices are enumerated as `GPU.X`, where `X={0, 1, 2,...}` (only Intel® GPU devices are considered).
|
||||
* If the system has an integrated GPU, its `id` is always 0 (`GPU.0`).
|
||||
* The order of other GPUs is not predefined and depends on the GPU driver.
|
||||
* The `GPU` is an alias for `GPU.0`.
|
||||
* If the system does not have an integrated GPU, devices are enumerated, starting from 0.
|
||||
* For GPUs with multi-tile architecture (multiple sub-devices in OpenCL terms), a specific tile may be addressed as `GPU.X.Y`, where `X,Y={0, 1, 2,...}`, `X` - id of the GPU device, `Y` - id of the tile within device `X`
|
||||
|
||||
For demonstration purposes, see the [Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) that can print out the list of available devices with associated indices. Below is an example output (truncated to the device names only):
|
||||
|
||||
@ -40,7 +40,7 @@ Available devices:
|
||||
Device: HDDL
|
||||
```
|
||||
|
||||
Then device name can be passed to `ov::Core::compile_model()` method:
|
||||
Then, device name can be passed to the `ov::Core::compile_model()` method:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -94,7 +94,7 @@ Then device name can be passed to `ov::Core::compile_model()` method:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
## Supported inference data types
|
||||
## Supported Inference Data Types
|
||||
The GPU plugin supports the following data types as inference precision of internal primitives:
|
||||
|
||||
- Floating-point data types:
|
||||
@ -106,20 +106,22 @@ The GPU plugin supports the following data types as inference precision of inter
|
||||
- u1
|
||||
|
||||
Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities.
|
||||
u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations.
|
||||
For more details on how to get a quantized model, refer to [Model Optimization](@ref openvino_docs_model_optimization_guide) document.
|
||||
The `u1`/`u8`/`i8` data types are used for quantized operations only, which means that they are not selected automatically for non-quantized operations.
|
||||
For more details on how to get a quantized model, refer to the [Model Optimization guide](@ref openvino_docs_model_optimization_guide).
|
||||
|
||||
Floating-point precision of a GPU primitive is selected based on operation precision in IR except [compressed f16 IR form](../../MO_DG/prepare_model/FP16_Compression.md) which is executed in the f16 precision.
|
||||
Floating-point precision of a GPU primitive is selected based on operation precision in the OpenVINO IR, except for the [compressed f16 OpenVINO IR form](../../MO_DG/prepare_model/FP16_Compression.md), which is executed in the `f16` precision.
|
||||
|
||||
> **NOTE**: Hardware acceleration for i8/u8 precision may be unavailable on some platforms. In that case a model is executed in the floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via the `ov::device::capabilities` property.
|
||||
> **NOTE**: Hardware acceleration for `i8`/`u8` precision may be unavailable on some platforms. In such cases, a model is executed in the floating-point precision taken from IR. Hardware support of `u8`/`i8` acceleration can be queried via the `ov::device::capabilities` property.
|
||||
|
||||
[Hello Query Device C++ Sample](../../../samples/cpp/hello_query_device/README.md) can be used to print out the supported data types for all detected devices.
|
||||
|
||||
## Supported features
|
||||
## Supported Features
|
||||
|
||||
### Multi-device execution
|
||||
The GPU plugin supports the following features:
|
||||
|
||||
### Multi-device Execution
|
||||
If a system has multiple GPUs (for example, an integrated and a discrete Intel GPU), then any supported model can be executed on all GPUs simultaneously.
|
||||
It is done by specifying `"MULTI:GPU.1,GPU.0"` as a target device.
|
||||
It is done by specifying `MULTI:GPU.1,GPU.0` as a target device.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -133,12 +135,12 @@ It is done by specifying `"MULTI:GPU.1,GPU.0"` as a target device.
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
See [Multi-device execution page](../multi_device.md) for more details.
|
||||
For more details, see the [Multi-device execution](../multi_device.md).
|
||||
|
||||
### Automatic batching
|
||||
### Automatic Batching
|
||||
The GPU plugin is capable of reporting `ov::max_batch_size` and `ov::optimal_batch_size` metrics with respect to the current hardware
|
||||
platform and model. Thus, automatic batching is enabled by default when `ov::optimal_batch_size` is > 1 and `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` is set.
|
||||
Alternatively, it can be enabled explicitly via the device notion, e.g. `"BATCH:GPU"`.
|
||||
platform and model. Therefore, automatic batching is enabled by default when `ov::optimal_batch_size` is `> 1` and `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` is set.
|
||||
Alternatively, it can be enabled explicitly via the device notion, for example `BATCH:GPU`.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -176,22 +178,22 @@ Alternatively, it can be enabled explicitly via the device notion, e.g. `"BATCH:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
See [Automatic batching page](../automatic_batching.md) for more details.
|
||||
For more details, see the [Automatic batching](../automatic_batching.md).
|
||||
|
||||
### Multi-stream execution
|
||||
If either `ov::num_streams(n_streams)` with `n_streams > 1` or `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` property is set for the GPU plugin,
|
||||
### Multi-stream Execution
|
||||
If either the `ov::num_streams(n_streams)` with `n_streams > 1` or the `ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT)` property is set for the GPU plugin,
|
||||
multiple streams are created for the model. In the case of GPU plugin each stream has its own host thread and an associated OpenCL queue
|
||||
which means that the incoming infer requests can be processed simultaneously.
|
||||
|
||||
> **NOTE**: Simultaneous scheduling of kernels to different queues doesn't mean that the kernels are actually executed in parallel on the GPU device. The actual behavior depends on the hardware architecture and in some cases the execution may be serialized inside the GPU driver.
|
||||
> **NOTE**: Simultaneous scheduling of kernels to different queues does not mean that the kernels are actually executed in parallel on the GPU device. The actual behavior depends on the hardware architecture and in some cases the execution may be serialized inside the GPU driver.
|
||||
|
||||
When multiple inferences of the same model need to be executed in parallel, the multi-stream feature is preferred to multiple instances of the model or application.
|
||||
That's because implementation of streams in the GPU plugin supports weight memory sharing across streams, thus, memory consumption may be lower, compared to the other approaches.
|
||||
The reason for this is that the implementation of streams in the GPU plugin supports weight memory sharing across streams, thus, memory consumption may be lower, compared to the other approaches.
|
||||
|
||||
See [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide) for more details.
|
||||
For more details, see the [optimization guide](@ref openvino_docs_deployment_optimization_guide_dldt_optimization_guide).
|
||||
|
||||
### Dynamic shapes
|
||||
The GPU plugin supports dynamic shapes for batch dimension only (specified as 'N' in the [layouts terms](../layout_overview.md)) with a fixed upper bound. Any other dynamic dimensions are unsupported. Internally, GPU plugin creates
|
||||
### Dynamic Shapes
|
||||
The GPU plugin supports dynamic shapes for batch dimension only (specified as `N` in the [layouts terms](../layout_overview.md)) with a fixed upper bound. Any other dynamic dimensions are unsupported. Internally, GPU plugin creates
|
||||
`log2(N)` (`N` - is an upper bound for batch dimension here) low-level execution graphs for batch sizes equal to powers of 2 to emulate dynamic behavior, so that incoming infer request with a specific batch size is executed via a minimal combination of internal networks.
|
||||
For example, batch size 33 may be executed via 2 internal networks with batch size 32 and 1.
|
||||
|
||||
@ -211,11 +213,11 @@ The code snippet below demonstrates how to use dynamic batching in simple scenar
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
See [dynamic shapes guide](../ov_dynamic_shapes.md) for more details.
|
||||
For more details, see the [dynamic shapes guide](../ov_dynamic_shapes.md).
|
||||
|
||||
### Preprocessing acceleration
|
||||
### Preprocessing Acceleration
|
||||
The GPU plugin has the following additional preprocessing options:
|
||||
- `ov::intel_gpu::memory_type::surface` and `ov::intel_gpu::memory_type::buffer` values for `ov::preprocess::InputTensorInfo::set_memory_type()` preprocessing method. These values are intended to be used to provide a hint for the plugin on the type of input Tensors that will be set in runtime to generate proper kernels.
|
||||
- The `ov::intel_gpu::memory_type::surface` and `ov::intel_gpu::memory_type::buffer` values for the `ov::preprocess::InputTensorInfo::set_memory_type()` preprocessing method. These values are intended to be used to provide a hint for the plugin on the type of input Tensors that will be set in runtime to generate proper kernels.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -229,31 +231,31 @@ The GPU plugin has the following additional preprocessing options:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
With such preprocessing GPU plugin will expect `ov::intel_gpu::ocl::ClImage2DTensor` (or derived) to be passed for each NV12 plane via `ov::InferRequest::set_tensor()` or `ov::InferRequest::set_tensors()` methods.
|
||||
With such preprocessing, GPU plugin will expect `ov::intel_gpu::ocl::ClImage2DTensor` (or derived) to be passed for each NV12 plane via `ov::InferRequest::set_tensor()` or `ov::InferRequest::set_tensors()` methods.
|
||||
|
||||
Refer to [RemoteTensor API](./GPU_RemoteTensor_API.md) for usage examples.
|
||||
For usage examples, refer to the [RemoteTensor API](./GPU_RemoteTensor_API.md).
|
||||
|
||||
See [preprocessing API guide](../preprocessing_overview.md) for more details.
|
||||
For more details, see the [preprocessing API](../preprocessing_overview.md).
|
||||
|
||||
### Model caching
|
||||
### Model Caching
|
||||
Cache for the GPU plugin may be enabled via the common OpenVINO `ov::cache_dir` property. GPU plugin implementation supports only caching of compiled kernels,
|
||||
so all plugin-specific model transformations are executed on each `ov::Core::compile_model()` call regardless of the `cache_dir` option.
|
||||
Still, since kernel compilation is a bottleneck in the model loading process, a significant load time reduction can be achieved with the `ov::cache_dir` property enabled.
|
||||
|
||||
See [Model caching overview page](../Model_caching_overview.md) for more details.
|
||||
For more details, see the [Model caching overview](../Model_caching_overview.md).
|
||||
|
||||
### Extensibility
|
||||
See [GPU Extensibility](@ref openvino_docs_Extensibility_UG_GPU) page.
|
||||
For information on this subject, see the [GPU Extensibility](@ref openvino_docs_Extensibility_UG_GPU).
|
||||
|
||||
### GPU context and memory sharing via RemoteTensor API
|
||||
See [RemoteTensor API of GPU Plugin](GPU_RemoteTensor_API.md).
|
||||
### GPU Context and Memory Sharing via RemoteTensor API
|
||||
For information on this subject, see the [RemoteTensor API of GPU Plugin](GPU_RemoteTensor_API.md).
|
||||
|
||||
|
||||
## Supported properties
|
||||
## Supported Properties
|
||||
The plugin supports the properties listed below.
|
||||
|
||||
### Read-write properties
|
||||
All parameters must be set before calling `ov::Core::compile_model()` in order to take effect or passed as additional argument to `ov::Core::compile_model()`
|
||||
All parameters must be set before calling `ov::Core::compile_model()` in order to take effect or passed as additional argument to `ov::Core::compile_model()`.
|
||||
|
||||
- ov::cache_dir
|
||||
- ov::enable_profiling
|
||||
@ -268,7 +270,7 @@ All parameters must be set before calling `ov::Core::compile_model()` in order t
|
||||
- ov::intel_gpu::hint::queue_throttle
|
||||
- ov::intel_gpu::enable_loop_unrolling
|
||||
|
||||
### Read-only properties
|
||||
### Read-only Properties
|
||||
- ov::supported_properties
|
||||
- ov::available_devices
|
||||
- ov::range_for_async_infer_requests
|
||||
@ -285,7 +287,7 @@ All parameters must be set before calling `ov::Core::compile_model()` in order t
|
||||
- ov::intel_gpu::memory_statistics
|
||||
|
||||
## Limitations
|
||||
In some cases, the GPU plugin may implicitly execute several primitives on CPU using internal implementations which may lead to increase of CPU utilization.
|
||||
In some cases, the GPU plugin may implicitly execute several primitives on CPU using internal implementations, which may lead to an increase in CPU utilization.
|
||||
Below is a list of such operations:
|
||||
- Proposal
|
||||
- NonMaxSuppression
|
||||
@ -294,16 +296,16 @@ Below is a list of such operations:
|
||||
The behavior depends on specific parameters of the operations and hardware configuration.
|
||||
|
||||
## GPU Performance Checklist: Summary <a name="gpu-checklist"></a>
|
||||
Since OpenVINO relies on the OpenCL™ kernels for the GPU implementation, many general OpenCL tips apply:
|
||||
- Prefer `FP16` inference precision over `FP32`, as Model Optimizer can generate both variants and the `FP32` is the default. Also, consider using the [Post-training Optimization Tool](https://docs.openvino.ai/latest/pot_introduction.html).
|
||||
Since OpenVINO relies on the OpenCL kernels for the GPU implementation, many general OpenCL tips apply:
|
||||
- Prefer `FP16` inference precision over `FP32`, as Model Optimizer can generate both variants, and the `FP32` is the default. Also, consider using the [Post-training Optimization Tool](https://docs.openvino.ai/latest/pot_introduction.html).
|
||||
- Try to group individual infer jobs by using [automatic batching](../automatic_batching.md).
|
||||
- Consider [caching](../Model_caching_overview.md) to minimize model load time.
|
||||
- If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. You can use [CPU configuration options](./CPU.md) to limit the number of inference threads for the CPU plugin.
|
||||
- Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-looped polling for completion. If CPU load is a concern, consider the dedicated `queue_throttle` property mentioned previously. Notice that this option may increase inference latency, so consider combining with multiple GPU streams or [throughput performance hints](../performance_hints.md).
|
||||
- When operating media inputs consider [remote tensors API of the GPU Plugin](./GPU_RemoteTensor_API.md).
|
||||
- If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. [CPU configuration options](./CPU.md) can be used to limit the number of inference threads for the CPU plugin.
|
||||
- Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated `queue_throttle` property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or [throughput performance hints](../performance_hints.md).
|
||||
- When operating media inputs, consider [remote tensors API of the GPU Plugin](./GPU_RemoteTensor_API.md).
|
||||
|
||||
|
||||
## See Also
|
||||
## Additional Resources
|
||||
* [Supported Devices](Supported_Devices.md)
|
||||
* [Optimization guide](@ref openvino_docs_optimization_guide_dldt_optimization_guide)
|
||||
* [GPU plugin developers documentation](https://github.com/openvinotoolkit/openvino/wiki/GPUPluginDevelopersDocs)
|
||||
|
@ -4,7 +4,7 @@ The GPU plugin implementation of the `ov::RemoteContext` and `ov::RemoteTensor`
|
||||
pipeline developers who need video memory sharing and interoperability with existing native APIs,
|
||||
such as OpenCL, Microsoft DirectX, or VAAPI.
|
||||
Using these interfaces allows you to avoid any memory copy overhead when plugging OpenVINO™ inference
|
||||
into an existing GPU pipeline. It also enables OpenCL kernels participating in the pipeline to become
|
||||
into an existing GPU pipeline. It also enables OpenCL kernels to participate in the pipeline to become
|
||||
native buffer consumers or producers of the OpenVINO™ inference.
|
||||
|
||||
There are two interoperability scenarios supported by the Remote Tensor API:
|
||||
@ -14,27 +14,27 @@ handles and used to create the OpenVINO™ `ov::CompiledModel` or `ov::Tensor` o
|
||||
* The OpenCL context or buffer handles can be obtained from existing GPU plugin objects, and used in OpenCL processing on the application side.
|
||||
|
||||
Class and function declarations for the API are defined in the following files:
|
||||
* Windows\*: `openvino/runtime/intel_gpu/ocl/ocl.hpp` and `openvino/runtime/intel_gpu/ocl/dx.hpp`
|
||||
* Linux\*: `openvino/runtime/intel_gpu/ocl/ocl.hpp` and `openvino/runtime/intel_gpu/ocl/va.hpp`
|
||||
* Windows -- `openvino/runtime/intel_gpu/ocl/ocl.hpp` and `openvino/runtime/intel_gpu/ocl/dx.hpp`
|
||||
* Linux -- `openvino/runtime/intel_gpu/ocl/ocl.hpp` and `openvino/runtime/intel_gpu/ocl/va.hpp`
|
||||
|
||||
The most common way to enable the interaction of your application with the Remote Tensor API is to use user-side utility classes
|
||||
and functions that consume or produce native handles directly.
|
||||
|
||||
## Context sharing between application and GPU plugin
|
||||
## Context Sharing Between Application and GPU Plugin
|
||||
|
||||
GPU plugin classes that implement the `ov::RemoteContext` interface are responsible for context sharing.
|
||||
Obtaining a context object is the first step of sharing pipeline objects.
|
||||
The context object of the GPU plugin directly wraps OpenCL context, setting a scope for sharing
|
||||
`ov::CompiledModel` and `ov::RemoteTensor` objects. `ov::RemoteContext` object can be either created on top of
|
||||
an existing handle from native api or retrieved from the GPU plugin.
|
||||
The context object of the GPU plugin directly wraps OpenCL context, setting a scope for sharing the
|
||||
`ov::CompiledModel` and `ov::RemoteTensor` objects. The `ov::RemoteContext` object can be either created on top of
|
||||
an existing handle from a native API or retrieved from the GPU plugin.
|
||||
|
||||
Once you obtain the context, you can use it to compile a new `ov::CompiledModel` or create `ov::RemoteTensor`
|
||||
Once you have obtained the context, you can use it to compile a new `ov::CompiledModel` or create `ov::RemoteTensor`
|
||||
objects.
|
||||
For network compilation, use a dedicated flavor of `ov::Core::compile_model()`, which accepts the context as an
|
||||
additional parameter.
|
||||
|
||||
### Creation of RemoteContext from native handle
|
||||
To create `ov::RemoteContext` object for user context, explicitly provide the context to the plugin using constructor for one
|
||||
### Creation of RemoteContext from Native Handle
|
||||
To create the `ov::RemoteContext` object for user context, explicitly provide the context to the plugin using constructor for one
|
||||
of `ov::RemoteContext` derived classes.
|
||||
|
||||
@sphinxtabset
|
||||
@ -92,13 +92,13 @@ of `ov::RemoteContext` derived classes.
|
||||
@endsphinxtabset
|
||||
|
||||
|
||||
### Getting RemoteContext from the plugin
|
||||
### Getting RemoteContext from the Plugin
|
||||
If you do not provide any user context, the plugin uses its default internal context.
|
||||
The plugin attempts to use the same internal context object as long as plugin options are kept the same.
|
||||
Therefore, all `ov::CompiledModel` objects created during this time share the same context.
|
||||
Once the plugin options are changed, the internal context is replaced by the new one.
|
||||
Once the plugin options have been changed, the internal context is replaced by the new one.
|
||||
|
||||
To request the current default context of the plugin use one of the following methods:
|
||||
To request the current default context of the plugin, use one of the following methods:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -116,15 +116,15 @@ To request the current default context of the plugin use one of the following me
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
## Memory sharing between application and GPU plugin
|
||||
## Memory Sharing Between Application and GPU Plugin
|
||||
|
||||
The classes that implement the `ov::RemoteTensor` interface are the wrappers for native API
|
||||
memory handles (which can be obtained from them at any time).
|
||||
|
||||
To create a shared tensor from a native memory handle, use dedicated `create_tensor`or `create_tensor_nv12` methods
|
||||
of the `ov::RemoteContext` sub-classes.
|
||||
`ov::intel_gpu::ocl::ClContext` has multiple overloads of `create_tensor` methods which allow to wrap pre-allocated native handles with `ov::RemoteTensor`
|
||||
object or request plugin to allocate specific device memory. See code snippets below for more details.
|
||||
`ov::intel_gpu::ocl::ClContext` has multiple overloads of `create_tensor` methods which allow to wrap pre-allocated native handles with the `ov::RemoteTensor`
|
||||
object or request plugin to allocate specific device memory. For more details, see the code snippets below:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -193,14 +193,14 @@ object or request plugin to allocate specific device memory. See code snippets b
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
`ov::intel_gpu::ocl::D3DContext` and `ov::intel_gpu::ocl::VAContext` classes are derived from `ov::intel_gpu::ocl::ClContext`,
|
||||
thus they provide the functionality described above and extend it
|
||||
The `ov::intel_gpu::ocl::D3DContext` and `ov::intel_gpu::ocl::VAContext` classes are derived from `ov::intel_gpu::ocl::ClContext`.
|
||||
Therefore, they provide the functionality described above and extend it
|
||||
to allow creation of `ov::RemoteTensor` objects from `ID3D11Buffer`, `ID3D11Texture2D` pointers or the `VASurfaceID` handle respectively.
|
||||
|
||||
## Direct NV12 video surface input
|
||||
## Direct NV12 Video Surface Input
|
||||
|
||||
To support the direct consumption of a hardware video decoder output, the plugin accepts two-plane video
|
||||
surfaces as arguments for the `create_tensor_nv12()` function, which creates a pair or `ov::RemoteTensor`
|
||||
surfaces as arguments for the `create_tensor_nv12()` function, which creates a pair of `ov::RemoteTensor`
|
||||
objects which represent the Y and UV planes.
|
||||
|
||||
To ensure that the plugin generates the correct execution graph for the NV12 dual-plane input, static preprocessing
|
||||
@ -208,7 +208,7 @@ should be added before model compilation:
|
||||
|
||||
@snippet snippets/gpu/preprocessing.cpp init_preproc
|
||||
|
||||
Since `ov::intel_gpu::ocl::ClImage2DTensor` (and derived classes) doesn't support batched surfaces, if batching and surface sharing are required
|
||||
Since the `ov::intel_gpu::ocl::ClImage2DTensor` and its derived classes do not support batched surfaces, if batching and surface sharing are required
|
||||
at the same time, inputs need to be set via the `ov::InferRequest::set_tensors` method with vector of shared surfaces for each plane:
|
||||
|
||||
@sphinxtabset
|
||||
@ -230,15 +230,15 @@ at the same time, inputs need to be set via the `ov::InferRequest::set_tensors`
|
||||
|
||||
I420 color format can be processed in a similar way
|
||||
|
||||
## Context & queue sharing
|
||||
## Context & Queue Sharing
|
||||
|
||||
The GPU plugin supports creation of shared context from `cl_command_queue` handle. In that case
|
||||
opencl context handle is extracted from the given queue via OpenCL™ API, and the queue itself is used inside
|
||||
The GPU plugin supports creation of shared context from the `cl_command_queue` handle. In that case,
|
||||
the `opencl` context handle is extracted from the given queue via OpenCL™ API, and the queue itself is used inside
|
||||
the plugin for further execution of inference primitives. Sharing the queue changes the behavior of the `ov::InferRequest::start_async()`
|
||||
method to guarantee that submission of inference primitives into the given queue is finished before
|
||||
returning control back to the calling thread.
|
||||
|
||||
This sharing mechanism allows to do pipeline synchronization on the app side and avoid blocking the host thread
|
||||
This sharing mechanism allows performing pipeline synchronization on the app side and avoiding blocking the host thread
|
||||
on waiting for the completion of inference. The pseudo-code may look as follows:
|
||||
|
||||
@sphinxdirective
|
||||
@ -260,27 +260,27 @@ on waiting for the completion of inference. The pseudo-code may look as follows:
|
||||
### Limitations
|
||||
|
||||
- Some primitives in the GPU plugin may block the host thread on waiting for the previous primitives before adding its kernels
|
||||
to the command queue. In such cases the `ov::InferRequest::start_async()` call takes much more time to return control to the calling thread
|
||||
to the command queue. In such cases, the `ov::InferRequest::start_async()` call takes much more time to return control to the calling thread
|
||||
as internally it waits for a partial or full network completion.
|
||||
Examples of operations: Loop, TensorIterator, DetectionOutput, NonMaxSuppression
|
||||
- Synchronization of pre/post processing jobs and inference pipeline inside a shared queue is user's responsibility
|
||||
- Throughput mode is not available when queue sharing is used, i.e. only a single stream can be used for each compiled model.
|
||||
- Synchronization of pre/post processing jobs and inference pipeline inside a shared queue is user's responsibility.
|
||||
- Throughput mode is not available when queue sharing is used, i.e., only a single stream can be used for each compiled model.
|
||||
|
||||
## Low-Level Methods for RemoteContext and RemoteTensor creation
|
||||
## Low-Level Methods for RemoteContext and RemoteTensor Creation
|
||||
|
||||
The high-level wrappers mentioned above bring a direct dependency on native APIs to the user program.
|
||||
If you want to avoid the dependency, you still can directly use the `ov::Core::create_context()`,
|
||||
`ov::RemoteContext::create_tensor()`, and `ov::RemoteContext::get_params()` methods.
|
||||
On this level, native handles are re-interpreted as void pointers and all arguments are passed
|
||||
using `ov::AnyMap` containers that are filled with `std::string, ov::Any` pairs.
|
||||
Two types of map entries are possible: descriptor and container. The first map entry is a
|
||||
descriptor, which sets the expected structure and possible parameter values of the map.
|
||||
Two types of map entries are possible: descriptor and container.
|
||||
Descriptor sets the expected structure and possible parameter values of the map.
|
||||
|
||||
Refer to `openvino/runtime/intel_gpu/remote_properties.hpp` header file for possible low-level properties and their description.
|
||||
For possible low-level properties and their description, refer to the `openvino/runtime/intel_gpu/remote_properties.hpp` header file .
|
||||
|
||||
## Examples
|
||||
|
||||
Refer to the sections below to see pseudo-code of usage examples.
|
||||
To see pseudo-code of usage examples, refer to the sections below.
|
||||
|
||||
> **NOTE**: For low-level parameter usage examples, see the source code of user-side wrappers from the include files mentioned above.
|
||||
|
||||
|
@ -1,12 +1,12 @@
|
||||
# HDDL device {#openvino_docs_OV_UG_supported_plugins_HDDL}
|
||||
# HDDL Device {#openvino_docs_OV_UG_supported_plugins_HDDL}
|
||||
|
||||
## Introducing the HDDL Plugin
|
||||
|
||||
The OpenVINO Runtime HDDL plugin was developed for inference with neural networks on Intel® Vision Accelerator Design with Intel® Movidius™ VPUs. It is designed for use cases that require large throughput for deep learning inference, up to dozens of times more than the MYRIAD Plugin.
|
||||
The OpenVINO Runtime HDDL plugin was developed for inference with neural networks on Intel® Vision Accelerator Design with Intel® Movidius™ VPUs. It is designed for use cases that require large throughput for deep learning inference, up to dozens of times more than the MYRIAD Plugin.
|
||||
|
||||
## Configuring the HDDL Plugin
|
||||
|
||||
To configure your Intel® Vision Accelerator Design With Intel® Movidius™ on supported operating systems, refer to the Steps for Intel® Vision Accelerator Design with Intel® Movidius™ VPUs section in the installation guides for [Linux](../../install_guides/installing-openvino-linux.md) or [Windows](../../install_guides/installing-openvino-windows.md).
|
||||
To configure your Intel® Vision Accelerator Design With Intel® Movidius™ on supported operating systems, refer to the [installation guide](../../install_guides/installing-openvino-config-ivad-vpu).
|
||||
|
||||
> **NOTE**: The HDDL and Myriad plugins may cause conflicts when used at the same time.
|
||||
> To ensure proper operation in such a case, the number of booted devices needs to be limited in the 'hddl_autoboot.config' file.
|
||||
@ -18,21 +18,21 @@ To see the list of supported networks for the HDDL plugin, refer to the list on
|
||||
|
||||
## Supported Configuration Parameters
|
||||
|
||||
See VPU common configuration parameters for [VPU Plugins](VPU.md).
|
||||
When specifying key values as raw strings (that is, when using the Python API), omit the `KEY_` prefix.
|
||||
For information on VPU common configuration parameters, see the [VPU Plugins](VPU.md).
|
||||
When specifying key values as raw strings (when using the Python API), omit the `KEY_` prefix.
|
||||
|
||||
In addition to common parameters for both VPU plugins, the HDDL plugin accepts the following options:
|
||||
|
||||
| Parameter Name | Parameter Values | Default | Description |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| KEY_PERF_COUNT | YES/NO | NO | Enable performance counter option. |
|
||||
| KEY_VPU_HDDL_GRAPH_TAG | string | empty string | Allows to execute network on specified count of devices. |
|
||||
| KEY_VPU_HDDL_STREAM_ID | string | empty string | Allows to execute inference on a specified device. |
|
||||
| KEY_VPU_HDDL_DEVICE_TAG | string | empty string | Allows to allocate/deallocate networks on specified devices. |
|
||||
| KEY_VPU_HDDL_BIND_DEVICE | YES/NO | NO | Whether the network should bind to a device. Refer to vpu_plugin_config.hpp. |
|
||||
| KEY_VPU_HDDL_RUNTIME_PRIORITY | signed int | 0 | Specify the runtime priority of a device among all devices running the same network. Refer to vpu_plugin_config.hpp. |
|
||||
| `KEY_PERF_COUNT` | `YES`/`NO` | `NO` | Enables performance counter option. |
|
||||
| `KEY_VPU_HDDL_GRAPH_TAG` | string | empty string | Allows executing network on specified count of devices. |
|
||||
| `KEY_VPU_HDDL_STREAM_ID` | string | empty string | Allows executing inference on a specified device. |
|
||||
| `KEY_VPU_HDDL_DEVICE_TAG` | string | empty string | Allows allocating/deallocating networks on specified devices. |
|
||||
| `KEY_VPU_HDDL_BIND_DEVICE` | `YES`/`NO` | `NO` | Enables the network to be bound to a device. Refer to the 'vpu_plugin_config.hpp' file. |
|
||||
| `KEY_VPU_HDDL_RUNTIME_PRIORITY` | signed int | 0 | Specifies the runtime priority of a device among all devices running the same network. Refer to the `vpu_plugin_config.hpp` file. |
|
||||
|
||||
## See Also
|
||||
## Additional Resources
|
||||
|
||||
* [Supported Devices](Supported_Devices.md)
|
||||
* [VPU Plugins](VPU.md)
|
||||
|
@ -1,21 +1,20 @@
|
||||
# MYRIAD device {#openvino_docs_OV_UG_supported_plugins_MYRIAD}
|
||||
# MYRIAD Device {#openvino_docs_OV_UG_supported_plugins_MYRIAD}
|
||||
|
||||
## Introducing MYRIAD Plugin
|
||||
|
||||
The OpenVINO Runtime MYRIAD plugin has been developed for inference of neural networks on Intel® Neural Compute Stick 2.
|
||||
The OpenVINO Runtime MYRIAD plugin has been developed for inference of neural networks on Intel® Neural Compute Stick 2.
|
||||
|
||||
## Configuring the MYRIAD Plugin
|
||||
|
||||
To configure your Intel® Vision Accelerator Design With Intel® Movidius™ on supported operating systemss, refer to the Steps for Intel® Vision Accelerator Design with Intel® Movidius™ VPUs section in the installation guides for [Linux](../../install_guides/installing-openvino-linux.md) or [Windows](../../install_guides/installing-openvino-windows.md).
|
||||
To configure your Intel® Vision Accelerator Design With Intel® Movidius™ on supported operating systems, refer to the [installation guide](../../install_guides/installing-openvino-config-ivad-vpu).
|
||||
|
||||
> **NOTE**: The HDDL and MYRIAD plugins may cause conflicts when used at the same time.
|
||||
> To ensure proper operation in such a case, the number of booted devices needs to be limited in the 'hddl_autoboot.config' file.
|
||||
> **NOTE**: The HDDL and MYRIAD plugins may cause conflicts when used at the same time.
|
||||
> To ensure proper operation in such a case, the number of booted devices needs to be limited in the `hddl_autoboot.config` file.
|
||||
> Otherwise, the HDDL plugin will boot all available Intel® Movidius™ Myriad™ X devices.
|
||||
|
||||
## Supported Configuration Parameters
|
||||
|
||||
See VPU common configuration parameters for the [VPU Plugins](VPU.md).
|
||||
When specifying key values as raw strings (that is, when using the Python API), omit the `KEY_` prefix.
|
||||
For information on the VPU common configuration parameters, see the [VPU Plugins](VPU.md).
|
||||
When specifying key values as raw strings (when using the Python API), omit the `KEY_` prefix.
|
||||
|
||||
In addition to common parameters, the MYRIAD plugin accepts the following options:
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
# VPU devices {#openvino_docs_OV_UG_supported_plugins_VPU}
|
||||
# VPU Devices {#openvino_docs_OV_UG_supported_plugins_VPU}
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
@ -11,7 +11,7 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
This chapter provides information on the OpenVINO Runtime plugins that enable inference of deep learning models on the supported VPU devices:
|
||||
This chapter provides information on the OpenVINO™ Runtime plugins that enable inference of deep learning models on the supported VPU devices:
|
||||
|
||||
* Intel® Neural Compute Stick 2 powered by the Intel® Movidius™ Myriad™ X — Supported by the [MYRIAD Plugin](MYRIAD.md)
|
||||
* Intel® Vision Accelerator Design with Intel® Movidius™ VPUs — Supported by the [HDDL Plugin](HDDL.md)
|
||||
@ -68,7 +68,7 @@ VPU plugins support layer fusion and decomposition.
|
||||
|
||||
#### Fusing Rules
|
||||
|
||||
Certain layers can be merged into convolution, ReLU, and Eltwise layers according to the patterns below:
|
||||
Certain layers can be merged into 'convolution', 'ReLU', and 'Eltwise' layers according to the patterns below:
|
||||
|
||||
- Convolution
|
||||
- Convolution + ReLU → Convolution
|
||||
@ -87,7 +87,7 @@ Certain layers can be merged into convolution, ReLU, and Eltwise layers accordin
|
||||
|
||||
#### Joining Rules
|
||||
|
||||
> **NOTE**: Application of these rules depends on tensor sizes and resources available.
|
||||
> **NOTE**: Application of these rules depends on tensor sizes and available resources.
|
||||
|
||||
Layers can be joined only when the two conditions below are met:
|
||||
|
||||
@ -96,38 +96,38 @@ Layers can be joined only when the two conditions below are met:
|
||||
|
||||
### Decomposition Rules
|
||||
|
||||
- Convolution and Pooling layers are tiled resulting in the following pattern:
|
||||
- A Split layer that splits tensors into tiles
|
||||
- A set of tiles, optionally with service layers like Copy
|
||||
- Depending on a tiling scheme, a Concatenation or Sum layer that joins all resulting tensors into one and restores the full blob that contains the result of a tiled operation
|
||||
- Convolution and Pooling layers are tiled, resulting in the following pattern:
|
||||
- A `Split` layer that splits tensors into tiles
|
||||
- A set of tiles, optionally with service layers like `Copy`
|
||||
- Depending on a tiling scheme, a `Concatenation` or `Sum` layer that joins all resulting tensors into one and restores the full blob that contains the result of a tiled operation
|
||||
|
||||
Names of tiled layers contain the `@soc=M/N` part, where `M` is the tile number and `N` is the number of tiles:
|
||||

|
||||
|
||||
> **NOTE**: Nominal layers, such as Shrink and Expand, are not executed.
|
||||
> **NOTE**: Nominal layers, such as `Shrink` and `Expand`, are not executed.
|
||||
|
||||
> **NOTE**: VPU plugins can add extra layers like Copy.
|
||||
> **NOTE**: VPU plugins can add extra layers like `Copy`.
|
||||
|
||||
## VPU Common Configuration Parameters
|
||||
|
||||
VPU plugins support the configuration parameters listed below.
|
||||
The parameters are passed as `std::map<std::string, std::string>` on `InferenceEngine::Core::LoadNetwork`
|
||||
or `InferenceEngine::Core::SetConfig`.
|
||||
When specifying key values as raw strings (that is, when using Python API), omit the `KEY_` prefix.
|
||||
When specifying key values as raw strings (when using Python API), omit the `KEY_` prefix.
|
||||
|
||||
| Parameter Name | Parameter Values | Default | Description |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `KEY_VPU_HW_STAGES_OPTIMIZATION` | `YES`/`NO` | `YES` | Turn on HW stages usage<br /> Applicable for Intel Movidius Myriad X and Intel Vision Accelerator Design devices only. |
|
||||
| `KEY_VPU_COMPUTE_LAYOUT` | `VPU_AUTO`, `VPU_NCHW`, `VPU_NHWC` | `VPU_AUTO` | Specify internal input and output layouts for network layers. |
|
||||
| `KEY_VPU_PRINT_RECEIVE_TENSOR_TIME` | `YES`/`NO` | `NO` | Add device-side time spent waiting for input to PerformanceCounts.<br />See <a href="#VPU_DATA_TRANSFER_PIPELINING">Data Transfer Pipelining</a> section for details. |
|
||||
| `KEY_VPU_IGNORE_IR_STATISTIC` | `YES`/`NO` | `NO` | VPU plugin could use statistic present in IR in order to try to improve calculations precision.<br /> If you don't want statistic to be used enable this option. |
|
||||
| `KEY_VPU_CUSTOM_LAYERS` | path to XML file | empty string | This option allows to pass XML file with custom layers binding.<br />If layer is present in such file, it would be used during inference even if the layer is natively supported. |
|
||||
| `KEY_VPU_PRINT_RECEIVE_TENSOR_TIME` | `YES`/`NO` | `NO` | Add device-side time spent waiting for input to PerformanceCounts.<br />See the <a href="#VPU_DATA_TRANSFER_PIPELINING">Data Transfer Pipelining</a> section for details. |
|
||||
| `KEY_VPU_IGNORE_IR_STATISTIC` | `YES`/`NO` | `NO` | VPU plugin could use statistic present in IR in order to try to improve calculations precision.<br /> This option is enabled to exclude the statistic. |
|
||||
| `KEY_VPU_CUSTOM_LAYERS` | path to XML file | empty string | This option allows passing XML file with custom layers binding.<br />If a layer is present in such file, it will be used during inference even if the layer is natively supported. |
|
||||
|
||||
|
||||
## Data Transfer Pipelining <a name="VPU_DATA_TRANSFER_PIPELINING"> </a>
|
||||
|
||||
MYRIAD plugin tries to pipeline data transfer to/from device with computations.
|
||||
While one infer request is executed, the data for next infer request can be uploaded to device in parallel.
|
||||
MYRIAD plugin tries to pipeline data transfer to/from a device with computations.
|
||||
While one infer request is executed, the data for the next infer request can be uploaded to a device in parallel.
|
||||
The same applies to result downloading.
|
||||
|
||||
`KEY_VPU_PRINT_RECEIVE_TENSOR_TIME` configuration parameter can be used to check the efficiency of current pipelining.
|
||||
@ -136,10 +136,10 @@ In a perfect pipeline this time should be near zero, which means that the data w
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Get the following message when running inference with the VPU plugin: "[VPU] Cannot convert layer <layer_name> due to unsupported layer type <layer_type>"**
|
||||
**When running inference with the VPU plugin: "[VPU] Cannot convert layer <layer_name> due to unsupported layer type <layer_type>"**
|
||||
|
||||
This means that your topology has a layer that is unsupported by your target VPU plugin. To resolve this issue, you can implement the custom layer for the target device using the [OpenVINO™ Extensibility mechanism](../../Extensibility_UG/Intro.md). Or, to quickly get a working prototype, you can use the heterogeneous scenario with the default fallback policy (see the [Heterogeneous execution](../hetero_execution.md) section). Use the HETERO mode with a fallback device that supports this layer, for example, CPU: `HETERO:MYRIAD,CPU`.
|
||||
For a list of VPU-supported layers, see the Supported Layers section of the [Supported Devices](Supported_Devices.md) page.
|
||||
This means that the topology has a layer unsupported by the target VPU plugin. To resolve this issue, the custom layer can be implemented for the target device, using the [OpenVINO™ Extensibility mechanism](../../Extensibility_UG/Intro.md). To quickly get a working prototype, use the heterogeneous scenario with the default fallback policy (see the [Heterogeneous execution](../hetero_execution.md) section). Use the HETERO mode with a fallback device that supports this layer, for example, CPU: `HETERO:MYRIAD,CPU`.
|
||||
For a list of VPU-supported layers, see the **Supported Layers** section of the [Supported Devices](Supported_Devices.md) page.
|
||||
|
||||
## Known Layers Limitations
|
||||
|
||||
|
@ -1,25 +1,23 @@
|
||||
# Query device properties, configuration {#openvino_docs_OV_UG_query_api}
|
||||
# Query Device Properties - Configuration {#openvino_docs_OV_UG_query_api}
|
||||
|
||||
## Query device properties and devices configuration
|
||||
|
||||
The OpenVINO™ toolkit supports inferencing with several types of devices (processors or accelerators).
|
||||
The OpenVINO™ toolkit supports inference with several types of devices (processors or accelerators).
|
||||
This section provides a high-level description of the process of querying of different device properties and configuration values at runtime.
|
||||
|
||||
OpenVINO runtime has two types of properties:
|
||||
- Read only properties which provides information about the devices (such device name, termal, execution capabilities, etc) and information about ov::CompiledModel to understand what configuration values were used to compile the model with.
|
||||
- Mutable properties which are primarily used to configure ov::Core::compile_model process and affect final inference on the specific set of devices. Such properties can be set globally per device via ov::Core::set_property or locally for particular model in ov::Core::compile_model and ov::Core::query_model calls.
|
||||
- Read only properties which provide information about the devices (such as device name, thermal state, execution capabilities, etc.) and information about configuration values used to compile the model (`ov::CompiledModel`) .
|
||||
- Mutable properties which are primarily used to configure the `ov::Core::compile_model` process and affect final inference on a specific set of devices. Such properties can be set globally per device via `ov::Core::set_property` or locally for particular model in the `ov::Core::compile_model` and the `ov::Core::query_model` calls.
|
||||
|
||||
OpenVINO property is represented as a named constexpr variable with a given string name and type (see ). Example:
|
||||
An OpenVINO property is represented as a named constexpr variable with a given string name and a type. The following example represents a read-only property with a C++ name of `ov::available_devices`, a string name of `AVAILABLE_DEVICES` and a type of `std::vector<std::string>`:
|
||||
```
|
||||
static constexpr Property<std::vector<std::string>, PropertyMutability::RO> available_devices{"AVAILABLE_DEVICES"};
|
||||
```
|
||||
represents a read-only property with C++ name `ov::available_devices`, string name `AVAILABLE_DEVICES` and type `std::vector<std::string>`.
|
||||
|
||||
Refer to the [Hello Query Device С++ Sample](../../../samples/cpp/hello_query_device/README.md) sources and the [Multi-Device execution](../multi_device.md) documentation for examples of using setting and getting properties in user applications.
|
||||
|
||||
### Get a set of available devices
|
||||
### Get a Set of Available Devices
|
||||
|
||||
Based on read-only property `ov::available_devices`, OpenVINO Core collects information about currently available devices enabled by OpenVINO plugins and returns information using the `ov::Core::get_available_devices` method:
|
||||
Based on the `ov::available_devices` read-only property, OpenVINO Core collects information about currently available devices enabled by OpenVINO plugins and returns information, using the `ov::Core::get_available_devices` method:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -48,13 +46,13 @@ GPU.0
|
||||
GPU.1
|
||||
```
|
||||
|
||||
If there are more than one instance of a specific device, the devices are enumerated with `.suffix` where `suffix` is a unique string identifier. Each device name can then be passed to:
|
||||
If there are multiple instances of a specific device, the devices are enumerated with a suffix comprising a full stop and a unique string identifier, such as `.suffix`. Each device name can then be passed to:
|
||||
|
||||
* `ov::Core::compile_model` to load the model to a specific device with specific configuration properties.
|
||||
* `ov::Core::get_property` to get common or device specific properties.
|
||||
* `ov::Core::get_property` to get common or device-specific properties.
|
||||
* All other methods of the `ov::Core` class that accept `deviceName`.
|
||||
|
||||
### Working with properties in Your Code
|
||||
### Working with Properties in Your Code
|
||||
|
||||
The `ov::Core` class provides the following method to query device information, set or get different device configuration properties:
|
||||
|
||||
@ -66,11 +64,11 @@ The `ov::CompiledModel` class is also extended to support the properties:
|
||||
* `ov::CompiledModel::get_property`
|
||||
* `ov::CompiledModel::set_property`
|
||||
|
||||
For documentation about OpenVINO common device-independent properties, refer to `openvino/runtime/properties.hpp`. Device specific configuration keys can be found in corresponding device folders (for example, `openvino/runtime/intel_gpu/properties.hpp`).
|
||||
For documentation about OpenVINO common device-independent properties, refer to the `openvino/runtime/properties.hpp`. Device-specific configuration keys can be found in corresponding device folders (for example, `openvino/runtime/intel_gpu/properties.hpp`).
|
||||
|
||||
### Working with properties via Core
|
||||
### Working with Properties via Core
|
||||
|
||||
#### Getting device properties
|
||||
#### Getting Device Properties
|
||||
|
||||
The code below demonstrates how to query `HETERO` device priority of devices which will be used to infer the model:
|
||||
|
||||
@ -112,17 +110,17 @@ To extract device properties such as available devices (`ov::available_devices`)
|
||||
|
||||
A returned value appears as follows: `Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz`.
|
||||
|
||||
> **NOTE**: In order to understand a list of supported properties on `ov::Core` or `ov::CompiledModel` levels, use `ov::supported_properties` which contains a vector of supported property names. Properties which can be changed, has `ov::PropertyName::is_mutable` returning the `true` value. Most of the properites which are changable on ov::Core level, cannot be changed once the model is compiled, so it becomes immutable read-only property.
|
||||
> **NOTE**: In order to understand a list of supported properties on `ov::Core` or `ov::CompiledModel` levels, use `ov::supported_properties` which contains a vector of supported property names. Properties which can be changed, has `ov::PropertyName::is_mutable` returning the `true` value. Most of the properites which are changable on `ov::Core` level, cannot be changed once the model is compiled, so it becomes immutable read-only property.
|
||||
|
||||
#### Configure a work with a model
|
||||
#### Configure a Work with a Model
|
||||
|
||||
`ov::Core` methods like:
|
||||
The `ov::Core` methods like:
|
||||
|
||||
* `ov::Core::compile_model`
|
||||
* `ov::Core::import_model`
|
||||
* `ov::Core::query_model`
|
||||
|
||||
accept variadic list of properties as last arguments. Each property in such parameters lists should be used as function call to pass property value with specified property type.
|
||||
accept a selection of properties as last arguments. Each of the properties should be used as a function call to pass a property value with a specified property type.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -140,11 +138,11 @@ accept variadic list of properties as last arguments. Each property in such para
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
The example below specifies hints that a model should be compiled to be inferenced with multiple inference requests in parallel to achive best throughput while inference should be performed without accuracy loss with FP32 precision.
|
||||
The example below specifies hints that a model should be compiled to be inferred with multiple inference requests in parallel to achieve best throughput, while inference should be performed without accuracy loss with FP32 precision.
|
||||
|
||||
#### Setting properties globally
|
||||
#### Setting Properties Globally
|
||||
|
||||
`ov::Core::set_property` with a given device name should be used to set global configuration properties which are the same accross multiple `ov::Core::compile_model`, `ov::Core::query_model`, etc. calls, while setting property on the specific `ov::Core::compile_model` call applies properties only for current call:
|
||||
`ov::Core::set_property` with a given device name should be used to set global configuration properties, which are the same across multiple `ov::Core::compile_model`, `ov::Core::query_model`, and other calls. However, setting properties on a specific `ov::Core::compile_model` call applies properties only for the current call:
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -162,9 +160,9 @@ The example below specifies hints that a model should be compiled to be inferenc
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
### Properties on CompiledModel level
|
||||
### Properties on CompiledModel Level
|
||||
|
||||
#### Getting property
|
||||
#### Getting Property
|
||||
|
||||
The `ov::CompiledModel::get_property` method is used to get property values the compiled model has been created with or a compiled model level property such as `ov::optimal_number_of_infer_requests`:
|
||||
|
||||
@ -221,7 +219,7 @@ Or the number of threads that would be used for inference on `CPU` device:
|
||||
|
||||
@endsphinxtabset
|
||||
|
||||
#### Setting properties for compiled model
|
||||
#### Setting Properties for Compiled Model
|
||||
|
||||
The only mode that supports this method is [Multi-Device execution](../multi_device.md):
|
||||
|
||||
|
BIN
docs/benchmarks/files/Platform_list.pdf
Normal file
BIN
docs/benchmarks/files/Platform_list.pdf
Normal file
Binary file not shown.
@ -14,11 +14,11 @@
|
||||
|
||||
The [Intel® Distribution of OpenVINO™ toolkit](https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html) helps accelerate deep learning inference across a variety of Intel® processors and accelerators.
|
||||
|
||||
The benchmarks below demonstrate high performance gains on several public neural networks on multiple Intel® CPUs, GPUs and VPUs covering a broad performance range. Use this data to help you decide which hardware is best for your applications and solutions, or to plan your AI workload on the Intel computing already included in your solutions.
|
||||
The benchmark results below demonstrate high performance gains on several public neural networks on multiple Intel® CPUs, GPUs and VPUs covering a broad performance range. The results may be helpful when deciding which hardware is best for your applications and solutions or to plan AI workload on the Intel computing already included in your solutions.
|
||||
|
||||
Use the links below to review the benchmarking results for each alternative:
|
||||
The following benchmarks are available:
|
||||
|
||||
* [Intel® Distribution of OpenVINO™ toolkit Benchmark Results](performance_benchmarks_openvino.md)
|
||||
* [OpenVINO™ Model Server Benchmark Results](performance_benchmarks_ovms.md)
|
||||
* [Intel® Distribution of OpenVINO™ toolkit Benchmark Results](performance_benchmarks_openvino.md).
|
||||
* [OpenVINO™ Model Server Benchmark Results](performance_benchmarks_ovms.md).
|
||||
|
||||
Performance for a particular application can also be evaluated virtually using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/), a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. [Learn more](https://devcloud.intel.com/edge/get_started/devcloud/) or [Register here](https://inteliot.force.com/DevcloudForEdge/s/).
|
||||
Performance of a particular application can also be evaluated virtually using [Intel® DevCloud for the Edge](https://devcloud.intel.com/edge/). It is a remote development environment with access to Intel® hardware and the latest versions of the Intel® Distribution of the OpenVINO™ Toolkit. To learn more about it, visit [the website](https://www.intel.com/content/www/us/en/developer/tools/devcloud/edge/overview.html) or [create an account](https://www.intel.com/content/www/us/en/forms/idz/devcloud-registration.html?tgt=https://www.intel.com/content/www/us/en/secure/forms/devcloud-enrollment/account-provisioning.html).
|
||||
|
@ -1,28 +1,29 @@
|
||||
# Performance Information Frequently Asked Questions {#openvino_docs_performance_benchmarks_faq}
|
||||
|
||||
The following questions and answers are related to [performance benchmarks](./performance_benchmarks.md) published on the documentation site.
|
||||
The following questions (Q#) and answers (A) are related to published [performance benchmarks](./performance_benchmarks.md).
|
||||
|
||||
#### 1. How often do performance benchmarks get updated?
|
||||
New performance benchmarks are typically published on every `major.minor` release of the Intel® Distribution of OpenVINO™ toolkit.
|
||||
#### Q1: How often do performance benchmarks get updated?
|
||||
**A**: New performance benchmarks are typically published on every `major.minor` release of the Intel® Distribution of OpenVINO™ toolkit.
|
||||
|
||||
#### 2. Where can I find the models used in the performance benchmarks?
|
||||
All of the models used are included in the toolkit's [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo) GitHub repository.
|
||||
#### Q2: Where can I find the models used in the performance benchmarks?
|
||||
**A**: All models used are included in the GitHub repository of [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo).
|
||||
|
||||
#### 3. Will there be new models added to the list used for benchmarking?
|
||||
The models used in the performance benchmarks were chosen based on general adoption and usage in deployment scenarios. We're continuing to add new models that support a diverse set of workloads and usage.
|
||||
#### Q3: Will there be any new models added to the list used for benchmarking?
|
||||
**A**: The models used in the performance benchmarks were chosen based on general adoption and usage in deployment scenarios. New models that support a diverse set of workloads and usage are added periodically.
|
||||
|
||||
#### 4. What does CF or TF in the graphs stand for?
|
||||
CF means Caffe*, while TF means TensorFlow*.
|
||||
#### Q4: What does "CF" or "TF" in the graphs stand for?
|
||||
**A**: The "CF" means "Caffe", and "TF" means "TensorFlow".
|
||||
|
||||
#### 5. How can I run the benchmark results on my own?
|
||||
All of the performance benchmarks were generated using the open-sourced tool within the Intel® Distribution of OpenVINO™ toolkit called `benchmark_app`, which is available in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md).
|
||||
#### Q5: How can I run the benchmark results on my own?
|
||||
**A**: All of the performance benchmarks were generated using the open-source tool within the Intel® Distribution of OpenVINO™ toolkit called `benchmark_app`. This tool is available in both [C++](../../samples/cpp/benchmark_app/README.md) and [Python](../../tools/benchmark_tool/README.md).
|
||||
|
||||
#### Q6: What image sizes are used for the classification network models?
|
||||
**A**: The image size used in inference depends on the benchmarked network. The table below presents the list of input sizes for each network model:
|
||||
|
||||
#### 6. What image sizes are used for the classification network models?
|
||||
The image size used in the inference depends on the network being benchmarked. The following table shows the list of input sizes for each network model.
|
||||
| **Model** | **Public Network** | **Task** | **Input Size** (Height x Width) |
|
||||
|------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|-----------------------------|-----------------------------------|
|
||||
| [bert-base-cased](https://github.com/PaddlePaddle/PaddleNLP/tree/v2.1.1) | BERT | question / answer | 124 |
|
||||
| [bert-large-uncased-whole-word-masking-squad](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/intel/bert-large-uncased-whole-word-masking-squad-int8-0001) | BERT-large | question / answer | 384 |
|
||||
| [bert-large-uncased-whole-word-masking-squad-int8-0001](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/intel/bert-large-uncased-whole-word-masking-squad-int8-0001) | BERT-large | question / answer | 384 |
|
||||
| [bert-small-uncased-whole-masking-squad-0002](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/intel/bert-small-uncased-whole-word-masking-squad-0002) | BERT-small | question / answer | 384 |
|
||||
| [brain-tumor-segmentation-0001-MXNET](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/brain-tumor-segmentation-0001) | brain-tumor-segmentation-0001 | semantic segmentation | 128x128x128 |
|
||||
| [brain-tumor-segmentation-0002-CF2](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/brain-tumor-segmentation-0002) | brain-tumor-segmentation-0002 | semantic segmentation | 128x128x128 |
|
||||
@ -33,7 +34,7 @@ The image size used in the inference depends on the network being benchmarked. T
|
||||
| [Facedetection0200](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/intel/face-detection-0200) | FaceDetection0200 | detection | 256x256 |
|
||||
| [faster_rcnn_resnet50_coco-TF](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/faster_rcnn_resnet50_coco) | Faster RCNN Tf | object detection | 600x1024 |
|
||||
| [forward-tacotron-duration-prediction](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/forward-tacotron) | ForwardTacotron | text to speech | 241 |
|
||||
| [inception-v4-TF](https://github.com/openvinotoolkit/open_model_zoo/tree/develop/models/public/googlenet-v4-tf) | Inception v4 Tf (aka GoogleNet-V4) | classification | 299x299 |
|
||||
| [inception-v4-TF](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/googlenet-v4-tf) | Inception v4 Tf (aka GoogleNet-V4) | classification | 299x299 |
|
||||
| [inception-v3-TF](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/googlenet-v3) | Inception v3 Tf | classification | 299x299 |
|
||||
| [mask_rcnn_resnet50_atrous_coco](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/mask_rcnn_resnet50_atrous_coco) | Mask R-CNN ResNet50 Atrous | instance segmentation | 800x1365 |
|
||||
| [mobilenet-ssd-CF](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/mobilenet-ssd) | SSD (MobileNet)_COCO-2017_Caffe | object detection | 300x300 |
|
||||
@ -49,22 +50,22 @@ The image size used in the inference depends on the network being benchmarked. T
|
||||
| [yolo_v4-TF](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/yolo-v4-tf) | Yolo-V4 TF | object detection | 608x608 |
|
||||
| [ssd_mobilenet_v1_coco-TF](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/ssd_mobilenet_v1_coco) | ssd_mobilenet_v1_coco | object detection | 300x300 |
|
||||
| [ssdlite_mobilenet_v2-TF](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/ssdlite_mobilenet_v2) | ssdlite_mobilenet_v2 | object detection | 300x300 |
|
||||
| [unet-camvid-onnx-0001](https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/intel/unet-camvid-onnx-0001/description/unet-camvid-onnx-0001.md) | U-Net | semantic segmentation | 368x480 |
|
||||
| [yolo-v3-tiny-tf](https://github.com/openvinotoolkit/open_model_zoo/tree/develop/models/public/yolo-v3-tiny-tf) | YOLO v3 Tiny | object detection | 416x416 |
|
||||
| [unet-camvid-onnx-0001](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/intel/unet-camvid-onnx-0001) | U-Net | semantic segmentation | 368x480 |
|
||||
| [yolo-v3-tiny-tf](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/yolo-v3-tiny-tf) | YOLO v3 Tiny | object detection | 416x416 |
|
||||
| [yolo-v3](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/yolo-v3-tf) | YOLO v3 | object detection | 416x416 |
|
||||
| [ssd-resnet34-1200-onnx](https://github.com/openvinotoolkit/open_model_zoo/tree/develop/models/public/ssd-resnet34-1200-onnx) | ssd-resnet34 onnx model | object detection | 1200x1200 |
|
||||
| [ssd-resnet34-1200-onnx](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/ssd-resnet34-1200-onnx) | ssd-resnet34 onnx model | object detection | 1200x1200 |
|
||||
|
||||
#### 7. Where can I purchase the specific hardware used in the benchmarking?
|
||||
Intel partners with various vendors all over the world. Visit the [Intel® AI: In Production Partners & Solutions Catalog](https://www.intel.com/content/www/us/en/internet-of-things/ai-in-production/partners-solutions-catalog.html) for a list of Equipment Makers and the [Supported Devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) documentation. You can also remotely test and run models before purchasing any hardware by using [Intel® DevCloud for the Edge](http://devcloud.intel.com/edge/).
|
||||
#### Q7: Where can I purchase the specific hardware used in the benchmarking?
|
||||
**A**: Intel partners with vendors all over the world. For a list of Hardware Manufacturers, see the [Intel® AI: In Production Partners & Solutions Catalog](https://www.intel.com/content/www/us/en/internet-of-things/ai-in-production/partners-solutions-catalog.html) . For more details, see the [Supported Devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) documentation. Before purchasing any hardware, you can test and run models remotely, using [Intel® DevCloud for the Edge](http://devcloud.intel.com/edge/).
|
||||
|
||||
#### 8. How can I optimize my models for better performance or accuracy?
|
||||
We published a set of guidelines and recommendations to optimize your models available in the [optimization guide](../optimization_guide/dldt_optimization_guide.md). For further support, please join the conversation in the [Community Forum](https://software.intel.com/en-us/forums/intel-distribution-of-openvino-toolkit).
|
||||
#### Q8: How can I optimize my models for better performance or accuracy?
|
||||
**A**: Set of guidelines and recommendations to optimize models are available in the [optimization guide](../optimization_guide/dldt_optimization_guide.md). Join the conversation in the [Community Forum](https://software.intel.com/en-us/forums/intel-distribution-of-openvino-toolkit) for further support.
|
||||
|
||||
#### 9. Why are INT8 optimized models used for benchmarking on CPUs with no VNNI support?
|
||||
The benefit of low-precision optimization using the OpenVINO™ toolkit model optimizer extends beyond processors supporting VNNI through Intel® DL Boost. The reduced bit width of INT8 compared to FP32 allows Intel® CPU to process the data faster and thus offers better throughput on any converted model agnostic of the intrinsically supported low-precision optimizations within Intel® hardware. Refer to [Model Accuracy for INT8 and FP32 Precision](performance_int8_vs_fp32.md) for comparison on boost factors for different network models and a selection of Intel® CPU architectures, including AVX-2 with Intel® Core™ i7-8700T, and AVX-512 (VNNI) with Intel® Xeon® 5218T and Intel® Xeon® 8270.
|
||||
#### Q9: Why are INT8 optimized models used for benchmarking on CPUs with no VNNI support?
|
||||
**A**: The benefit of low-precision optimization using the OpenVINO™ toolkit model optimizer extends beyond processors supporting VNNI through Intel® DL Boost. The reduced bit width of INT8 compared to FP32 allows Intel® CPU to process the data faster. Therefore, it offers better throughput on any converted model, regardless of the intrinsically supported low-precision optimizations within Intel® hardware. For comparison on boost factors for different network models and a selection of Intel® CPU architectures, including AVX-2 with Intel® Core™ i7-8700T, and AVX-512 (VNNI) with Intel® Xeon® 5218T and Intel® Xeon® 8270, refer to the [Model Accuracy for INT8 and FP32 Precision](performance_int8_vs_fp32.md) article.
|
||||
|
||||
#### 10. Where can I search for OpenVINO™ performance results based on HW-platforms?
|
||||
The web site format has changed in order to support the more common search approach of looking for the performance of a given neural network model on different HW-platforms. As opposed to review a given HW-platform's performance on different neural network models.
|
||||
#### Q10: Where can I search for OpenVINO™ performance results based on HW-platforms?
|
||||
**A**: The website format has changed in order to support more common approach of searching for the performance results of a given neural network model on different HW-platforms. As opposed to reviewing performance of a given HW-platform when working with different neural network models.
|
||||
|
||||
#### 11. How is Latency measured?
|
||||
Latency is measured by running the OpenVINO™ Runtime in synchronous mode. In synchronous mode each frame or image is processed through the entire set of stages (pre-processing, inference, post-processing) before the next frame or image is processed. This KPI is relevant for applications where the inference on a single image is required, for example the analysis of an ultra sound image in a medical application or the analysis of a seismic image in the oil & gas industry. Other use cases include real-time or near real-time applications like an industrial robot's response to changes in its environment and obstacle avoidance for autonomous vehicles where a quick response to the result of the inference is required.
|
||||
#### Q11: How is Latency measured?
|
||||
**A**: Latency is measured by running the OpenVINO™ Runtime in synchronous mode. In this mode, each frame or image is processed through the entire set of stages (pre-processing, inference, post-processing) before the next frame or image is processed. This KPI is relevant for applications where the inference on a single image is required. For example, the analysis of an ultra sound image in a medical application or the analysis of a seismic image in the oil & gas industry. Other use cases include real or near real-time applications, e.g. the response of industrial robot to changes in its environment and obstacle avoidance for autonomous vehicles, where a quick response to the result of the inference is required.
|
||||
|
@ -6,24 +6,44 @@
|
||||
:hidden:
|
||||
|
||||
openvino_docs_performance_benchmarks_faq
|
||||
Download Performance Data Spreadsheet in MS Excel* Format <https://docs.openvino.ai/downloads/benchmark_files/OV-2022.1-Download-Excel.xlsx>
|
||||
Download Performance Data Spreadsheet in MS Excel Format <https://docs.openvino.ai/downloads/benchmark_files/OV-2022.1-Download-Excel.xlsx>
|
||||
openvino_docs_performance_int8_vs_fp32
|
||||
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
This benchmark setup includes a single machine on which both the benchmark application and the OpenVINO™ installation reside.
|
||||
Features and benefits of Intel® technologies depend on system configuration and may require enabled hardware, software or service activation. More information on this subject may be obtained from the original equipment manufacturer (OEM), official [Intel® web page](https://www.intel.com) or retailer.
|
||||
|
||||
The benchmark application loads the OpenVINO™ Runtime and executes inferences on the specified hardware (CPU, GPU or VPU). The benchmark application measures the time spent on actual inferencing (excluding any pre or post processing) and then reports on the inferences per second (or Frames Per Second). For more information on the benchmark application, please also refer to the entry 5 of the [FAQ section](performance_benchmarks_faq.md).
|
||||
## Platform Configurations
|
||||
|
||||
Measuring inference performance involves many variables and is extremely use-case and application dependent. We use the below four parameters for measurements, which are key elements to consider for a successful deep learning inference application:
|
||||
@sphinxdirective
|
||||
|
||||
- **Throughput** - Measures the number of inferences delivered within a latency threshold. (for example, number of Frames Per Second - FPS). When deploying a system with deep learning inference, select the throughput that delivers the best trade-off between latency and power for the price and performance that meets your requirements.
|
||||
:download:`A full list of HW platforms used for testing (along with their configuration)<../../../docs/benchmarks/files/Platform_list.pdf>`
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
For more specific information, refer to the [Configuration Details](https://docs.openvino.ai/resources/benchmark_files/system_configurations_2022.1.html) document.
|
||||
|
||||
## Benchmark Setup Information
|
||||
|
||||
This benchmark setup includes a single machine on which both the benchmark application and the OpenVINO™ installation reside. The presented performance benchmark numbers are based on realease 2022.1 of Intel® Distribution of OpenVINO™ toolkit.
|
||||
|
||||
The benchmark application loads the OpenVINO™ Runtime and executes inferences on the specified hardware (CPU, GPU or VPU). It measures the time spent on actual inferencing (excluding any pre or post processing) and then reports on the inferences per second (or Frames Per Second - FPS). For additional information on the benchmark application, refer to the entry 5 in the [FAQ section](performance_benchmarks_faq.md).
|
||||
|
||||
Measuring inference performance involves many variables and is extremely use case and application dependent. Below are four parameters used for measurements, which are key elements to consider for a successful deep learning inference application:
|
||||
|
||||
- **Throughput** - Measures the number of inferences delivered within a latency threshold (for example, number of FPS). When deploying a system with deep learning inference, select the throughput that delivers the best trade-off between latency and power for the price and performance that meets your requirements.
|
||||
- **Value** - While throughput is important, what is more critical in edge AI deployments is the performance efficiency or performance-per-cost. Application performance in throughput per dollar of system cost is the best measure of value.
|
||||
- **Efficiency** - System power is a key consideration from the edge to the data center. When selecting deep learning solutions, power efficiency (throughput/watt) is a critical factor to consider. Intel designs provide excellent power efficiency for running deep learning workloads.
|
||||
- **Latency** - This measures the synchronous execution of inference requests and is reported in milliseconds. Each inference request (for example: preprocess, infer, postprocess) is allowed to complete before the next is started. This performance metric is relevant in usage scenarios where a single image input needs to be acted upon as soon as possible. An example would be the healthcare sector where medical personnel only request analysis of a single ultra sound scanning image or in real-time or near real-time applications for example an industrial robot's response to actions in its environment or obstacle avoidance for autonomous vehicles.
|
||||
- **Latency** - This parameter measures the synchronous execution of inference requests and is reported in milliseconds. Each inference request (i.e., preprocess, infer, postprocess) is allowed to complete before the next one is started. This performance metric is relevant in usage scenarios where a single image input needs to be acted upon as soon as possible. An example of that kind of a scenario would be real-time or near real-time applications, i.e., the response of an industrial robot to its environment or obstacle avoidance for autonomous vehicles.
|
||||
|
||||
## bert-base-cased [124]
|
||||
## Benchmark Performance Results
|
||||
|
||||
Benchmark performance results below are based on testing as of March 17, 2022. They may not reflect all publicly available updates at the time of testing.
|
||||
<!-- See configuration disclosure for details. No product can be absolutely secure. -->
|
||||
Performance varies by use, configuration and other factors, which are elaborated further in [here](https://www.intel.com/PerformanceIndex). Used Intel optimizations (for Intel® compilers or other products) may not optimize to the same degree for non-Intel products.
|
||||
|
||||
### bert-base-cased [124]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -33,7 +53,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
@endsphinxdirective
|
||||
|
||||
|
||||
## bert-large-uncased-whole-word-masking-squad-int8-0001 [384]
|
||||
### bert-large-uncased-whole-word-masking-squad-int8-0001 [384]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -42,7 +62,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## deeplabv3-TF [513x513]
|
||||
### deeplabv3-TF [513x513]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -51,7 +71,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## densenet-121-TF [224x224]
|
||||
### densenet-121-TF [224x224]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -60,7 +80,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## efficientdet-d0 [512x512]
|
||||
### efficientdet-d0 [512x512]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -69,7 +89,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## faster-rcnn-resnet50-coco-TF [600x1024]
|
||||
### faster-rcnn-resnet50-coco-TF [600x1024]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -78,7 +98,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## inception-v4-TF [299x299]
|
||||
### inception-v4-TF [299x299]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -87,7 +107,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## mobilenet-ssd-CF [300x300]
|
||||
### mobilenet-ssd-CF [300x300]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -96,7 +116,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## mobilenet-v2-pytorch [224x224]
|
||||
### mobilenet-v2-pytorch [224x224]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -105,7 +125,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## resnet-18-pytorch [224x224]
|
||||
### resnet-18-pytorch [224x224]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -115,7 +135,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
@endsphinxdirective
|
||||
|
||||
|
||||
## resnet_50_TF [224x224]
|
||||
### resnet_50_TF [224x224]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -124,7 +144,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## ssd-resnet34-1200-onnx [1200x1200]
|
||||
### ssd-resnet34-1200-onnx [1200x1200]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -133,7 +153,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## unet-camvid-onnx-0001 [368x480]
|
||||
### unet-camvid-onnx-0001 [368x480]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -142,7 +162,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## yolo-v3-tiny-tf [416x416]
|
||||
### yolo-v3-tiny-tf [416x416]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -151,7 +171,7 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## yolo_v4-tf [608x608]
|
||||
### yolo_v4-tf [608x608]
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -160,199 +180,4 @@ Measuring inference performance involves many variables and is extremely use-cas
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Platform Configurations
|
||||
|
||||
Intel® Distribution of OpenVINO™ toolkit performance benchmark numbers are based on release 2022.1.
|
||||
|
||||
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. Performance results are based on testing as of March 17, 2022 and may not reflect all publicly available updates. See configuration disclosure for details. No product can be absolutely secure.
|
||||
|
||||
Performance varies by use, configuration and other factors. Learn more at [www.intel.com/PerformanceIndex](https://www.intel.com/PerformanceIndex).
|
||||
|
||||
Your costs and results may vary.
|
||||
|
||||
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
|
||||
|
||||
Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.
|
||||
|
||||
Testing by Intel done on: see test date for each HW platform below.
|
||||
|
||||
**CPU Inference Engines**
|
||||
|
||||
| Configuration | Intel® Xeon® E-2124G | Intel® Xeon® W1290P |
|
||||
| ------------------------------- | ---------------------- | --------------------------- |
|
||||
| Motherboard | ASUS* WS C246 PRO | ASUS* WS W480-ACE |
|
||||
| CPU | Intel® Xeon® E-2124G CPU @ 3.40GHz | Intel® Xeon® W-1290P CPU @ 3.70GHz |
|
||||
| Hyper Threading | OFF | ON |
|
||||
| Turbo Setting | ON | ON |
|
||||
| Memory | 2 x 16 GB DDR4 2666MHz | 4 x 16 GB DDR4 @ 2666MHz |
|
||||
| Operating System | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.3 LTS |
|
||||
| Kernel Version | 5.4.0-42-generic | 5.4.0-42-generic |
|
||||
| BIOS Vendor | American Megatrends Inc.* | American Megatrends Inc. |
|
||||
| BIOS Version | 1901 | 2301 |
|
||||
| BIOS Release | September 24, 2021 | July 8, 2021 |
|
||||
| BIOS Settings | Select optimized default settings, <br>save & exit | Select optimized default settings, <br>save & exit |
|
||||
| Batch size | 1 | 1 |
|
||||
| Precision | INT8 | INT8 |
|
||||
| Number of concurrent inference requests | 4 | 5 |
|
||||
| Test Date | March 17, 2022 | March 17, 2022 |
|
||||
| Rated maximum TDP/socket in Watt | [71](https://ark.intel.com/content/www/us/en/ark/products/134854/intel-xeon-e-2124g-processor-8m-cache-up-to-4-50-ghz.html#tab-blade-1-0-1) | [125](https://ark.intel.com/content/www/us/en/ark/products/199336/intel-xeon-w-1290p-processor-20m-cache-3-70-ghz.html) |
|
||||
| CPU Price/socket on Feb 14, 2022, USD<br>Prices may vary | [213](https://ark.intel.com/content/www/us/en/ark/products/134854/intel-xeon-e-2124g-processor-8m-cache-up-to-4-50-ghz.html) | [539](https://ark.intel.com/content/www/us/en/ark/products/199336/intel-xeon-w-1290p-processor-20m-cache-3-70-ghz.html) |
|
||||
|
||||
**CPU Inference Engines (continue)**
|
||||
|
||||
| Configuration | Intel® Xeon® Silver 4216R | Intel® Xeon® Silver 4316 |
|
||||
| ------------------------------- | ---------------------- | --------------------------- |
|
||||
| Motherboard | Intel® Server Board S2600STB | Intel Corporation / WilsonCity |
|
||||
| CPU | Intel® Xeon® Silver 4216R CPU @ 2.20GHz | Intel® Xeon® Silver 4316 CPU @ 2.30GHz |
|
||||
| Hyper Threading | ON | ON |
|
||||
| Turbo Setting | ON | ON |
|
||||
| Memory | 12 x 32 GB DDR4 2666MHz | 16 x 32 GB DDR4 @ 2666MHz |
|
||||
| Operating System | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.3 LTS |
|
||||
| Kernel Version | 5.3.0-24-generic | 5.4.0-100-generic |
|
||||
| BIOS Vendor | Intel Corporation | Intel Corporation |
|
||||
| BIOS Version | SE5C620.86B.02.01.<br>0013.121520200651 | WLYDCRB1.SYS.0021.<br>P41.2109200451 |
|
||||
| BIOS Release | December 15, 2020 | September 20, 2021 |
|
||||
| BIOS Settings | Select optimized default settings, <br>change power policy <br>to "performance", <br>save & exit | Select optimized default settings, <br>save & exit |
|
||||
| Batch size | 1 | 1 |
|
||||
| Precision | INT8 | INT8 |
|
||||
| Number of concurrent inference requests | 32 | 10 |
|
||||
| Test Date | March 17, 2022 | March 17, 2022 |
|
||||
| Rated maximum TDP/socket in Watt | [125](https://ark.intel.com/content/www/us/en/ark/products/193394/intel-xeon-silver-4216-processor-22m-cache-2-10-ghz.html#tab-blade-1-0-1) | [150](https://ark.intel.com/content/www/us/en/ark/products/215270/intel-xeon-silver-4316-processor-30m-cache-2-30-ghz.html)|
|
||||
| CPU Price/socket on June 21, 2021, USD<br>Prices may vary | [1,002](https://ark.intel.com/content/www/us/en/ark/products/193394/intel-xeon-silver-4216-processor-22m-cache-2-10-ghz.html) | [1083](https://ark.intel.com/content/www/us/en/ark/products/215270/intel-xeon-silver-4316-processor-30m-cache-2-30-ghz.html)|
|
||||
|
||||
**CPU Inference Engines (continue)**
|
||||
|
||||
| Configuration | Intel® Xeon® Gold 5218T | Intel® Xeon® Platinum 8270 | Intel® Xeon® Platinum 8380 |
|
||||
| ------------------------------- | ---------------------------- | ---------------------------- | -----------------------------------------|
|
||||
| Motherboard | Intel® Server Board S2600STB | Intel® Server Board S2600STB | Intel Corporation / WilsonCity |
|
||||
| CPU | Intel® Xeon® Gold 5218T CPU @ 2.10GHz | Intel® Xeon® Platinum 8270 CPU @ 2.70GHz | Intel® Xeon® Platinum 8380 CPU @ 2.30GHz |
|
||||
| Hyper Threading | ON | ON | ON |
|
||||
| Turbo Setting | ON | ON | ON |
|
||||
| Memory | 12 x 32 GB DDR4 2666MHz | 12 x 32 GB DDR4 2933MHz | 16 x 16 GB DDR4 3200MHz |
|
||||
| Operating System | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.1 LTS |
|
||||
| Kernel Version | 5.3.0-24-generic | 5.3.0-24-generic | 5.4.0-64-generic |
|
||||
| BIOS Vendor | Intel Corporation | Intel Corporation | Intel Corporation |
|
||||
| BIOS Version | SE5C620.86B.02.01.<br>0013.121520200651 | SE5C620.86B.02.01.<br>0013.121520200651 | WLYDCRB1.SYS.0020.<br>P86.2103050636 |
|
||||
| BIOS Release | December 15, 2020 | December 15, 2020 | March 5, 2021 |
|
||||
| BIOS Settings | Select optimized default settings, <br>change power policy to "performance", <br>save & exit | Select optimized default settings, <br>change power policy to "performance", <br>save & exit | Select optimized default settings, <br>change power policy to "performance", <br>save & exit |
|
||||
| Batch size | 1 | 1 | 1 |
|
||||
| Precision | INT8 | INT8 | INT8 |
|
||||
| Number of concurrent inference requests | 32 | 52 | 80 |
|
||||
| Test Date | March 17, 2022 | March 17, 2022 | March 17, 2022 |
|
||||
| Rated maximum TDP/socket in Watt | [105](https://ark.intel.com/content/www/us/en/ark/products/193953/intel-xeon-gold-5218t-processor-22m-cache-2-10-ghz.html#tab-blade-1-0-1) | [205](https://ark.intel.com/content/www/us/en/ark/products/192482/intel-xeon-platinum-8270-processor-35-75m-cache-2-70-ghz.html#tab-blade-1-0-1) | [270](https://mark.intel.com/content/www/us/en/secure/mark/products/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz.html#tab-blade-1-0-1) |
|
||||
| CPU Price/socket on Feb 14, 2022, USD<br>Prices may vary | [1,349](https://ark.intel.com/content/www/us/en/ark/products/193953/intel-xeon-gold-5218t-processor-22m-cache-2-10-ghz.html) | [7,405](https://ark.intel.com/content/www/us/en/ark/products/192482/intel-xeon-platinum-8270-processor-35-75m-cache-2-70-ghz.html) | [8,099](https://mark.intel.com/content/www/us/en/secure/mark/products/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz.html#tab-blade-1-0-0) |
|
||||
|
||||
|
||||
**CPU Inference Engines (continue)**
|
||||
|
||||
| Configuration | Intel® Core™ i9-10920X | Intel® Core™ i9-10900TE | Intel® Core™ i9-12900 |
|
||||
| -------------------- | -------------------------------------| ----------------------- | -------------------------------------------------------------- |
|
||||
| Motherboard | ASUS* PRIME X299-A II | B595 | Intel Corporation<br>internal/Reference<br>Validation Platform |
|
||||
| CPU | Intel® Core™ i9-10920X CPU @ 3.50GHz | Intel® Core™ i9-10900TE CPU @ 1.80GHz | 12th Gen Intel® Core™ i9-12900 |
|
||||
| Hyper Threading | ON | ON | OFF |
|
||||
| Turbo Setting | ON | ON | - |
|
||||
| Memory | 4 x 16 GB DDR4 2666MHz | 2 x 8 GB DDR4 @ 2400 MHz | 4 x 8 GB DDR4 4800MHz |
|
||||
| Operating System | Ubuntu 20.04.3 LTS | Ubuntu 20.04.3 LTS | Microsoft Windows 10 Pro |
|
||||
| Kernel Version | 5.4.0-42-generic | 5.4.0-42-generic | 10.0.19043 N/A Build 19043 |
|
||||
| BIOS Vendor | American Megatrends Inc.* | American Megatrends Inc.* | Intel Corporation |
|
||||
| BIOS Version | 1004 | Z667AR10.BIN | ADLSFWI1.R00.2303.<br>B00.2107210432 |
|
||||
| BIOS Release | March 19, 2021 | July 15, 2020 | July 21, 2021 |
|
||||
| BIOS Settings | Default Settings | Default Settings | Default Settings |
|
||||
| Batch size | 1 | 1 | 1 |
|
||||
| Precision | INT8 | INT8 | INT8 |
|
||||
| Number of concurrent inference requests | 24 | 5 | 4 |
|
||||
| Test Date | March 17, 2022 | March 17, 2022 | March 17, 2022 |
|
||||
| Rated maximum TDP/socket in Watt | [165](https://ark.intel.com/content/www/us/en/ark/products/198012/intel-core-i9-10920x-x-series-processor-19-25m-cache-3-50-ghz.html) | [35](https://ark.intel.com/content/www/us/en/ark/products/203901/intel-core-i910900te-processor-20m-cache-up-to-4-60-ghz.html) | [65](https://ark.intel.com/content/www/us/en/ark/products/134597/intel-core-i912900-processor-30m-cache-up-to-5-10-ghz.html) |
|
||||
| CPU Price/socket on Feb 14, 2022, USD<br>Prices may vary | [700](https://ark.intel.com/content/www/us/en/ark/products/198012/intel-core-i9-10920x-x-series-processor-19-25m-cache-3-50-ghz.html) | [444](https://ark.intel.com/content/www/us/en/ark/products/203901/intel-core-i910900te-processor-20m-cache-up-to-4-60-ghz.html) | [519](https://ark.intel.com/content/www/us/en/ark/products/134597/intel-core-i912900-processor-30m-cache-up-to-5-10-ghz.html)|
|
||||
|
||||
**CPU Inference Engines (continue)**
|
||||
| Configuration | Intel® Core™ i7-8700T | Intel® Core™ i7-1185G7 |
|
||||
| -------------------- | ----------------------------------- | -------------------------------- |
|
||||
| Motherboard | GIGABYTE* Z370M DS3H-CF | Intel Corporation<br>internal/Reference<br>Validation Platform |
|
||||
| CPU | Intel® Core™ i7-8700T CPU @ 2.40GHz | Intel® Core™ i7-1185G7 @ 3.00GHz |
|
||||
| Hyper Threading | ON | ON |
|
||||
| Turbo Setting | ON | ON |
|
||||
| Memory | 4 x 16 GB DDR4 2400MHz | 2 x 8 GB DDR4 3200MHz |
|
||||
| Operating System | Ubuntu 20.04.3 LTS | Ubuntu 20.04.3 LTS |
|
||||
| Kernel Version | 5.4.0-42-generic | 5.8.0-050800-generic |
|
||||
| BIOS Vendor | American Megatrends Inc.* | Intel Corporation |
|
||||
| BIOS Version | F14c | TGLSFWI1.R00.4391.<br>A00.2109201819 |
|
||||
| BIOS Release | March 23, 2021 | September 20, 2021 |
|
||||
| BIOS Settings | Select optimized default settings, <br>set OS type to "other", <br>save & exit | Default Settings |
|
||||
| Batch size | 1 | 1 |
|
||||
| Precision | INT8 | INT8 |
|
||||
| Number of concurrent inference requests | 4 | 4 |
|
||||
| Test Date | March 17, 2022 | March 17, 2022 |
|
||||
| Rated maximum TDP/socket in Watt | [35](https://ark.intel.com/content/www/us/en/ark/products/129948/intel-core-i7-8700t-processor-12m-cache-up-to-4-00-ghz.html#tab-blade-1-0-1) | [28](https://ark.intel.com/content/www/us/en/ark/products/208664/intel-core-i7-1185g7-processor-12m-cache-up-to-4-80-ghz-with-ipu.html) |
|
||||
| CPU Price/socket on Feb 14, 2022, USD<br>Prices may vary | [303](https://ark.intel.com/content/www/us/en/ark/products/129948/intel-core-i7-8700t-processor-12m-cache-up-to-4-00-ghz.html) | [426](https://ark.intel.com/content/www/us/en/ark/products/208664/intel-core-i7-1185g7-processor-12m-cache-up-to-4-80-ghz-with-ipu.html) |
|
||||
|
||||
**CPU Inference Engines (continue)**
|
||||
|
||||
| Configuration | Intel® Core™ i3-8100 | Intel® Core™ i5-8500 | Intel® Core™ i5-10500TE |
|
||||
| -------------------- |----------------------------------- | ---------------------------------- | ----------------------------------- |
|
||||
| Motherboard | GIGABYTE* Z390 UD | ASUS* PRIME Z370-A | GIGABYTE* Z490 AORUS PRO AX |
|
||||
| CPU | Intel® Core™ i3-8100 CPU @ 3.60GHz | Intel® Core™ i5-8500 CPU @ 3.00GHz | Intel® Core™ i5-10500TE CPU @ 2.30GHz |
|
||||
| Hyper Threading | OFF | OFF | ON |
|
||||
| Turbo Setting | OFF | ON | ON |
|
||||
| Memory | 4 x 8 GB DDR4 2400MHz | 2 x 16 GB DDR4 2666MHz | 2 x 16 GB DDR4 @ 2666MHz |
|
||||
| Operating System | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.3 LTS |
|
||||
| Kernel Version | 5.3.0-24-generic | 5.4.0-42-generic | 5.4.0-42-generic |
|
||||
| BIOS Vendor | American Megatrends Inc.* | American Megatrends Inc.* | American Megatrends Inc.* |
|
||||
| BIOS Version | F8 | 3004 | F21 |
|
||||
| BIOS Release | May 24, 2019 | July 12, 2021 | November 23, 2021 |
|
||||
| BIOS Settings | Select optimized default settings, <br> set OS type to "other", <br>save & exit | Select optimized default settings, <br>save & exit | Select optimized default settings, <br>set OS type to "other", <br>save & exit |
|
||||
| Batch size | 1 | 1 | 1 |
|
||||
| Precision | INT8 | INT8 | INT8 |
|
||||
| Number of concurrent inference requests | 4 | 3 | 4 |
|
||||
| Test Date | March 17, 2022 | March 17, 2022 | March 17, 2022 |
|
||||
| Rated maximum TDP/socket in Watt | [65](https://ark.intel.com/content/www/us/en/ark/products/126688/intel-core-i3-8100-processor-6m-cache-3-60-ghz.html#tab-blade-1-0-1)| [65](https://ark.intel.com/content/www/us/en/ark/products/129939/intel-core-i5-8500-processor-9m-cache-up-to-4-10-ghz.html#tab-blade-1-0-1)| [35](https://ark.intel.com/content/www/us/en/ark/products/203891/intel-core-i5-10500te-processor-12m-cache-up-to-3-70-ghz.html) |
|
||||
| CPU Price/socket on Feb 14, 2022, USD<br>Prices may vary | [117](https://ark.intel.com/content/www/us/en/ark/products/126688/intel-core-i3-8100-processor-6m-cache-3-60-ghz.html) | [192](https://ark.intel.com/content/www/us/en/ark/products/129939/intel-core-i5-8500-processor-9m-cache-up-to-4-10-ghz.html) | [195](https://ark.intel.com/content/www/us/en/ark/products/203891/intel-core-i5-10500te-processor-12m-cache-up-to-3-70-ghz.html) |
|
||||
|
||||
|
||||
**CPU Inference Engines (continue)**
|
||||
|
||||
| Configuration | Intel Atom® x5-E3940 | Intel Atom® x6425RE | Intel® Celeron® 6305E |
|
||||
| -------------------- | --------------------------------------|------------------------------- |----------------------------------|
|
||||
| Motherboard | Intel Corporation<br>internal/Reference<br>Validation Platform | Intel Corporation<br>internal/Reference<br>Validation Platform | Intel Corporation<br>internal/Reference<br>Validation Platform |
|
||||
| CPU | Intel Atom® Processor E3940 @ 1.60GHz | Intel Atom® x6425RE<br>Processor @ 1.90GHz | Intel® Celeron®<br>6305E @ 1.80GHz |
|
||||
| Hyper Threading | OFF | OFF | OFF |
|
||||
| Turbo Setting | ON | ON | ON |
|
||||
| Memory | 1 x 8 GB DDR3 1600MHz | 2 x 4GB DDR4 3200MHz | 2 x 8 GB DDR4 3200MHz |
|
||||
| Operating System | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.3 LTS | Ubuntu 20.04.3 LTS |
|
||||
| Kernel Version | 5.4.0-42-generic | 5.13.0-27-generic | 5.13.0-1008-intel |
|
||||
| BIOS Vendor | American Megatrends Inc.* | Intel Corporation | Intel Corporation |
|
||||
| BIOS Version | 5.12 | EHLSFWI1.R00.3273.<br>A01.2106300759 | TGLIFUI1.R00.4064.A02.2102260133 |
|
||||
| BIOS Release | September 6, 2017 | June 30, 2021 | February 26, 2021 |
|
||||
| BIOS Settings | Default settings | Default settings | Default settings |
|
||||
| Batch size | 1 | 1 | 1 |
|
||||
| Precision | INT8 | INT8 | INT8 |
|
||||
| Number of concurrent inference requests | 4 | 4 | 4|
|
||||
| Test Date | March 17, 2022 | March 17, 2022 | March 17, 2022 |
|
||||
| Rated maximum TDP/socket in Watt | [9.5](https://ark.intel.com/content/www/us/en/ark/products/96485/intel-atom-x5-e3940-processor-2m-cache-up-to-1-80-ghz.html) | [12](https://mark.intel.com/content/www/us/en/secure/mark/products/207907/intel-atom-x6425e-processor-1-5m-cache-up-to-3-00-ghz.html#tab-blade-1-0-1) | [15](https://ark.intel.com/content/www/us/en/ark/products/208072/intel-celeron-6305e-processor-4m-cache-1-80-ghz.html)|
|
||||
| CPU Price/socket on Feb 14, 2022, USD<br>Prices may vary | [34](https://ark.intel.com/content/www/us/en/ark/products/96485/intel-atom-x5-e3940-processor-2m-cache-up-to-1-80-ghz.html) | [59](https://ark.intel.com/content/www/us/en/ark/products/207899/intel-atom-x6425re-processor-1-5m-cache-1-90-ghz.html) |[107](https://ark.intel.com/content/www/us/en/ark/products/208072/intel-celeron-6305e-processor-4m-cache-1-80-ghz.html) |
|
||||
|
||||
**Accelerator Inference Engines**
|
||||
|
||||
| Configuration | Intel® Neural Compute Stick 2 | Intel® Vision Accelerator Design<br>with Intel® Movidius™ VPUs (Mustang-V100-MX8) |
|
||||
| --------------------------------------- | ------------------------------------- | ------------------------------------- |
|
||||
| VPU | 1 X Intel® Movidius™ Myriad™ X MA2485 | 8 X Intel® Movidius™ Myriad™ X MA2485 |
|
||||
| Connection | USB 2.0/3.0 | PCIe X4 |
|
||||
| Batch size | 1 | 1 |
|
||||
| Precision | FP16 | FP16 |
|
||||
| Number of concurrent inference requests | 4 | 32 |
|
||||
| Rated maximum TDP/socket in Watt | 2.5 | [30](https://www.mouser.com/ProductDetail/IEI/MUSTANG-V100-MX8-R10?qs=u16ybLDytRaZtiUUvsd36w%3D%3D) |
|
||||
| CPU Price/socket on Feb 14, 2022, USD<br>Prices may vary | [69](https://ark.intel.com/content/www/us/en/ark/products/140109/intel-neural-compute-stick-2.html) | [492](https://www.mouser.com/ProductDetail/IEI/MUSTANG-V100-MX8-R10?qs=u16ybLDytRaZtiUUvsd36w%3D%3D) |
|
||||
| Host Computer | Intel® Core™ i7 | Intel® Core™ i5 |
|
||||
| Motherboard | ASUS* Z370-A II | Uzelinfo* / US-E1300 |
|
||||
| CPU | Intel® Core™ i7-8700 CPU @ 3.20GHz | Intel® Core™ i5-6600 CPU @ 3.30GHz |
|
||||
| Hyper Threading | ON | OFF |
|
||||
| Turbo Setting | ON | ON |
|
||||
| Memory | 4 x 16 GB DDR4 2666MHz | 2 x 16 GB DDR4 2400MHz |
|
||||
| Operating System | Ubuntu* 20.04.3 LTS | Ubuntu* 20.04.3 LTS |
|
||||
| Kernel Version | 5.0.0-23-generic | 5.0.0-23-generic |
|
||||
| BIOS Vendor | American Megatrends Inc.* | American Megatrends Inc.* |
|
||||
| BIOS Version | 411 | 5.12 |
|
||||
| BIOS Release | September 21, 2018 | September 21, 2018 |
|
||||
| Test Date | March 17, 2022 | March 17, 2022 |
|
||||
|
||||
For more detailed configuration descriptions, see [Configuration Details](https://docs.openvino.ai/resources/benchmark_files/system_configurations_2022.1.html).
|
@ -6,17 +6,17 @@ OpenVINO™ Model Server is an open-source, production-grade inference platform
|
||||
|
||||
## Measurement Methodology
|
||||
|
||||
OpenVINO™ Model Server is measured in multiple-client-single-server configuration using two hardware platforms connected by ethernet network. The network bandwidth depends on the platforms as well as models under investigation and it is set to not be a bottleneck for workload intensity. This connection is dedicated only to the performance measurements. The benchmark setup is consists of four main parts:
|
||||
OpenVINO™ Model Server is measured in a multiple-client-single-server configuration using two hardware platforms connected by an ethernet network. The network bandwidth depends on the platforms as well as models under investigation, and it is set not to be a bottleneck for workload intensity. This connection is dedicated only to the performance measurements. The benchmark setup consists of four main parts:
|
||||
|
||||

|
||||
|
||||
* **OpenVINO™ Model Server** is launched as a docker container on the server platform and it listens (and answers on) requests from clients. OpenVINO™ Model Server is run on the same machine as the OpenVINO™ toolkit benchmark application in corresponding benchmarking. Models served by OpenVINO™ Model Server are located in a local file system mounted into the docker container. The OpenVINO™ Model Server instance communicates with other components via ports over a dedicated docker network.
|
||||
- **OpenVINO™ Model Server** -- It is launched as a docker container on the server platform and it listens, and answers to, requests from clients. It is run on the same system as the OpenVINO™ toolkit benchmark application in corresponding benchmarking. Models served by it are placed in a local file system mounted into the docker container. The OpenVINO™ Model Server instance communicates with other components via ports over a dedicated docker network.
|
||||
|
||||
* **Clients** are run in separated physical machine referred to as client platform. Clients are implemented in Python3 programming language based on TensorFlow* API and they work as parallel processes. Each client waits for a response from OpenVINO™ Model Server before it will send a new next request. The role played by the clients is also verification of responses.
|
||||
- **Clients** - They are run in a separated physical system referred to as a client platform. Clients are implemented in the Python3 programming language based on the TensorFlow API and they work as parallel processes. Each client waits for a response from OpenVINO™ Model Server before it sends a new request. Clients also play a role in verification of responses.
|
||||
|
||||
* **Load balancer** works on the client platform in a docker container. HAProxy is used for this purpose. Its main role is counting of requests forwarded from clients to OpenVINO™ Model Server, estimating its latency, and sharing this information by Prometheus service. The reason of locating the load balancer on the client site is to simulate real life scenario that includes impact of physical network on reported metrics.
|
||||
- **Load Balancer** -- It works on the client platform in a docker container by using a HAProxy. It is mainly responsible for counting requests forwarded from clients to OpenVINO™ Model Server, estimating its latency, and sharing this information by Prometheus service. The reason for locating this part on the client site is to simulate a real life scenario that includes an impact of a physical network on reported metrics.
|
||||
|
||||
* **Execution Controller** is launched on the client platform. It is responsible for synchronization of the whole measurement process, downloading metrics from the load balancer, and presenting the final report of the execution.
|
||||
- **Execution Controller** -- It is launched on the client platform. It is responsible for synchronization of the whole measurement process, downloading metrics from Load Balancer and presenting the final report of the execution.
|
||||
|
||||
## resnet-50-TF (INT8)
|
||||

|
||||
@ -44,8 +44,25 @@ OpenVINO™ Model Server is measured in multiple-client-single-server configurat
|
||||

|
||||
|
||||
## Image Compression for Improved Throughput
|
||||
OpenVINO Model Server supports compressed binary input data (images in JPEG and PNG formats) for vision processing models. This
|
||||
feature improves overall performance on networks where the bandwidth constitutes a system bottleneck. A good example of such use could be wireless 5G communication, a typical 1 Gbit/sec Ethernet network or a usage scenario with many client machines issuing a high rate of inference requests to one single central OpenVINO model server. Generally the performance improvement increases with increased compressibility of the data/image. The decompression on the server-side is performed by the OpenCV library. Please refer to [Supported Image Formats](#supported-image-formats-for-ovms-compression).
|
||||
OpenVINO™ Model Server supports compressed binary input data (images in JPEG and PNG formats) for vision processing models. This
|
||||
feature improves overall performance on networks where the bandwidth constitutes a system bottleneck. Some examples of such a use case are: wireless 5G communication, a typical 1 Gbit/sec Ethernet network, and a scenario of multiple client machines issuing a high rate of inference requests to a single, central OpenVINO model server. Generally, performance improvement grows with increased compressibility of data/image. Decompression on the server side is performed by the OpenCV library.
|
||||
|
||||
### Supported Image Formats for OVMS Compression
|
||||
|
||||
- Always supported:
|
||||
- Portable image format - `*.pbm`, `*.pgm`, `*.ppm`, `*.pxm`, `*.pnm`.
|
||||
- Radiance HDR - `*.hdr`, `*.pic`.
|
||||
- Sun rasters - `*.sr`, `*.ras`.
|
||||
- Windows bitmaps - `*.bmp`, `*.dib`.
|
||||
|
||||
- Limited support (refer to OpenCV documentation):
|
||||
- Raster and Vector geospatial data supported by GDAL.
|
||||
- JPEG files - `*.jpeg`, `*.jpg`, `*.jpe`.
|
||||
- Portable Network Graphics - `*.png`.
|
||||
- TIFF files - `*.tiff`, `*.tif`.
|
||||
- OpenEXR Image files - `*.exr`.
|
||||
- JPEG 2000 files - `*.jp2`.
|
||||
- WebP - `*.webp`.
|
||||
|
||||
### googlenet-v4-tf (FP32)
|
||||

|
||||
@ -478,20 +495,3 @@ OpenVINO™ Model Server performance benchmark numbers are based on release 2021
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Supported Image Formats for OVMS Compression
|
||||
- Always supported:
|
||||
|
||||
- Portable image format - *.pbm, *.pgm, *.ppm *.pxm, *.pnm
|
||||
- Radiance HDR - *.hdr, *.pic
|
||||
- Sun rasters - *.sr, *.ras
|
||||
- Windows bitmaps - *.bmp, *.dib
|
||||
|
||||
- Limited support (please see OpenCV documentation):
|
||||
|
||||
- Raster and Vector geospatial data supported by GDAL
|
||||
- JPEG files - *.jpeg, *.jpg, *.jpe
|
||||
- Portable Network Graphics - *.png
|
||||
- TIFF files - *.tiff, *.tif
|
||||
- OpenEXR Image files - *.exr
|
||||
- JPEG 2000 files - *.jp2
|
||||
- WebP - *.webp
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Model Accuracy for INT8 and FP32 Precision {#openvino_docs_performance_int8_vs_fp32}
|
||||
|
||||
The following table shows the absolute accuracy drop that is calculated as the difference in accuracy between the FP32 representation of a model and its INT8 representation.
|
||||
The following table presents the absolute accuracy drop calculated as the accuracy difference between FP32 and INT8 representations of a model:
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
@ -241,7 +241,7 @@ The following table shows the absolute accuracy drop that is calculated as the d
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
The table below illustrates the speed-up factor for the performance gain by switching from an FP32 representation of an OpenVINO™ supported model to its INT8 representation.
|
||||
The table below illustrates the speed-up factor for the performance gain by switching from an FP32 representation of an OpenVINO™ supported model to its INT8 representation:
|
||||
|
||||
@sphinxdirective
|
||||
.. raw:: html
|
||||
|
@ -1,30 +1,32 @@
|
||||
# General Optimizations {#openvino_docs_deployment_optimization_guide_common}
|
||||
|
||||
This chapter covers application-level optimization techniques such as asynchronous execution to improve data pipelining, pre-processing acceleration and so on.
|
||||
While the techniques (e.g. pre-processing) can be specific to end-user applications, the associated performance improvements are general and shall improve any target scenario (both latency and throughput).
|
||||
This article covers application-level optimization techniques, such as asynchronous execution, to improve data pipelining, pre-processing acceleration and so on.
|
||||
While the techniques (e.g. pre-processing) can be specific to end-user applications, the associated performance improvements are general and shall improve any target scenario -- both latency and throughput.
|
||||
|
||||
@anchor inputs_pre_processing
|
||||
## Inputs Pre-Processing with OpenVINO
|
||||
|
||||
In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:
|
||||
- Model Optimizer can efficiently bake the mean and normalization (scale) values into the model (for example, to the weights of the first convolution). Please see [relevant Model Optimizer command-line options](../MO_DG/prepare_model/Additional_Optimizations.md).
|
||||
- Let the OpenVINO accelerate other means of [Image Pre-processing and Conversion](../OV_Runtime_UG/preprocessing_overview.md).
|
||||
- You can directly input a data that is already in the _on-device_ memory, by using the [remote tensors API of the GPU Plugin](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
In many cases, a network expects a pre-processed image. It is advised not to perform any unnecessary steps in the code:
|
||||
- Model Optimizer can efficiently incorporate the mean and normalization (scale) values into a model (for example, to the weights of the first convolution). For more details, see the [relevant Model Optimizer command-line options](../MO_DG/prepare_model/Additional_Optimizations.md).
|
||||
- Let OpenVINO accelerate other means of [Image Pre-processing and Conversion](../OV_Runtime_UG/preprocessing_overview.md).
|
||||
- Data which is already in the "on-device" memory can be input directly by using the [remote tensors API of the GPU Plugin](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
|
||||
@anchor async_api
|
||||
## Prefer OpenVINO Async API
|
||||
The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and executes immediately (effectively serializing the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()`. Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md).
|
||||
The API of the inference requests offers Sync and Async execution. While the `ov::InferRequest::infer()` is inherently synchronous and executes immediately (effectively serializing the execution flow in the current application thread), the Async "splits" the `infer()` into `ov::InferRequest::start_async()` and `ov::InferRequest::wait()`. For more information, see the [API examples](../OV_Runtime_UG/ov_infer_request.md).
|
||||
|
||||
A typical use-case for the `ov::InferRequest::infer()` is running a dedicated application thread per source of inputs (e.g. a camera), so that every step (frame capture, processing, results parsing and associated logic) is kept serial within the thread.
|
||||
In contrast, the `ov::InferRequest::start_async()` and `ov::InferRequest::wait()` allow the application to continue its activities and poll or wait for the inference completion when really needed. So one reason for using asynchronous code is _efficiency_.
|
||||
A typical use case for the `ov::InferRequest::infer()` is running a dedicated application thread per source of inputs (e.g. a camera), so that every step (frame capture, processing, parsing the results, and associated logic) is kept serial within the thread.
|
||||
In contrast, the `ov::InferRequest::start_async()` and `ov::InferRequest::wait()` allow the application to continue its activities and poll or wait for the inference completion when really needed. Therefore, one reason for using an asynchronous code is "efficiency".
|
||||
|
||||
> **NOTE**: Although the Synchronous API can be somewhat easier to start with, in the production code always prefer to use the Asynchronous (callbacks-based, below) API, as it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios).
|
||||
> **NOTE**: Although the Synchronous API can be somewhat easier to start with, prefer to use the Asynchronous (callbacks-based, below) API in the production code. The reason is that it is the most general and scalable way to implement the flow control for any possible number of requests (and hence both latency and throughput scenarios).
|
||||
|
||||
Let's see how the OpenVINO Async API can improve overall frame rate of the application. The key advantage of the Async approach is as follows: while a device is busy with the inference, the application can do other things in parallel (e.g. populating inputs or scheduling other requests) rather than wait for the current inference to complete first.
|
||||
The key advantage of the Async approach is that when a device is busy with the inference, the application can do other things in parallel (e.g. populating inputs or scheduling other requests) rather than wait for the current inference to complete first.
|
||||
|
||||
In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding vs inference) and not by the sum of the stages.
|
||||
In the example below, inference is applied to the results of the video decoding. It is possible to keep two parallel infer requests, and while the current one is processed, the input frame for the next one is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding vs inference) and not by the sum of the stages.
|
||||
|
||||
You can compare the pseudo-codes for the regular and async-based approaches:
|
||||
Below are example-codes for the regular and async-based approaches to compare:
|
||||
|
||||
- In the regular way, the frame is captured with OpenCV and then immediately processed:<br>
|
||||
- Normally, the frame is captured with OpenCV and then immediately processed:<br>
|
||||
|
||||
@snippet snippets/dldt_optimization_guide8.cpp part8
|
||||
|
||||
@ -42,20 +44,20 @@ Refer to the [Object Detection С++ Demo](@ref omz_demos_object_detection_demo_c
|
||||
> **NOTE**: Using the Asynchronous API is a must for [throughput-oriented scenarios](./dldt_deployment_optimization_tput.md).
|
||||
|
||||
### Notes on Callbacks
|
||||
Notice that the Async's `ov::InferRequest::wait()` waits for the specific request only. However, running multiple inference requests in parallel provides no guarantees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait`. The most scalable approach is using callbacks (set via the `ov::InferRequest::set_callback`) that are executed upon completion of the request. The callback functions will be used by the OpenVINO runtime to notify on the results (or errors.
|
||||
This is more event-driven approach.
|
||||
Keep in mind that the `ov::InferRequest::wait()` of the Async API waits for the specific request only. However, running multiple inference requests in parallel provides no guarantees on the completion order. This may complicate a possible logic based on the `ov::InferRequest::wait`. The most scalable approach is using callbacks (set via the `ov::InferRequest::set_callback`) that are executed upon completion of the request. The callback functions will be used by OpenVINO Runtime to notify you of the results (or errors).
|
||||
This is a more event-driven approach.
|
||||
|
||||
Few important points on the callbacks:
|
||||
- It is the application responsibility to ensure that any callback function is thread-safe
|
||||
- Although executed asynchronously by a dedicated threads the callbacks should NOT include heavy operations (e.g. I/O) and/or blocking calls. Keep the work done by any callback to a minimum.
|
||||
A few important points on the callbacks:
|
||||
- It is the job of the application to ensure that any callback function is thread-safe.
|
||||
- Although executed asynchronously by a dedicated threads, the callbacks should NOT include heavy operations (e.g. I/O) and/or blocking calls. Work done by any callback should be kept to a minimum.
|
||||
|
||||
## "get_tensor" Idiom
|
||||
Within the OpenVINO, each device may have different internal requirements on the memory padding, alignment, etc for intermediate tensors. The **input/output tensors** are also accessible by the application code.
|
||||
As every `ov::InferRequest` is created by the particular instance of the `ov::CompiledModel`(that is already device-specific) the requirements are respected and the requests' input/output tensors are still device-friendly.
|
||||
Thus:
|
||||
* `get_tensor` (that offers the `data()` method to get a system-memory pointer to the tensor's content), is a recommended way to populate the inference inputs (and read back the outputs) **from/to the host memory**
|
||||
* For example, for the GPU device, the inputs/outputs tensors are mapped to the host (which is fast) only when the `get_tensor` is used, while for the `set_tensor` a copy into the internal GPU structures may happen
|
||||
* In contrast, when the input tensors are already in the **on-device memory** (e.g. as a result of the video-decoding), prefer the `set_tensor` as a zero-copy way to proceed
|
||||
* Consider [GPU device Remote tensors API](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
@anchor tensor_idiom
|
||||
## The "get_tensor" Idiom
|
||||
Each device within OpenVINO may have different internal requirements on the memory padding, alignment, etc., for intermediate tensors. The **input/output tensors** are also accessible by the application code.
|
||||
As every `ov::InferRequest` is created by the particular instance of the `ov::CompiledModel`(that is already device-specific) the requirements are respected and the input/output tensors of the requests are still device-friendly.
|
||||
To sum it up:
|
||||
* The `get_tensor` (that offers the `data()` method to get a system-memory pointer to the content of a tensor), is a recommended way to populate the inference inputs (and read back the outputs) **from/to the host memory**:
|
||||
* For example, for the GPU device, the **input/output tensors** are mapped to the host (which is fast) only when the `get_tensor` is used, while for the `set_tensor` a copy into the internal GPU structures may happen.
|
||||
* In contrast, when the input tensors are already in the **on-device memory** (e.g. as a result of the video-decoding), prefer the `set_tensor` as a zero-copy way to proceed. For more details, see the [GPU device Remote tensors API](../OV_Runtime_UG//supported_plugins/GPU_RemoteTensor_API.md).
|
||||
|
||||
Please consider the [API examples](../OV_Runtime_UG/ov_infer_request.md) for `get_tensor` and `set_tensor`.
|
||||
Consider the [API examples](@ref in_out_tensors) for the `get_tensor` and `set_tensor`.
|
@ -14,48 +14,43 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Runtime or deployment optimizations are focused on tuning of the inference _parameters_ (e.g. optimal number of the requests executed simultaneously) and other means of how a model is _executed_.
|
||||
Runtime optimizations, or deployment optimizations, focus on tuning inference parameters and execution means (e.g., the optimum number of requests executed simultaneously). Unlike model-level optimizations, they are highly specific to the hardware and case they are used for, and often come at a cost.
|
||||
`ov::hint::inference_precision` is a "typical runtime configuration" which trades accuracy for performance, allowing `fp16/bf16` execution for the layers that remain in `fp32` after quantization of the original `fp32` model.
|
||||
|
||||
As referenced in the parent [performance introduction topic](./dldt_optimization_guide.md), the [dedicated document](./model_optimization_guide.md) covers the **model-level optimizations** like quantization that unlocks the 8-bit inference. Model-optimizations are most general and help any scenario and any device (that e.g. accelerates the quantized models). The relevant _runtime_ configuration is `ov::hint::inference_precision` which trades the accuracy for the performance (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model).
|
||||
Therefore, optimization should start with defining the use case. For example, if it is about processing millions of samples by overnight jobs in data centers, throughput could be prioritized over latency. On the other hand, real-time usages would likely trade off throughput to deliver the results at minimal latency. A combined scenario is also possible, targeting the highest possible throughput, while maintaining a specific latency threshold.
|
||||
|
||||
Then, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers.
|
||||
In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency. Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold.
|
||||
Below you can find summary on the associated tips.
|
||||
It is also important to understand how the full-stack application would use the inference component "end-to-end." For example, to know what stages need to be orchestrated to save workload devoted to fetching and preparing input data.
|
||||
|
||||
How the full-stack application uses the inference component _end-to-end_ is also important. For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. Below you can find multiple tips on connecting the data input pipeline and the model inference efficiently.
|
||||
These are also common performance tricks that help both latency and throughput scenarios.
|
||||
|
||||
Further documents cover the associated _runtime_ performance optimizations topics. Please also consider [matrix support of the features by the individual devices](@ref features_support_matrix).
|
||||
|
||||
[General, application-level optimizations](dldt_deployment_optimization_common.md), and specifically:
|
||||
For more information on this topic, see the following articles:
|
||||
* [feature support by device](@ref features_support_matrix),
|
||||
|
||||
* [Inputs Pre-processing with the OpenVINO](../OV_Runtime_UG/preprocessing_overview.md)
|
||||
* [Inputs Pre-processing with the OpenVINO](@ref inputs_pre_processing).
|
||||
* [Async API](@ref async_api).
|
||||
* [The 'get_tensor' Idiom](@ref tensor_idiom).
|
||||
* For variably-sized inputs, consider [dynamic shapes](../OV_Runtime_UG/ov_dynamic_shapes.md).
|
||||
|
||||
* [Async API and 'get_tensor' Idiom](dldt_deployment_optimization_common.md)
|
||||
See the [latency](./dldt_deployment_optimization_latency.md) and [throughput](./dldt_deployment_optimization_tput.md) optimization guides, for **use-case-specific optimizations**
|
||||
|
||||
* For variably-sized inputs, consider [dynamic shapes](../OV_Runtime_UG/ov_dynamic_shapes.md)
|
||||
## Writing Performance-Portable Inference Applications
|
||||
Although inference performed in OpenVINO Runtime can be configured with a multitude of low-level performance settings, it is not recommended in most cases. Firstly, achieving the best performance with such adjustments requires deep understanding of device architecture and the inference engine.
|
||||
|
||||
**Use-case specific optimizations** such as optimizing for [latency](./dldt_deployment_optimization_latency.md) or [throughput](./dldt_deployment_optimization_tput.md)
|
||||
|
||||
## Writing Performance Portable Inference Application
|
||||
Each of the OpenVINO's [supported devices](../OV_Runtime_UG/supported_plugins/Supported_Devices.md) offers a bunch of low-level performance settings.
|
||||
Tweaking this detailed configuration requires deep architecture understanding.
|
||||
|
||||
Also, while the resulting performance may be optimal for the specific combination of the device and the model that is inferred, it is actually neither device/model nor future-proof:
|
||||
- Even within a family of the devices (like various CPUs), different instruction set, or number of CPU cores would eventually result in different execution configuration to be optimal.
|
||||
- Similarly the optimal batch size is very much specific to the particular instance of the GPU.
|
||||
- Compute vs memory-bandwidth requirements for the model being inferenced, as well as inference precision, possible model's quantization also contribute to the optimal parameters selection.
|
||||
- Finally, the optimal execution parameters of one device do not transparently map to another device type, for example:
|
||||
- Both the CPU and GPU devices support the notion of the [streams](./dldt_deployment_optimization_tput_advanced.md), yet the optimal number of the streams is deduced very differently.
|
||||
Secondly, such optimization may not translate well to other device-model combinations. In other words, one set of execution parameters is likely to result in different performance when used under different conditions. For example:
|
||||
* both the CPU and GPU support the notion of [streams](./dldt_deployment_optimization_tput_advanced.md), yet they deduce their optimal number very differently.
|
||||
* Even among devices of the same type, different execution configurations can be considered optimal, as in the case of instruction sets or the number of cores for the CPU and the batch size for the GPU.
|
||||
* Different models have different optimal parameter configurations, considering factors such as compute vs memory-bandwidth, inference precision, and possible model quantization.
|
||||
* Execution "scheduling" impacts performance strongly and is highly device-specific, for example, GPU-oriented optimizations like batching, combining multiple inputs to achieve the optimal throughput, [do not always map well to the CPU](dldt_deployment_optimization_internals.md).
|
||||
|
||||
Here, to mitigate the performance configuration complexity the **Performance Hints** offer the high-level "presets" for the **latency** and **throughput**, as detailed in the [Performance Hints usage document](../OV_Runtime_UG/performance_hints.md).
|
||||
|
||||
To make the configuration process much easier and its performance optimization more portable, the option of [Performance Hints](../OV_Runtime_UG/performance_hints.md) has been introduced. It comprises two high-level "presets" focused on either **latency** or **throughput** and, essentially, makes execution specifics irrelevant.
|
||||
|
||||
Beyond execution _parameters_ there is a device-specific _scheduling_ that greatly affects the performance.
|
||||
Specifically, GPU-oriented optimizations like batching, which combines many (potentially tens) of inputs to achieve optimal throughput, do not always map well to the CPU, as e.g. detailed in the [further internals](dldt_deployment_optimization_internals.md) sections.
|
||||
|
||||
The hints really hide the _execution_ specifics required to saturate the device. In the [internals](dldt_deployment_optimization_internals.md) sections you can find the implementation details (particularly how the OpenVINO implements the 'throughput' approach) for the specific devices. Keep in mind that the hints make this transparent to the application. For example, the hints obviates the need for explicit (application-side) batching or streams.
|
||||
|
||||
With the hints, it is enough to keep separate infer requests per camera or another source of input and process the requests in parallel using Async API as explained in the [application design considerations section](@ref throughput_app_design). The main requirement for the application to leverage the throughput is **running multiple inference requests in parallel**.
|
||||
The Performance Hints functionality makes configuration transparent to the application, for example, anticipates the need for explicit (application-side) batching or streams, and facilitates parallel processing of separate infer requests for different input sources
|
||||
|
||||
|
||||
In summary, when the performance _portability_ is of concern, consider the Performance Hints as a solution. You may find further details and API examples [here](../OV_Runtime_UG/performance_hints.md).
|
||||
## Additional Resources
|
||||
|
||||
* [Using Async API and running multiple inference requests in parallel to leverage throughput](@ref throughput_app_design).
|
||||
* [The throughput approach implementation details for specific devices](dldt_deployment_optimization_internals.md)
|
||||
* [Details on throughput](dldt_deployment_optimization_tput.md)
|
||||
* [Details on latency](dldt_deployment_optimization_latency.md)
|
||||
* [API examples and details](../OV_Runtime_UG/performance_hints.md).
|
||||
|
@ -2,23 +2,23 @@
|
||||
## Throughput on the CPU: Internals
|
||||
As explained in the [throughput-related section](./dldt_deployment_optimization_tput.md), the OpenVINO streams is a mean of running multiple requests in parallel.
|
||||
In order to best serve multiple inference requests executed simultaneously, the inference threads are grouped/pinned to the particular CPU cores, constituting the "CPU" streams.
|
||||
This provides much better performance for the networks than batching especially for the many-core machines:
|
||||
This provides much better performance for the networks than batching, especially for the multiple-core systems:
|
||||

|
||||
|
||||
Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, with much less synchronization within CNN ops):
|
||||
Compared to the batching, the parallelism is somewhat transposed (i.e., performed over inputs with much less synchronization within CNN ops):
|
||||

|
||||
|
||||
Note that [high-level performance hints](../OV_Runtime_UG/performance_hints.md) allows the implementation to select the optimal number of the streams, _depending on the model compute demands_ and CPU capabilities (including [int8 inference](@ref openvino_docs_model_optimization_guide) hardware acceleration, number of cores, etc).
|
||||
Keep in mind that [high-level performance hints](../OV_Runtime_UG/performance_hints.md) allow the implementation to select the optimal number of streams depending on model's compute demands and CPU capabilities, including [int8 inference](@ref openvino_docs_model_optimization_guide) hardware acceleration, number of cores, etc.
|
||||
|
||||
## Automatic Batching Internals
|
||||
As explained in the section on the [automatic batching](../OV_Runtime_UG/automatic_batching.md), the feature performs on-the-fly grouping of the inference requests to improve device utilization.
|
||||
The Automatic Batching relaxes the requirement for an application to saturate devices like GPU by _explicitly_ using a large batch. It performs transparent inputs gathering from
|
||||
[Automatic batching](../OV_Runtime_UG/automatic_batching.md) performs on-the-fly grouping of inference requests to improve device utilization.
|
||||
It relaxes the requirement for an application to saturate devices such as GPU by "explicitly" using a large batch. It performs transparent input gathering from
|
||||
individual inference requests followed by the actual batched execution, with no programming effort from the user:
|
||||

|
||||
|
||||
Essentially, the Automatic Batching shifts the asynchronousity from the individual requests to the groups of requests that constitute the batches. Thus, for the execution to be efficient it is very important that the requests arrive timely, without causing a batching timeout.
|
||||
Essentially, Automatic Batching shifts asynchronicity from individual requests to groups of requests that constitute the batches. Furthermore, for the execution to be efficient, it is very important that the requests arrive timely, without causing a batching timeout.
|
||||
Normally, the timeout should never be hit. It is rather a graceful way to handle the application exit (when the inputs are not arriving anymore, so the full batch is not possible to collect).
|
||||
|
||||
So if your workload experiences the timeouts (resulting in the performance drop, as the timeout value adds itself to the latency of every request), consider balancing the timeout value vs the batch size. For example in many cases having smaller timeout value and batch size may yield better performance than large batch size, but coupled with the timeout value that cannot guarantee accommodating the full number of the required requests.
|
||||
If a workload experiences timeouts, which lead to a drop in performance due to increased latency of every request, consider balancing its value against the batch size. For example, a smaller batch size and timeout value may yield better results than a large batch size coupled with a timeout value that cannot guarantee accommodating all the required requests.
|
||||
|
||||
Finally, following the "get_tensor idiom" section from the [general optimizations](./dldt_deployment_optimization_common.md) helps the Automatic Batching to save on inputs/outputs copies. Thus, in your application always prefer the "get" versions of the tensors' data access APIs.
|
||||
Finally, following the `get_tensor` idiom section from the [general optimizations](./dldt_deployment_optimization_common.md) helps Automatic Batching to save on inputs/outputs copies. According to that you should always prefer the "get" versions of the tensors' data access APIs in your applications.
|
||||
|
@ -10,27 +10,28 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Latency Specifics
|
||||
A significant fraction of applications focused on the situations where typically a single model is loaded (and single input is used) at a time.
|
||||
This is a regular "consumer" use case.
|
||||
While an application can create more than one request if needed (for example to support [asynchronous inputs population](./dldt_deployment_optimization_common.md)), the inference performance depends on **how many requests are being inferenced in parallel** on a device.
|
||||
A significant portion of deep learning use cases involve applications loading a single model and using a single input at a time, which is the of typical "consumer" scenario.
|
||||
While an application can create more than one request if needed, for example to support [asynchronous inputs population](@ref async_api), its **inference performance depends on how many requests are being inferenced in parallel** on a device.
|
||||
|
||||
Similarly, when multiple models are served on the same device, it is important whether the models are executed simultaneously, or in chain (for example in the inference pipeline).
|
||||
As expected, the easiest way to achieve the lowest latency is **running only one concurrent inference at a moment** on the device. Accordingly, any additional concurrency usually results in the latency growing fast.
|
||||
Similarly, when multiple models are served on the same device, it is important whether the models are executed simultaneously or in a chain, for example, in the inference pipeline.
|
||||
As expected, the easiest way to achieve **low latency is by running only one inference at a time** on one device. Accordingly, any additional concurrency usually results in latency rising fast.
|
||||
|
||||
However, some conventional "root" devices (e.g. CPU or GPU) can be in fact internally composed of several "sub-devices". In many cases letting the OpenVINO to transparently leverage the "sub-devices" helps to improve the application throughput (e.g. serve multiple clients simultaneously) without degrading the latency. For example, multi-socket CPUs can deliver as high number of requests (at the same minimal latency) as there are NUMA nodes in the machine. Similarly, a multi-tile GPU (which is essentially multiple GPUs in a single package), can deliver a multi-tile scalability with the number of inference requests, while preserving the single-tile latency.
|
||||
However, some conventional "root" devices (i.e., CPU or GPU) can be in fact internally composed of several "sub-devices". In many cases, letting OpenVINO leverage the "sub-devices" transparently helps to improve application's throughput (e.g., serve multiple clients simultaneously) without degrading latency. For example, multi-socket CPUs can deliver as many requests at the same minimal latency as there are NUMA nodes in the system. Similarly, a multi-tile GPU, which is essentially multiple GPUs in a single package, can deliver a multi-tile scalability with the number of inference requests, while preserving the single-tile latency.
|
||||
|
||||
Thus, human expertise is required to get more _throughput_ out of the device even in the inherently latency-oriented cases. OpenVINO can take this configuration burden via [high-level performance hints](../OV_Runtime_UG/performance_hints.md), via `ov::hint::PerformanceMode::LATENCY` specified for the `ov::hint::performance_mode` property for the compile_model.
|
||||
Typically, human expertise is required to get more "throughput" out of the device, even in the inherently latency-oriented cases. OpenVINO can take this configuration burden via [high-level performance hints](../OV_Runtime_UG/performance_hints.md), the `ov::hint::PerformanceMode::LATENCY` specified for the `ov::hint::performance_mode` property for the `compile_model`.
|
||||
|
||||
> **NOTE**: [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) is a recommended way for performance configuration, which is both device-agnostic and future-proof.
|
||||
|
||||
In the case when there are multiple models to be used simultaneously, consider using different devices for inferencing the different models. Finally, when multiple models are executed in parallel on the device, using additional `ov::hint::model_priority` may help to define relative priorities of the models (please refer to the documentation on the [matrix features support for OpenVINO devices](@ref features_support_matrix) to check for the support of the feature by the specific device).
|
||||
When multiple models are to be used simultaneously, consider running inference on separate devices for each of them. Finally, when multiple models are executed in parallel on a device, using additional `ov::hint::model_priority` may help to define relative priorities of the models. Refer to the documentation on the [matrix features support for OpenVINO devices](@ref features_support_matrix) to check if your device supports the feature.
|
||||
|
||||
## First-Inference Latency and Model Load/Compile Time
|
||||
There are cases when model loading/compilation are heavily contributing to the end-to-end latencies.
|
||||
For example when the model is used exactly once, or when due to on-device memory limitations the model is unloaded (to free the memory for another inference) and reloaded at some cadence.
|
||||
**First-Inference Latency and Model Load/Compile Time**
|
||||
|
||||
Such a "first-inference latency" scenario however may pose an additional limitation on the model load\compilation time, as inference accelerators (other than the CPU) usually require certain level of model compilation upon loading.
|
||||
The [model caching](../OV_Runtime_UG/Model_caching_overview.md) is a way to amortize the loading/compilation time over multiple application runs. If the model caching is not possible (as e.g. it requires write permissions for the applications), the CPU device is almost exclusively offers the fastest model load time. Also, consider using the [AUTO device](../OV_Runtime_UG/auto_device_selection.md). It allows to transparently use the CPU for inference, while the actual accelerator loads the model (upon that, the inference hot-swapping also happens automatically).
|
||||
In some cases, model loading and compilation contribute to the "end-to-end" latency more than usual.
|
||||
For example, when the model is used exactly once, or when it is unloaded and reloaded in a cycle, to free the memory for another inference due to on-device memory limitations.
|
||||
|
||||
Finally, notice that any [throughput-oriented options](./dldt_deployment_optimization_tput.md) may increase the model up time significantly.
|
||||
Such a "first-inference latency" scenario may pose an additional limitation on the model load\compilation time, as inference accelerators (other than the CPU) usually require a certain level of model compilation upon loading.
|
||||
The [model caching](../OV_Runtime_UG/Model_caching_overview.md) option is a way to lessen the impact over multiple application runs. If model caching is not possible, for example, it may require write permissions for the application, the CPU offers the fastest model load time almost every time.
|
||||
|
||||
Another way of dealing with first-inference latency is using the [AUTO device selection inference mode](../OV_Runtime_UG/auto_device_selection.md). It starts inference on the CPU, while waiting for the actual accelerator to load the model. At that point, it shifts to the new device seamlessly.
|
||||
|
||||
Finally, note that any [throughput-oriented options](./dldt_deployment_optimization_tput.md) may significantly increase the model uptime.
|
||||
|
@ -1,27 +1,24 @@
|
||||
# Optimizing for Throughput {#openvino_docs_deployment_optimization_guide_tput}
|
||||
|
||||
## General Throughput Considerations
|
||||
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md) one possible use-case is _delivering every single request at the minimal delay_.
|
||||
Throughput on the other hand, is about inference scenarios in which potentially large **number of inference requests are served simultaneously to improve the device utilization**.
|
||||
As described in the section on the [latency-specific considerations](./dldt_deployment_optimization_latency.md), one of the possible use cases is *delivering every single request at the minimal delay*.
|
||||
Throughput, on the other hand, is about inference scenarios in which potentially **large number of inference requests are served simultaneously to improve the device utilization**.
|
||||
|
||||
The associated increase in latency is not linearly dependent on the number of requests executed in parallel.
|
||||
Here, a trade-off between overall throughput and serial performance of individual requests can be achieved with the right OpenVINO performance configuration.
|
||||
A trade-off between overall throughput and serial performance of individual requests can be achieved with the right performance configuration of OpenVINO.
|
||||
|
||||
## Basic and Advanced Ways of Leveraging Throughput
|
||||
With the OpenVINO there are two means of leveraging the throughput with the individual device:
|
||||
* **Basic (high-level)** flow with [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) which is inherently **portable and future-proof**
|
||||
* **Advanced (low-level)** approach of explicit **batching** and **streams**, explained in the separate [document](dldt_deployment_optimization_tput_advanced.md).
|
||||
There are two ways of leveraging throughput with individual devices:
|
||||
* **Basic (high-level)** flow with [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) which is inherently **portable and future-proof**.
|
||||
* **Advanced (low-level)** approach of explicit **batching** and **streams**. For more details, see the [runtime inference optimizations](dldt_deployment_optimization_tput_advanced.md).
|
||||
|
||||
In both cases application should be designed to execute multiple inference requests in parallel as detailed in the [next section](@ref throughput_app_design).
|
||||
|
||||
Finally, consider the _automatic_ multi-device execution covered below.
|
||||
In both cases, the application should be designed to execute multiple inference requests in parallel, as described in the following section.
|
||||
|
||||
@anchor throughput_app_design
|
||||
## Throughput-Oriented Application Design
|
||||
Most generally, throughput-oriented inference applications should:
|
||||
* Expose substantial amounts of _inputs_ parallelism (e.g. process multiple video- or audio- sources, text documents, etc)
|
||||
* Decompose the data flow into a collection of concurrent inference requests that are aggressively scheduled to be executed in parallel
|
||||
* Setup the configuration for the _device_ (e.g. as parameters of the `ov::Core::compile_model`) via either [low-level explicit options](dldt_deployment_optimization_tput_advanced.md), introduced in the previous section or [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) (**preferable**):
|
||||
In general, most throughput-oriented inference applications should:
|
||||
* Expose substantial amounts of *input* parallelism (e.g. process multiple video- or audio- sources, text documents, etc).
|
||||
* Decompose the data flow into a collection of concurrent inference requests that are aggressively scheduled to be executed in parallel:
|
||||
* Setup the configuration for the *device* (for example, as parameters of the `ov::Core::compile_model`) via either previously introduced [low-level explicit options](dldt_deployment_optimization_tput_advanced.md) or [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) (**preferable**):
|
||||
@sphinxdirective
|
||||
|
||||
.. tab:: C++
|
||||
@ -37,14 +34,15 @@ Most generally, throughput-oriented inference applications should:
|
||||
:fragment: [compile_model]
|
||||
|
||||
@endsphinxdirective
|
||||
* Query the `ov::optimal_number_of_infer_requests` from the `ov::CompiledModel` (resulted from compilation of the model for a device) to create the number of the requests required to saturate the device
|
||||
* Use the Async API with callbacks, to avoid any dependency on the requests' completion order and possible device starvation, as explained in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common)
|
||||
* Query the `ov::optimal_number_of_infer_requests` from the `ov::CompiledModel` (resulted from a compilation of the model for the device) to create the number of the requests required to saturate the device.
|
||||
* Use the Async API with callbacks, to avoid any dependency on the completion order of the requests and possible device starvation, as explained in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common).
|
||||
|
||||
## Multi-Device Execution
|
||||
OpenVINO offers automatic, [scalable multi-device inference](../OV_Runtime_UG/multi_device.md). This is simple _application-transparent_ way to improve the throughput. No need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance the inference requests between devices, etc. From the application point of view, it is communicating to the single device that internally handles the actual machinery.
|
||||
Just like with other throughput-oriented scenarios, there are two major pre-requisites for optimal multi-device performance:
|
||||
* Using the [Asynchronous API](@ref openvino_docs_deployment_optimization_guide_common) and [callbacks](../OV_Runtime_UG/ov_infer_request.md) in particular
|
||||
* Providing the multi-device (and hence the underlying devices) with enough data to crunch. As the inference requests are naturally independent data pieces, the multi-device performs load-balancing at the “requests” (outermost) level to minimize the scheduling overhead.
|
||||
OpenVINO offers the automatic, scalable [multi-device inference mode](../OV_Runtime_UG/multi_device.md), which is a simple *application-transparent* way to improve throughput. There is no need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance inference requests between devices, etc. For the application using it, multi-device is like any other device, as it manages all processes internally.
|
||||
Just like with other throughput-oriented scenarios, there are several major pre-requisites for optimal multi-device performance:
|
||||
* Using the [Asynchronous API](@ref async_api) and [callbacks](../OV_Runtime_UG/ov_infer_request.md) in particular.
|
||||
* Providing the multi-device (and hence the underlying devices) with enough data to crunch. As the inference requests are naturally independent data pieces, the multi-device performs load-balancing at the "requests" (outermost) level to minimize the scheduling overhead.
|
||||
|
||||
Notice that the resulting performance is usually a fraction of the “ideal” (plain sum) value, when the devices compete for a certain resources, like the memory-bandwidth which is shared between CPU and iGPU.
|
||||
> **NOTE**: While the legacy approach of optimizing the parameters of each device separately works, the [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) allow to configure all devices (that are part of the specific multi-device configuration) at once.
|
||||
Keep in mind that the resulting performance is usually a fraction of the "ideal" (plain sum) value, when the devices compete for certain resources such as the memory-bandwidth, which is shared between CPU and iGPU.
|
||||
|
||||
> **NOTE**: While the legacy approach of optimizing the parameters of each device separately works, the [OpenVINO performance hints](../OV_Runtime_UG/performance_hints.md) allow configuring all devices (that are part of the specific multi-device configuration) at once.
|
||||
|
@ -1,77 +1,79 @@
|
||||
# Using Advanced Throughput Options: Streams and Batching {#openvino_docs_deployment_optimization_guide_tput_advanced}
|
||||
|
||||
## OpenVINO Streams
|
||||
As detailed in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common) running multiple inference requests asynchronously is important for general application efficiency.
|
||||
Internally, every device implements a queue. The queue acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
|
||||
As explained in the [common-optimizations section](@ref openvino_docs_deployment_optimization_guide_common), running multiple inference requests asynchronously is important for general application efficiency.
|
||||
Internally, every device implements a queue, which acts as a buffer, storing the inference requests until retrieved by the device at its own pace.
|
||||
The devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput.
|
||||
This configurable mean of this device-side parallelism is commonly referred as **streams**.
|
||||
This configurable method of this device-side parallelism is commonly referred as **streams**.
|
||||
|
||||
> **NOTE**: Notice that streams are **really executing the requests in parallel, but not in the lock step** (as e.g. the batching does), which makes the streams fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) when individual requests can have different shapes.
|
||||
> **NOTE**: Be aware that streams are **really executing the requests in parallel, but not in the lock step** (as the batching does), which makes the streams fully compatible with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md), while individual requests can have different shapes.
|
||||
|
||||
> **NOTE**: Most OpenVINO devices (including CPU, GPU and VPU) support the streams, yet the _optimal_ number of the streams is deduced very differently, please see the a dedicated section below.
|
||||
> **NOTE**: Most OpenVINO devices (including CPU, GPU and VPU) support the streams, yet the *optimal* number of the streams is deduced very differently. More information on this topic can be found in the section [below](@ref stream_considerations).
|
||||
|
||||
Few general considerations:
|
||||
* Using the streams does increase the latency of an individual request
|
||||
* When no number of streams is not specified, a device creates a bare minimum of streams (usually just one), as the latency-oriented case is default
|
||||
* Please find further tips for the optimal number of the streams [below](@ref throughput_advanced)
|
||||
* Streams are memory-hungry, as every stream duplicates the intermediate buffers to do inference in parallel to the rest of streams
|
||||
* Always prefer streams over creating multiple `ov:Compiled_Model` instances for the same model, as weights memory is shared across streams, reducing the memory consumption
|
||||
* Notice that the streams also inflate the model load (compilation) time.
|
||||
A few general considerations:
|
||||
* Using the streams does increase the latency of an individual request:
|
||||
* When the number of streams is not specified, a device creates a bare minimum of streams (usually, just one), as the latency-oriented case is default.
|
||||
* See further tips for the optimal number of the streams [below](@ref throughput_advanced).
|
||||
* Streams are memory-intensive, as every stream duplicates the intermediate buffers to do inference in parallel to the rest of the streams:
|
||||
* Always prefer streams over creating multiple `ov:Compiled_Model` instances for the same model, as weights memory is shared across streams, reducing the memory consumption.
|
||||
* Keep in mind that the streams also inflate the model load (compilation) time.
|
||||
|
||||
For efficient asynchronous execution, the streams are actually handling the inference with a special pool of the threads (a thread per stream).
|
||||
Each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular `ov:Compiled_Model`.
|
||||
If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution.
|
||||
There are further device-specific details e.g. for the CPU, that you may find in the [internals](dldt_deployment_optimization_internals.md) section.
|
||||
Each time you start inference requests (potentially from different application threads), they are actually muxed into an inference queue of the particular `ov:Compiled_Model`.
|
||||
If there is a vacant stream, it pulls the request from the queue and actually expedites that to the on-device execution.
|
||||
There are further device-specific details, like for the CPU, in the [internals](dldt_deployment_optimization_internals.md) section.
|
||||
|
||||
## Batching
|
||||
Hardware accelerators like GPUs are optimized for massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
|
||||
While the streams (described earlier) already help to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient, compared to calling a kernel on the multiple inputs at once.
|
||||
As explained in the next section, the batching is a must to leverage maximum throughput on the GPUs.
|
||||
Hardware accelerators such as GPUs are optimized for a massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput.
|
||||
While the streams (described in previous section) already help to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient compared to calling a kernel on the multiple inputs at once.
|
||||
As explained in the next section, the batching is a must to leverage maximum throughput on the GPU.
|
||||
|
||||
There are two primary ways of using the batching to help application performance:
|
||||
* Collecting the inputs explicitly on the application side and then _sending these batched requests to the OpenVINO_
|
||||
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic
|
||||
* _Sending individual requests_, while configuring the OpenVINO to collect and perform inference on the requests in batch [automatically](../OV_Runtime_UG/automatic_batching.md).
|
||||
In both cases, optimal batch size is very device-specific. Also as explained below, the optimal batch size depends on the model, inference precision and other factors.
|
||||
There are several primary methods of using the batching to help application performance:
|
||||
* Collecting the inputs explicitly on the application side and then **sending the batch requests to OpenVINO**:
|
||||
* Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic.
|
||||
* **Sending individual requests**, while configuring OpenVINO to collect and perform inference on the requests in batch [automatically](../OV_Runtime_UG/automatic_batching.md).
|
||||
|
||||
In both cases, the optimal batch size is very device-specific. As explained below, the optimal batch size also depends on the model, inference precision and other factors.
|
||||
|
||||
@anchor throughput_advanced
|
||||
## Choosing the Number of Streams and/or Batch Size
|
||||
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements.
|
||||
Run performance testing in the scope of development, and make sure to validate overall (end-to-end) application performance.
|
||||
Run performance testing in the scope of development, and make sure to validate overall (*end-to-end*) application performance.
|
||||
|
||||
Different devices behave differently with the batch sizes. The optimal batch size depends on the model, inference precision and other factors.
|
||||
Similarly, different devices require different number of execution streams to saturate.
|
||||
Finally, in some cases combination of streams and batching may be required to maximize the throughput.
|
||||
Similarly, different devices require a different number of execution streams to saturate.
|
||||
In some cases, combination of streams and batching may be required to maximize the throughput.
|
||||
|
||||
One possible throughput optimization strategy is to **set an upper bound for latency and then increase the batch size and/or number of the streams until that tail latency is met (or the throughput is not growing anymore)**.
|
||||
Also, consider [OpenVINO Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction) that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
|
||||
Consider [OpenVINO Deep Learning Workbench](@ref workbench_docs_Workbench_DG_Introduction) that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
|
||||
|
||||
> **NOTE**: When playing with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md) use only the streams (no batching), as they tolerate individual requests having different shapes.
|
||||
> **NOTE**: When playing with [dynamically-shaped inputs](../OV_Runtime_UG/ov_dynamic_shapes.md), use only the streams (no batching), as they tolerate individual requests having different shapes.
|
||||
|
||||
> **NOTE**: Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the alternative, portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario and model.
|
||||
> **NOTE**: Using the [High-Level Performance Hints](../OV_Runtime_UG/performance_hints.md) is the alternative, portable and future-proof option, allowing OpenVINO to find the best combination of streams and batching for a given scenario and a model.
|
||||
|
||||
@anchor stream_considerations
|
||||
### Number of Streams Considerations
|
||||
* Select the number of streams is it is **less or equal** to the number of requests that your application would be able to runs simultaneously
|
||||
* To avoid wasting resources, the number of streams should be enough to meet the _average_ parallel slack rather than the peak load
|
||||
* As a more portable option (that also respects the underlying hardware configuration) use the `ov::streams::AUTO`
|
||||
* It is very important to keep these streams busy, by running as many inference requests as possible (e.g. start the newly-arrived inputs immediately)
|
||||
* Bare minimum of requests to saturate the device can be queried as `ov::optimal_number_of_infer_requests` of the `ov:Compiled_Model`
|
||||
* _Maximum number of streams_ for the device (per model) can be queried as the `ov::range_for_streams`
|
||||
* Select the number of streams that is **less or equal** to the number of requests that the application would be able to run simultaneously.
|
||||
* To avoid wasting resources, the number of streams should be enough to meet the *average* parallel slack rather than the peak load.
|
||||
* Use the `ov::streams::AUTO` as a more portable option (that also respects the underlying hardware configuration).
|
||||
* It is very important to keep these streams busy, by running as many inference requests as possible (for example, start the newly-arrived inputs immediately):
|
||||
* A bare minimum of requests to saturate the device can be queried as the `ov::optimal_number_of_infer_requests` of the `ov:Compiled_Model`.
|
||||
* *The maximum number of streams* for the device (per model) can be queried as the `ov::range_for_streams`.
|
||||
|
||||
### Batch Size Considerations
|
||||
* Select the batch size that is **equal** to the number of requests that your application is able to runs simultaneously
|
||||
* Otherwise (or if the number of "available" requests fluctuates), you may need to keep several instances of the network (reshaped to the different batch size) and select the properly sized instance in the runtime accordingly
|
||||
* For OpenVINO devices that internally implement a dedicated heuristic, the `ov::optimal_batch_size` is a _device_ property (that accepts the actual model as a parameter) to query the recommended batch size for the model.
|
||||
* Select the batch size that is **equal** to the number of requests that your application is able to run simultaneously:
|
||||
* Otherwise (or if the number of "available" requests fluctuates), you may need to keep several instances of the network (reshaped to the different batch size) and select the properly sized instance in the runtime accordingly.
|
||||
* For OpenVINO devices that implement a dedicated heuristic internally, the `ov::optimal_batch_size` is a *device* property (that accepts the actual model as a parameter) to query the recommended batch size for the model.
|
||||
|
||||
|
||||
### Few Device Specific Details
|
||||
### A Few Device-specific Details
|
||||
* For the **GPU**:
|
||||
* When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using only the streams for the GPU may suffice
|
||||
* Notice that the GPU runs 2 request per stream, so 4 requests can be served by 2 streams
|
||||
* Alternatively, consider single stream with with 2 requests (each with a small batch size like 2), which would total the same 4 inputs in flight
|
||||
* Typically, for 4 and more requests the batching delivers better throughput
|
||||
* Batch size can be calculated as "number of inference requests executed in parallel" divided by the "number of requests that the streams consume"
|
||||
* E.g. if you process 16 cameras (by 16 requests inferenced _simultaneously_) by the two GPU streams (each can process two requests), the batch size per request is 16/(2*2)=4
|
||||
* When the parallel slack is small, for example, only 2-4 requests executed simultaneously, then using only the streams for the GPU may suffice:
|
||||
* The GPU runs 2 requests per stream, so 4 requests can be served by 2 streams.
|
||||
* Alternatively, consider a single stream with 2 requests (each with a small batch size like 2), which would total the same 4 inputs in flight.
|
||||
* Typically, for 4 and more requests the batching delivers better throughput.
|
||||
* A batch size can be calculated as "a number of inference requests executed in parallel" divided by the "number of requests that the streams consume":
|
||||
* For example, if you process 16 cameras (by 16 requests inferenced *simultaneously*) by 2 GPU streams (each can process two requests), the batch size per request is 16/(2*2)=4.
|
||||
|
||||
* For the **CPU always use the streams first**
|
||||
* On the high-end CPUs, using moderate (2-8) batch size _in addition_ to the maximum number of streams, may further improve the performance.
|
||||
* For the **CPU, always use the streams first!**:
|
||||
* On high-end CPUs, using moderate (2-8) batch size *in addition* to the maximum number of streams may further improve the performance.
|
||||
|
@ -1,40 +1,36 @@
|
||||
# Introduction to Performance Optimization {#openvino_docs_optimization_guide_dldt_optimization_guide}
|
||||
Before exploring possible optimization techniques, let us first define what the inference performance is and how to measure that.
|
||||
Notice that reported inference performance often tends to focus on the speed of execution.
|
||||
In fact these are at least four connected factors of accuracy, throughput/latency and efficiency. The rest of the document discusses how to balance these key factors.
|
||||
Even though inference performance should be defined as a combination of many factors, including accuracy and efficiency, it is most often described as the speed of execution. As the rate with which the model processes live data, it is based on two fundamentally interconnected metrics: latency and throughput.
|
||||
|
||||
|
||||
## What Is Inference Performance
|
||||
Generally, performance means how fast the model processes the live data. Two key metrics are used to measure the performance: latency and throughput are fundamentally interconnected.
|
||||
|
||||

|
||||
|
||||
**Latency** measures inference time (ms) required to process a single input. When it comes to the executing multiple inputs simultaneously (e.g. via batching) then the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually of more concern.
|
||||
To calculate **throughput**, divide number of inputs that were processed by the processing time.
|
||||
**Latency** measures inference time (in ms) required to process a single input. When it comes to executing multiple inputs simultaneously (for example, via batching), the overall throughput (inferences per second, or frames per second, FPS, in the specific case of visual processing) is usually more of a concern.
|
||||
**Throughput** is calculated by dividing the number of inputs that were processed by the processing time.
|
||||
|
||||
## End-to-End Application Performance
|
||||
It is important to separate the "pure" inference time of a neural network and the end-to-end application performance. For example data transfers between the host and a device may unintentionally affect the performance when a host input tensor is processed on the accelerator like dGPU.
|
||||
It is important to separate the "pure" inference time of a neural network and the end-to-end application performance. For example, data transfers between the host and a device may unintentionally affect the performance when a host input tensor is processed on the accelerator such as dGPU.
|
||||
|
||||
Similarly, the input-preprocessing contributes significantly to the to inference time. As detailed in the [getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) section, when drilling into _inference_ performance, one option is to measure all such items separately.
|
||||
For the **end-to-end scenario** though, consider the image pre-processing thru the OpenVINO and the asynchronous execution as a way to amortize the communication costs like data transfers. You can find further details in the [general optimizations document](./dldt_deployment_optimization_common.md).
|
||||
Similarly, the input-preprocessing contributes significantly to the inference time. As described in the [getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) section, when evaluating *inference* performance, one option is to measure all such items separately.
|
||||
For the **end-to-end scenario**, though, consider image pre-processing with OpenVINO and the asynchronous execution as a way to lessen the communication costs (like data transfers). For more details, see the [general optimizations guide](./dldt_deployment_optimization_common.md).
|
||||
|
||||
**First-inference latency** is another specific case (e.g. when fast application start-up is required) where the resulting performance may be well dominated by the model loading time. Consider [model caching](../OV_Runtime_UG/Model_caching_overview.md) as a way to improve model loading/compilation time.
|
||||
Another specific case is **first-inference latency** (for example, when a fast application start-up is required), where the resulting performance may be well dominated by the model loading time. [Model caching](../OV_Runtime_UG/Model_caching_overview.md) may be considered as a way to improve model loading/compilation time.
|
||||
|
||||
Finally, **memory footprint** restrictions is another possible concern when designing an application. While this is a motivation for the _model_ optimization techniques referenced in the next section, notice that the the throughput-oriented execution is usually much more memory-hungry, as detailed in the [Runtime Inference Optimizations](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
Finally, **memory footprint** restriction is another possible concern when designing an application. While this is a motivation for the use of the *model* optimization techniques, keep in mind that the throughput-oriented execution is usually much more memory consuming. For more details, see the [Runtime Inference Optimizations guide](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
|
||||
|
||||
> **NOTE**: To get performance numbers for OpenVINO, as well as tips how to measure it and compare with native framework, check [Getting performance numbers](../MO_DG/prepare_model/Getting_performance_numbers.md) page.
|
||||
> **NOTE**: To get performance numbers for OpenVINO, along with the tips on how to measure and compare it with a native framework, see the [Getting performance numbers article](../MO_DG/prepare_model/Getting_performance_numbers.md).
|
||||
|
||||
## Improving the Performance: Model vs Runtime Optimizations
|
||||
## Improving Performance: Model vs Runtime Optimizations
|
||||
|
||||
> **NOTE**: Make sure that your model can be successfully inferred with OpenVINO Runtime.
|
||||
> **NOTE**: First, make sure that your model can be successfully inferred with OpenVINO Runtime.
|
||||
|
||||
With the OpenVINO there are two primary ways of improving the inference performance, namely model- and runtime-level optimizations. **These two optimizations directions are fully compatible**.
|
||||
There are two primary optimization approaches to improving inference performance with OpenVINO: model- and runtime-level optimizations. They are **fully compatible** and can be done independently.
|
||||
|
||||
- **Model optimizations** includes model modification, such as quantization, pruning, optimization of preprocessing, etc. Fore more details, refer to this [document](./model_optimization_guide.md).
|
||||
- Notice that the model optimizations directly improve the inference time, even without runtime parameters tuning, described below
|
||||
- **Model optimizations** include model modifications, such as quantization, pruning, optimization of preprocessing, etc. For more details, refer to this [document](./model_optimization_guide.md).
|
||||
- The model optimizations directly improve the inference time, even without runtime parameters tuning (described below).
|
||||
|
||||
- **Runtime (Deployment) optimizations** includes tuning of model _execution_ parameters. To read more visit the [Runtime Inference Optimizations](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
- **Runtime (Deployment) optimizations** includes tuning of model *execution* parameters. Fore more details, see the [Runtime Inference Optimizations guide](../optimization_guide/dldt_deployment_optimization_guide.md).
|
||||
|
||||
## Performance benchmarks
|
||||
To estimate the performance and compare performance numbers, measured on various supported devices, a wide range of public models are available at [Performance benchmarks](../benchmarks/performance_benchmarks.md) section.
|
||||
A wide range of public models for estimating performance and comparing the numbers (measured on various supported devices) are available in the [Performance benchmarks section](../benchmarks/performance_benchmarks.md).
|
@ -12,11 +12,11 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
Model optimization is an optional offline step of improving final model performance by applying special optimization methods like quantization, pruning, preprocessing optimization, etc. OpenVINO provides several tools to optimize models at different steps of model development:
|
||||
Model optimization is an optional offline step of improving final model performance by applying special optimization methods, such as quantization, pruning, preprocessing optimization, etc. OpenVINO provides several tools to optimize models at different steps of model development:
|
||||
|
||||
- **Model Optimizer** implements optimization to a model, most of them added by default, but you can configure mean/scale values, batch size, RGB vs BGR input channels, and other parameters to speed up preprocess of a model ([Embedding Preprocessing Computation](../MO_DG/prepare_model/Additional_Optimizations.md)).
|
||||
- **Model Optimizer** implements most of the optimization parameters to a model by default. Yet, you are free to configure mean/scale values, batch size, RGB vs BGR input channels, and other parameters to speed up preprocess of a model ([Embedding Preprocessing Computation](../MO_DG/prepare_model/Additional_Optimizations.md)).
|
||||
|
||||
- **Post-training Optimization tool** [(POT)](../../tools/pot/docs/Introduction.md) is designed to optimize the inference of deep learning models by applying post-training methods that do not require model retraining or fine-tuning, for example, post-training 8-bit quantization.
|
||||
- **Post-training Optimization tool** [(POT)](../../tools/pot/docs/Introduction.md) is designed to optimize inference of deep learning models by applying post-training methods that do not require model retraining or fine-tuning, for example, post-training 8-bit quantization.
|
||||
|
||||
- **Neural Network Compression Framework** [(NNCF)](./nncf_introduction.md) provides a suite of advanced methods for training-time model optimization within the DL framework, such as PyTorch and TensorFlow. It supports methods, like Quantization-aware Training and Filter Pruning. NNCF-optimized models can be inferred with OpenVINO using all the available workflows.
|
||||
|
||||
@ -25,15 +25,15 @@
|
||||
|
||||

|
||||
|
||||
To understand which development optimization tool you need, refer to the diagram:
|
||||
|
||||
Post-training methods are limited in terms of achievable accuracy and for challenging use cases accuracy might degrade. In this case, training-time optimization with NNCF is an option.
|
||||
|
||||
Once the model is optimized using the aforementioned tools it can be used for inference using the regular OpenVINO inference workflow. No changes to the code are required.
|
||||
The diagram below will help you understand which development optimization tool you need to use:
|
||||
|
||||

|
||||
|
||||
If you are not familiar with model optimization methods, we recommend starting from [post-training methods](@ref pot_introduction).
|
||||
Post-training methods are limited in terms of achievable accuracy, which may degrade for certain scenarios. In such cases, training-time optimization with NNCF may give better results.
|
||||
|
||||
## See also
|
||||
Once the model has been optimized using the aforementioned tools, it can be used for inference using the regular OpenVINO inference workflow. No changes to the code are required.
|
||||
|
||||
If you are not familiar with model optimization methods, refer to [post-training methods](@ref pot_introduction).
|
||||
|
||||
## Additional Resources
|
||||
- [Deployment optimization](./dldt_deployment_optimization_guide.md)
|
@ -1,40 +1,41 @@
|
||||
# Neural Network Compression Framework {#docs_nncf_introduction}
|
||||
|
||||
Neural Network Compression Framework (NNCF) is a set of advanced algorithms for optimizing Deep Neural Networks (DNN).
|
||||
It provides in-training optimization capabilities, which means that fine-tuning or even re-training the original model is necessary, and supports several optimization algorithms:
|
||||
|
||||
|Compression algorithm|PyTorch|TensorFlow 2.x|
|
||||
| :--- | :---: | :---: |
|
||||
|[8- bit quantization](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md) | Supported | Supported |
|
||||
|[Filter pruning](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Pruning.md) | Supported | Supported |
|
||||
|[Sparsity](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Sparsity.md) | Supported | Supported |
|
||||
|[Mixed-precision quantization](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#mixed_precision_quantization) | Supported | Not supported |
|
||||
|[Binarization](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Binarization.md) | Supported | Not supported |
|
||||
The Neural Network Compression Framework (NNCF) aims at optimizing Deep Neural Networks (DNN) by means of methods such as quantization and pruning. It provides in-training optimization capabilities, which means that the optimization methods require model fine-tuning or even re-training.
|
||||
|
||||
NNCF is distributed as a separate tool but is closely aligned with OpenVINO in terms of supported optimization features and models. It is open source and available on [GitHub](https://github.com/openvinotoolkit/nncf). The diagram below shows the model optimization workflow, using NNCF.
|
||||
|
||||
The model optimization workflow using NNCF:
|
||||

|
||||
|
||||
The main NNCF characteristics:
|
||||
- Support for optimization of PyTorch and TensorFlow 2.x models.
|
||||
### Features
|
||||
- Support for optimization of PyTorch and TensorFlow 2.x models.
|
||||
- Support for various optimization algorithms, applied during a model fine-tuning process to achieve a better trade-off between performance and accuracy:
|
||||
|
||||
|Compression algorithm|PyTorch|TensorFlow 2.x|
|
||||
| :--- | :---: | :---: |
|
||||
|[8- bit quantization](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md) | Supported | Supported |
|
||||
|[Filter pruning](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Pruning.md) | Supported | Supported |
|
||||
|[Sparsity](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Sparsity.md) | Supported | Supported |
|
||||
|[Mixed-precision quantization](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#mixed_precision_quantization) | Supported | Not supported |
|
||||
|[Binarization](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Binarization.md) | Supported | Not supported |
|
||||
|
||||
- Stacking of optimization methods, for example: 8-bit quaNtization + Filter Pruning.
|
||||
- Support for [Accuracy-Aware model training](https://github.com/openvinotoolkit/nncf/blob/develop/docs/Usage.md#accuracy-aware-model-training) pipelines via the [Adaptive Compression Level Training](https://github.com/openvinotoolkit/nncf/tree/develop/docs/accuracy_aware_model_training/AdaptiveCompressionLevelTraining.md) and [Early Exit Training](https://github.com/openvinotoolkit/nncf/tree/develop/docs/accuracy_aware_model_training/EarlyExitTrainig.md).
|
||||
- Automatic and configurable model graph transformation to obtain the compressed model (limited support for TensorFlow models, only the ones created using Sequential or Keras Functional API, are supported).
|
||||
- Automatic, configurable model graph transformation to obtain the compressed model.
|
||||
> **NOTE**: Only models created using Sequential or Keras Functional API are supported. Support for TensorFlow models is limited.
|
||||
- GPU-accelerated layers for faster compressed model fine-tuning.
|
||||
- Distributed training support.
|
||||
- Configuration file examples for each supported compression algorithm.
|
||||
- Exporting PyTorch compressed models to ONNX checkpoints and TensorFlow compressed models to SavedModel or Frozen Graph format, ready to use with [OpenVINO™ toolkit](https://github.com/openvinotoolkit/).
|
||||
- Open source, available on [GitHub](https://github.com/openvinotoolkit/nncf).
|
||||
- Git patches for prominent third-party repositories ([huggingface-transformers](https://github.com/huggingface/transformers)) demonstrating the process of integrating NNCF into custom training pipelines.
|
||||
- Examples of configuration files for each supported compression algorithm.
|
||||
- Exporting PyTorch compressed models to ONNX checkpoints and TensorFlow compressed models to SavedModel or Frozen Graph format, ready to use with [OpenVINO toolkit](https://github.com/openvinotoolkit/).
|
||||
- Git patches for prominent third-party repositories ([huggingface-transformers](https://github.com/huggingface/transformers)) demonstrating the process of integrating NNCF into custom training pipelines.
|
||||
|
||||
## Get started
|
||||
### Installation
|
||||
NNCF provides the packages available for installation through the PyPI repository. To install the latest version via pip manager run the following command:
|
||||
## Installation
|
||||
NNCF provides the packages available for installation through the PyPI repository. To install the latest version via pip manager, run the following command:
|
||||
```
|
||||
pip install nncf
|
||||
```
|
||||
|
||||
### Usage examples
|
||||
NNCF provides various examples and tutorials that demonstrate usage of optimization methods.
|
||||
## Usage examples
|
||||
NNCF provides various examples and tutorials that demonstrate usage of optimization methods:
|
||||
|
||||
### Tutorials
|
||||
- [Quantization-aware training of PyTorch model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/302-pytorch-quantization-aware-training)
|
||||
@ -52,7 +53,7 @@ NNCF provides various examples and tutorials that demonstrate usage of optimizat
|
||||
- [Instance Segmentation sample](https://github.com/openvinotoolkit/nncf/blob/develop/examples/tensorflow/segmentation/README.md)
|
||||
|
||||
|
||||
## See also
|
||||
## Additional Resources
|
||||
- [Compressed Model Zoo](https://github.com/openvinotoolkit/nncf#nncf-compressed-model-zoo)
|
||||
- [NNCF in HuggingFace Optimum](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/optimum)
|
||||
- [Post-training optimization](../../tools/pot/docs/Introduction.md)
|
||||
|
@ -1,25 +1,22 @@
|
||||
# Compile Tool {#openvino_inference_engine_tools_compile_tool_README}
|
||||
|
||||
Compile tool is a C++ application that enables you to compile a model for inference on a specific device and export the compiled representation to a binary file.
|
||||
With the Compile Tool, you can compile a model using supported OpenVINO Runtime devices on a machine that doesn't have the physical device connected and then transfer a generated file to any machine with the target inference device available. See the [Features support matrix](../../docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md) to understand which device support import / export functionality.
|
||||
With this tool, you can compile a model using supported OpenVINO Runtime devices on a machine that does not have the physical device connected, and then transfer a generated file to any machine with the target inference device available. To learn which device supports the import / export functionality, see the [feature support matrix](../../docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md).
|
||||
|
||||
The tool compiles networks for the following target devices using corresponding OpenVINO Runtime plugins:
|
||||
* Intel® Neural Compute Stick 2 (MYRIAD plugin)
|
||||
The tool compiles networks for the following target devices using a corresponding OpenVINO Runtime plugin: Intel® Neural Compute Stick 2 (MYRIAD plugin).
|
||||
|
||||
The tool is delivered as an executable file that can be run on both Linux* and Windows*.
|
||||
The tool is located in the `<INSTALLROOT>/tools/compile_tool` directory.
|
||||
The tool is delivered as an executable file that can be run on both Linux and Windows. It is located in the `<INSTALLROOT>/tools/compile_tool` directory.
|
||||
|
||||
## Workflow of the Compile tool
|
||||
|
||||
1. First, the application reads command-line parameters and loads a model to the OpenVINO Runtime device.
|
||||
2. Then the application exports a blob with the compiled model and writes it to the output file.
|
||||
First, the application reads command-line parameters and loads a model to the OpenVINO Runtime device. After that, the application exports a blob with the compiled model and writes it to the output file.
|
||||
|
||||
Also, the compile_tool supports the following capabilities:
|
||||
- Embedding [layout](../../docs/OV_Runtime_UG/layout_overview.md) and precision conversions (see [Optimize Preprocessing](../../docs/OV_Runtime_UG/preprocessing_overview.md)). To compile the model with advanced preprocessing capabilities, refer to [Use Case - Integrate and Save Preprocessing Steps Into IR](../../docs/OV_Runtime_UG/preprocessing_usecase_save.md) which shows how to have all the preprocessing in the compiled blob.
|
||||
- Compile blobs for OpenVINO Runtime API 2.0 by default or for Inference Engine API with explicit option `-ov_api_1_0`
|
||||
- Accepts device specific options for customizing the compilation process
|
||||
Also, the Compile tool supports the following capabilities:
|
||||
- Embedding [layout](../../docs/OV_Runtime_UG/layout_overview.md) and precision conversions (for more details, see the [Optimize Preprocessing](../../docs/OV_Runtime_UG/preprocessing_overview.md)). To compile the model with advanced preprocessing capabilities, refer to the [Use Case - Integrate and Save Preprocessing Steps Into OpenVINO IR](../../docs/OV_Runtime_UG/preprocessing_usecase_save.md), which shows how to have all the preprocessing in the compiled blob.
|
||||
- Compiling blobs for OpenVINO Runtime API 2.0 by default or for Inference Engine API with explicit option `-ov_api_1_0`.
|
||||
- Accepting device specific options for customizing the compilation process.
|
||||
|
||||
## Run the Compile Tool
|
||||
## Running the Compile Tool
|
||||
|
||||
Running the application with the `-h` option yields the following usage message:
|
||||
|
||||
|
@ -10,29 +10,31 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Introduction
|
||||
This document assumes that you already tried [Default Quantization](@ref pot_default_quantization_usage) for the same model. In case when it introduces a significant accuracy degradation, the Accuracy-aware Quantization algorithm can be used to remain accuracy within the pre-defined range. This may cause a
|
||||
degradation of performance in comparison to [Default Quantization](@ref pot_default_quantization_usage) algorithm because some layers can be reverted back to the original precision.
|
||||
The Accuracy-aware Quantization algorithm allows to perform quantization while maintaining accuracy within a pre-defined range. Note that it should be used only if the [Default Quantization](@ref pot_default_quantization_usage) introduces a significant accuracy degradation. The reason for it not being the primary choice is its potential for performance degradation, due to some layers getting reverted to the original precision.
|
||||
|
||||
> **NOTE**: In case of GNA `target_device`, the Accuracy-aware Quantization algorithm behavior is different. It is searching for the best configuration selecting between INT8 and INT16 precisions for weights of each layer. The algorithm works for the `performance` preset only. For the `accuracy` preset, this algorithm is not helpful since the whole model is already in INT16 precision.
|
||||
To proceed with this article, make sure you have read how to use [Default Quantization](@ref pot_default_quantization_usage).
|
||||
|
||||
> **NOTE**: The Accuracy-aware Quantization algorithm's behavior is different for the GNA `target_device`. In this case it searches for the best configuration and selects between INT8 and INT16 precisions for weights of each layer. The algorithm works for the `performance` preset only. It is not useful for the `accuracy` preset, since the whole model is already in INT16 precision.
|
||||
|
||||
A script for Accuracy-aware Quantization includes four steps:
|
||||
1. Prepare data and dataset interface
|
||||
2. Define accuracy metric
|
||||
3. Select quantization parameters
|
||||
4. Define and run quantization process
|
||||
1. Prepare data and dataset interface.
|
||||
2. Define accuracy metric.
|
||||
3. Select quantization parameters.
|
||||
4. Define and run quantization process.
|
||||
|
||||
## Prepare data and dataset interface
|
||||
This step is the same as in the case of [Default Quantization](@ref pot_default_quantization_usage). The only difference is that `__getitem__()` method should return `(data, annotation)` or `(data, annotation, metadata)` where `annotation` is required and its format should correspond to the expectations of the `Metric` class. `metadata` is an optional field that can be used to store additional information required for post-processing.
|
||||
This step is the same as in [Default Quantization](@ref pot_default_quantization_usage). The only difference is that `__getitem__()` should return `(data, annotation)` or `(data, annotation, metadata)`. The `annotation` is required and its format should correspond to the expectations of the `Metric` class. The `metadata` is an optional field that can be used to store additional information required for post-processing.
|
||||
|
||||
## Define accuracy metric
|
||||
To control accuracy during the optimization a `openvino.tools.pot.Metric` interface should be implemented. Each implementation should override the following properties:
|
||||
To control accuracy during optimization, the `openvino.tools.pot.Metric` interface should be implemented. Each implementation should override the following properties and methods:
|
||||
|
||||
**Properties**
|
||||
- `value` - returns the accuracy metric value for the last model output in a format of `Dict[str, numpy.array]`.
|
||||
- `avg_value` - returns the average accuracy metric over collected model results in a format of `Dict[str, numpy.array]`.
|
||||
- `higher_better` should return `True` if a higher value of the metric corresponds to better performance, otherwise, returns `False`. Default implementation returns `True`.
|
||||
- `higher_better` if a higher value of the metric corresponds to better performance, returns `True` , otherwise, `False`. The default implementation returns `True`.
|
||||
|
||||
and methods:
|
||||
- `update(output, annotation)` - calculates and updates the accuracy metric value using the last model output and annotation. The model output and annotation should be passed in this method. It should also contain the model-specific post-processing in case the model returns the raw output.
|
||||
**Methods**
|
||||
- `update(output, annotation)` - calculates and updates the accuracy metric value, using the last model output and annotation. The model output and annotation should be passed in this method. It should also contain the model-specific post-processing in case the model returns the raw output.
|
||||
- `reset()` - resets collected accuracy metric.
|
||||
- `get_attributes()` - returns a dictionary of metric attributes:
|
||||
```
|
||||
@ -41,7 +43,7 @@ and methods:
|
||||
Required attributes:
|
||||
- `direction` - (`higher-better` or `higher-worse`) a string parameter defining whether metric value
|
||||
should be increased in accuracy-aware algorithms.
|
||||
- `type` - a string representation of metric type. For example, 'accuracy' or 'mean_iou'.
|
||||
- `type` - a string representation of a metric type. For example, "accuracy" or "mean_iou".
|
||||
|
||||
Below is an example of the accuracy top-1 metric implementation with POT API:
|
||||
```python
|
||||
@ -103,13 +105,12 @@ engine = IEEngine(config=engine_config, data_loader=data_loader, metric=metric)
|
||||
```
|
||||
|
||||
## Select quantization parameters
|
||||
Accuracy-aware Quantization uses the Default Quantization algorithm at the initialization step so that all its parameters are also valid and can be specified. Here, we
|
||||
describe only Accuracy-aware Quantization required parameters:
|
||||
- `"maximal_drop"` - maximum accuracy drop which has to be achieved after the quantization. Default value is `0.01` (1%).
|
||||
Accuracy-aware Quantization uses the Default Quantization algorithm at the initialization step in such an order that all its parameters are also valid and can be specified. The only parameter required exclusively by Accuracy-aware Quantization is:
|
||||
- `"maximal_drop"` - the maximum accuracy drop which has to be achieved after the quantization. The default value is `0.01` (1%).
|
||||
|
||||
## Run quantization
|
||||
|
||||
The code example below shows a basic quantization workflow with accuracy control. `UserDataLoader()` is a placeholder for the implementation of `DataLoader`.
|
||||
The example code below shows a basic quantization workflow with accuracy control. `UserDataLoader()` is a placeholder for the implementation of `DataLoader`.
|
||||
|
||||
```python
|
||||
from openvino.tools.pot import IEEngine
|
||||
@ -140,13 +141,13 @@ algorithms = [
|
||||
}
|
||||
]
|
||||
|
||||
# Step 1: implement and create user's data loader
|
||||
# Step 1: Implement and create user's data loader.
|
||||
data_loader = UserDataLoader()
|
||||
|
||||
# Step 2: implement and create user's data loader
|
||||
# Step 2: Implement and create user's data loader.
|
||||
metric = Accuracy()
|
||||
|
||||
# Step 3: load model
|
||||
# Step 3: Load the model.
|
||||
model = load_model(model_config=model_config)
|
||||
|
||||
# Step 4: Initialize the engine for metric calculation and statistics collection.
|
||||
@ -161,7 +162,7 @@ compressed_model = pipeline.run(model=model)
|
||||
compress_model_weights(compressed_model)
|
||||
|
||||
# Step 7: Save the compressed model to the desired path.
|
||||
# Set save_path to the directory where the model should be saved
|
||||
# Set save_path to the directory where the model should be saved.
|
||||
compressed_model_paths = save_model(
|
||||
model=compressed_model,
|
||||
save_path="optimized_model",
|
||||
|
@ -10,45 +10,44 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Introduction
|
||||
The [Default Quantization](@ref pot_default_quantization_usage) of the Post-training Optimization Tool (POT) is
|
||||
the fastest and easiest way to get a quantized model because it requires only some unannotated representative dataset to be provided in most cases. Thus, it is recommended to use it as a starting point when it comes to model optimization. However, it can lead to significant accuracy deviation in some cases. This document is aimed at providing tips to address this issue.
|
||||
the fastest and easiest way to get a quantized model. It requires only some unannotated representative dataset to be provided in most cases. Therefore, it is recommended to use it as a starting point when it comes to model optimization. However, it can lead to significant accuracy deviation in some cases. The purpose of this article is to provide tips to address this issue.
|
||||
|
||||
> **NOTE**: POT uses inference on the CPU during model optimization. It means the ability to infer the original
|
||||
> floating-point model is a prerequisite for model optimization.
|
||||
> It is also worth mentioning that in the case of 8-bit quantization it is recommended to run POT on the same CPU
|
||||
> **NOTE**: POT uses inference on the CPU during model optimization. It means that ability to infer the original
|
||||
> floating-point model is essential for model optimization.
|
||||
> It is also worth mentioning that in case of the 8-bit quantization, it is recommended to run POT on the same CPU
|
||||
> architecture when optimizing for CPU or VNNI-based CPU when quantizing for a non-CPU device, such as GPU, VPU, or GNA.
|
||||
> It should help to avoid the impact of the [saturation issue](@ref pot_saturation_issue) that occurs on AVX and SSE based CPU devices.
|
||||
|
||||
## Improving accuracy after the Default Quantization
|
||||
Parameters of the Default Quantization algorithm with basic settings are shown below:
|
||||
Parameters of the Default Quantization algorithm with basic settings are presented below:
|
||||
```python
|
||||
{
|
||||
"name": "DefaultQuantization", # Optimization algorithm name
|
||||
"params": {
|
||||
"preset": "performance", # Preset [performance, mixed] which controls
|
||||
# the quantization scheme. For the CPU:
|
||||
# performance - symmetric quantization of weights and activations
|
||||
# mixed - symmetric weights and asymmetric activations
|
||||
# performance - symmetric quantization of weights and activations.
|
||||
# mixed - symmetric weights and asymmetric activations.
|
||||
"stat_subset_size": 300 # Size of subset to calculate activations statistics that can be used
|
||||
# for quantization parameters calculation
|
||||
# for quantization parameters calculation.
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
In the case of substantial accuracy degradation after applying this method there are two alternatives:
|
||||
1. Hyperparameters tuning
|
||||
2. AccuracyAwareQuantization algorithm
|
||||
There are two alternatives in case of substantial accuracy degradation after applying this method:
|
||||
1. Hyperparameters tuning.
|
||||
2. AccuracyAwareQuantization algorithm.
|
||||
|
||||
### Tuning Hyperparameters of the Default Quantization
|
||||
The Default Quantization algorithm provides multiple hyperparameters which can be used in order to improve accuracy results for the fully-quantized model.
|
||||
Below is a list of best practices that can be applied to improve accuracy without a substantial performance reduction with respect to default settings:
|
||||
1. The first recommended option is to change the `preset` from `performance` to `mixed`. This enables asymmetric quantization of
|
||||
activations and can be helpful for models with non-ReLU activation functions, for example, YOLO, EfficientNet, etc.
|
||||
2. The next option is `use_fast_bias`. Setting this option to `false` enables a different bias correction method which is more accurate, in general,
|
||||
and applied after model quantization as a part of the Default Quantization algorithm.
|
||||
> **NOTE**: Changing this option can substantially increase quantization time in the POT tool.
|
||||
3. Another important option is a `range_estimator`. It defines how to calculate the minimum and maximum of quantization range for weights and activations.
|
||||
2. The second option is the `use_fast_bias`. Setting this option to `false` enables a different bias correction method which is generally more accurate
|
||||
and applied after model quantization, as a part of the Default Quantization algorithm.
|
||||
> **NOTE**: Changing this option can substantially increase quantization time in POT tool.
|
||||
3. Another important option is the `range_estimator`. It defines how to calculate the minimum and maximum of quantization range for weights and activations.
|
||||
For example, the following `range_estimator` for activations can improve the accuracy for Faster R-CNN based networks:
|
||||
```python
|
||||
{
|
||||
@ -72,16 +71,16 @@ For example, the following `range_estimator` for activations can improve the acc
|
||||
|
||||
Find the possible options and their description in the `configs/default_quantization_spec.json` file in the POT directory.
|
||||
|
||||
4. The next option is `stat_subset_size`. It controls the size of the calibration dataset used by POT to collect statistics for quantization parameters initialization.
|
||||
It is assumed that this dataset should contain a sufficient number of representative samples. Thus, varying this parameter may affect accuracy (higher is better).
|
||||
However, we empirically found that 300 samples are sufficient to get representative statistics in most cases.
|
||||
5. The last option is `ignored_scope`. It allows excluding some layers from the quantization process, i.e. their inputs will not be quantized. It may be helpful for some patterns for which it is known in advance that they drop accuracy when executing in low-precision.
|
||||
For example, `DetectionOutput` layer of SSD model expressed as a subgraph should not be quantized to preserve the accuracy of Object Detection models.
|
||||
One of the sources for the ignored scope can be the Accuracy-aware algorithm which can revert layers back to the original precision (see details below).
|
||||
4. The next option is the `stat_subset_size`. It controls the size of the calibration dataset used by POT to collect statistics for quantization parameters initialization.
|
||||
It is assumed that this dataset should contain a sufficient number of representative samples. Hence, varying this parameter may affect accuracy (higher is better).
|
||||
However, it proves that 300 samples are sufficient to get representative statistics in most cases.
|
||||
5. The last option is the `ignored_scope`. It allows excluding some layers from the quantization process, for example, their inputs will not be quantized. It may be helpful for some patterns, which are known in advance, that they drop accuracy when executing in low-precision.
|
||||
For example, the `DetectionOutput` layer of SSD model expressed as a subgraph should not be quantized to preserve the accuracy of Object Detection models.
|
||||
One of the sources for the ignored scope can be the Accuracy-aware algorithm, which can revert layers back to the original precision (see the details below).
|
||||
|
||||
## Accuracy-aware Quantization
|
||||
In case when the steps above do not lead to the accurate quantized model you may use the so-called [Accuracy-aware Quantization](@ref pot_accuracyaware_usage) algorithm which leads to mixed-precision models.
|
||||
A fragment of Accuracy-aware Quantization configuration with default settings is shown below below:
|
||||
If the steps above do not result in an accurate quantized model, you may use the so-called [Accuracy-aware Quantization](@ref pot_accuracyaware_usage) algorithm, which produces mixed-precision models.
|
||||
Here is a fragment of Accuracy-aware Quantization configuration with default settings:
|
||||
```python
|
||||
{
|
||||
"name": "AccuracyAwareQuantization",
|
||||
@ -97,10 +96,9 @@ A fragment of Accuracy-aware Quantization configuration with default settings is
|
||||
|
||||
Since the Accuracy-aware Quantization calls the Default Quantization at the first step it means that all the parameters of the latter one are also valid and can be applied to the accuracy-aware scenario.
|
||||
|
||||
> **NOTE**: In general case, possible speedup after applying the Accuracy-aware Quantization algorithm is less than after the Default Quantization when the model gets fully quantized.
|
||||
> **NOTE**: In general, the potential increase in speed with the Accuracy-aware Quantization algorithm is not as high as with the Default Quantization, when the model gets fully quantized.
|
||||
|
||||
### Reducing the performance gap of Accuracy-aware Quantization
|
||||
To improve model performance after Accuracy-aware Quantization, you can try the `"tune_hyperparams"` setting and set it to `True`. It will enable searching for optimal quantization parameters before reverting layers to the "backup" precision. Note, that this can increase the overall quantization time.
|
||||
To improve model performance after Accuracy-aware Quantization, try the `"tune_hyperparams"` setting and set it to `True`. It will enable searching for optimal quantization parameters before reverting layers to the "backup" precision. Note that this may impact the overall quantization time, though.
|
||||
|
||||
If you do not achieve the desired accuracy and performance after applying the
|
||||
Accuracy-aware Quantization algorithm or you need an accurate fully-quantized model, we recommend either using Quantization-Aware Training from [NNCF](@ref docs_nncf_introduction).
|
||||
If the Accuracy-aware Quantization algorithm does not provide the desired accuracy and performance or you need an accurate, fully-quantized model, use [NNCF](@ref docs_nncf_introduction) for Quantization-Aware Training.
|
||||
|
@ -10,30 +10,28 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Introduction
|
||||
This document describes how to apply model quantization with the Default Quantization method without accuracy control using an unannotated dataset. To use this method, you need to create a Python* script using an API of Post-Training Optimization Tool (POT) and implement data preparation logic and quantization pipeline. In case you are not familiar with Python*, you can try [command-line interface](@ref pot_compression_cli_README) of POT which is designed to quantize models from OpenVINO™ [Model Zoo](https://github.com/openvinotoolkit/open_model_zoo). The figure below shows the common workflow of the quantization script implemented with POT API.
|
||||
This guide describes how to apply model quantization with the Default Quantization method without accuracy control, using an unannotated dataset. To use this method, you need to create a Python script using an API of Post-Training Optimization Tool (POT) and implement data preparation logic and quantization pipeline. If you are not familiar with Python, try [command-line interface](@ref pot_compression_cli_README) of POT which is designed to quantize models from OpenVINO [Model Zoo](https://github.com/openvinotoolkit/open_model_zoo). The figure below shows the common workflow of the quantization script implemented with POT API.
|
||||
|
||||

|
||||
|
||||
The script should include three basic steps:
|
||||
1. Prepare data and dataset interface
|
||||
2. Select quantization parameters
|
||||
3. Define and run quantization process
|
||||
1. Prepare data and dataset interface.
|
||||
2. Select quantization parameters.
|
||||
3. Define and run quantization process.
|
||||
|
||||
## Prepare data and dataset interface
|
||||
In most cases, it is required to implement only `openvino.tools.pot.DataLoader` interface which allows acquiring data from a dataset and applying model-specific pre-processing providing access by index. Any implementation should override the following methods:
|
||||
In most cases, it is required to implement only the `openvino.tools.pot.DataLoader` interface, which allows acquiring data from a dataset and applying model-specific pre-processing providing access by index. Any implementation should override the following methods:
|
||||
|
||||
- `__len__()`, returns the size of the dataset
|
||||
- `__getitem__()`, provides access to the data by index in range of 0 to `len(self)`. It also can encapsulate the logic of model-specific pre-processing. The method should return data in the following format:
|
||||
- `(data, annotation)`
|
||||
|
||||
where `data` is the input that is passed to the model at inference so that it should be properly preprocessed. `data` can be either `numpy.array` object or dictionary, where the key is the name of the model input and value is `numpy.array` which corresponds to this input. Since `annotation` is not used by the Default Quantization method this object can be `None` in this case.
|
||||
- The `__len__()`, returns the size of the dataset.
|
||||
- The `__getitem__()`, provides access to the data by index in range of 0 to `len(self)`. It can also encapsulate the logic of model-specific pre-processing. This method should return data in the `(data, annotation)` format, in which:
|
||||
* The `data` is the input that is passed to the model at inference so that it should be properly preprocessed. It can be either the `numpy.array` object or a dictionary, where the key is the name of the model input and value is `numpy.array` which corresponds to this input.
|
||||
* The `annotation` is not used by the Default Quantization method. Therfore, this object can be `None` in this case.
|
||||
|
||||
You can wrap framework data loading classes by `openvino.tools.pot.DataLoader` interface which is usually straightforward. For example, `torch.utils.data.Dataset` has a similar interface as `openvino.tools.pot.DataLoader` so that its TorchVision implementations can be easily wrapped by POT API.
|
||||
Framework data loading classes can be wrapped by the `openvino.tools.pot.DataLoader` interface which is usually straightforward. For example, the `torch.utils.data.Dataset` has a similar interface as the `openvino.tools.pot.DataLoader`, so that its TorchVision implementations can be easily wrapped by POT API.
|
||||
|
||||
> **NOTE**: Model-specific preprocessing, for example, mean/scale normalization can be embedded into the model at the conversion step using Model Optimizer component. This should be considered during the implementation of the DataLoader interface to avoid "double" normalization which can lead to the loss of accuracy after optimization.
|
||||
> **NOTE**: Model-specific preprocessing (for example, mean/scale normalization), can be embedded into the model at the conversion step, using Model Optimizer component. This should be considered during the implementation of the DataLoader interface to avoid "double" normalization, which can lead to the loss of accuracy after optimization.
|
||||
|
||||
The code example below defines `DataLoader` for three popular use cases: images, text, and audio.
|
||||
The example code below defines the `DataLoader` for three popular use cases: images, text, and audio.
|
||||
|
||||
@sphinxtabset
|
||||
|
||||
@ -70,15 +68,16 @@ Default Quantization algorithm has mandatory and optional parameters which are d
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
- `"target_device"` - currently, only two options are available: `"ANY"` (or `"CPU"`) - to quantize model for CPU, GPU, or VPU, and `"GNA"` - for inference on GNA.
|
||||
- `"stat_subset_size"` - size of data subset to calculate activations statistics used for quantization. The whole dataset is used if no parameter specified. We recommend using not less than 300 samples.
|
||||
- `"stat_subset_size"` - size of data subset to calculate activations statistics used for quantization. The whole dataset is used if no parameter is specified. It is recommended to use not less than 300 samples.
|
||||
- `"stat_batch_size"` - size of batch to calculate activations statistics used for quantization. 1 if no parameter specified.
|
||||
|
||||
Full specification of the Default Quantization method is available in this [document](@ref pot_compression_algorithms_quantization_default_README).
|
||||
For full specification, see the the [Default Quantization method](@ref pot_compression_algorithms_quantization_default_README).
|
||||
|
||||
## Run quantization
|
||||
POT API provides its own methods to load and save model objects from OpenVINO Intermediate Representation: `load_model` and `save_model`. It also has a concept of `Pipeline` that sequentially applies specified optimization methods to the model. `create_pipeine` method is used to instantiate a `Pipeline` object.
|
||||
A code example below shows a basic quantization workflow:
|
||||
POT API provides methods to load and save model objects from OpenVINO Intermediate Representation: the `load_model` and `save_model`. It also has a concept of the `Pipeline` that sequentially applies specified optimization methods to the model. The `create_pipeline` method is used to instantiate a `Pipeline` object.
|
||||
An example code below shows a basic quantization workflow:
|
||||
|
||||
```python
|
||||
from openvino.tools.pot import IEEngine
|
||||
@ -86,7 +85,7 @@ from openvino.tools.pot load_model, save_model
|
||||
from openvino.tools.pot import compress_model_weights
|
||||
from openvino.tools.pot import create_pipeline
|
||||
|
||||
# Model config specifies the model name and paths to model .xml and .bin file
|
||||
# Model config specifies the name of the model and paths to .xml and .bin files of the model.
|
||||
model_config =
|
||||
{
|
||||
"model_name": "model",
|
||||
@ -94,7 +93,7 @@ model_config =
|
||||
"weights": path_to_bin,
|
||||
}
|
||||
|
||||
# Engine config
|
||||
# Engine config.
|
||||
engine_config = {"device": "CPU"}
|
||||
|
||||
algorithms = [
|
||||
@ -108,10 +107,10 @@ algorithms = [
|
||||
}
|
||||
]
|
||||
|
||||
# Step 1: Implement and create user's data loader
|
||||
# Step 1: Implement and create a user data loader.
|
||||
data_loader = ImageLoader("<path_to_images>")
|
||||
|
||||
# Step 2: Load model
|
||||
# Step 2: Load a model.
|
||||
model = load_model(model_config=model_config)
|
||||
|
||||
# Step 3: Initialize the engine for metric calculation and statistics collection.
|
||||
@ -126,7 +125,7 @@ compressed_model = pipeline.run(model=model)
|
||||
compress_model_weights(compressed_model)
|
||||
|
||||
# Step 6: Save the compressed model to the desired path.
|
||||
# Set save_path to the directory where the model should be saved
|
||||
# Set save_path to the directory where the model should be saved.
|
||||
compressed_model_paths = save_model(
|
||||
model=compressed_model,
|
||||
save_path="optimized_model",
|
||||
@ -136,10 +135,10 @@ compressed_model_paths = save_model(
|
||||
|
||||
The output of the script is the quantized model that can be used for inference in the same way as the original full-precision model.
|
||||
|
||||
If accuracy degradation after applying the Default Quantization method is high, it is recommended to try tips from [Quantization Best Practices](@ref pot_docs_BestPractices) document or use [Accuracy-aware Quantization](@ref pot_accuracyaware_usage) method.
|
||||
If high degradation of accuracy occurs after applying the Default Quantization method, it is recommended to follow the tips from [Quantization Best Practices](@ref pot_docs_BestPractices) article or use [Accuracy-aware Quantization](@ref pot_accuracyaware_usage) method.
|
||||
|
||||
## Quantizing cascaded models
|
||||
In some cases, when the optimizing model is a cascaded model, i.e. consists of several submodels, for example, MT-CNN, you will need to implement a complex inference pipeline that can properly handle different submodels and data flow between them. POT API provides an `Engine` interface for this purpose which allows customization of the inference logic. However, we suggest inheriting from `IEEngine` helper class that already contains all the logic required to do the inference based on OpenVINO™ Python API. See the following [example](@ref pot_example_face_detection_README).
|
||||
In some cases, when the optimized model is a cascaded one (consists of several submodels, for example, MT-CNN), you will need to implement a complex inference pipeline that can properly handle different submodels and data flow between them. POT API provides the `Engine` interface for this purpose, which allows customization of the inference logic. However, it is recommended to inherit from `IEEngine` helper class that already contains all the logic required to do the inference based on OpenVINO Python API. For more details, see the following [example](@ref pot_example_face_detection_README).
|
||||
|
||||
## Examples
|
||||
|
||||
|
@ -16,36 +16,35 @@
|
||||
|
||||
@endsphinxdirective
|
||||
|
||||
## Introduction
|
||||
|
||||
Post-training model optimization is the process of applying special methods without model retraining or fine-tuning, for example, post-training 8-bit quantization. Therefore, this process does not require a training dataset or a training pipeline in the source DL framework. To apply post-training methods in OpenVINO™, you need:
|
||||
* A floating-point precision model, FP32 or FP16, converted into the OpenVINO™ Intermediate Representation (IR) format
|
||||
and run on CPU with the OpenVINO™.
|
||||
Post-training model optimization is the process of applying special methods without model retraining or fine-tuning. Therefore, it does not require either a training dataset or a training pipeline in the source DL framework. In OpenVINO, post-training methods, such as post-training 8-bit quantization, require:
|
||||
* A floating-point precision model (FP32 or FP16), converted to the OpenVINO IR format (Intermediate Representation)
|
||||
and run on CPU with OpenVINO.
|
||||
* A representative calibration dataset representing a use case scenario, for example, 300 samples.
|
||||
* In case of accuracy constraints, a validation dataset and accuracy metrics should be available.
|
||||
|
||||
For the needs of post-training optimization, OpenVINO™ provides a Post-training Optimization Tool (POT) which supports the uniform integer quantization method. This method allows substantially increasing inference performance and reducing the model size.
|
||||
OpenVINO provides a Post-training Optimization Tool (POT) that supports the uniform integer quantization method. It can substantially increase inference performance and reduce the size of a model.
|
||||
|
||||
Figure below shows the optimization workflow with POT:
|
||||
The figure below shows the optimization workflow with POT:
|
||||

|
||||
|
||||
|
||||
## Quantizing models with POT
|
||||
|
||||
POT provides two main quantization methods that can be used depending on the user's needs and requirements:
|
||||
Depending on your needs and requirements, POT provides two main quantization methods that can be used:
|
||||
|
||||
* [Default Quantization](@ref pot_default_quantization_usage) is a recommended method that provides fast and accurate results in most cases. It requires only a unannotated dataset for quantization. For details, see the [Default Quantization algorithm](@ref pot_compression_algorithms_quantization_default_README) documentation.
|
||||
* [Default Quantization](@ref pot_default_quantization_usage) -- a recommended method that provides fast and accurate results in most cases. It requires only an unannotated dataset for quantization. For more details, see the [Default Quantization algorithm](@ref pot_compression_algorithms_quantization_default_README) documentation.
|
||||
|
||||
* [Accuracy-aware Quantization](@ref pot_accuracyaware_usage) is an advanced method that allows keeping accuracy at a predefined range at the cost of performance improvement in case when `Default Quantization` cannot guarantee it. The method requires annotated representative dataset and may require more time for quantization. For details, see the
|
||||
* [Accuracy-aware Quantization](@ref pot_accuracyaware_usage) -- an advanced method that allows keeping accuracy at a predefined range, at the cost of performance improvement, when `Default Quantization` cannot guarantee it. This method requires an annotated representative dataset and may require more time for quantization. For more details, see the
|
||||
[Accuracy-aware Quantization algorithm](@ref accuracy_aware_README) documentation.
|
||||
|
||||
HW platforms support different integer precisions and quantization parameters, for example 8-bit in CPU, GPU, VPU, 16-bit for GNA. POT abstracts this complexity by introducing a concept of "target device" that is used to set quantization settings specific to the device. The `target_device` parameter is used for this purpose.
|
||||
Different hardware platforms support different integer precisions and quantization parameters. For example, 8-bit is used by CPU, GPU, VPU, and 16-bit by GNA. POT abstracts this complexity by introducing a concept of the "target device" used to set quantization settings, specific to the device.
|
||||
|
||||
> **NOTE**: There is a special `target_device: "ANY"` which leads to portable quantized models compatible with CPU, GPU, and VPU devices. GNA-quantized models are compatible only with CPU.
|
||||
|
||||
For benchmarking results collected for the models optimized with the POT tool, refer to [INT8 vs FP32 Comparison on Select Networks and Platforms](@ref openvino_docs_performance_int8_vs_fp32).
|
||||
For benchmarking results collected for the models optimized with the POT tool, refer to the [INT8 vs FP32 Comparison on Select Networks and Platforms](@ref openvino_docs_performance_int8_vs_fp32).
|
||||
|
||||
## See Also
|
||||
## Additional Resources
|
||||
|
||||
* [Performance Benchmarks](https://docs.openvino.ai/latest/openvino_docs_performance_benchmarks_openvino.html)
|
||||
* [INT8 Quantization by Using Web-Based Interface of the DL Workbench](https://docs.openvino.ai/latest/workbench_docs_Workbench_DG_Int_8_Quantization.html)
|
||||
|
@ -1,19 +1,19 @@
|
||||
# Saturation (overflow) Issue Workaround {#pot_saturation_issue}
|
||||
|
||||
## Introduction
|
||||
8-bit instructions of previous generations of Intel® CPUs, namely those based on SSE, AVX-2, AVX-512 instruction sets, admit so-called saturation (overflow) of the intermediate buffer when calculating the dot product which is an essential part of Convolutional or MatMul operations. This saturation can lead to an accuracy drop on the mentioned architectures during the inference of 8-bit quantized models. However, it is not possible to predict such degradation since most of the computations are executed in parallel during DL model inference which makes this process non-deterministic. This problem is typical for models with non-ReLU activation functions and low level of redundancy, for example, optimized or efficient models. It can prevent deploying the model on legacy hardware or creating cross-platform applications. The problem does not occur on the CPUs with Intel Deep Learning Boost (VNNI) technology and further generations, as well as on GPUs.
|
||||
8-bit instructions of older Intel CPU generations (based on SSE, AVX-2, and AVX-512 instruction sets) are prone to so-called saturation (overflow) of the intermediate buffer when calculating the dot product, which is an essential part of Convolutional or MatMul operations. This saturation can lead to a drop in accuracy when running inference of 8-bit quantized models on the mentioned architectures. Additionally, it is impossible to predict if the issue occurs in a given setup, since most computations are executed in parallel during DL model inference, which makes this process non-deterministic. This is a common problem for models with non-ReLU activation functions and low level of redundancy (for example, optimized or efficient models). It can prevent deploying the model on legacy hardware or creating cross-platform applications. The problem does not occur on GPUs or CPUs with Intel Deep Learning Boost (VNNI) technology and further generations.
|
||||
|
||||
## Saturation Problem Detection
|
||||
The only way to detect saturation issue is to run inference on the CPU that admits it and on the hardware that does not have such problem (for example, VNNI-based CPU). If the accuracy difference is significant (more than 1%), this is the main indicator of the saturation issue impact.
|
||||
The only way to detect the saturation issue is to run inference on a CPU that allows it and then on one that does not (for example, a VNNI-based CPU). A significant difference in accuracy (more than 1%) will be the main indicator of the saturation issue impact.
|
||||
|
||||
## Workaround
|
||||
There is a workaround that helps fully address the saturation issue during the inference. The algorithm uses only 7 bits to represent weights (of Convolutional or Fully-Connected layers) while quantizing activations using the full range of 8-bit data types. However, this can lead to an accuracy degradation due to the reduced representation of weights. On the other hand, using this workaround for the first layer can help mitigate the saturation issue for many models.
|
||||
## Saturation Issue Workaround
|
||||
While quantizing activations use the full range of 8-bit data types, there is a workaround using only 7 bits to represent weights (of Convolutional or Fully-Connected layers). Using this algorithm for the first layer can help mitigate the saturation issue for many models. However, this can lead to lower accuracy due to reduced representation of weights.
|
||||
|
||||
POT tool provides three options to deal with the saturation issue. The options can be enabled in the POT configuration file using the "saturation_fix" parameter:
|
||||
POT tool provides three options to deal with the saturation issue. The options can be enabled in the POT configuration file using the `saturation_fix` parameter:
|
||||
|
||||
* (Default) Fix saturation issue for the first layer: "first_layer" option
|
||||
* Apply for all layers in the model: "all" option
|
||||
* Do not apply saturation fix at all: "no" option
|
||||
* "First_layer" option -- (default) fix saturation issue for the first layer.
|
||||
* "All" option -- apply for all layers in the model.
|
||||
* "No" option -- do not apply saturation fix at all.
|
||||
|
||||
Below is an example of the section in the POT configuration file with the `saturation_fix` option:
|
||||
```json
|
||||
@ -28,12 +28,12 @@ Below is an example of the section in the POT configuration file with the `satur
|
||||
}
|
||||
]
|
||||
```
|
||||
## Recommendations
|
||||
|
||||
If you observe the saturation issue, we recommend trying the option "all" during the model quantization. If it does not help improve the accuracy, we recommend using [Quantization-aware training from NNCF](https://github.com/openvinotoolkit/nncf) and fine-tuning the model.
|
||||
If you observe the saturation issue, try the "all" option during model quantization. If the accuracy problem still occurs, try using [Quantization-aware training from NNCF](https://github.com/openvinotoolkit/nncf) and fine-tuning the model.
|
||||
|
||||
If you are not planning to use legacy CPU HW, you can use the option "no", which might also lead to slightly better accuracy.
|
||||
Use the "no" option when leaving out legacy CPU HW. It might also lead to slightly better accuracy.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
## See Also
|
||||
* [Lower Numerical Precision Deep Learning Inference and Training blogpost](https://www.intel.com/content/www/us/en/developer/articles/technical/lower-numerical-precision-deep-learning-inference-and-training.html)
|
||||
* [Configuration file description](@ref pot_configs_README)
|
@ -1,12 +1,12 @@
|
||||
# AccuracyAwareQuantization Algorithm {#accuracy_aware_README}
|
||||
|
||||
## Introduction
|
||||
AccuracyAwareQuantization algorithm is aimed at accurate quantization and allows the model's accuracy to stay within the
|
||||
pre-defined range. This may cause a
|
||||
degradation in performance in comparison to [DefaultQuantization](../default/README.md) algorithm because some layers can be reverted back to the original precision.
|
||||
The purpose of AccuracyAwareQuantization Algorithm is performing precise quantization, while keeping model accuracy within a
|
||||
pre-defined range. In comparison to [DefaultQuantization](../default/README.md) algorithm this may cause a
|
||||
degradation in performance because some layers can be reverted back to the original precision.
|
||||
|
||||
## Parameters
|
||||
Since the [DefaultQuantization](../default/README.md) algorithm is used as an initialization, all its parameters are also valid and can be specified. Here is an example of the definition of the `AccuracyAwareQuantization` method and its parameters:
|
||||
Since the [DefaultQuantization](../default/README.md) algorithm is used as an initialization, all its parameters are also valid and can be specified. Below is an example of the `AccuracyAwareQuantization` method and its parameters:
|
||||
```json
|
||||
{
|
||||
"name": "AccuracyAwareQuantization", // the name of optimization algorithm
|
||||
@ -16,40 +16,92 @@ Since the [DefaultQuantization](../default/README.md) algorithm is used as an in
|
||||
}
|
||||
```
|
||||
|
||||
Below is the description of AccuracyAwareQuantization-specific parameters:
|
||||
Below are the descriptions of AccuracyAwareQuantization-specific parameters:
|
||||
- `"ranking_subset_size"` - size of a subset that is used to rank layers by their contribution to the accuracy drop.
|
||||
Default value is `300`. The more samples it has the better ranking you have, potentially.
|
||||
- `"max_iter_num"` - maximum number of iterations of the algorithm, in other words maximum number of layers that may
|
||||
be reverted back to floating-point precision. By default it is limited by the overall number of quantized layers.
|
||||
- `"maximal_drop"` - maximum accuracy drop which has to be achieved after the quantization. Default value is `0.01` (1%).
|
||||
- `"drop_type"` - drop type of the accuracy metric:
|
||||
- `"absolute"` (default) - absolute drop with respect to the results of the full-precision model
|
||||
- `"relative"` - relative to the results of the full-precision model
|
||||
- `"use_prev_if_drop_increase"` - whether to use network snapshot from the previous iteration of in case if drop
|
||||
increases. Default value is `True`.
|
||||
- `"base_algorithm"` - name of the algorithm that is used to quantize model at the beginning. Default value is
|
||||
Default value is `300`, and more samples it has the better ranking, potentially.
|
||||
- `"max_iter_num"` - the maximum number of iterations of the algorithm. In other words, the maximum number of layers that may
|
||||
be reverted back to floating-point precision. By default, it is limited by the overall number of quantized layers.
|
||||
- `"maximal_drop"` - the maximum accuracy drop which has to be achieved after the quantization. The default value is `0.01` (1%).
|
||||
- `"drop_type"` - a drop type of the accuracy metric:
|
||||
- `"absolute"` - the (default) absolute drop with respect to the results of the full-precision model.
|
||||
- `"relative"` - relative to the results of the full-precision model.
|
||||
- `"use_prev_if_drop_increase"` - the use of network snapshot from the previous iteration when a drop
|
||||
increases. The default value is `True`.
|
||||
- `"base_algorithm"` - name of the algorithm that is used to quantize a model at the beginning. The default value is
|
||||
"DefaultQuantization".
|
||||
- `"convert_to_mixed_preset"` - whether to convert the model to "mixed" mode if the accuracy criteria for the model
|
||||
- `"convert_to_mixed_preset"` - set to convert the model to "mixed" mode if the accuracy criteria for the model
|
||||
quantized with "performance" preset are not satisfied. This option can help to reduce number of layers that are reverted
|
||||
to floating-point precision. Note: this is an experimental feature.
|
||||
- `"metrics"` - optional list of metrics that are taken into account during optimization. It consists of tuples with the
|
||||
to floating-point precision. Keep in mind that this is an **experimental** feature.
|
||||
- `"metrics"` - an optional list of metrics that are taken into account during optimization. It consists of tuples with the
|
||||
following parameters:
|
||||
- `"name"` - name of the metric to optimize
|
||||
- `"baseline_value"` - baseline metric value of the original model. This is the optional parameter. The validations on
|
||||
the whole validation will be initiated in the beginning if nothing specified.
|
||||
- `"metric_subset_ratio"` - part of the validation set that is used to compare original full-precision and
|
||||
fully quantized models when creating ranking subset in case of predefined metric values of the original model.
|
||||
Default value is `0.5`.
|
||||
- `"tune_hyperparams"` - enables quantization parameters tuning as a preliminary step before reverting layers back
|
||||
to the floating-point precision. It can bring additional performance and accuracy boost but increase overall
|
||||
quantization time. Default value is `False`.
|
||||
- `"name"` - name of the metric to optimize.
|
||||
- `"baseline_value"` - (optional parameter) a baseline metric value of the original model. The validations on
|
||||
The validation will be initiated entirely in the beginning if nothing specified.
|
||||
- `"metric_subset_ratio"` - a part of the validation set that is used to compare original full-precision and
|
||||
fully quantized models when creating a ranking subset in case of predefined metric values of the original model.
|
||||
The default value is `0.5`.
|
||||
- `"tune_hyperparams"` - enables tuning of quantization parameters as a preliminary step before reverting layers back
|
||||
to the floating-point precision. It can bring an additional boost in performance and accuracy, at the cost of increased overall
|
||||
quantization time. The default value is `False`.
|
||||
|
||||
## Examples
|
||||
## Additional Resources
|
||||
|
||||
Example:
|
||||
* [Quantization of Object Detection model with control of accuracy](https://github.com/openvinotoolkit/openvino/tree/master/tools/pot/openvino/tools/pot/api/samples/object_detection)
|
||||
|
||||
A template and full specification for AccuracyAwareQuantization algorithm for POT command-line interface:
|
||||
* [Template](https://github.com/openvinotoolkit/openvino/blob/master/tools/pot/configs/accuracy_aware_quantization_template.json)
|
||||
* [Full specification](https://github.com/openvinotoolkit/openvino/blob/master/tools/pot/configs/accuracy_aware_quantization_spec.json)
|
||||
Full specification and a template for AccuracyAwareQuantization algorithm for POT command-line interface:
|
||||
* [Full specification](https://github.com/openvinotoolkit/openvino/blob/master/tools/pot/configs/accuracy_aware_quantization_spec.json)
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
.. dropdown:: Template
|
||||
|
||||
.. code-block:: javascript
|
||||
|
||||
/* This configuration file is the fastest way to get started with the accuracy aware
|
||||
quantization algorithm. It contains only mandatory options with commonly used
|
||||
values. All other options can be considered as an advanced mode and requires
|
||||
deep knowledge of the quantization process. An overall description of all possible
|
||||
parameters can be found in the accuracy_aware_quantization_spec.json */
|
||||
|
||||
{
|
||||
/* Model parameters */
|
||||
|
||||
"model": {
|
||||
"model_name": "model_name", // Model name
|
||||
"model": "<MODEL_PATH>", // Path to model (.xml format)
|
||||
"weights": "<PATH_TO_WEIGHTS>" // Path to weights (.bin format)
|
||||
},
|
||||
|
||||
/* Parameters of the engine used for model inference */
|
||||
|
||||
"engine": {
|
||||
"config": "<CONFIG_PATH>" // Path to Accuracy Checker config
|
||||
},
|
||||
|
||||
/* Optimization hyperparameters */
|
||||
|
||||
"compression": {
|
||||
"target_device": "ANY", // Target device, the specificity of which will be taken
|
||||
// into account during optimization
|
||||
"algorithms": [
|
||||
{
|
||||
"name": "AccuracyAwareQuantization", // Optimization algorithm name
|
||||
"params": {
|
||||
"preset": "performance", // Preset [performance, mixed, accuracy] which control the quantization
|
||||
// mode (symmetric, mixed (weights symmetric and activations asymmetric)
|
||||
// and fully asymmetric respectively)
|
||||
|
||||
"stat_subset_size": 300, // Size of subset to calculate activations statistics that can be used
|
||||
// for quantization parameters calculation
|
||||
|
||||
"maximal_drop": 0.01, // Maximum accuracy drop which has to be achieved after the quantization
|
||||
"tune_hyperparams": false // Whether to search the best quantization parameters for model
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@endsphinxdirective
|
||||
|
@ -1,10 +1,9 @@
|
||||
# DefaultQuantization Algorithm {#pot_compression_algorithms_quantization_default_README}
|
||||
|
||||
## Introduction
|
||||
DefaultQuantization algorithm is designed to do a fast and, in many cases, accurate quantization. It does not have any control of accuracy metric but provides a lot of knobs that can be used to improve it.
|
||||
The DefaultQuantization Algorithm is designed to perform fast and accurate quantization. It does not offer direct control over the accuracy metric itself but provides many options that can be used to improve it.
|
||||
|
||||
## Parameters
|
||||
DefaultQuantization algorithm has mandatory and optional parameters. For more details on how to use these parameters please refer to [Best Practices](@ref pot_docs_BestPractices) document. Here is an example of the definition of DefualtQuantization method and its parameters:
|
||||
DefaultQuantization Algorithm has mandatory and optional parameters. For more details on how to use these parameters, refer to [Best Practices](@ref pot_docs_BestPractices) article. Below is an example of the DefaultQuantization method and its parameters:
|
||||
```python
|
||||
{
|
||||
"name": "DefaultQuantization", # the name of optimization algorithm
|
||||
@ -15,90 +14,89 @@ DefaultQuantization algorithm has mandatory and optional parameters. For more de
|
||||
```
|
||||
|
||||
### Mandatory parameters
|
||||
- `"preset"` - preset which controls the quantization mode (symmetric and asymmetric). It can take two values:
|
||||
- `"preset"` - a preset which controls the quantization mode (symmetric and asymmetric). It can take two values:
|
||||
- `"performance"` (default) - stands for symmetric quantization of weights and activations. This is the most
|
||||
performant across all the HW.
|
||||
efficient across all the HW.
|
||||
- `"mixed"` - symmetric quantization of weights and asymmetric quantization of activations. This mode can be useful
|
||||
for quantization of NN which has both negative and positive input values in quantizing operations, e.g.
|
||||
for quantization of NN, which has both negative and positive input values in quantizing operations, for example
|
||||
non-ReLU based CNN.
|
||||
- `"stat_subset_size"` - size of subset to calculate activations statistics used for quantization. The whole dataset
|
||||
is used if no parameter specified. We recommend using not less than 300 samples.
|
||||
- `"stat_batch_size"` - size of batch to calculate activations statistics used for quantization. 1 if no parameter specified.
|
||||
|
||||
- `"stat_subset_size"` - size of a subset to calculate activations statistics used for quantization. The whole dataset
|
||||
is used if no parameter is specified. It is recommended to use not less than 300 samples.
|
||||
- `"stat_batch_size"` - size of a batch to calculate activations statistics used for quantization. It has a value of 1 if no parameter is specified.
|
||||
|
||||
### Optional parameters
|
||||
All other options can be considered as an advanced mode and require deep knowledge of the quantization process. Below
|
||||
All other options should be considered as an advanced mode and require deep knowledge of the quantization process. Below
|
||||
is an overall description of all possible parameters:
|
||||
- `"model type"` - An optional parameter, needed for additional patterns in the model, default value is None (supported only "Transformer" now)
|
||||
- `"inplace_statistic"` - An optional parameter, needed for change method collect statistics, reduces the amount of memory consumed, but increases the calibration time
|
||||
- `"ignored"` - NN subgraphs which should be excluded from the optimization process
|
||||
- `"scope"` - list of particular nodes to exclude
|
||||
- `"operations"` - list of operation types to exclude (expressed in OpenVINO IR notation). This list consists of
|
||||
- `"model type"` - an optional parameter, required for additional patterns in the model. The default value is "None" ("Transformer" is only other supported value now).
|
||||
- `"inplace_statistic"` - an optional parameter, required for change of collect statistics method. This parameter reduces the amount of memory consumed, but increases the calibration time.
|
||||
- `"ignored"` - NN subgraphs which should be excluded from the optimization process:
|
||||
- `"scope"` - a list of particular nodes to exclude.
|
||||
- `"operations"` - a list of operation types to exclude (expressed in OpenVINO IR notation). This list consists of
|
||||
the following tuples:
|
||||
- `"type"` - type of ignored operation
|
||||
- `"attributes"` - if attributes are defined they will be considered during the ignorance. They are defined by
|
||||
a dictionary of `"<NAME>": "<VALUE>"` pairs.
|
||||
- `"weights"` - this section manually defines quantization scheme for weights and the way to estimate the
|
||||
quantization range for that. It worth noting that changing the quantization scheme may lead to inability to infer such
|
||||
- `"type"` - a type of ignored operation.
|
||||
- `"attributes"` - if attributes are defined, they will be considered during the ignorance. They are defined by
|
||||
a dictionary of `"<NAME>": "<VALUE>"` pairs.
|
||||
- `"weights"` - this section describes quantization scheme for weights and the way to estimate the
|
||||
quantization range for that. It is worth noting that changing the quantization scheme may lead to inability to infer such
|
||||
mode on the existing HW.
|
||||
- `"bits"` - bit-width, default is 8
|
||||
- `"mode"` - quantization mode (symmetric or asymmetric)
|
||||
- `"level_low"` - minimum level in the integer range in which we quantize to, default is 0 for unsigned range, -2^(bit-1) - for signed
|
||||
- `"level_high"` - maximum level in the integer range in which we quantize to, default is 2^bits-1 for unsigned range, 2^(bit-1)-1 - for signed
|
||||
- `"granularity"` - quantization scale granularity and can take the following two values:
|
||||
- `"pertensor"` (default) - per-tensor quantization with one scale factor and zero-point
|
||||
- `"perchannel"` - per-channel quantization with per-channel scale factor and zero-point
|
||||
- `"bits"` - bit-width, the default value is "8".
|
||||
- `"mode"` - a quantization mode (symmetric or asymmetric).
|
||||
- `"level_low"` - the minimum level in the integer range to quantize. The default is "0" for an unsigned range, and "-2^(bit-1)" for a signed one .
|
||||
- `"level_high"` - the maximum level in the integer range to quantize. The default is "2^bits-1" for an unsigned range, and "2^(bit-1)-1" for a signed one.
|
||||
- `"granularity"` - quantization scale granularity. It can take the following values:
|
||||
- `"pertensor"` (default) - per-tensor quantization with one scale factor and zero-point.
|
||||
- `"perchannel"` - per-channel quantization with per-channel scale factor and zero-point.
|
||||
- `"range_estimator"` - this section describes parameters of range estimator that is used in MinMaxQuantization
|
||||
method to get the quantization ranges and filter outliers based on the collected statistics. These are the parameters
|
||||
that user can vary to get better accuracy results:
|
||||
method to get the quantization ranges and filter outliers based on the collected statistics. Below are the parameters
|
||||
that can be modified to get better accuracy results:
|
||||
- `"max"` - parameters to estimate top border of quantizing floating-point range:
|
||||
- `"type"` - type of the estimator:
|
||||
- `"max"` (default) - estimates the maximum in the quantizing set of value
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator
|
||||
- `"type"` - a type of the estimator:
|
||||
- `"max"` (default) - estimates the maximum in the quantizing set of value.
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value.
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator.
|
||||
- `"min"` - parameters to estimate bottom border of quantizing floating-point range:
|
||||
- `"type"` - type of the estimator:
|
||||
- `"min"` (default) - estimates the minimum in the quantizing set of value
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator
|
||||
- `"activations"` - this section manually defines quantization scheme for activations and the way to estimate the
|
||||
quantization range for that. Again, changing the quantization scheme may lead to inability to infer such
|
||||
mode on the existing HW.
|
||||
- `"bits"` - bit-width, default is 8
|
||||
- `"mode"` - quantization mode (symmetric or asymmetric)
|
||||
- `"level_low"` - minimum level in the integer range in which we quantize to, default is 0 for unsigned range, -2^(bit-1) - for signed
|
||||
- `"level_high"` - maximum level in the integer range in which we quantize to, default is 2^bits-1 for unsigned range, 2^(bit-1)-1 - for signed
|
||||
- `"granularity"` - quantization scale granularity and can take the following two values:
|
||||
- `"pertensor"` (default) - per-tensor quantization with one scale factor and zero-point
|
||||
- `"perchannel"` - per-channel quantization with per-channel scale factor and zero-point
|
||||
- `"type"` - a type of the estimator:
|
||||
- `"min"` (default) - estimates the minimum in the quantizing set of value.
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value.
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator.
|
||||
- `"activations"` - this section describes quantization scheme for activations and the way to estimate the
|
||||
quantization range for that. As before, changing the quantization scheme may lead to inability to infer such
|
||||
mode on the existing HW:
|
||||
- `"bits"` - bit-width, the default value is "8".
|
||||
- `"mode"` - a quantization mode (symmetric or asymmetric).
|
||||
- `"level_low"` - the minimum level in the integer range to quantize. The default is "0" for an unsigned range, and "-2^(bit-1)" for a signed one.
|
||||
- `"level_high"` - the maximum level in the integer range to quantize. The default is "2^bits-1" for an unsigned range, and "2^(bit-1)-1" for a signed one.
|
||||
- `"granularity"` - quantization scale granularity. It can take the following values:
|
||||
- `"pertensor"` (default) - per-tensor quantization with one scale factor and zero-point.
|
||||
- `"perchannel"` - per-channel quantization with per-channel scale factor and zero-point.
|
||||
- `"range_estimator"` - this section describes parameters of range estimator that is used in MinMaxQuantization
|
||||
method to get the quantization ranges and filter outliers based on the collected statistics. These are the parameters
|
||||
that user can vary to get better accuracy results:
|
||||
- `"preset"` - preset that defines the same estimator both for top and bottom borders of quantizing
|
||||
that can be modified to get better accuracy results:
|
||||
- `"preset"` - preset that defines the same estimator for both top and bottom borders of quantizing
|
||||
floating-point range. Possible value is `"quantile"`.
|
||||
- `"max"` - parameters to estimate top border of quantizing floating-point range:
|
||||
- `"aggregator"` - type of the function used to aggregate statistics obtained with estimator
|
||||
- `"aggregator"` - a type of the function used to aggregate statistics obtained with the estimator
|
||||
over the calibration dataset to get a value of the top border:
|
||||
- `"mean"` (default) - aggregates mean value
|
||||
- `"max"` - aggregates max value
|
||||
- `"min"` - aggregates min value
|
||||
- `"median"` - aggregates median value
|
||||
- `"mean_no_outliers"` - aggregates mean value after removal of extreme quantiles
|
||||
- `"median_no_outliers"` - aggregates median value after removal of extreme quantiles
|
||||
- `"hl_estimator"` - Hodges-Lehmann filter based aggregator
|
||||
- `"type"` - type of the estimator:
|
||||
- `"max"` (default) - estimates the maximum in the quantizing set of value
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator
|
||||
- `"mean"` (default) - aggregates mean value.
|
||||
- `"max"` - aggregates max value.
|
||||
- `"min"` - aggregates min value.
|
||||
- `"median"` - aggregates median value.
|
||||
- `"mean_no_outliers"` - aggregates mean value after removal of extreme quantiles.
|
||||
- `"median_no_outliers"` - aggregates median value after removal of extreme quantiles.
|
||||
- `"hl_estimator"` - Hodges-Lehmann filter based aggregator.
|
||||
- `"type"` - a type of the estimator:
|
||||
- `"max"` (default) - estimates the maximum in the quantizing set of value.
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value.
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator.
|
||||
- `"min"` - parameters to estimate bottom border of quantizing floating-point range:
|
||||
- `"type"` - type of the estimator:
|
||||
- `"max"` (default) - estimates the maximum in the quantizing set of value
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator
|
||||
- `"type"` - a type of the estimator:
|
||||
- `"max"` (default) - estimates the maximum in the quantizing set of value.
|
||||
- `"quantile"` - estimates the quantile in the quantizing set of value.
|
||||
- `"outlier_prob"` - outlier probability used in the "quantile" estimator.
|
||||
- `"use_layerwise_tuning"` - enables layer-wise fine-tuning of model parameters (biases, Convolution/MatMul weights and FakeQuantize scales) by minimizing the mean squared error between original and quantized layer outputs.
|
||||
Enabling this option may increase compressed model accuracy, but will result in increased execution time and memory consumption.
|
||||
|
||||
## Examples
|
||||
## Additional Resources
|
||||
Tutorials:
|
||||
* [Quantization of Image Classification model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/301-tensorflow-training-openvino)
|
||||
* [Quantization of Object Detection model from Model Zoo](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/111-detection-quantization)
|
||||
@ -113,8 +111,56 @@ Examples:
|
||||
Command-line example:
|
||||
* [Quantization of Image Classification model](https://docs.openvino.ai/latest/pot_configs_examples_README.html)
|
||||
|
||||
A template and full specification for DefaultQuantization algorithm for POT command-line inferface:
|
||||
* [Template](https://github.com/openvinotoolkit/openvino/blob/master/tools/pot/configs/default_quantization_template.json)
|
||||
Full specification and a template for DefaultQuantization algorithm for POT command-line inferface:
|
||||
* [Full specification](https://github.com/openvinotoolkit/openvino/blob/master/tools/pot/configs/default_quantization_spec.json)
|
||||
|
||||
@sphinxdirective
|
||||
|
||||
.. dropdown:: Template
|
||||
|
||||
.. code-block:: javascript
|
||||
|
||||
/* This configuration file is the fastest way to get started with the default
|
||||
quantization algorithm. It contains only mandatory options with commonly used
|
||||
values. All other options can be considered as an advanced mode and requires
|
||||
deep knowledge of the quantization process. An overall description of all possible
|
||||
parameters can be found in the default_quantization_spec.json */
|
||||
|
||||
{
|
||||
/* Model parameters */
|
||||
|
||||
"model": {
|
||||
"model_name": "model_name", // Model name
|
||||
"model": "<MODEL_PATH>", // Path to model (.xml format)
|
||||
"weights": "<PATH_TO_WEIGHTS>" // Path to weights (.bin format)
|
||||
},
|
||||
|
||||
/* Parameters of the engine used for model inference */
|
||||
|
||||
"engine": {
|
||||
"config": "<CONFIG_PATH>" // Path to Accuracy Checker config
|
||||
},
|
||||
|
||||
/* Optimization hyperparameters */
|
||||
|
||||
"compression": {
|
||||
"target_device": "ANY", // Target device, the specificity of which will be taken
|
||||
// into account during optimization
|
||||
"algorithms": [
|
||||
{
|
||||
"name": "DefaultQuantization", // Optimization algorithm name
|
||||
"params": {
|
||||
"preset": "performance", // Preset [performance, mixed, accuracy] which control the quantization
|
||||
// mode (symmetric, mixed (weights symmetric and activations asymmetric)
|
||||
// and fully asymmetric respectively)
|
||||
|
||||
"stat_subset_size": 300 // Size of subset to calculate activations statistics that can be used
|
||||
// for quantization parameters calculation
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@endsphinxdirective
|
||||
|
Loading…
Reference in New Issue
Block a user