* Update for get started samples (#10975) (#11020) * Update for get started samples * Update docs/get_started/get_started_demos.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/get_started/get_started_demos.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/get_started/get_started_demos.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * formatting * rewording * fix links * fix formatting * Update docs/get_started/get_started_demos.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/get_started/get_started_demos.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * replace squeezenet1.1 with googlenet-v1 * GoogleNet v1 Caffe* model Co-authored-by: Yuan Xu <yuan1.xu@intel.com> (cherry picked from commit412f2190d1) * [DOCS] update HETERO execution (#11003) the PR has been reviewed and accepted for master already, now updating 22.1 * Incremental improvement of MO user guide. (#11010) (#11028) * Incremental improvement of MO user guide. * Apply feedback * POT documentation updates (#10578) (#11024) * POT changes * change install * change img size * remove cli option * Documentation fixes (#11044) * Benchmark app usage * Fixed link to the devices * More fixes * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Removed several hardcoded links Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Removed obsolete code snippets (#11061) * Removed obsolete code snippets * NCC style * Fixed NCC for BA * fix a reference link (#11048) * updates * adding gna to linux * add missing reference * update * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * update * minor updates * add gna item to yum and apt * add gna to get started page * update reference formatting * merge commit Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * AUTO and MULTI Doc update for release 2022.1 (#11066) * Update Auto plugin docs (#10623) * Update Auto plugin docs Revise auto plugin and auto plugin debugging articles. Include necessary image files. * Update docs/OV_Runtime_UG/supported_plugins/AutoPlugin_Debugging.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/AutoPlugin_Debugging.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/AutoPlugin_Debugging.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/auto_device_selection.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/AutoPlugin_Debugging.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/auto_device_selection.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/AutoPlugin_Debugging.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Update AutoPlugin_Debugging.md * include review corrections * Update auto_device_selection.md * Update auto_device_selection.md * Update auto_device_selection.md * Update auto_device_selection.md Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * [AUTOPLUGIN] update multi plugin document for ov2.0 (#10688) * update multi document Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update snippets ov::enableProfile Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * fix build issue Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * use Anymap in snippets Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * fix format and set property Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update python Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * try fo fix test document issue Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * removed NEW IE-CENTRIC API and upated set_property Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update ov::optimal_number_of_infer_requests Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * Updated multi code snippets (#11037) * [Auto PLUGIN] update Auto docs (#10889) * update Auto docs Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update python snippets Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * remove vpu, fix a mistaken in python code Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update MYRIAD device full name Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update API name old API use name Inference Engine API NEW API usen name OpenVINO Runtime API 2.0 Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update tab name, and code format Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * fix AUTO4 format issue Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update set_property code Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * auto draft Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * mv code into .cpp and .py modify the devicelist part accoding to the review Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * remove priority list in code and document modify the begning of the document remove perfomance data remove old API use compile_model instead of set_property add a image about cpu accelerate Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * fix mis print and code is not match document Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * try to fix doc build issue Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * fix snippets code compile issue Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * update sh scripts with ```sh``` Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> * [CPU] CPU plugin docs refactoring backport to the release branch (#11039) * CPU device documentation refresh * Bfloat16 inference page aligned with the new API * Bfloat16 inference section moved to CPU main * First review comments applied * Second review step comments applied * OneDNN reference changed to the GitHub page * AvgPool added to the oneDNN ops list * Updated note about latency, added note about mem usage with dynamic shapes * DOCS: API Reference (#11063) * Renamed API reference * Try to fix API reference for new API * Fixes after self-review * Reworked OpenVINO Plugin dev guide structure * Properties * Try to fix links * Mark properties for MYRIAD & HDDL * Extensibility guide with FE extensions and remove OV_FRAMEWORK_MAP from docs * Rework of Extensibility Intro, adopted examples to missing OPENVINO_FRAMEWORK_MAP * Removed OPENVINO_FRAMEWORK_MAP reference * Frontend extension detailed documentation * Fixed distributed snippets * Fixed snippet inclusion in FE extension document and chapter headers * Fixed wrong name in a snippet reference * Fixed test for template extension due to changed number of loaded extensions * Update docs/Extensibility_UG/frontend_extensions.md Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> * Minor fixes in extension snippets * Small grammar fix Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> * Update Benchmark guides (#11076) * - Update Benchmark Tool usage message - Remove not existed paths - Fix examples * remove reference on FPGA * Added groups for core headers (#11068) * DOCS: transition banner (#10973) * transition banner * minor fix * update transition banner * updates * update custom.js * updates * updates * Add a troubleshooting issue for PRC installation (#11074) * updates * adding gna to linux * add missing reference * update * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * Update docs/install_guides/installing-model-dev-tools.md Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * update * minor updates * add gna item to yum and apt * add gna to get started page * update reference formatting * merge commit * add a troubleshooting issue * update * update * fix CVS-71846 Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> * DOC Removed indentation before snippets (#11111) * Removed indentation * Fixed code style * Added more information about tensor names (#11070) * Added more information about tensor names * Fixed comment and added documentation for extensions * Fixed code style * Fixed typo * Added group for transformation passes (#11101) * Added group for transformation passes * Try to fix CI * Docs: update AC info in API 2.0 migration guide (#11106) * Docs: update AC info in API 2.0 migration guide * Update docs/OV_Runtime_UG/migration_ov_2_0/intro.md * Update docs/OV_Runtime_UG/migration_ov_2_0/intro.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Update headings and some wordings for Transition Guide (#11065) * updates * update * merge from releases/22/1 * update heading * update headings and some wordings * Feature/azaytsev/cherry pick pr11110 (#11115) * Minor fixes * Feature/azaytsev/img updates (#11110) * Updated images * Updated images * DOCS: doxy sphinxtabs (#11027) * initial implementation of doxy sphinxtabs * fixes * fixes * fixes * fixes * fixes * Reshape documentation (#10901) (#11108) * Reshape documentation * Converting Model : reshape metrined, Supported Devices: no shape inference mentioning * demos removed * Added deployment guide (#11060) * Added deployment guide * Added local distribution * Updates * Fixed more indentations * update edit on github branches (#11129) * DOCS: fixed hardcoded links (#11100) * Fixes * Use links * Updated documentation for compile_tool (#11049) * Benchmarks 2022 1 (#11130) * Minor fixes * Updates for 2022.1 * Edits according to the review * Edits according to review comments * Edits according to review comments * Edits according to review comments * Fixed table * Edits according to review comments * Removed config for Intel® Core™ i7-11850HE * Removed forward-tacotron-duration-prediction-241 graph * Added resnet-18-pytorch * [80085] New images for docs (#11114) * change doc structure * fix manager tools * fix manager tools 3 step * fix manager tools 3 step * new img * new img for OV Runtime * fix steps * steps * fix intendents * change list * fix space * fix space * code snippets fix * change display * fix screenshot (#11140) * applying reviewers comments to the Opt Guide (#11093) * applying reviewrs comments * fixed refs, more structuring (bold, bullets, etc) * refactoring tput/latency sections * next iteration (mostly latency), also brushed the auto-batching and other sections * updates sync/async images * common opts brushed * WIP tput redesigned * minor brushing of common and auto-batching * Tput fully refactored * fixed doc name in the link * moved int8 perf counters to the right section * fixed links * fixed broken quotes * fixed more links * add ref to the internals to the TOC * Added a note on the batch size Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> * Add info about Docker images in Deployment guide (#11136) * [DOCS]transition_guide_intro_language (#11134) (#11142) a few language suggestions and grammar issues # Conflicts: # docs/OV_Runtime_UG/migration_ov_2_0/intro.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * [DOCS]autodevice_table_fix (#11141) * Update release version in readme (#11146) * [AUTO] Fix mess table in doc (#11149) * update AUTO Debug doc with snippets (#11153) Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com> * Update ShapeInference.md (#11168) * Benchmarks 2022 1 updates (#11180) * Updated graphs * Quick fix for TODO in Dynamic Shapes article * Anchor link fixes * [Docs][IE Samples] fix hard links (#11144) (#11186) * fix hard links * change encoding * fix TM Co-authored-by: CCR\ntyukaev <nikolay.tyukaev@intel.com> Co-authored-by: CCR\ntyukaev <nikolay.tyukaev@intel.com> * More conservative recommendations on dynamic shapes usage in docs (#11161) * More conservative recommendations about using dynamic shapes * Duplicated statement from C++ part to Python part of reshape doc (no semantical changes) * Added software tab for Linux installer (#11159) * Added software tab for Linux installer * Added information for apt and yum * Update docs/install_guides/installing-openvino-apt.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Update docs/install_guides/installing-openvino-apt.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Update docs/install_guides/installing-openvino-linux.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Update docs/install_guides/installing-openvino-apt.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Update docs/install_guides/installing-openvino-apt.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * [docs] python snippets for devices (#11174) * Update CPU docs * update GPU docs * update with sphinxtab * Fix docs * Add preprocessig snippet * Fix path * Fixed DM config (#11199) * Renamed user guides (#11137) * [Python API] Fix documentation for Core API -- release (#11200) * [Python API] Fix documentation for Core API * fix style * [OMZ]: port bugfix to 2022/1 branch (#11204) * a bunch of doc fixes (#11230) * Missing backslashes right after mo (#11252) * Revert vpu custom kernel (#11226) * Added original VPU custom kernel doc * Moved to new API * Added links from introduction * Fixed intro * DOCS-InstallGuide_review (#11217) langage adjustment * Docs labels adjustment (#11227) * Adjusted documentation labels * Renamed images * fix doc tests Co-authored-by: CCR\ntyukaev <nikolay.tyukaev@intel.com> * cvs-80083 (#11280) * fix wildcard sphinxdirective (#11263) * [docs] python snippets for migration pages (#11224) * save work * Add common snipp * update ie pipeline with python snippets * ov_common_snippet * Python snippets for graph construction * Fix docs Co-authored-by: Anastasia Kuporosova <anastasia.kuporosova@intel.com> * [Python API][Docs] Fix references for several classes (#11260) * next iteration after discussion with Yuri (#11197) * next iteration after discussion with Yuri * WIP tput * Basic/Advanced Flow * brushing/links * wording, testing the failing link * refactored levels, added hash * added advanced tput to the TOC (required by sphinx) * changed wording of the title to be more pro-active * minor misprint, etc * emphasized the flow names * Update two paragraphs in performance hints docs (cherry picked from commit 61415fd91f417b70eae595cc15976dec7af0865b) * minor brushing * e2e flow in the app design * no separate hints doc * minor brushing * final, neat-picking brushing Co-authored-by: Helena <helena.kloosterman@intel.com> * [docs] add missed old python api snippets (#11233) * Add missed old api snippets * Fix names * Fix markers * Fix methods call * Model optimizataion documentation update (#11072) * Fixed Model Optimization Guide and NNCF docs * Fixed the link to Optimum * Updated installatin guide * Changed API description * Changes quantization documents * Fixed links in the relevant components * Fixed API description * Revised CLI document * Fixed formatting bugs in the main document * Fixed formatting bugs in the main document * Changed the structure. Added Default quantization usage via API * Fixed E2E CLI example * Added AccuracyAware usage description * Revised structure and examples * Fixed a link to POT intro * Changed the structure for algorithms * Fixed links * Additional fixed of the links * Revised Ranger documentation * Some fixes * Revised Best Practicies * Fixed descriptions * Fixed section names * Changed the workflow one more time * Additional fixes to the model structure * Fixed AA usage * Added DefaultQuantization flow image * Fixed many issues * Fixed many issues * Applied many comments * Additional fixes * Fixed examples and provided links to them * Changed DataLoader Example. Fixed FAQ * Changed the main README for GitHub * Fixed E2E CLI example * Fixed links and code of DataLoader * Fixed build issues * Fixed more links * Fixed one more documentation build issue * Fixed more links * Fixed code example * Add multiple data loaders * Add audio example * Minor fixes in the code of sample loaders * Add descriptions of dataloaders. Changed the behaviour of text loader * Fixed typos * Added a new item into the FAQ * Apply wording corrections * Update docs/OV_Runtime_UG/supported_plugins/CPU.md Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * Fixed comments Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * [DOCS]continue_language_review-transitionguide (#11148) * [DOCS]-continue_language_review-transitionguide the overview has been merged, the remaining articles are reviewed here * Update docs/OV_Runtime_UG/migration_ov_2_0/deployment_migration.md * Update docs/OV_Runtime_UG/migration_ov_2_0/deployment_migration.md * Update docs/OV_Runtime_UG/migration_ov_2_0/deployment_migration.md * Update docs/OV_Runtime_UG/migration_ov_2_0/graph_construction.md * Update docs/OV_Runtime_UG/migration_ov_2_0/configure_devices.md * Configurable OpenCL usage in BA (#11344) (#11363) * Feature/azaytsev/doc fixes 2022 1 1 (#11388) * Removed a redundant image * Fixed ops specifications and other issues * converted html links to anchor links * converted html links to anchor links * Fixed a link * Fixed a link * Changed anchor links according to dev review * [DOCS] polish autodevice article (#11171) the article has been changed much and its language has been impacted in the process. Here are some corrections. * sphinx google search (#11439) * sphinx google search * fixes * fixes * fix version tabs * Fixed operation names (#11447) * DOCS-transitionguide_name_correction (#11449) OpenVINO™ 2.0 => OpenVINO™ API 2.0 * Azure CI: Update branch for contrib and testdata repos (#11473) * review GPU language changes (#11343) As per ticket #CVS-80053 * int8 link removed * DOCS-benchmarktool_python_correction (#11479) add info on tool installation Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com> * DOCS-cpu_language_review (#11526) Co-Authored-By: Yuan Xu <yuan1.xu@intel.com> * Update Convert_Model_From_TensorFlow.md (#11425) * [OMZ]: update submodule (#11286) * Support config option for time_tests suite (#11628) * Add links to MO installation and ONNX examples (#11617) These edits help make it easier for a new user to find more information on how to convert ONNX models. * Docs: Add links to specific examples (#11618) * Update docs/OV_Runtime_UG/integrate_with_your_application.md * Add links to specific examples This edit adds links to more example applications, making it easier for users to discover how to build an OpenVINO application around their specific model. * Fix failure of pytest in timetest (#11647) * Update installing-openvino-windows-header.md (#11221) (#11592) * Update installing-openvino-windows-header.md * Update docs/install_guides/installing-openvino-windows-header.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Sebastian Golebiewski <sebastianx.golebiewski@intel.com> * Update Yocto documentation for 2022.1 (#11655) * installing-openvino-yocto.md: fix install instructions (#10785) Change _ to : as per the new override syntax. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> * installing-openvino-yocto: update for 2022.1 Update the branch to be used for 2022.1 and remove reference to -staticdev package which isn't generated anymore. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> * DOCS-hetero_alignment_changes (#11643) Align the HETERO article with the AUTO and MULTI template * Fix CI on Windows (#11659) - fix pip requirements in OMZ - fix cpuFuncTests on AlderLake * Docs multiplugin page-wide tabs merge (#11461) * Update multi_device.md * druga runda * runda trzecia 11 * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/multi_device.md * Update docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md * correct post review * align the property table * Update docs/OV_Runtime_UG/auto_device_selection.md * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/multi_device.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Update docs/OV_Runtime_UG/supported_plugins/Device_Plugins.md Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Co-authored-by: Yuan Xu <yuan1.xu@intel.com> * Fix compilation error in docs snippets (#11675) * plugin api separate config (#11109) * Revert "plugin api separate config (#11109)" (#11705) This reverts commit3249e61bfb. * Fix a heading in Auto (#11743) * fix the heading * fix headings * Docs: Add source code links to OpenVINO Samples (#11803) * Docs: Add links to Samples source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Add link to source code on GitHub * Update docs/OV_Runtime_UG/Samples_Overview.md * Update samples/c/hello_classification/README.md * Update samples/c/hello_nv12_input_classification/README.md * Update samples/cpp/classification_sample_async/README.md * Update samples/cpp/hello_classification/README.md * Update samples/cpp/hello_nv12_input_classification/README.md * Update samples/python/classification_sample_async/README.md * Update samples/python/hello_classification/README.md * Update samples/python/hello_query_device/README.md * Update samples/python/hello_reshape_ssd/README.md * Update samples/python/speech_sample/README.md * Update samples/cpp/hello_query_device/README.md * Update samples/cpp/speech_sample/README.md * Update samples/cpp/hello_reshape_ssd/README.md * Update samples/cpp/model_creation_sample/README.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Docs: Add links to specific object detection examples (#11820) * Docs: Add links to object detection examples * Docs: Add links to specific examples * Docs: Add links to specific examples * Update docs/MO_DG/prepare_model/convert_model/tf_specific/Convert_YOLO_From_Tensorflow.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Docs: Add that ONNX models are compatible with OpenVINO (#11821) * Docs: Add that ONNX models are compatible with OpenVINO * Update docs/MO_DG/prepare_model/convert_model/Convert_Model_From_ONNX.md Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> * Docs: Add links to info on benchmark application (#11822) * Docs: Add link to benchmark_app * Docs: Add link to benchmark_app * Docs: Add link to benchmark_app * DOCS-add supported PdPd models_port (#11804) (#11827) * fix formatting (#11904) * DOCS-nncf_rephrasing-port #11997 (#12007) * Puts page switch parameters in alphabetic order to support S3 (#11960) (#11966) * Puts page switch parameters in alphabetic order to support S3 (#11960) Signed-off-by: intelkevinputnam <intelkevinputnam@github.com> Co-authored-by: intelkevinputnam <intelkevinputnam@github.com> * DOCS-restore_gsearch_comma (#11980) Co-authored-by: Kevin Putnam <kevin.putnam@intel.com> Co-authored-by: intelkevinputnam <intelkevinputnam@github.com> Co-authored-by: Piotr Milewski <piotr.milewski@intel.com> * Install only proper GNA library files (#11243) * If CMAKE_BUILD_TYPE is not set - set it to 'Release' by default (#11026) This behavior is already used by default because ONNX is enabled by default and thirdparty/onnx/onnx/CMakeLists.txt forcing CMAKE_BUILD_TYPE to Release if it is not set It fixes the following issues: - When ONNX frontend is disabled - source is built for Debug, which is very unexpected comparing to Release with ONNX frontend enabled - When ONNX frontend is disabled, even libopenvino.so could not be built due to some generated makefiles issues It is set to 'Release' (not to 'Debug') to comply with default behavior when ONNX is enabled (it is default option working for most users) * Build with system TBB (#11244) * Build with system TBB * Fixes * Check whether system TBB is available * Try to fix ONNX Runtime build with system TBB * Test * Fixed compilation of threading.cpp * Fixed unset of cache dirs * Limit dearch paths of TBB * Try to enable pip packages with custom TBB * Fix for TBB 2021.2 * Install only needed TBB libraries * Install TBB from system to pip package * Reverted usage of TBBROOT * Fixed oneTBB case * Try to fix Android * Escape some paths * Added samples path * Fixed TBBBind usage for case of system TBB * Disabled TBBBind usage for oneTBB (#11386) * Tbb 2018 and older usage (#11411) * fixed TBB * Fixed compilation with old TBBs * Fixed installation for custom provided TBB * Fixed detection of sample type c / cpp (#11444) * Tbb: download only if system libraries are not found (#11415) * Download custom TBB on demand * Download TBBBind on demand * Fixed install steps * FIxes * Don't use system TBB * Fixed WIndows backslash paths * Revert "Install only proper GNA library files (#11243)" This reverts commit8a1a6e8b1a. * Limit ONNX version (#11949) OV does not currently support opset 17 introduced in onnx 1.12 release. * setupvars.sh: Removing extra semicolon, which breaks glibc build (#11849) This extra semicolon creates an output as example below. The extra '::' is equivalent to add '.' as part of the LD_LIBRARY_PATH. This breaks glibc build, and very often creates weird issue when launch commands from different path. ...inference_engine/external/tbb/lib::/opt/intel/openvino_2021/... We also noticed that :${parameter:+:$parameter} is widely used in this file. Please review the code and fix as needed. * Updated setupvars scripts * Install external / user provided TBB as well * Remove protobuf requirements in python bindings (#11886) * Fixes for cases when TBB_DIR env var is set * Disable loading of v7 reader for new IR versions (#12252) * Disable loading of v7 reader for new IR versions * Try to fix CI * Fixed PDPD frontend * Fixed error message creation * Fixed newAPI for case if core was removed (#12207) * Fixed newAPI for case if core was removed * Fixed code style * Fixed typo * Use new API by default * Create core with template plugin * Added doxygen comment * Updated build_samples.sh not to call make command * Fixes * Don't use make in build_samples.sh script * Limit protobuf version * Fix for Include dirs * [PyOV] Fix bugbear's B023 (#12040) * Sync .github/workflows/py_checks.yml with master * Revert "Sync .github/workflows/py_checks.yml with master" This reverts commit9ae2dd9f46. * Change quotes * Revert "Sync .github/workflows/py_checks.yml with master" This reverts commit9ae2dd9f46. * Add static shared_objects map in FEM - add unit tests for frontend lib close - not use static FEM in ie network reader - add main for gtest which can use manifest file to filter tests * Move library pointers map to manger impl - add to manger impl method to make frontend from loaded plugin * Add shutdown function to ov namespace it cleans the static resources * Revert changes related to linking mian for tests * Add python binding to ov::openvino_shutdown * Renamed shutdown method and added to legacy C++ API * Added C bindings * Remove redundant files * Fixed code style * Cpp fix of python segfault, reverted pybind workaround (#10749) * test fix of segfault * styles applied * added keep_alive to pybind * remove redundant code * fix json tests * review remarks * introduced correct path to dlls in CI * removing passing path via env variable * introduced cpp solution * remove keep alive * review remarks * remove explicit removing model * removed shared_objects from ir frontend * core test updated * unified approach to handle extensions by frontends * added nullptr check * Revert "added nullptr check" This reverts commit666f5e4489. * Revert "unified approach to handle extensions by frontends" This reverts commitbf85ac24a6. * m_extensions declaration in Frontend * added assert * Revert "Disable loading of v7 reader for new IR versions (#12252)" This reverts commit60ee201d93. * Removed old headers from OV 2.0 API * FIxed clang * [OMZ]: update submodule * Fixed sampes build * Fixed tets build * Fixed docs compilation * Disable ARM plugin build * Disable MO * Revert "FIxed clang" This reverts commit8ebc86935c. * Revert "Removed old headers from OV 2.0 API" This reverts commit4e64eb22a1. * Revert "Disable ARM plugin build" This reverts commit54f805c28b. * Removed lib_close tests Co-authored-by: Sergey Lyubimtsev <sergey.lyubimtsev@intel.com> Co-authored-by: Karol Blaszczak <karol.blaszczak@intel.com> Co-authored-by: Maxim Vafin <maxim.vafin@intel.com> Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> Co-authored-by: Yuan Xu <yuan1.xu@intel.com> Co-authored-by: Yuan Hu <yuan2.hu@intel.com> Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> Co-authored-by: Maksim Kutakov <maksim.kutakov@intel.com> Co-authored-by: Sergey Lyalin <sergey.lyalin@intel.com> Co-authored-by: Ivan Tikhonov <ivan.tikhonov@intel.com> Co-authored-by: Ilya Churaev <ilya.churaev@intel.com> Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com> Co-authored-by: Ekaterina Aidova <ekaterina.aidova@intel.com> Co-authored-by: Evgenya Stepyreva <evgenya.stepyreva@intel.com> Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com> Co-authored-by: Ilya Naumov <ilya.naumov@intel.com> Co-authored-by: Alexey Suhov <alexey.suhov@intel.com> Co-authored-by: Chen Peter <peter.chen@intel.com> Co-authored-by: Vladimir Dudnik <vladimir.dudnik@intel.com> Co-authored-by: Alexey Lebedev <alexey.lebedev@intel.com> Co-authored-by: Anastasia Kuporosova <anastasia.kuporosova@intel.com> Co-authored-by: Eddy Kim <eddy.kim@intel.com> Co-authored-by: Helena <helena.kloosterman@intel.com> Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com> Co-authored-by: Alexander Zhogov <alexander.zhogov@intel.com> Co-authored-by: Evan <evan.juras@gmail.com> Co-authored-by: FanJiangIntel <fan.jiang@intel.com> Co-authored-by: Sebastian Golebiewski <sebastianx.golebiewski@intel.com> Co-authored-by: Anuj Mittal <anuj.mittal@intel.com> Co-authored-by: Mateusz Tabaka <mateusz.tabaka@intel.com> Co-authored-by: Kevin Putnam <kevin.putnam@intel.com> Co-authored-by: intelkevinputnam <intelkevinputnam@github.com> Co-authored-by: Piotr Milewski <piotr.milewski@intel.com> Co-authored-by: Mikhail Nosov <mikhail.nosov@intel.com> Co-authored-by: Tomasz Jankowski <tomasz1.jankowski@intel.com> Co-authored-by: stephenli2000 <stephen@aotu.ai> Co-authored-by: Artur Kulikowski <artur.kulikowski@intel.com> Co-authored-by: Jan Iwaszkiewicz <jan.iwaszkiewicz@intel.com> Co-authored-by: p-wysocki <przemyslaw.wysocki@intel.com> Co-authored-by: Raasz, Pawel <pawel.raasz@intel.com> Co-authored-by: Mateusz Bencer <mateusz.bencer@intel.com>
37 KiB
How to Implement Custom Layers for VPU (Intel® Neural Compute Stick 2)
To enable operations not supported by OpenVINO™ out of the box, you need a custom extension for Model Optimizer, a custom nGraph operation set, and a custom kernel for the device you will target. This page describes custom kernel support for one the VPU, the Intel® Neural Compute Stick 2 device, which uses the MYRIAD device plugin.
NOTES:
- OpenCL* custom layer support is available in the preview mode.
- This section assumes you are familiar with developing kernels using OpenCL. To customize your topology with an OpenCL layer, carry out the tasks described on this page:
- Write and compile your OpenCL code with the standalone offline OpenCL compiler (
clc). - Write a configuration file to bind the OpenCL kernel to the topology file (
.xml) of the model IR. - Pass the configuration file to the OpenVINO™ Runtime with the model IR.
Compile OpenCL code for VPU (Intel® Neural Compute Stick 2)
Note
: OpenCL compiler, targeting Intel® Neural Compute Stick 2 for the SHAVE* processor only, is redistributed with OpenVINO. OpenCL support is provided by ComputeAorta* and is distributed under a license agreement between Intel® and Codeplay* Software Ltd. The OpenCL toolchain for the Intel® Neural Compute Stick 2 supports offline compilation only, so first compile OpenCL C code using the standalone
clccompiler. You can find the compiler binary at<INSTALL_DIR>/tools/cl_compiler.
Note
: By design, custom OpenCL layers support any OpenCL kernels written assuming OpenCL version 1.2. It also supports half float extension and is optimized for this type, because it is a native type for Intel® Movidius™ VPUs.
- Prior to running a compilation, make sure that the following variables are set:
SHAVE_MA2X8XLIBS_DIR=<INSTALL_DIR>/tools/cl_compiler/lib/SHAVE_LDSCRIPT_DIR=<INSTALL_DIR>/tools/cl_compiler/ldscripts/SHAVE_MYRIAD_LD_DIR=<INSTALL_DIR>/tools/cl_compiler/bin/SHAVE_MOVIASM_DIR=<INSTALL_DIR>/tools/cl_compiler/bin/
- Run the compilation with the command below. You should use
--strip-binary-headerto make an OpenCL runtime-agnostic binary runnable with the OpenVINO™ Runtime.cd <INSTALL_DIR>/tools/cl_compiler/bin ./clc --strip-binary-header custom_layer.cl -o custom_layer.bin
Write a Configuration File
To tie the topology IR for a layer you customize, prepare a configuration file, so that the OpenVINO™ Runtime can find parameters for your kernel and the execution work grid is described. For example, consider the following OpenCL kernel signature:
__kernel void reorg_nhwc(__global const half *src, __global half *out, int w, int h, int c, int stride);
A configuration file for this kernel might be the following:
<CustomLayer name="ReorgYolo" type="MVCL" version="1">
<Kernel entry="reorg_nhwc">
<Source filename="reorg.bin"/>
</Kernel>
<Parameters>
<Tensor arg-name="src" type="input" port-index="0" format="BYXF"/>
<Tensor arg-name="out" type="output" port-index="0" format="BYXF"/>
<Scalar arg-name="w" type="int" port-index="0" source="I.X" />
<Scalar arg-name="h" type="int" port-index="0" source="I.Y" />
<Scalar arg-name="c" type="int" port-index="0" source="I.F" />
<Scalar arg-name="stride" type="int" source="stride" />
</Parameters>
<WorkSizes dim="input,0" global="(Y+7)/8*8,1,1" local="8,1,1"/>
</CustomLayer>
Each custom layer is described with the CustomLayer node. It has the following nodes and attributes:
- Root node
CustomLayercontains the following attributes:name– (Required) The name of the OpenVINO™ Runtime layer to bind the kernel with.typeandversion– (Required) Reserved for future use. Set them toMVCLand1respectively.max-shaves– (Optional) The maximum number of SHAVE cores that should be dedicated for the layer. It is useful for debugging concurrency issues or for resource saving that memory bound kernel does not scale well with the number of cores, so more resources can be left for the rest of a topology.
- Sub-node
Kernelmust contain the following attributes:entry– The name of your kernel function as you defined it in a source file. In the example above, it isreorg_nhwc.- Node
Sourcemust contain the following attributes:filename– The path to a compiled binary relative to the XML configuration file.
- Sub-node
Parameters– Describes parameters bindings. For more information, see the description below. - Sub-node
WorkSizes– Describes local and global work group sizes and the source for dimension deduction as a pairdirection,port. In the example above, the work group is described relatively to the dimension of the input tensor that comes through port 0 in the IR.globalandlocalwork group configurations support any simple math expressions with +,-,*,/, and () fromB(batch),Y(height),X(width) andF(channels). - Sub-node
Where– Allows to customize bindings with thekey="value"attribute. For example, to substitute only 3x3 convolutions, write<Where kernel="3,3"/>in the binding xml.
Parameter description supports Tensor of one of tensor types such as input, output, input_buffer, output_buffer or data, Scalar, or Data nodes and has the following format:
-
Each
Tensornode ofinputoroutputtype must contain the following attributes:arg-name– The name of a kernel parameter in the kernel signature.type– Node type:inputoroutputas specified in the IR.port-index– A number of input/output ports as specified in the IR.format– The channel order in the tensor. Optional conversion layers are generated if the custom layer format is not compatible with formats of neighboring layers.BFXY,BYXF, andANYformats are supported currently.
-
Each
Tensornode ofinput_bufferoroutput_buffertype must contain the following attributes:arg-name– The name of a kernel parameter in the kernel signature.type– Node type:input_bufferoroutput_buffer. Use the appropriate type to bind multiple kernels that correspond to different stages of the same layer.port-index– The unique identifier to bind by.dim– The dim source with the samedirection,portformat used forWorkSizesbindings.size– Amount of bytes needed. Current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and might be expended in the future.
Here is an example of multi-stage MVN layer binding:
<CustomLayer name="MVN" stage="0" type="MVCL" version="1">
<Kernel entry="reduction_mean">
<Source filename="mvn.bin"/>
</Kernel>
<Parameters>
<Tensor arg-name="src" type="input" port-index="0" format="BFYX"/>
<Tensor arg-name="mean" type="output_buffer" port-index="0" dim="output,0" size="Y*F*4"/>
<Tensor arg-name="variance" type="output_buffer" port-index="1" dim="output,0" size="Y*F*4"/>
<!--other parameters -->
</Parameters>
<WorkSizes dim="output,0" global="((Y+7)/8)*8,F,1" local="8,1,1"/>
</CustomLayer>
<CustomLayer name="MVN" stage="1" type="MVCL" version="1">
<Kernel entry="mvn_scale">
<Source filename="mvn_scale_changed_orded.bin"/>
</Kernel>
<Parameters>
<Tensor arg-name="src_data" type="input" port-index="0" format="BFYX"/>
<Tensor arg-name="dst_data" type="output" port-index="0" format="BFYX"/>
<Tensor arg-name="mean_part" type="input_buffer" port-index="0" dim="output,0" size="Y*F*4"/>
<Tensor arg-name="power_mean" type="input_buffer" port-index="1" dim="output,0" size="Y*F*4"/>
<!--other parameters -->
</Parameters>
<WorkSizes dim="output,0" global="((Y+7)/8)*8,F,1" local="8,1,1"/>
</CustomLayer>
- Each
Tensornode that has the typedatamust contain the following attributes: source– A name of the blob as it is in the IR. Typical example isweightsfor convolution.format– Specifies the channel order in the tensor. Optional conversion layers are generated if the custom layer format is not.
<CustomLayer name="BinaryConvolution" type="MVCL" version="1">
<Kernel entry="binary_convolution">
<Source filename="binary_layers.bin"/>
</Kernel>
<Parameters>
<Tensor arg-name="src_data" type="input" port-index="0" format="BFYX"/>
<Data arg-name="weights_data" type="data" source="weights" format="ANY"/>
<Tensor arg-name="dst_data" type="output" port-index="0" format="BFYX"/>
<!--other parameters -->
</Parameters>
<WorkSizes dim="output,0" global="X,Y,F" local="1,1,1"/>
</CustomLayer>
-
Each
Scalarnode must contain the following attributes: -
arg-name– The name of a kernel parameter in the kernel signature. -
type–intorfloatvalue. It is used for correct argument extraction from IR parameters. -
source– Contains the name of the parameter in the IR file or input/output (I/O,In/On, wherenis a port number) followed by dimensionB(batch),Y(height),X(width), orF(channels). -
Each
Datanode must contain the following attributes:arg-name– The name of a kernel parameter in the kernel signature.type– Node type. Currently,local_datais the only supported value, which defines buffer allocated in fast local on-chip memory. It is limited to 100KB for all__localand__privatearrays defined inside the kernel as well as all__localparameters passed to the kernel. Note that a manual-DMA extension requires double buffering. If the custom layer is detected to run out of local memory, the inference fails.dim– The dim source with the samedirection,portformat used forWorkSizesbindings.size– Amount of bytes needed. The current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and may be extended in the future. The example binding below illustrates a kernel with two local buffers passed to the kernel.
<CustomLayer name="GRN" type="MVCL" version="1">
<Kernel entry="grn_NCHW">
<Source filename="grn.bin"/>
</Kernel>
<Parameters>
<Tensor arg-name="src_data" type="input" port-index="0" format="BFYX"/>
<Tensor arg-name="dst_data" type="output" port-index="0" format="BFYX"/>
<Data arg-name="src" type="local_data" dim="input,0" size="X*F*2" />
<Data arg-name="dst" type="local_data" dim="input,0" size="X*F*2" />
<Scalar arg-name="C" type="int" port-index="0" source="I.F" />
<Scalar arg-name="bias" type="float" source="bias" />
</Parameters>
<WorkSizes dim="input,0" global="X,Y,1" local="X,1,1"/>
</CustomLayer>
Pass Configuration File to OpenVINO™ Runtime
Note
: If both native and custom layer implementations are present, the custom kernel has a priority over the native one. Before loading the network that features the custom layers, provide a separate configuration file and load it using the ov::Core::set_property() method with the "CONFIG_KEY" key and the configuration file name as a value before loading the network that uses custom operations to the plugin:
@snippet docs/snippets/vpu/custom_op.cpp part0
Optimizing Kernels with OpenCL for VPU (Intel® Neural Compute Stick 2)
This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.
| OpenCL Model | VPU Mapping |
|---|---|
| Device code | Executed on SHAVE cores |
| Private memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while the work group is executed |
| Local memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while the work group is executed |
| Global memory | Mapped to DDR, used to pass execution preserved parameters for inputs, outputs, and blobs |
| Work group | Executed on a single SHAVE core iterating over multiple work items |
Note that by the OpenCL specification, the work group execution order is not specified. This means that it is your responsibility to ensure that race conditions among work groups are not introduced. Custom layer runtime spits evenly work grid among available compute resources and executes them in an arbitrary order. This static scheduling approach works best if the load is evenly spread out across work groups, which is a typical case for Deep Learning kernels. The following guidelines are recommended to use for work group partitioning:
- Split work evenly across work groups.
- Adjust work group granularity to maintain equal workload for all compute codes.
- Set the maximum number of cores using the
max-shavesattribute for theCustomLayernode. This keeps more resources for the rest of topology. It is also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization. - Try an alternate data layout (
BFXY/BYXF) for the kernel if it improves work group partitioning or data access patterns. Consider not just specific layer boost, but full topology performance because data conversion layers would be automatically inserted as appropriate.
Offline OpenCL compiler (clc) features automatic vectorization over get_global_id(0) usage, if uniform access is detected.
For example, the kernel below could be automatically vectorized:
__kernel void cvtf32f16(__global float* restrict inImage, __global half* restrict outImage,
float scale, float bais)
{
int idx = get_global_id(0) + get_global_id(1) * get_global_size(0) + get_global_id(2) * get_global_size(0) * get_global_size(1);
outImage[idx] = convert_half(inImage[idx]*scale+bais);
}
However, this work-group based vectorizer (WGV) conflicts with the default LLVM vectorizer based on superword level parallelism (SLP) for the current compiler version. Manual vectorization is recommended to provide the best performance for non-uniform code patterns. WGV works if and only if vector types are not used in the code.
Here is a short list of optimization tips:
- Help auto-vectorizer ensure non-aliasing pointers for kernel parameters by putting
restrictwhere possible.
- This can give a performance boost, especially for kernels with unrolling, like
ocl_grnfrom the example below. - Place
restrictmarkers for kernels with manually vectorized codes. In theocl_grnkernel below, the unrolled version withoutrestrictis up to 20% slower than the most optimal one, which combines unrolling andrestrict.
- Put
#‍pragma unroll Nto your loop header. The compiler does not trigger unrolling by default, so it is your responsibility to annotate the code with pragmas as appropriate. Theocl_grnversion with#‍pragma unroll 4is up to 50% faster, most of which comes from unrolling the first loop, because LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the fist loopvariance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);is only 2-stage (load-compute). Pay attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from4to6results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel® Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store, and compute stages:
__kernel void ocl_grn(__global const half* restrict src_data, __global half* restrict dst_data, int C, float bias)
{
int x = get_global_id(0);
int W = get_global_size(0);
int y = get_global_id(1);
int H = get_global_size(1);
float variance = bias + 1e-9f;
#pragma unroll 4
for (int c = 0; c < C; c++)
variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);
variance = 1.f / native_sqrt(variance);
#pragma unroll 4
for (int c = 0; c < C; c++)
dst_data[c*H*W + y*W + x] = (half)((float)src_data[c*H*W + y*W + x] * variance);
}
To check the efficiency of WGV, you can compare performance of the kernel above with the kernel below, which is manually vectorized over width:
__kernel void ocl_grn_line(__global const half* restrict src_data, __global half* restrict dst_data, int C, int W, float bias)
{
int y = get_global_id(1);
int H = get_global_size(1);
for (int x = 0; x < W/8; x++)
{
float8 variance = (float8)(bias+1e-9f);
#pragma unroll 4
for (int c = 0; c < C; c++)
{
__global const half8* restrict src_line = ((__global const half8 * restrict)(src_data + c*H*W + y*W));
half8 sh = src_line[x];
variance += convert_float8(sh*sh);
}
variance = 1.f/native_sqrt(variance);
#pragma unroll 4
for (int c = 0; c < C; c++)
{
__global const half8* restrict src_line = ((__global const half8 * restrict)(src_data + c*H*W + y*W));
__global half8* restrict dst_line = ((__global half8 * restrict)(dst_data + c*H*W + y*W));
dst_line[x] = convert_half8(convert_float8(src_line[x])*variance);
}
}
for (int x = W/8*8; x < W; x++)
{
float variance = bias+1e-9f;
#pragma unroll 4
for (int c = 0; c < C; c++)
variance += (float)(src_data[c*H*W + y*W + x]*src_data[c*H*W + y*W + x]);
variance = 1.f/native_sqrt(variance);
#pragma unroll 4
for (int c = 0; c < C; c++)
dst_data[c*H*W + y*W + x] = (float)src_data[c*H*W + y*W + x]*variance;
}
}
Both versions perform the same, but the second one has more complex code.
-
If it is easy to predict the work group size, you can also use the
reqd_work_group_sizekernel attribute to ask the compiler to unroll the code up to the local size of the work group. Note that if the kernel is actually executed with the different work group configuration, the result is undefined. -
Prefer to use the
halfcompute if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functionshalf_*are mapped to a single hardware instruction. Use the standardnative_*function for the rest of types. -
Prefer to use the
convert_halffunction overvstore_halfif conversion to 32-bit float is required.convert_halfis mapped to a single hardware instruction. For thecvtf32f16kernel above, the lineoutImage[idx] = convert_half(inImage[idx]*scale+bais);is eight times slower than the code withvstore_half. -
Mind early exits. Early exit can be extremely costly for the current version of the
clccompiler due to conflicts with the auto-vectorizer. The generic advice would be to setup local size byxdimension equal to inputs or/and outputs width. If it is impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks, for example,if (get_global_id(0) >= width) return, use line-wise kernel variant with manual vectorization. The kernel example below demonstrates the impact of early exits on kernel performance.// Initial version __kernel void reorg(const __global half* restrict src, __global half* restrict out, int stride) { int w = get_global_id(0); int W = get_global_size(0); int h = get_global_id(1); int H = get_global_size(1); int c = get_global_id(2); int C = get_global_size(2); int C2 = C/(stride*stride); int offset = c / C2; int c2 = c - C2 * offset; int H2 = H*stride; int W2 = W*stride; int h2 = h*stride + offset / stride; int w2 = w*stride + offset - stride * (offset / stride); out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2]; }
This reorg kernel is auto-vectorizable, but an input for YOLO v2 topology is NCHW=<1,64,26,26> and it is not multiple of vector width, which is 8 for half data type. As a result, the Inference Engine does not select the auto-vectorized kernel.
To compare performance of auto-vectorized and scalar version of the kernel, change the input size toNCHW=<1,64,26,32>. This enables the auto-vectorized version to be selected by the Inference Engine and can give you about 30% uplift.
Since the auto-vectorized version is faster, it makes sense to enable it for the YOLO v2 topology input size by setting the local size multiple of vector, for example, 32, and adjust global sizes accordingly. As a result, the execution work grid exceeds actual input dimension, so out-of-bound checks should be inserted. See the updated kernel version below:
// Version with out-of-bound checks added
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int W, int stride)
{
int w = get_global_id(0);
w = min(w, W-1);
int h = get_global_id(1);
int H = get_global_size(1);
int c = get_global_id(2);
int C = get_global_size(2);
int C2 = C/(stride*stride);
int offset = c / C2;
int c2 = c - C2 * offset;
int H2 = H*stride;
int W2 = W*stride;
int h2 = h*stride + offset / stride;
int w2 = w*stride + offset - stride * (offset / stride);
out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2];
}
This code performs the same as the initial kernel above (scalar) due to branching overhead. If you replace min/max expression w = min(w, W-1); with if (w >= W) return;, runtime increases up to 2x against to code without branching (initial version).
If branching is inevitable for your element-based kernel, it is recommended to change the scheme to line-based. See the kernel variant below:
// Line-wise version
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int H, int W, int stride)
{
int h = min((int)get_global_id(0), H-1);
int c = get_global_id(1);
int C = get_global_size(1);
int C2 = C/(stride*stride);
int offset = c / C2;
int c2 = c - C2 * offset;
int H2 = H*stride;
int W2 = W*stride;
for (int w = 0; w < W; ++w)
{
int h2 = h*stride + offset / stride;
int w2 = w*stride + offset - stride * (offset / stride);
out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2];
}
}
This decreases the execution time up to 40% against the best performing vectorized kernel without early exits (initial version).
7. Reuse computations among work items by using line-based kernels or sharing values though __local memory.
8. Improve data access locality. Most of custom kernels are memory bound while convolution and fully connected layers are hardware-implemented. The code below demonstrates a further optimized version of the reorg kernel unrolled by stride:
// Unrolled line-wise version
__kernel void reorg_unrolled_by_stride(const __global half* restrict src, __global half* restrict dst,
int H, int W, int stride)
{
int h = min((int)get_global_id(0), H-1);
int c2 = get_global_id(1);
int C2 = get_global_size(1);
int C = C2*stride*stride;
int H2 = H*stride;
int W2 = W*stride;
for (int stride_y = 0; stride_y < stride; stride_y++)
for (int stride_x = 0; stride_x < stride; stride_x++)
for (int w2 = 0, w = 0; w < W; w2 += stride, w++)
dst[W*H*C2*(stride_y*stride+stride_x) + W*H*c2 + W*h + w] = src[W2*H2*c2 + W2*h*stride + W2*stride_y + w2 + stride_x];
}
scr data in this case loaded only once. As the result, the cycle count drops up to 45% against the line-wise version.
-
Copy data from
__dlobalto__localor__privatememory if the data is accessed more than once. Access to__dlobalmemory is orders of magnitude slower than access to__local/__privatedue to statically scheduled pipeline, which stalls completely on memory access without any prefetch. The same recommendation is applicable for scalar load/store from/to a__blobalpointer since work-group copying could be done in a vector fashion. -
Use a manual DMA extension. Local (on-chip) memory throughput is up to 24x higher than DDR throughput. Starting from OpenVINO™ 2020.1, VPU OpenCL features manual-DMA kernel extension to copy sub-tensor used by work group into local memory and performing compute without DDR evolved. Here is the simple GRN kernel implementation that runs over DDR. Local size is in the form (width of the input tensor, 1, 1) to define a large enough work group to get code automatically vectorized and unrolled, while global size is (width of the input tensor, height of the input tensor, 1):
__kernel void grn_NCHW(
__global const half* restrict src_data,
__global half* restrict dst_data,
int C,
float bias)
{
float variance = bias + 1e-9f;
#pragma unroll 4
for (int c = 0; c < C; c++)
{
float val = (float) src_data[c*get_global_size(1)*get_global_size(0) + get_global_id(1)*get_global_size(0) + get_global_id(0)];
variance += val*val;
}
half hvariance = (half)(native_rsqrt((half)(variance/16.f))*0.25f);
#pragma unroll 4
for (int c = 0; c < C; c++)
{
dst_data[c*get_global_size(1)*get_global_size(0) + get_global_id(1)*get_global_size(0) + get_global_id(0)]
= src_data[c*get_global_size(1)*get_global_size(0) + get_global_id(1)*get_global_size(0) + get_global_id(0)] * hvariance;
}
}
This kernel can be rewritten to introduce special data binding __dma_preload and __dma_postwrite intrinsics. This means that instead of one kernel, a group of three kernels should be implemented: kernelName, __dma_preload_kernelName, and __dma_postwrite_kernelName. __dma_preload_kernelName for a particular work group n is guaranteed to be executed before the n-th work group itself, while __dma_postwrite_kernelName is guaranteed to be executed after a corresponding work group. You can define one of those functions that are intended to be used to copy data from-to __global and __local memory. The syntactics requires exact functional signature match. The example below illustrates how to prepare your kernel for manual-DMA.
__kernel void __dma_preload_grn_NCHW(
__global const half* restrict src,
__global half* restrict dst,
__local half* restrict local_src,
__local half* restrict local_dst,
int C,
float bias)
{
// ToDO: copy required piece of src tensor into local_src
}
__kernel void __dma_postwrite_grn_NCHW(
__global const half* restrict src,
__global half* restrict dst,
__local const half* restrict local_src,
__local half* restrict local_dst,
int C,
float bias)
{
// ToDO: copy back computed piece of local_dst into dst
}
__kernel void grn_NCHW(
__global const half* restrict src_data,
__global half* restrict dst_data,
__local half* restrict src,
__local half* restrict dst,
int C,
float bias)
{
// same as the example above
}
The GRN kernel operates on channel-major tensors to compute average over full channel range and then normalizes input elements to produce the output.
As a part of the manual DMA extension, a group of work group copy functions are introduced in addition to async_work_group_copy, which is also mapped to a DMA call.
Here is the list of supported functions:
// 2D sub-tensor copy
event_t WorkGroupDmaCreateStrideTransaction(
const local T *src,
global T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t size, // total number of bytes loaded for all lines from source to destination
event_t event) __OVERLOAD;
event_t WorkGroupDmaCreateStrideTransaction(
const global T *src,
local T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t size, // total number of bytes loaded for all lines from source to destination
event_t event) __OVERLOAD;
// 3D sub-tensor copy
event_t WorkGroupDmaCreate3DTransaction(
const local T *src,
global T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t num_planes, // number of planes to be copied
size_t src_plane_stride, // stride between corresponding 2 consecutive planes of source in bytes
size_t dst_plane_stride, // stride between corresponding 2 consecutive planes of destination in bytes
size_t size, // size of the loaded plane in bytes, analogues to the size in 2D case
event_t event) __OVERLOAD;
event_t WorkGroupDmaCreate3DTransaction(
const global T *src,
local T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t num_planes, // number of planes to be copied
size_t src_plane_stride, // stride between corresponding 2 consecutive planes of source in bytes
size_t dst_plane_stride, // stride between corresponding 2 consecutive planes of destination in bytes
size_t size, // size of the loaded plane in bytes, analogues to the size in 2D case
event_t event) __OVERLOAD;
where T can be uchar, char, short, ushort, int, uint, long, ulong, half or float.
Modified version of the GRN kernel could be the following:
__kernel void __dma_preload_grn_NCHW(
__global const half* restrict src,
__global half* restrict dst,
__local half* restrict local_src,
__local half* restrict local_dst,
int C,
float bias)
{
WorkGroupDmaCreate3DTransaction(
src + get_group_id(0)*get_local_size(0)
+ get_group_id(1)*get_local_size(1)*get_global_size(0), // src
local_src, // dst
get_local_size(0) * sizeof(half), // src width
get_local_size(0) * sizeof(half), // dst width
get_global_size(0) * sizeof(half), // src stride
get_local_size(0) * sizeof(half), // dst stride
C, // num planes
get_global_size(0) * get_global_size(1) * sizeof(half), // src plane stride
get_local_size(0) * get_local_size(1) * sizeof(half), // dst plane stride
get_local_size(0) * get_local_size(1) * sizeof(half), // plane size
0);
}
__kernel void __dma_postwrite_grn_NCHW(
__global const half* restrict src,
__global half* restrict dst,
__local const half* restrict local_src,
__local half* restrict local_dst,
int C,
float bias)
{
WorkGroupDmaCreate3DTransaction(
local_dst, // src
dst + get_group_id(0)*get_local_size(0)
+ get_group_id(1)*get_local_size(1)*get_global_size(0), // dst
get_local_size(0) * sizeof(half), // src width
get_local_size(0) * sizeof(half), // dst width
get_local_size(0) * sizeof(half), // src stride
get_global_size(0) * sizeof(half), // dst stride
C, // num planes
get_local_size(0) * get_local_size(1) * sizeof(half), // src plane stride
get_global_size(0) * get_global_size(1) * sizeof(half), // dst plane stride
get_local_size(0) * get_local_size(1) * sizeof(half), // plane size
0);
}
__kernel void grn_NCHW(
__global const half* restrict src_data,
__global half* restrict dst_data,
__local half* restrict src,
__local half* restrict dst,
int C,
float bias)
{
float variance = bias + 1e-9f;
#pragma unroll 8
for (int c = 0; c < C; c++)
{
float val = (float) src[c*get_local_size(1)*get_local_size(0) + get_local_id(1)*get_local_size(0) + get_local_id(0)];
variance += val*val;
}
half hvariance = (half)(native_rsqrt((half)(variance/16.f))*0.25f);
#pragma unroll 8
for (int c = 0; c < C; c++)
{
dst[c*get_local_size(1)*get_local_size(0) + get_local_id(1)*get_local_size(0) + get_local_id(0)]
= src[c*get_local_size(1)*get_local_size(0) + get_local_id(1)*get_local_size(0) + get_local_id(0)] * hvariance;
}
}
Note the get_local_size and get_local_id usage inside the kernel. 21x speedup is expected for a kernel on enet-curbs setup because it was completely limited by memory usage.
An alternative method to using DMA is to use work item copy extension. Those functions are executed inside a kernel and requires work groups equal to single work item.
Here is the list of supported work item functions:
item_dma_event_t WorkItemDmaCreateTransaction(
const global T *src,
private T *dst,
size_t size,
item_dma_event_t event) __OVERLOAD;
item_dma_event_t WorkItemDmaCreateTransaction(
const private T *src,
global T *dst,
size_t size,
item_dma_event_t event) __OVERLOAD;
item_dma_event_t WorkItemDmaCreateStrideTransaction(
const global T *src,
private T *dst,
size_t src_width,
size_t dst_width,
size_t src_stride,
size_t dst_stride,
size_t size,
item_dma_event_t event) __OVERLOAD;
item_dma_event_t WorkItemDmaCreateStrideTransaction(
const private T *src,
global T *dst,
size_t src_width,
size_t dst_width,
size_t src_stride,
size_t dst_stride,
size_t size,
item_dma_event_t event) __OVERLOAD;
item_dma_event_t WorkItemDmaCreate3DTransaction(
const global T *src,
private T *dst,
size_t src_width,
size_t dst_width,
size_t src_stride,
size_t dst_stride,
size_t num_planes,
size_t src_plane_stride,
size_t dst_plane_stride,
size_t size,
item_dma_event_t event) __OVERLOAD;
item_dma_event_t WorkItemDmaCreate3DTransaction(
const private T *src,
global T *dst,
size_t src_width,
size_t dst_width,
size_t src_stride,
size_t dst_stride,
size_t num_planes,
size_t src_plane_stride,
size_t dst_plane_stride,
size_t size,
item_dma_event_t event) __OVERLOAD;
where T can be uchar, char, short, ushort, int, uint, long, ulong, half or float.