* Changes according to feedback comments
* Replaced @ref's with html links
* Fixed links, added a title page for installing from repos and images, fixed formatting issues
* Added links
* minor fix
* Added DL Streamer to the list of components installed by default
* Link fixes
* Link fixes
* ovms doc fix (#2988)
* added OpenVINO Model Server
* ovms doc fixes
Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>
* added OpenVINO Model Server to docs (#2541)
* added OpenVINO Model Server
* updated documentation to include valid links
* minor fixes
* Fixed links and style
* Update README.md
fixed links to model_server
* more corrections
* dropped reference in ie_docs and minor fixes
* Update README.md
Fixed links to Inference Engine pages
Co-authored-by: Alina Alborova <alina.alborova@intel.com>
Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com>
* Added Model Server docs to 2021/1
Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>
Co-authored-by: Alina Alborova <alina.alborova@intel.com>
* convert to doxygen comments
* layouts and code comments
* separate layout
* Changed layouts
* Removed FPGA from the documentation
* Updated according to CVS-38225
* some changes
* Made changes to benchmarks according to review comments
* Added logo info to the Legal_Information, updated Ubuntu, CentOS supported versions
* Updated supported Intel® Core™ processors list
* Fixed table formatting
* update api layouts
* Added new index page with overview
* Changed CMake and Python versions
* Fixed links
* some layout changes
* some layout changes
* some layout changes
* COnverted svg images to png
* layouts
* update layout
* Added a label for nGraph_Python_API.md
* fixed links
* Fixed image
* removed links to ../IE_DG/Introduction.md
* Removed links to tools overview page as removed
* some changes
* Remove link to Integrate_your_kernels_into_IE.md
* remove openvino_docs_IE_DG_Graph_debug_capabilities from layout as it was removed
* update layouts
* Post-release fixes and installation path changes
* Added PIP installation and Build from Source to the layout
* Fixed formatting issue, removed broken link
* Renamed section EXAMPLES to RESOURCES according to review comments
* add mo faq navigation by url param
* Removed DLDT description
* Replaced wrong links
* MInor fix for path to the cpp samples
* fixes
* Update ops.py
* Fix style
Co-authored-by: Nikolay Tyukaev <ntyukaev_lo@jenkins.inn.intel.com>
Co-authored-by: Tyukaev <nikolay.tyukaev@intel.com>
Co-authored-by: aalborov <alina.alborova@intel.com>
Co-authored-by: Rafal Blaczkowski <rafal.blaczkowski@intel.com>
Co-authored-by: Alexander Zhogov <alexander.zhogov@intel.com>
* Update OpenVino ONNX CI
* Change parallel execution to single
* Enlarge timeout
* Remove timeout
* Add timeout to test execution
* Skip hanging test
* Add description to skip issue
* Removed FPGA from the documentation
* Updated according to CVS-38225
* Added logo info to the Legal_Information, updated Ubuntu, CentOS supported versions
* Updated supported Intel® Core™ processors list
* Added new index page with overview
* Changed CMake and Python versions
* Fixed links
* COnverted svg images to png
* Added a label for nGraph_Python_API.md
* fixed links
* Fixed image
* First draft of nGraph documentation
* updated according to review comments
* Updated
* Reviewed the nGraph Transformation section, added missing images
* Update nGraph_dg.md
* Delete python_api.md
Removed since there is already the nGraph_Python_API.md document with a comprehensive overview.
Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com>
Co-authored-by: CCR\avladimi <anastasiya.ageeva@intel.com>
* Initial changes for 2021.1
* Inserted Graphtool scripts, updated configurations info
* Updated FAQ and minor changes to performance_benchmarks.md
* Updated for 2021.1
* Updated
* incorporated review comments
* incorporated review comments for FAQ
* fixed link
* Update the menu to align with POT doc headers
It changes the menu to align with Post-training Optimization Toolkit documentation titles.
* Corrected one title
Run Examples => How to Run Examples
* Replaced direct links to docs.openvinotoolkit.org with relative links
* Replaced direct links to docs.openvinotoolkit.org with relative links. Added GSGs for Win and macOS
* Minor fixes in GSGs
* Replaced direct links to docs.openvinotoolkit.org with relative links
* Removed links to OpenVINO markdown files that contain anchor - they don't work in the current implementation of the doc process
* Fixed Notes
* Removed links to OpenVINO markdown files that contain anchor - they don't work in the current implementation of the doc process
* fixed link to installing-openvino-linux.md
* Added new GSG for macOS, made minor changes in Windows GSG
* Update get_started_macos.md
Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com>
* Downgrade cmake for samples
Downgraded cmake version to default version for Ubuntu 18.04
* Updated supported python version
The minimal python version in 2021.1 is 3.5
* Added notes about cmake requirements for samples and demo
* setupvars.sh: Added logic for exporting path env in case if it not defined
* setupvars: Removed duplicated colon
* install_openvino_dependencies: Updated copyrights
setupvars.bat: Updated notification about incorrect Python version. Removed checking ICC2019
setupvars.sh: Removed logic with choosing higher version of installed Python. Added dynamic detecting python3 major and minor version for setting path. Add checking minimum required Python version(now 3.6)
* FQ+Mul fusion transform skeleton
* FQ+Mul fusion transform tests prep
* Basic UT for the transform
* Basic implementation of the transform
* Parametrized UTs for FQMul transform
* Parametrization of FQ+Mul UTs
* Make sure that the shapes of constants match
* Check if the mul constant matches FQ data
* CentOs compilation error fix
* PR feedback and adjusted tests
* NHWC layout of the mul constant
* UT: FQ output limits 4D
* Redundant CF pass removed
* Rewrite the graph in a different way
* Shape checking infrastructure skeleton
* Handle some negative cases
* Check the rt info in the fusion test
* Fuse all Mul nodes detected after FQ node
* Dont cast the original FQ node
* Dont throw if CF fails in new output range calculation
* More UTs
* Accept any type of input to FQ in the transformation
* Test the fusion when all FQ inputs are non-const
* Fusion test when only one output limit is const
* setupvars.sh: Added logic for exporting path env in case if it not defined
* setupvars: Removed duplicated colon
* Kept quotes where they were
* setupvars: updated copyrights
* [GNA] fix scale factor calculation for unfused bias after fc
* change check
* add test
* apply requested changes
* cpplint fix
* apply test changes
* modify model for test to match ::op::
* Remove dead code.
* Protect device specific config options with device checks.
* Add missing space to precision parsing error message.
* Allow to switch FP32 input precision to U8.
* Commit.
* Added opset4 version in the class Interpolate.
* Added class ONNXResize11Op to read ONNX Resize with opset version >= 11.
* Added support for Interpolate-4 into transformations TestInterpolateReshapeWA and InterpolateConcat.
* Added support for Interpolate-4 into transformation InterpolateWithConcat.
* Deleted redundant checks from the transformation UpsampleToResample.
* Reverted last changes.
* Changed ONNX Resize extractor to support for Interpolate-4.
* Added conversion of ONNXResize11Op into Interpolate-4.
* Added support for Interpolate-4 into the transformation InterpolateSequenceToInterpolate.
* Small fix for formatting.
* Written tests for MO version of Interpolate-4 with shape_calculation_mode = sizes.
* Written tests for infer function of Interpolate-4.
* Now transformations InterpolateWithConcat, InterpolateConcat, InterpolateReshapeWA skip Interpolate-4.
* Used create_op_with_const_inputs in the transformation InterpolateSequenceToInterpolate.
* The transformation ONNXResize11ToInterpolate4 was rewritten using find_and_replace_pattern.
* Now the dictionary infers (dictionary of infer functions of Interpolate) is a class static attribute.
* Deleted unused variable.
* Restored original logic of find_and_replace_pattern method of the class InterpolateReshapeWA.
* Used create_op_with_const_inputs() in the transformation InterpolateSequenceToInterpolate for opset1 case.
* Replaced resize_name by resize.soft_get('name', resize.id).
* Small fixes.
* Added two tests for Interpolate-4 infer function.
* Fixed the transformation ONNXResize11ToInterpolateV4 for the case when ONNXResize11 operation has 3 inputs.
* Added conversion of ONNXResize11 with tf_crop_and_resize_mode to ROIPooling + ONNXResize11.
* Fixed bugs in the transformation ONNXResize11ToInterpolateV4 and in the infer function of the operation ONNXResize11.
* Small changes.
* Renamed transformation that converts ONNXResize11 into ROIPooling + ONNXResize11 and fixed BOM-file.
* Fixed tests for the transformation InterpolateSequenceToInterpolate.
* Small change.
* Now the transformation InterpolateSequenceToInterpolate preserves output layer name.
* Deleted the transformation ONNXResize11ToTFCropAndResize.
* Fix for concat layer with more than 2 inputs
Signed-off-by: Bartosz Sochacki <bartosz.sochacki@intel.com>
* Fixed check if affine is used for crop layer
Signed-off-by: Bartosz Sochacki <bartosz.sochacki@intel.com>
* code cleanup for fix affine layer check
Signed-off-by: Bartosz Sochacki <bartosz.sochacki@intel.com>
* added test for concat layer with multiple inputs
* simplified test to use less number of layers
* fixed code style
* fixed coding style
* addressed review comments and one more issue that appeared during testing
* fixed code style errors
* scale factor propagation for concat layer with multiple inputs
* fix for a case when all inputs to concat are activation layers
* fix for linux compilation - C++14 is not enabled and fails on lambda with auto parameters
* corrected current year in headers in concat multi input tests
* fixes for code review issues raised by Denis Orlov
* enabled integer mode computation in GNA concat multi input test
* removed 1 space per review comment
* a fix to fail when not all scale factors are equal
* added GNA_DEVICE_MODE config to concat multi input test
* corrected searching for a next input to concat layer
* changed selection of 2nd candidate for source quant value
* code style fix - else and brackets should be in the same line
* small code improvement
* fix for mixing line endings
* addressed with endless requantization loop and fixed failing tests
Special test case with input values which cannot be correctly processed via
decomposition with int AVG pool layer.
Signed-off-by: Alexander Peskov <alexander.peskov@intel.com>
* [IE TESTS] GatherTree op ref function has been created.
* [IE TESTS] Added GatherTree single layer test
* [IE TESTS] Fixed code styles.
* [IE TESTS] GatherTree test FP32 precion was enabled.
* [IE TESTS] Refactoring of Builder::makeConstatn procedure
The refactoring is aimed at managing the range of random data for the constants initialization procedure.
* [IE TESTS] GatherTree test was extended with constants
* [IE TESTS] GatherTree ref rewritten to non-templated function.
* [IE TESTS] GatherTree test inp shape indx enum removed.
* Revert "[IE TESTS] Refactoring of Builder::makeConstatn procedure"
This reverts commit 2648172e00ccca266d39e8775b890b8a8395f57c.
* [IE TESTS] makeConstant was augmented with random data range parameters.
* [IE TESTS] GatherTree test was rewritten using makeConstant function.
* [IE TESTS] GaterTree test call templated makeConstant
* [IE TESTS] GaterTree test code style fix
* Fix fusing Multiply node with Convolution in case group != 1
* Add transformation test
* Do not fuse if not possible to reshape const
* Update fuse_linear_ops.py
* Make pybind more verbose in debug on windows
* Remove the NDEBUG flag everywhere
* Code complexity reduction...
* Missing colon
* And now the missing empty line...
* Reusable functions
* Now the mood of the sentence was wrong...
* Free functions instead of methods
* Add host_tesnor_2_vector() implementation and unit tests. One reference OP refactored to use it.
* Ngraph assertion message refactored.
Co-authored-by: Tomasz Dołbniak <tomasz.dolbniak@intel.com>
* Fix style.
Co-authored-by: Tomasz Dołbniak <tomasz.dolbniak@intel.com>
- U16toF32 conversion kernel converted to more generic ConvDepth one
- U16 <-> F32 conversion only are supported for now
- kernel is not used in the preprocessing graph yet
- tests are extended
* Moves splitLargeKernelConv tests to unit tests
Originally, file with tests has been placed in a wrong place
so it was not integrated into any testing application.
Now it is a part of unit tests on VPU.
Test itself has been disabled due to issue with NCE unit usage
described in #-33366
* Introduces pass I/O memory types annotation of stages
It is useful to see where inputs and outputs are located in
performance report for analysing possible issues.
* Introduces endsWith and tuple2Vector utilities
endsWith checks if source has suffix equals to second
argument. tuple2Vector converts tuple of arbitrary size
containing the same type to vector. It could be useful
working with gtest parameter generators that have
std::tuple as return type.
* Introduces unit tests on annotating stages memory types
* Introduces missing format placeholders
* Makes memory types annotation optional
Enables private option "enableMemoryTypesAnnotation" which
disabled by default. Disabling annotation by default allows
avoid issues with tests which rely on stages names.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* Mish activation calculation costs more time than memory copy, so
allocate more shaves mish activation.
Co-authored-by: Jiang, Renzhi <renzhi.jiang@intel.com>
* Fixed visitor for Interpolate-1 and Interpolate-4
* Code style fix
* Remove unnecessary changes
* Fixed compilation on Linux for Atttribute visitor of vector<size_t>
* Added unit test for IE IR Reader for Interpolate-4
* Updated unit test for IR Reader for Interpolate-4
* Updated unit test
[GNA] Fixed case of unconnected output of split layer
[GNA] Fixed case of unconnected output of split layer
test
[GNA] Fixed case of unconnected output of split layer
fixed
* Updated ConcatOptimization transformation to work when one dimension of input to Concat is 0D
* Fixed ConcatOptimization transformation to reconnect input edges to Concat
* Completely re-written ConcatOptimization
* Updated Concat0D optimization transformation
* Fixed order of traversing Concat input ports
* Refactored ConcatOptimization transformation to use `delete_input_port` function
* Detele trailing unconnected ports in the ConcatOptimization.py
* Cleaner implementation of ConcatOptimization + unit test
* [GNA] check whether permute operation is last one in the model
* add assert for checking
* change casting to static
* check casting to ConvolutionLayer
* Do not remove convert after the topK
* Added debug message
* Removed xFail
* Revert "Added debug message"
This reverts commit a01ace4ade88d73e2797b47c58db33943b0f508d.
* Added test
Fix of 36693 issue.
* Problem: One of the concat inputs is a constant. Adjust_data_layout pass tries to duplicate all inputs that do not meet the strides requirements, and then copy from the original input to the duplicate with strides. But duplicateData with an argument in the form of a constant also creates a constant, and then, when Copy, an error appears, the presence of a constant output, which cannot be.
* Solution: In addConvertedData create an intermediate date with the same description as the constant, and then copy the constant data into it with the required strides.
Co-authored-by: DariaMityagina <daria.mityagina@intel.com>
* validate_and_infer_types() implementation
* input parameter validation for LSTM, GRU and RNN
* style-check applied
* Add LSTMSequence dynamic shape validation and test props for RNNCell, GRUCell, LSTMCell and LSTMSequence.
* recurrent_sequence.hpp moved to ngraph/core/include/ngraph/op/util/
* style check applied
* removed unused variable from LSTMSequence::validate_and_infer_types
* Add missing newline mark at the end of file.
* Add supression macro for FusedOp deprecation.
* Add element type initialization
* Apply PR review remarks
* Rewrite tests to use gtest exception assertions.
* Fixed prototype of evaluate method.
* Rewritten Interpolate-4 ctors (added argument output_shape). Corrected tests.
* Fixed typo.
* Fixed number of args of make_shared in op::v4::Interpolate::clone_with_new_inputs.
* Fixes in Interpolate-4 tests.
* Now ONNX Upsample-1 is readed as Interpolate-4 with 4 inputs.
* Code style fixes.
* Some fixes in Interpolate-4 layer test.
* Now ONNX Upsample-9 is readed as Interpolate-4 with 4 inputs.
* Small fixes.
* Some changes.
* Fixed processing of 'scales' input in evaluation of Interpolate-4: now 'scales' contains scales only from 'axes'.
* Fixes in documentation.
* Now reference implementation of Interpolate-4 is rewritten for using 3 required inputs.
* Some code style fixes.
* Small fix.
* Started to write tests for method evaluate() of Interpolate-4.
* Continued to write tests for evaluate() of Interpolate-4.
* Some fixes.
* Some additions.
* Written draft of tests for 'cubic' mode with using scales.
* Some changes.
* Started to write tests for 'nearest' mode.
* Started to write tests for 'linear_onnx' mode.
* Some changes.
* Small fixes.
* Added setup of output type.
* Small addition.
* Added debug print into Interpolate-4 evaluate.
* Now in Interpolate-4 evaluation tests 3 inputs of Interpolate are Constants.
* Small changes.
* Added some debug print.
* Added more debug print.
* Some fixes.
* Now 4th argument of runtime::interpolate has type std::vector<int64_t>.
* Added checks for result of calculations.
* Added another expected values vector for the mode 'cubic'.
* Temporarily commented result value checks for the cubic mode.
* Written tests for 'nearest' mode.
* Some reorganization.
* Written tests for 'linear_onnx' mode.
* Fixed padding loop.
* Fixed docs for 'linear_onnx' mode.
* Written tests for 'cubic' mode.
* Deleted debug print.
* Fixed code style.
* Enabled CPU layer tests for Interpolate-4.
* Reverted changes of this file.
* Now ONNX importer reads Resize-1 as Interpolate-4 with 4 inputs.
* Now ONNX importer reads Resize-11 as Interpolate-4 with 4 inputs.
* Small fixes.
* Fixed docs.
* Added small epsilon in the shape calculation in the function op::v4::Interpolate::infer_using_scales.
* Small fix.
* Reduced size of the template function eval().
* Now the 'nearest' mode is rewritten to CoordinateTransfom instead of NDim* classes
* Now the 'cubic' mode is rewritten to use CoordinateTransformation instead of NDim classes.
* Started to rewrite 'linear' mode using CoordinateTransform.
* Started to write helper class.
* Small fix.
* Small changes.
* Some fix.
* Fixed typo.
* Now the preamble of 'linear_onnx' mode implementation is a method of helper class.
* Now the function clip_coord is the method of helper class, and the mode 'linear' uses CoordinateTransform instead of NDim classes.
* Deleted NDim classes.
* Some fixes.
* Some refactoring.
* Some refactoring: now inner calculation of 'linear' is in helper class.
* Moved reference implementation of Interpolate-4 into library with reference implementations.
* Small fix.
* Deleted commented tests.
* Code stile fixes.
* Deleted redundant type prop tests for Interpolate-4.
* Documentation fixes.
* Disabled IE_CPU tests for ONNX Resize: Interpolate-4 is not implemented in plugins.
* Temporarily disabled some ONNX tests.
* Some refactoring: deleted redundant attributes of InterpolateEval class.
* Small fix.
* Added NGRAPH_RTTI_DECLARATION and NGRAPH_RTTI_DEFINITION.
* Added debug print to 'cubic' mode calculation.
* Some deletions.
* Small fix.
* Fixed typos.
* Added another debug print.
* Now indices_shape is constructed from std::vector<std::size_t>(num_of_axes, 4) again.
* Fixed CMakeLists.txt.
* Small fix.
* Added more debug print.
* Fixed typo.
* Fixed calculation of the first argument of helper.clip_coord in the inner loop of cubic_func.
* Some code style fixes.
* Alphabetically sorted operations of opset4.
* Deleted constant cannot_define_axes.
* Used std::iota instead of loop.
* Renamed structure InfoToCallReference.
* Now void op::v4::Interpolate::validate_and_infer_types() checks an element type of an input tensor.
* Code style fix.
* Changes in reading of ONNX Resize and Upsample: we generate Interpolate-4 without 'axes' input.
* Now bodies of functions evaluate_interpolate_v4() and inline bool evaluate() are moved in the method bool op::v4::Interpolate::evaluate.
* Fixes in example of the documentation of Interpolate-4.
* Fixed typos.
* Small fix.
* Some fixes.
* Deleted some type aliases.
* Uncommented a test for 'cubic' mode.
* Small code style fix in bool op::v4::Interpolate::evaluate.
* Uncommented more test for Interpolate-4 reference implementation.
* Added more debug print.
* Some changes.
* Uncommented all tests for Interpolate-4 evaluate().
* Deleted debug print.
* Deleted 'simple' mode from the map nearest_mode_map.
* Code style fixes.
* Disabled some CPU tests.
* Some fixes.
* Small fixes.
* Some fixes.
* Fixed typo.
* Fixed typo.
* Small change.
* Fixed some typos.
* Fixed some typos.
* Fix in operator() of the class GetOriginalCoordinate.
* Disabled some CPU tests.
* Small changes.
* Deleted template function from resize.cpp.
* Code style fixes.
* Refactored the method op::v4::Interpolate::evaluate.
* Added documentation for infer_using_scales() and infer_using_shapes().
* Added documentation for classes GetNearestPixel and GetOriginalCoordinate.
* Small fixes.
* Some code style fixes.
* Small fixes.
* Some changes.
* Added NGRAPH_SUPPRESS_DEPRECATED_START and NGRAPH_SUPPRESS_DEPRECATED_END for using v0::InterpolateAttrs; and using v0::Interpolate;
* Code style fix.
* Enabled ONNX import only tests for Resize-10, Upsample-8, Upsample-9.
* Fixed element type for scales_const. Fixed targetShapes and pads in single layer tests.
* Small changes.
* Added conversion from NGRAPH to CNNLayer.
* Added documentation for the class InterpolateEval.
* Now 'nearest' mode has special tests in cpu single layer tests.
* Small changes.
* Fixes in cpu single layer tests.
* Temporarily commented Interpolate-4 in ConvertFunctionToCNNNetworkTests.
* Added some docs.
* Enabled some tests for Resize-11.
* Added test.
* Corrected expected values in the resize_downsample_scales_align_corners case.
* Added more test for the 'cubic' mode.
* Added more test for linear_onnx mode.
* Deleted debug print for linear_onnx mode.
* Deleted debug prints. Added yet another test for 'nearest' mode.
* Fixes for import of ONNX Resize-10 and Upsamples.
* Applyed Evgeny Lazarev fix for Interpolate-4 infer function.
* Code style fixes.
* Some tests were deleted from unit_test.manifest file for INTERPRETER.
* Deleted test for downscale Resize-10: results of infer are correct, but comparison is not.
* Enabled test INTERPRETER.onnx_empty_initializers_handling.
* Some fixes.
* Added the method run_with_tolerance_as_fp() to the class TestCase and the method compare_results_with_tolerance_as_fp() to the class TestCaseEngine.
* Small fix.
* Small fix.
* Added yet another type_prop test.
* Disabled CPU test IE_CPU.onnx_empty_initializers_handling.
* Code style fixes.
* Enabled some ONNX tests.
* Some changes.
* Code style fix.
* Enabled test INTERPRETER.onnx_model_round.
* Disabled tests with behavior as behavior of INTERPRETER.onnx_resize11_scales_down_linear.
* Changed tolerance for test onnx_empty_initializers_handling.
* Changed tolerance in the test onnx_resize11_sizes_linear (otherwise this test is failed in MacOS). Disabled test INTERPRETER.onnx_resize11_sizes_nearest_asymmetric_floor, because this test failed in MacOS only.
* Multiple fixes.
1. Fixes SpaceToBatch transformation to not crash if inputs are not Constant
2. Fixed eliminate_squueze, eliminate_unsqueeze to not crash when input has dynamic rank
3. Added reference implementation for the FloorMod operation
* Code style fixes
* Update floor_mod.hpp
Removed unnecessary function
Main purpose of this change is to fix weird behaviour of
fully_connected_gpu_fb_io_block_fp16 implementation where it shows
severe performance drop without bias.
Additionally assembly for case with bias is improved.
* initial commit
* first reshap-able variant
* right version for reshape
* comment update
* fixes for failed e2e
* set data type to ngraph TensorIterator
* Fix dynamic shapes for cells ops
* clean up
Co-authored-by: yegor.kruglov <ykruglov@nnlvdp-mkaglins.inn.intel.com>
* Implement reshapeable CTCGreedyDecoderPlusSparseToDense transformation and test
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix consts (after code-review #1)
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Add CTCGreedyDecoderTransformation with more generic pattern
Also it adds new middle-replacer for transforming sequence length to a mask
along with tests.
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Do fixes after review #2
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix after review #3
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix after review #4
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* first version of implementation
* added unit tests
* changed multiply to pow
* doc + unit tests
* more unit tests
* code review remarks
* missing new line
* remarks
* review remarks
* Build fix - update constant check function in HSwishFusionWithClamp
Co-authored-by: mitruska <katarzyna.mitrus@intel.com>
* Added HSwish operation
* Added HSwish fusing transformation
* Fixed BOM
* Added unit test for HSwish fusing transformation
* Fixed unit tests for transformations using 'build_graph_with_edge_attrs' function to build the graph
* Added fusion transformation for Swish operation
* Added fusing transformation for Softplus operation
* Added fusion transformation for Mish operation
* Added check for the node name in the unit tests
* Fixed Mish fusion pattern
* Updated Mish fusion transformation. Added unit test
* Updated HSwish fusing transformation
* Updated Swish fusion transformation and tests
* Fixed unit tests
* install_NEO_OCL_driver.sh: Added verifying current driver version
* install_NEO_OCL_driver.sh: Updated removal oldest driver on Ubuntu. Updated logic on defining
* install_NEO_OCL_driver.sh: Fixed function name
* Introduce Quantize-Dequantize to FakeQuantize transformation
* Revert changes in DequantizeLinear
* apply code format
* Changes after review:
- description for transformation
- remove NGRAPH_CHECK and move some checks from callback to predicates in pattern
- check if out_low/high are broadcastable for FQ's first input
- fix params to copy_runtime_info
* Add type_matches and type_matches_any predicates
* Use get_single_value
* Changes after review:
- add brief description of transformation
- use get_pattern_value_map instead of get_pattern_map
- change opset1 to opset4
- fix params to copy_runtime_info
* Check result of dynamic_pointer_cast
* Fixed NCE hang due to input dimensions exceed HW limitation
* Added convolution test for big input dimensions
Signed-off-by: wenzengc <wenzeng.chen@intel.com>
* Fixed order of transformation to convert the TF OD API SSD models
* Refactored the sub-graph modification for the TF OD API models related to Squeeze/Reshape after SSD heads
This change fixes concatenation in place optimization where it may
interact with convolution that uses physical padding.
One of such cases is where input to optimized concatenation is also
input to convolution, so it should have padding to enable optimized
implementation.
Previously for all concatenation inputs padding was overriden with only
concatenation axis being padded.
This change fixes this issue by propagating padding across inputs and
output.
* partial revert a commit with reference implementations of PriorBox(Clustered), disable references for this ops
* ngraph codestyle
* disable const folding unit tests for PriorBox(Clustered)
* fix arm build
* fix unit test
* Revert "fix unit test"
This reverts commit 1fe59e55d6.
* Reduce number of ops needed to create InstanceNorm
InstanceNorm in onnx importer creates the same subgraph for Mean twice - once for Variance and once for actual Mean.
This change makes InstanceNorm to use single Mean which is shared by numerator and Variance.
Also enables IE_CPU.onnx_model_instance_normalization test case
* Revert changes to .gitignore
* Replace Constant with Parameter in run_op_node.
* Pass inputs to function.
* Add func to get shape.
* Make constant if input is scalar.
* Add case for list.
* Fix test.
* Split tests for run_op_node and run_op_numeric_data.
* Split more tests.
* Split more and more tests.
* Mark tests with xfail.
* Mark more tests with xfail.
* Replace scalar with parameter.
* Code formatting.
* Set empty shape for scalar.
* Remove check for list.
Add `OUTPUT_STRIP_TRAILING_WHITESPACE` option to `execute_process` command.
Latest CMake (tested 3.18.1) doesn't strip new line from `protoc --version` call,
which leads to wrong `PROTOC_VERSION` variable and failure on git fetch step.
* 1d case optimization
* code refactor
* concat optimization
* removed using template for concat
* unit tests to concat constant folding
* synchro with current master
* [GNA] Propagate QoS timeout to the calling app (#1188)
* [GNA] Support timeout value set in Wait (#1499)
* [GNA] stored request id for completed sync infer request in order to get status later using wait() (#1458)
* stored request id for completed async infer request in order to get it's status later
* preserved status not started for multiple sequential calls to wait()
Co-authored-by: Denis Orlov <denis.orlov@intel.com>
* [GNA] Fix callbacks (#1607)
Co-authored-by: Eugene Smirnov <eugene.smirnov@intel.com>
* v1::Pad reference implementation
* ut fix: pad_negative_exterior_1d
* ut fix: pad_negative_exterior_1d_check_limits & pad_edge_1d
* Code formatting
* ut fix: pad_edge_1d_top_neg & pad_edge_1d_top_neg_bigger_than_tensor
* More Pad UT fixes
* Pad UT fixes: REFLECT mode
* Fix all Pad UTs
* Switch Pad evaluation in INT backend
* Non-template solution to v1::Pad::evaluate
* Always create v1::Pad with 4 inputs
* VS compilation error fix
* Python test fix
* Remove the v0::Pad constant folding pass
* Some extra checks in v1::Pad evaluator
* Code formatting
* Remove an obsolete CF test
* Added bool to u8 conversion
* Added opset1::ShapeOf handler
* Added ReduceLogicalAnd/Or support in ConvertPrecision pass
* Moved static map inside function; Updated callbacks
* Removed header
* Fixed tyle relaxed for cases when the same output consumes by multiple inputs in the same operation; added tests; fixed input types setting for already created type relaxed operations
* Removed reference implementations from public API
* Remove coordinate_transform from public API
* Introduced static library with reference implementations
* Extend MO for operation CTCLoss
* Change sequence length format to a mask format
* Add fixes after first-round review
* Add fixes after the second-round review
* Fixing CTCLossPlusCTCGreedyDecoder transformation
* Initial version of ReduceL1, ReduceL2 and ReduceLp enabling in the MO
* Added operations ReduceL1 and ReduceL2 to nGraph
* Removed ReduceLp. Added ReduceL1 and ReduceL2
* Separated specification of ReduceLp into ReduceL1 and ReduceL2
* Updated ReduceL1 and ReduceL2 specification
* Fixed ReduceL1 and ReduceL2 type prop tests
* Implemented nGraph transformation to decompose ReduceL1 and ReduceL2. Disabled them for CPU and GPU plugins
* Updated supported framework layers
* Added unit tests for ReduceL1 and ReduceL2 reference implementation
* Fixed ReduceXXX operations reference implementation by adding support for a new parameter 'keep_dims'
* Fixed constant folding for v0::Any
* Added ReduceL1 and ReduceL2 to Python API
* Implemented ReduceL1 and ReduceL2 decomposition tests and fixed ReduceL2 decomposition
* Added specific creator for ReduceXXX operations instead of NodeBuilders
* Fixed conversion ReduceXXX to CNNLayer
* Fixed parser for ReduceLogicalXXX operations
renamed logits -> bbox_deltas
updated ngraph unittests for Proposal
removed validate_and_infer_types Proposal-4
removed validate_and_infer_types Proposal-4
changed validate_and_infer_types in parent class of Proposal
removed get_output_size
successfully inferred Proposal on SSH and Faster-RCNN
added unittests for Proposal-4
added unittests for Proposal-4
added unittests for Proposal-4
returned back default namespace for Proposal
reduced number of outputs in v0::Proposal
correct conversion of Proposal-4 -> propodal_ie with 2 outputs
removed creator for proposal v0
removed converter for proposal v0
added Proposal-4 to MO
removed `for_deformable` attribute
added Proposal-4 to MO and nGraph Python API
removed typo in Proposal-4 specification
style corrections
style corrections and removed some redundant code
rename proposal Python api test
removed 'attrs' context from visitor
returned back AttrVisitor to check if passes OpenVINO ONNX pipeline
Should pass OpenVINO ONNX pipeline (returned back AttrVisitor just to check)
python api for Proposal-4 works ok
(style correction) python api for Proposal-4 works ok
parametrized proposal_ie some other corrections
removed 'attrs.' context from nGraph Python API tests for Proposal
minor corrections in replacer proposal->proposal_ie
corrected Python API OpenVINO-ONNX tests should pass
Improved workaround for AttributeVisitor for Proposal
Add additional check of im_info tensor shape to Proposal node in MKLDNNPlugin
😠 removed 4 extra spaces from test_dyn_attributes.py to match The Style
added new nGraph RTTI declarations, removed throwing exception in transformation
added new nGraph RTTI declarations, removed throwing exception in transformation, corrected exception in MKLDNNplugin
corrected im_info size checking in Proposal node of MKLDNNPlugin
It was discovered that it is sometimes useful to mark fast stages (e.g. stages that process less than 100 elements) in order to be able to parse them from final performance report and estimate its contribution into performance.
* Added pass for marking fast stages
* Introduced unit tests
* Added new predicates for smart pattern matching
* Added ConvMul and GroupConvMul fusion passes based on opset4; Added CPU functional tests for comparing fusion accuracy
* Improved ConvMultiply fusion to support scalars; Added positive and negative tests
* Added ConvolutionBackprop/GrouConvolutionBackprop Multiply fusion; Added functional tests
* Added test
* working ManagerWrapper
* Clean-up in ManagerWrapper
* worksave
* fixed building error
* Finished test of constant folding
* remove unused param
* Added get_vector function
* clean up
* RTTI base for ngraph::Node; cherry-pick from another branch, draft
* Added comments, moved code, switched to custom RTTI-based version of is_type
* Move rtti definitions in ngraph op class to the beginning of each class definition as a preparation for the next replacement
* Migrate part of operations to new RTTI
* Migrate GroupConvolution and Concat to new RTTI
* Apply code style for ngraph part
* Rename RTTI_DECLARATION/DEFINITION to NGRAPH_RTTI_DECLARATION/DEFINITION
* Reverted accidentally updated version of mkldnn
* TMP: rewrite RTTI back to constexprions as an attempt to fix static objects initialization order issue
* Apply ngraph code style
* Finalize move back to constexpr for RTTI
* Applied code-style
* TypeRelaxed template class implementation and necessary changes in ngraph + tests.
* Applied code-style
* Fix in fast algorithm in GraphRewrite, add new tests for this and other cases
* Make parent optional parameter for NGRAPH_RTTI_DECLARATION and remove Node::type_info; remove ability to have Node as a parent for type_info
* Try to resolve compilation error on Windows
* The next attempt to fix Windows build: re-introduce get_type_info_static
* Removed file that was removed in master and kept in this branch by mistake
* Next attempt to fix Windows build: externConstexpr
* Attempt to fix win build: extra public (suspect icc bug), remove get_type_info_static as useless.
* Next attempt to fix Windows: proxy const and constexpr
* Fixed constexpr
* Next attmpts: move get_type_info to cpp file
* Code stype fix
* Re-implemented RTTI without use of constexpr; run-time initialization is used; removed global definitions to avoid issues with order of static objects initialization
* Removed externConstexpr flag and removed TRANSFOMRATIONS_API for TypeRelaxed
* get_type_info_static initializes static local constant with type_info that is used for CLASS::type_info and CLASS::get_type_info
* Removed not needed debug output and useless comments
* Implemented better copy ctor for Node
* Fixed VisualizeTree issue for TypeRelaxed: stopped using < and > in type_info::name
* Better comments and names for methods
* Remove unused include
* Remove commented line
* Workaround for legacy conversion that uses Node::get_type_info().name as a type for the resulting CNNLayer leading to incorrect types for TypeRelaxed-based operations and then to fail in plugins
* Fixed typos, explicit ctor for TypeRelaxedBase, explanation for the need of get_overridden_output_type
* Fix typo
* Fixed issue with non-static name in type definition for TypeRelaxed and fixed WrapType to make it compatible with hierarchical relations between types
* Reverted default ctor for Output and reverted ability to reduce number of outputs for a Node; syntactically better debug message for a Node
* Cover methods of TypeRelaxedBase by tests
* Apply code-style
* Azure CI: Add Windows job with IncrediBuild
* Update IB version to 9.4.6
* Fix "Clone submodules"
* Update IB version to 9.5
* Update install link
* Add debug out
* Update debug out
* Remove debug out
* Disable initiator machine from acting as helpers
This change adds full support for asymmetric quantization to optimized
depthwise convolution, adds slm optimization and other minor
improvements.
Issue: CVS-25122
* unroll ti transformation, lstm sequence ie, rnn sequence ie
* Update unroll ti transformation, added GRUSequenceIE op, fixed several ti e2e tests
* apply ngraph codestyle
* fix naming after unroll transformation
* Added default constructor for RNNCellBase, fix conversions
* copy runtime info
* added UnrollTI unit tests
* clean up, move sequence ops in a separate PR
* clean up, ngraph code style
* temporary disable ngraph reader unit tests for ti
* fix unit tests on windows
* naming: use name of tensor after unroll tensor iteration transformation
* apply transformations to tensor iterator body, separate pass for ti transformations, fix naming issue
* fix build
* remove TensorIterationTransformations pass
* fix includes
* resolve conflicts
* fix build: incorrect includes
* remove split/concat for single iteration of TI, update to opset4, unit tests
* use matcher pass instead of graph rewrite
* try to enable UnrollTI transformation for all plugins
* disable unrollTI transformation for cpu plugin
* resolve review comments, enable unit tests
* update transformation description
* fix unit tests
* update transformation pipeline
* clean up
* clean up
* resolve review comments
* Separate MO configuration for TensorFlow 2 model conversion
Also, it updates documentation including steps to convert
TF2 model with a custom layer in Keras H5 format into SavedModel
* Do fixes based on the first-round code review
In one of the network it was the following pipeline:
```
FullyConnected -> Reshape -> FullyConnected
```
And the output of Reshape wasn't in the same order as input for this
layer. I found that the problem was connected with format of the layers.
During optimization passes this pipeline was transformed to the
following:
```
FullyConnected -> Reorder -> Reshape -> Reorder -> FullyConnected
```
Both `FullyConnected` layers works with `yxfb` format. This is why
Reorder layer after the Reshape has output layout with format `yxfb` and
`reshape_in_layout.format` returns `yxfb` format. But in this case we
have to convert Reshape to `bfyx` format because in this case we won't
change the order of elements.
I replaced `reshape_in_layout.format` (which returns `yxfb`) and
explicitly set `bfyx` format.
JIRA: 35288
* Draft version of the Swish nGraph operation and fusing transformations for different approaches to express the operation
* Swish fusing transformation refactoring
* Added Swish operation and extractor for TF. Removed unfolding transformation for the operation.
* Added SwishIE. Implemented transformation to convert Swish to SwishIE.
* Code style fixes
* Updated Swish reference implementation. Added tests for shape and value inference
* Fixed code style for Python API
* Fixed unit test
* Apply review comments
* Use matcher_pass_callback
* Make m_alpha attribute protected in the SwishIE operation
* Fixed Swish op PythonAPI test
* Added Caffe Slice_ext
* Added TFSlice, AttributedSlice (both with extractors and replacers), corrected SliceConverter and added unittests for all cases
* added comments to each type of Slice operation; optimized shape inference; moved mxlice inside of slice.py; renamed slice_replacers
* removed type annotation for get_shape_after_slice routine
* replaced zeros_like with zeros
* Corrected preserving node names, renamed attributes names, added tests fro slice_replacer onnx phase
* Renamed slice_replacers.py
* added more unittest cases
* added type annotations, moved to more relevant place routines for shape calculation, and some other minor corrections
* corrected a typo `normalize_slice_indices` comment
* corrected shape calculation for Nonconstant inputs
* corrected a few typos
* corrected type declarations
* corrected shape inference with rounding
* refactored unit-tests for front transforms of Slice
* added error raising for negative and zero shapes
* removed magic_num
* corrected AttributedSlice, clarified comments
* fixed unit-test for AttributedSliceToSlice
* typo in type hints corrected
* removed supported_attrs
* returned back default None for attrs of Slice
* Updated ConvertPrecision transformation to be executed for TI Body
* Added type fusion for GenericIE operation
* Added test for TensorIterator body precision conversion
This extends resample optimization for 8-bit types that uses feature
packed to mode to process multiple features in one work-item to features
not being multiple of packing factor.
For nearest resampling it is safe to copy extra feature padding for
blocked formats, so this change only removes this condition.
* Minimized ngraph headers inclusion
* Added compilation of plugin api headers with strict flags
* Fixed -WPedantic issue in ngraph headers
* Fixed compilation
* Trying to fix Windows
* Fixed GNA unit tests compilation
* Disabled WX test on Windows
* Enable ngraph python tests
* Refactor and unify ngraph with onnx python tests
* Revert deprecated test cases
* Set ngraph and onnx python tests as a one test suite execution
* Change unstrict Xfails to strict ones
* Update after review:
- add model zoo to onnx tests,
- improvements of tests
* Revert mounting zoo models dir
Co-authored-by: Michał Karzyński <4430709+postrational@users.noreply.github.com>
* [CPU] Add support 4th and 5th input DetectionOutput
* fix any comments
* move reference to ngraph
* some changes for mx nms
* change namespace for ref impl
Number of ops went down by 4.
Also fewer floating point operations improves precision here, so we're able
to unblock some test cases from ngraph's suite.
* Implement unicode conversion using Windows native functions
* NOCPPLINT
* Fixed deprecated c++ api usage in tests
* Moved impl to cpp
* Moved Unicode utils to Plugin API
* Added missed include for Windows
* Fixes for unit tests; CentOS fixes
* Fixed Windows compilation
* Fixed unit tests on Unix
* Fixed unix 2
* Build dlls with INTEGRITYCHECK flag if ENABLE_INTEGRITYCHECK=ON
INTEGRITYCHECK flag enforces digital signature before loading the binary in Windows.
Also, refine /guard:cf flag enabling - MSCV, Intel, clang compilers does support /guard:cf.
* first version
* fixed lower_bounds
* Added unit test
* Added support of negative axis
* Added more tests
* Slice refactor in order to reduce binary size
* remvoed unused headers
* added eveluate method to split
* review remarks. part 1
* review remakrs. part 2
* review remarks
* sync with master
* Aligned SpaceToBatch/BatchToSpace with the spec, converted from fused_op to op
* Implemented transformation to decompose STB/BTS
* Added unit tests
* Added new mode (INTERPRETER_TRANSFOMATIONS) for functional tests
* RTTI base for ngraph::Node; cherry-pick from another branch, draft
* Added comments, moved code, switched to custom RTTI-based version of is_type
* Move rtti definitions in ngraph op class to the beginning of each class definition as a preparation for the next replacement
* Migrate part of operations to new RTTI
* Migrate GroupConvolution and Concat to new RTTI
* Apply code style for ngraph part
* Rename RTTI_DECLARATION/DEFINITION to NGRAPH_RTTI_DECLARATION/DEFINITION
* Reverted accidentally updated version of mkldnn
* TMP: rewrite RTTI back to constexprions as an attempt to fix static objects initialization order issue
* Apply ngraph code style
* Finalize move back to constexpr for RTTI
* Applied code-style
* Fix in fast algorithm in GraphRewrite, add new tests for this and other cases
* Make parent optional parameter for NGRAPH_RTTI_DECLARATION and remove Node::type_info; remove ability to have Node as a parent for type_info
* Try to resolve compilation error on Windows
* The next attempt to fix Windows build: re-introduce get_type_info_static
* Removed file that was removed in master and kept in this branch by mistake
* Next attempt to fix Windows build: externConstexpr
* Attempt to fix win build: extra public (suspect icc bug), remove get_type_info_static as useless.
* Next attempt to fix Windows: proxy const and constexpr
* Fixed constexpr
* Next attmpts: move get_type_info to cpp file
* Code stype fix
* Re-implemented RTTI without use of constexpr; run-time initialization is used; removed global definitions to avoid issues with order of static objects initialization
* Remove already unncecessary compiler flag for Windows
* get_type_info_static initializes static local constant with type_info that is used for CLASS::type_info and CLASS::get_type_info
* Rewrite commens for NGRAPH_RTTI_... macros, remove not used header
* In this PR I'll add ngraph::pass::ConvertPrecision transformation and change only CPU Plugin to decrease number of changes. Other plugins will be updated in separate PR.
* This PR also includes changes for TI body transformations. We need to call the same sequence of transformations including ConvertPrecision for TI body.
* Hide implementation of SharedObjectLoader to cpp files
* Fixed GPU tests compilation
* Fixes for Unix; check OpenCL headers with strict flags
* Fixed Windows
* More fixes for Windows
* Fixed Unit tests
* Enabled compilation with libVA for new GPU tests
* Fixes for case when libVA is not available
* Removed useless NOMINMAX
* Useless include
* Fix
* Fixes
* Fixes for Intel compiler
* Fix for Windows + Intel compiler
* Fixed samples compilation with Intel compiler
* [Stress] Support OMZ model_info.py in get_testdata.py
* [Stress] Copy IRs from OMZ models folder to IRs folder
* [Stress] Support modified configs in C++ tests
* [Stress] Deprecate support of --env_conf due refactoring of configs
* [Stress] Update configs:
1. Removed env configs due deprecation
2. Moved test configs to a new format
* [Stress] Extend MemCheck records with info from test config
* Specify, review and approve operation Proposal-4
* added types section and some other corrections
* Added example of Proposal-4 without reductions
* Corrected information about input tensors
* removed 'logits' from specification, added information about shapes
* removed `for_deformable` attribute
* changed `batch_size` to 7
* updated output dimension
* Remove unnnecessary ir_version checks in the MO
* Cleaned up 'backend_attrs_v2' function
* Small clean up from the 'TFCustomSubgraphCall'
* Clean up the MO extractor attributes mapping
* Renamed PreluOp to PReLU
* [ci-skip][IE Myriad] ie::ICore pointer passed into FrontEnd from plugin
* [ci-skip][IE Myriad] Added MockICore to fix graph transformer tests
* [ci-skip][IE Myriad] IN renamed to I_N to avoid compile error in Windows build: C2513: 'int': no variable declared before '='
* Add default implementation that throws exception.
* Implement `createROI` for `TBlob` and existing compound blobs.
* Use reference couting for TBlob memory buffer to prolong its life time for ROI blobs.
* Add private extension for ND ROI and use it as implementation detail for now:
* Add `DimSlice` and `TensorSlice` structures for generic ND ROI support.
* Add `make_roi_desc` function to create `TensorDesc` for ROI.
* Removed legacy library includes from plugin api headers
* Removed IInferencePluginAPI interface; merged with IInferencePlugin
* Removed pluginAPIInterface usage in Core implementation
* First variant of tests for keep_constant_inputs
* Redone tests to check number of inputs
* Count inputs of layer via ngraph::Function
* Add additional transformations for CNNNetwork
* Modified work with CNNNetwork via iterators
* Add tests for FullyConnected Network
* Rename function for counting of inputs
* Debug output was deleted
* transformations_callback was removed
* Change ASSERT_GT on ASSERT_EQ
This PR introduces next changes:
1. Transformations *_tbl.hpp files were replaced with direct registration in cpp files.
2. Plugins use pass::Manager to call conversion passes.
3. Transformations callback was moved to PassBase class as there is no more need to keep it in separate class
4. All pattern based transformations must be inherited from MatcherPass class. GraphRewrite class will be used only for matchers registration and execution on function.
MatcherPass class adds new features to pattern-based transformations approach:
* Allows to run matcher pass on a single node.
* Operations that were created inside transformation callback can be added to execution list to be available for pattern matching within single GraphRewrite.
5. GraphRewrite MatchClosure was replaced with MatcherPass. So all matchers will be registered as a MatcherPass.
6. Added pass::Manager::clear_state() method to avoid dependency with nodes that no longer belongs to function after replacement.
7. Some representative transformations were updated to use MatcherPass as an example.
8. Mul->Add sequence fusion transformation was replaced with LinOpSequenceFusion.
9. Pattern and callback registration code was moved to class c-tors (will be finished for remaining passes in other PR) .
10. Updated pass::Manager to get pass names only when NGRAPH_PROFILE_PASS_ENABLE enabled.
11. Moving towards removing PassProperty.
12. Added ngraph::pattern::wrap_type<T>(inputs, pred) to simplify pattern creation.
13. GraphRewrite was updated to execute MatcherPass more efficient.
* Implementation of Resize-11
* Added support to sizes input
* Add tests to sizes input
* Added missing comment
* fixed tests
* fixed tests
* Fixed test. part 2.
* review remaks. part 1.
* review remarks. part 2.
Co-authored-by: Tomasz Socha <tomasz.socha@intel.com>
* Added more tests
Co-authored-by: Tomasz Socha <tomasz.socha@intel.com>
* Network serializer for v7 is removed
* Fixed compilation
* Fixed Windows build
* WA for GPU
* Create function 2 times
* Fixed compilation
* Added return
* [Stress] Redesigned MemCheckTests: 1. Added MemCheckPipeline to incapsulate measures and logging. 2. Moved references to array
* [Stress] Added tracking of THREADS in MemCheckTests
* [IE][VPU]: Moves UpgradeNMS4ToNMSDynamic transformation into myriad plugin
* [IE][VPU]: Moves UpgradeNMS4ToNMSDynamic from common to vpu folder
* [IE][VPU]: Moves Dynamic NMS from common folder to vpu
* [VPU]: Makes NMS conversion unconditional
* [VPU][NGraph]: Changes dynamic NMS base class from v3 to v4
* [VPU]: Moves NMS4toDynamic transformation before common optimization
* Try fix parsing error.
* Small exception refinements during importing model.
* More exception refinements.
* Skip segfaulting tests.
* More clear error types and messages. Func rename.
* Fix typo.
* Check on CI whether test_onnx will work.
* Add only those file which pass tests or have failing ones skipped.
* Add mish op to ngraph
* Update mish op
* Set v4 namespase for tests
* Add mish to cmake
* Add comments for mish op.
* Refactoring code style
* Update version to v1 for Mish op
* Add value propogation test for Mish op
* Refactoring mish op according to review
* Fix mish version
* Update cmake file
* Fix mish value propogation unit test
* Add unit test for mish op
Co-authored-by: Your Name <you@example.com>
* [Stress] Define Database constant arguments in memcheck_upload.py only
* [Stress] Simplify computations using HashableDict in `compare_memcheck_2_runs`
* [Stress] Add comparison using pandas
* Some pass creates datas duplicate with a different order from time to time (because of unordered_set usage). It leads to a different order in model->datas() list and affects the shape allocation process which relies on this order.
* Make shape allocation be relied on topological order of datas which is stable and doesn't depend on order datas creation during different passes.
Don't increment mapped_idx via prefix increment within the argument of the
potentially unsafe CPU_ISSET_S macro. If the macro is expanded so that the
increment expression is evaluated multiple times, it will return unexpected
results.
While the glibc implementation of CPU_ISSET_S macro seems to be safe, the musl
libc (v1.1.23) version is unsafe and will evaluate the first argument of
CPU_ISSET_S three times.
Co-authored-by: Christian Priebe <cp3213@ic.ac.uk>
In some networks, mvTensor would request a large CMX-DMA transfer (<512K). That starves DMA for other timing critical tasks such as SIPP. Limit CMX-DMA request size as an option in myriad_compile:
* Add compile option TILING_CMX_LIMIT_KB
* Declare compile option TILING_CMX_LIMIT_KB in IE tools (compile_tool and vpu_compile)
* Add tests for compile option TILING_CMX_LIMIT_KB. Small fix for naming behavior tests.
* Specify operation CTCLoss-4
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Correct documentation for CTCLoss after #1 review
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Correct documentation for CTCLoss after #2 review
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Correct documentation for CTCLoss after #3 review
* Correct documentation for CTCLoss after #4 review
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Correct layout for logits and add more description for unique attribute
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Correct types for length and indices tensors
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Correct formulas and punctuation
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
Myriad plugin treats DSR operation in a way removing such operations
and connecting inputs with each other (replacing output with one of them).
Semantic of connection is one inputs contains shape of another.
Since the same data object can have exactly one shape it's prohibited
to have DSR inputs connected with another data objects
(the only allowed exception is inputs that are already connected between
each other).
As a result of nGraph -> CNN conversion some operations could be optimized
out which in turn could lead to subsequent DSR operations where each has
its own shape sub-graph. Even if shape sub-graphs are identical it's not
visible to plugin that sees incorrect inputs (inputs of DSR are already
connected, but now with each other, when second DSR is parsed).
To overcome such issue (the reason is when operations are optimized out,
their shape sub-graphs are still there), additional ngraph
transformation should be introduced to merge subsequent DSR into single
DSR operation.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
Previously, if Reshape had input pattern with values [0, -1] - it
propagated dynamic shape through a function. At the same time,
taking "0" and "-1" interpretation into consideration, it turns out
in such cases we could just propagate the same input shape in case of
2D input.
For Faster-RCNN this fix makes static dimensions on dynamic paths static.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* In case of Begin/End/Stride inputs of StridedSlice have rank less
than input data rank - remaining dimensions must be kept unchanged.
* Previous, implementation had UB in such cases - out of bound
vector element access
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* Added mish layer doc opset
* Refactoring mish spec
* Update mish spec
* Change output description of Mish layer
* Fix Mish according to review
* Refactoring Mish and GELU spec according to code review
* Update formula for ops in spec
* Refactoring spec text
* Update Mish opset
* Change Mish version from 1 to 4
* Sort opset4
Co-authored-by: Your Name <you@example.com>
* Added documentation for Interpolate-3.
* Some fixes.
* Fixed some typos.
* Now Interpolate-3 is Interpolate-4.
* Fixed typo.
* DEleted unused 'mode' 'area'.
* Fixed some typos.
* Now 'axes' attribute is an input of Interpolate.
* Added description of variants of nearest_mode.
* Added descriptions of coordinate transformation modes.
* Now 'axes' is an optional input.
* Fixed typo.
the point is that we should check the ORIGINALLY (largest) list of the devices (actually ExecutableNetworks for them) to see if the device is just added back
* [LPT] FuseFakeQuantizeAndScaleShift transformation for last layer fix
* [LPT] refactoring
* [LPT] FuseFakeQuantizeAndScaleShift test: last layer name validation was added
* [IE][nGraph]: Introduces PartialShape ctor from values vector
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU][nGraph]: Moves evaluateTargetShape to common utilities
The same functionality - get upper-bound shape estimation for dynamic
input - is needed in dynamic Reshape along with dynamic Broadcast.
Return value type has been changed from PartialShape to vector<int64_t>.
The reason is Reshape encodes special values (0, -1) into input values
that define output shape. Representing those values (which upper-bound
provides evaluateTargetShape) as PartialShape leads to incorrect
representation vector with -1 as dynamic shape - which is not expected.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU][nGraph]: Introduces StaticShapeReshape
In comparison with original Reshape StaticShapeReshape propagates
upper-bound shape through a function in case of dynamic input. To do so,
shape inference method gets upper-bound shape from evaluateTargetShape,
decodes special values (0, -1) in it and then propagate the result.
Output shape processing happens only once, because if shape inference
were called after ShapeOf operations have been optimized out on dynamic
path, then evaluateTargetShape will require evaluate method for all
operations that appear in function before current Reshape. Since
evaluate method is implemented not for all operations it lead to
Faster-RCNN compilation error.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU][nGraph]: Updates Reshape DTS on StaticShapeReshape
In case of non-const Reshape input that defines output shape DTS uses
StaticShapeReshape which propagates upper-bound shape evaluated from
this input through a function.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU][nGraph][Tests]: Refactoring DTS Reshape tests
The only changes are:
* header files include reordering
* indentation/wrapping fixing
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU][nGraph]: Moves ShapeOf transformation out of DTS scope
In comparison with DTS ShapeOf transformation needs to work on whole
function. Separating these 2 transformations makes testing easier since
now it's possible to call specific DTS without ShapeOf transformation
and vice versa.
Also DynamicToStaticShapeOf has been renamed into
EliminateShapeOfAfterDSR since transformation doesn't introduce new DSR
operations.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [VPU][Tests]: Introduces DTS Reshape tests with non-const pattern
New StaticShapeReshape constructor has been added as well, since test
fixture should create it from reshape parameters, not reshape itself.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* Remove replacement of StridedSlice with other stages and execute it on device as one kernel.
* Refactor strided slice tests to be able to parametrize it by precision.
* Update firmware.
* added support for power layer with non-1 exponents to GNA plugin
* reverted a change caused by merge issue
* fixes for review comments (typo fix - lrelu instead of leru, unnamed structure instead of of named one in union with arguments of activation function, name fix - input instead of inputs),
scale-shift implementation based on affine layer instead of PWL,
* fixed code style
* fixes for coding style in scale_factor_calc.hpp
* added domain for power function
* fixed review comment - power function specific methods
* added check if dynamic casting was successful
* removed I16 as it is not supported by ngraph
* fixed initialization per review comment
SparseToDense used in Wide and Deep model is expressed through ScatterND operation.
ScatterND is more functional than SparseToDense. Hence, it was decided to replace SparseToDense
with ScatterND. ScatterND is more useful for other models.
Remove SparseToDense from the previous opset
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Check equality of shape data for the replaced and replacement input/output data in the model
* Connect data with shape in duplicateData method
* Disconnect shape with data which is being removed as unsued.
* Check that disconnected shape still have child dataToShape edges or consumers
* Refactor cleanUp to use removeUnsuedData and not duplicate code
* Added ctor for CNNNetworkImpl to convert from ngraphImpl
* Re-use in all places instead of manual conversion
* Hide convertToCNNNetworkImpl usage
* Remove useless test
* Fixed Gleb's comments
* DequantizeLinear 10 as a subgraph
* Enable DequantizeLinear from opset 13
* Exclude the failing tests
* Re-enable dequantize linear UTs
* Validation helper
* [VPU] Remove hardcoded shape type from MatMul dts
* [VPU] Forbid first GEMM input to be dynamic and transposed
* [VPU] Update DSR_MatMul tests to use DSR_TestsCommon base class
* [Stress] Fix missing retry for StressMemLeaksTests
* [Stress] Add smoothing with sliding average for StressMemleaksTests
* [Stress] Enable GPU in StressMemleaksTests precommit scope
- add error reporting for failed kernel runs during auto-tune
- fix auto-tuning for asymmetric quantization
- add asymmetric quantization information to cache
- change auto-tuning metric from average to min
This change adds checks, macros and defines for two early/experimental
features:
- local memory block reads
- builtin optimization hints, ie: __builtin_assume
* [IE VPU] Set name for outDSR in DTS transformations
* [IE VPU] Enable NonZero_Transpose tests
* [IE VPU] Set name for outDSR in Reduce DTS
* [IE VPU] Use move semantic in DTS
* Specification for the NMS-4 operation (updated shape infer function)
* Enabled NMS-4 in the Model Optimizer
* Changed opset version for NMS with dynamic outputs and namespace to be "dynamic"
* Added NMS-4
* Added opset4 to the nGraph
* Added unit tests for NMS-4 type infer
* Renamed UpgradeNMS3ToNMS4 to UpgradeNMS3ToNMSDynamic. Added stub for ConvertNMS4ToLegacy
* Make IE aware of opset4 ops
* Updated NMSIE to have different shape infer function based on the NMS it was converted from. Implemented NMS4->NMSIE conversion
* Apply code style
* Updated StaticShapeNonMaximumSuppression op in the VPU
* Introduced new version of NMSIE operation with shape infer function from v4::NMS
* Fixed dynamicToStaticNonMaxSuppression transformation
* Added new version of NMSIE op with updated shape infer function
* Fixed NMS4 to NMSIE2 transformation
* Fixed constructors for nGraph ops v4::NM and dynamic::NMS
* Updated text in the opset4 specification document
* Code style fixes
* Fixed constructors for StaticShapeNMS + fixed test
* Minor change to the NMS op in the MO
* Fixed typo in the dynamic_to_static_shape_non_max_suppression transformation
* Removed redundant checks
* Refactored NMS infer and validate functions
* Added more checks to the validate_and_infer_types functions for NMS-3 and NMS-4
* Fixed compilation issue on Windows for op NMS
* Code style fixes
* Fixed typos in the NMSIE and NMSIE2 to CNNLayer op conversion
* Fixed typo in the ie_cnn_layer_builder_ngraph.cpp
* Fixed the NMSToLegacyNMS transformation. Added unit tests
* Apply code review comments
* Refactored NMSIE to use visitors
* Removed calling ConvertNMS4ToLegacy in the common optimizations
* Moved NMS4ToNMSLegacy to convert1_to_legacy group of transformations
* Removed useless include statement
* Removed copy-paste issue
Co-authored-by: Evgeny Lazarev <elazarev.nnov@gmail.com>
* Fixed deleting Transpose layers after and before Interpolate layers.
* Added run_after() for the transformation InterpolateTranspose.
* Some checks were moved from the replacement function to the pattern.
* Added a check of the attribute 'axes' into the pattern.
The ExtractImagePatches operation collects patches from the input
tensor, as if applying a convolution. All extracted patches are stacked
in the depth dimension of the output.
JIRA: 30055
* LayerNorm(PyTorch/HuggingFace pattern)->MVN+Mul+Add. Improves perf on BERT by 5%
* deducing the across_channels from axes passed to the MVN op.
axes are normalized. if no axes is specified, falling back to the (previously) default across_channel value
Co-authored-by: myshevts <maim.y.shevtsov@intel.com>
[GNA] Added fix multiple output with one go to memory and test
[GNA] Added fix multiple output with one go to memory and test
[GNA] Added fix multiple output with one go to memory and test
Added multi output
Update gna_pass_manager.cpp
test
[GNA] Added fix multiple output with one go to memory and test
[GNA] Added fix multiple output with one go to memory and test
[GNA] Added fix multiple output with one go to memory and test
Added multi output
Update gna_pass_manager.cpp
test
tests
[GNA] Added fix multiple output with one go to memory and test
[GNA] Added fix multiple output with one go to memory and test
Added multi output
Update gna_pass_manager.cpp
test
tests
Added pass
Test
test
tests_2
return old
The problem was in order of freeing memory. _context was removed before
_device and it looks like cl::Device in destructor tries to read some
info from cl::Context. And in this case we got this problem with
addressing because the memory already was freed.
For fixing the problem I changed the order of constructing members. And
based on principle: "First constructed, last destructed", the problem
was fixed.
JIRA: 29649
* Fix kaldi models (batch > 1)
* ngraph codestyle
* fix ngraph to ie conversion
* Added comment
* apply review comments
* Added test for the case using the SetBatchSize function when ReadValue op is in the network
* Check status code instead of message
* Use new ngraph api
* Removed back phase transformations related to IRv7
* Fixed setting value for the input port using the 'set_value' method
* Removed front and middle phase transformations related to IRv7
* Cleanup the rest of the Model Optimizer transformations from IRv7 specific transformations
* Final cleanup of the deprecated IR v7 related code
* Removed 'blobs_as_input' usage in the Model Optimizer.
* Removed function '_fuse_add' from the Model Optimizer since it is not used anymore.
* Removed 'keep_in_IR' node attribute for FakeQuantize ops in the MO
* Disabled failing gpu_engine.user_context test
* Fix build issue
Why:
* Enable to build OpenVINO.
This change addresses the need by:
* Adding include directories,
* Removing IE::inference_engine_c_api dependency.
* Remove IE::inference_engine_nn_builder reference.
Why:
* Enable to build OpenVINO.
This change addresses the need by:
* Removing IE::inference_engine_nn_builder dependency.
Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
For b_fs_yx_fsv16 format in reference kernel features for dispatch are
rounded to multiple of 16. This change adds correct check in kernel to
return work-items that are inside this dispatch padding.
Previously those work-items could corrupt memory expected to be filled
with 0s, and for parametrized activation due to bounds checking with
modulo operator they could have been corrupting actual layer output.
Issue: CVS-27672
* [IE VPU] Add evaluate method to DSR
* [IE VPU] Enable DSR_Reshape tests
* [IE VPU] Improvements in DSR op
* [IE VPU] Fix typo in copyBlobAccordingUpperBound
* [IE VPU] Support dynamic inputs
* [IE VPU] Use dynamic inputs in tests
* [IE VPU] Improve conditions in propogateDynamism pass
* [IE VPU] Fix Myriad2 tests via dosabling reorder
* [IE VPU] make error message more explicit
* [IE VPU] Fix Win compilation: std::stoi in <string>
* [IE VPU] Improve data transferring to work with ND tensors
* [IE VPU] Avoid ODR in myriad common test utils
* [IE VPU] Split code in propagate dynamism into separate methods
* [IE VPU] Simplify conditions in DSR parsing
* [IE VPU] Emplace data in initialStages when remove stage order
- Named structures in bmp.h to avoid MSFT compiler error
- Fix for non-void function with missing return statement to avoid Intel compiler error
- Enabled "smoke_ExportUsingFileNameImportFromStreamNoThrowWithDeviceName" test
- Fix for MvncTest
* Execution graph via ngraph for CPU plugin
* Fixes
* Migrated to VariantImpl instead of Parameter
* Reverted to dedicated ExecutionNode once again
* Re-use new execution graph in tests
* Fixed one more tests to use execution graph via ngraph::Function
* [IE][VPU]: Enables dynamic output from middle of network support
This feature is very useful for debugging dynamic networks.
Changes include modification of existing addCopyForOutputsInsideNetwork
pass to respect dynamic outputs and moving propagateDynamismToOutputs
pass after addCopyForOutputsInsideNetwork. The motivation for last change
is to avoid unnecessary copy stages due to not synchronized logic, because
previously:
* First in Front-End (parseDSR) we mark shape data object as output
* Then in propagateDynamismToOutputs we insert copy stage for that case.
It's necessary if shape data object had other consumers
* Then in convertShapeNotation we insert Gather consumer for output data object
* Finally, addCopyForOutputsInsideNetwork inserts one more copy stage to leave
output data object without consumers.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU]: Replaces attrs.has + attrs.get with attrs.getOrDefault
* [IE][VPU]: Fixes setting IE-notation and converted-notation to the same data object
* Modifications to support fp16 networks in KMB-plugin
* StridedSliceIE is removed
* One function convertFunctionToICNNNetwork with default parameter
* Some little changes in function convertFunctionToICNNNetwork()
* Delete some spaces in code (style changes)
* Edit code style
* Edit code style one more
* Edit code style again
* Remove row with Transpose()
This change:
- extends concat in-place optimization for resample on input
- adds resample primitive int8 support for bilinear mode
- fixes some potential issues with offset calculations with in8
* [IE VPU] Enable variable number of inputs for ExpPriorGridGenerator layer
* [IE VPU] Add test cases for ExpPriorGridGenerator layer with less than three inputs
* Fix preserving names of output layers after TopK NGraph transformation
It helps to infer semantic-segmentation-adas-0001 model. See CVS-31977.
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix a test for TopK
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix TopK NGraph transformation and its test
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Transformation to eliminate trivial permute
* Minor changes in unit tests
* Replace trivial permutation with copy if input and output dims is equal
* Fix mergePermuteStages tests
* Small changes in the loop
* Add const modifier, change dimsVector type to SizeVector
* Change loop condition, rename valiable
* To reverse dimsVector
GCC and CLang *default* sanitizer linkage differs (static vs. dynamic).
Prefer default behavior as alternate seen having issues.
Default (GN)U linker fails with unresolved symbols linking Clang built
binaries with sanitizer enabled. Force use LLVM linker lld for Clang
builds.
Sanitizer instrumentation and link flags should be retained for all
binaries. Updating samples cmake configuration to keep those flags
after unset logic at the ie_build_samples().
* Fixed StridedSlice to Crop transformation to not apply when rank of data is changed
* Added unit test for StridedSlice to Crop transformation
Co-authored-by: Evgeny Lazarev <elazarev.nnov@gmail.com>
* [MO] Implement EmbeddingBag_3
* Transform dynamic sub-graph of Wide and Deep into EmbeddingSegmentsSum
- Expressed SparseWeightedSum sub-graph through EmbeddingSegmentsSum
- Removed experimental SparseWeightedSum layer
- Implemented tests for the transformation
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix EmbeddingBag shape infer
* Fix EmbeddingSegmentsSum transformation for Wide and Deep
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix EmbeddingSegmentSum replacer after ports swap
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Update package_BOM.txt
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Add unit tests for EmbeddingXXX shape infer
* Fix ATen resolver
* Remove deleted files from BOM
* Add opset version to embedding_bag
* Use base class for EmbeddingBag
* Fix per_sample_weights case
* Fix EmbeddingSegmentsSum transformation
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix EmbeddingBag checks
* Fix ATen front transformation and merge conflicts
* Fix BOM
* Work around limitation for I64 input of W&D model
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Cleanup where operation to fix affect of WhereDecomposition transform
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix BOM
* Correct EmbeddingSegmentSum transform for Wide and Deep
Add casting segment ids to i32 and remove ConstToResult sub-graph.
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Update BOM with RemoveConstToResult transform
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Add more comments for RemoveConstToResult transformation
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Remove useless logging in EmbeddingSegmentsSum transformation
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Small fixes
* Move EmbeddingBag resolving back to front phase
* Improve error messages
* Fix typo in unittests
* Reimplement sparse_reshape middle transform
Avoid deprecated API.
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Clean-up graph after sparse_reshape and ConstToResult transformation
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix clean-up for transformations
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Fix clean-up for transformation #2
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
Co-authored-by: Roman Kazantsev <roman.kazantsev@intel.com>
* Azure: Add Ninja
* Fix 'Install Ninja' on Linux
* Fix bin dir path on Windows
* Add -Wno-unused-variable on Mac
* Add -Wno-error=unused-command-line-argument on Mac
* Set CXXFLAGS for Mac
* Improvements
* Fix BIN_DIR on Linux
* [CPU] Updated DepthToSpace and SpaceToDepth layers to be conformant with the specification
The patch also includes n[d]hwc layout support as well as some optimizations
* [CPU][TESTS] Removed old DepthToSpace test since it doesn't corresponds to layer's specification
* [nGraph] Utilize CommonOptimizations pass with custom transformations callback
Implemented three operations: EmbeddingBagPackedSum,
EmbeddingBagOffsetsSum and EmbeddingSegmentsSum. These operations do
the same work but have a different format of inputs.
- change repo name to openvino
- update driver version
- fix path to samples data
- remove section about Movidius driver installation
- change latest release to 2020.3
- merge fixes in install_dependencies.sh from 2020 branch
adds fusing support to all available pooling kernels
tests all possible input type/output type configurations
fixes minor bug in max pooling in pooling_gpu_test.cpp
fixed minor bug with yxbf format in pooling_gpu_ref and pooling_gpu_int8_ref kernels
fixes bug with b_fs_yx_fsv32 format in pooling_gpu kernel
resolves bug with max pooling accuracy missmatch in case of non zero pad end layer parameter
resolves average pooling accuracy missmatch in case of non zero pad end layer parameter
The problem behind this error was in program_impl::init_graph() where in calculate_prior_boxes we are trying to calculate output layout of an entire network recursively which causes stack overflow. Calculating output layouts beforehand in processing order fixes this issue.
fix the following compile error:
inference-engine/src/mkldnn_plugin/mkldnn_memory_solver.hpp:60:9: error: 'int64_t' does not name a type
| 60 | int64_t size;
| | ^~~~~~~
include stdint.h to fix this.
Signed-off-by: Liwei Song <liwei.song@windriver.com>
* Create generic RecurrentSequenceDirection enum.
* Helper class RecurrentSequenceOp.
* Add ONNX GRU & RNN operators.
* Use OutputVector.
* Update doc.
* Add UTs for GRU and skip them on IE_CPU
* Add UT for bidirectional mode and fix it.
* Normalize activation function name case.
* Add unit-tests for RNN operator.
* UT for GRU with linear_before_reset set to true.
* Fix ONNX GRU for linear_before_reset case.
* Remove unnecessary symbol export macro.
* Fix CentOS error.
* Update UTs.
- Update few tests accuracy tolerance
- Update rnn_fwd_activations with new reference values and model.
* Review comment: add check for static shape
* Add UT for RNN with constant inputs W, R.
* Skip UT with const W,R on IE_CPU
* [IE][VPU]: Enables pass for propagating dynamism to network outputs
If network had dynamic output and then myriad Front-End inserted
convert stage at the end (to convert FP16 -> FP32 - output precision)
then dynamism would not be propagated - we have convert stage that
has dynamic input, but static output. As a result, we have run-time
error in Convert kernel: input and output shapes do not match.
At the moment, pass supports only Convert stage as output stage
over which we should propagate dynamism to outputs.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU]: Fixes parse DSR in case of output data
Replacing stage output must be done after replacing
data to shape parent, because the last one may access
original parent producer, but after replacing stage output
it'd not have one.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU]: Fixes MacOS build
* [IE][VPU]: Fixes shape data naming convention
Plugin part assumes that if there is dynamic data object, that's
represented as 2 different data objects (data and shape), then
shape data object has name = data object name + @shape suffix.
Pass that creates new dynamic data object should respect that
assumption.
* [IE][VPU]: Fixes dis-alignment in names of data objects representing dynamic data object
MyriadInferRequest::GetResult assumes that in case of dynamic data object
"data" data object and "shape" data object will have aligned names:
"shape" name = "data" name + "@shape" suffix.
In order to meet that expectation propagating dynamism pass must use output
data object name as prefix. Additionally, propagating pass must be applied
before converting shape notation pass in order to make output shape in IE
notation, not MDK, as MyriadInferRequest::GetResult is expecting.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* Update activation layer test
Signed-off-by: Mikhail Treskin <mikhail.treskin@intel.com>
* Get rid of LayerTestsCommonDeprecated class
Signed-off-by: Mikhail Treskin <mikhail.treskin@intel.com>
* Fix activation tests instantiations for gpu and myriad plugins
* Remove leaking inferWithInterp function
WhereDecomposition transform is applied to Where operation in for-garbage sub-graph remained after SparseWeightedSum transform.
Signed-off-by: Roman Kazantsev <roman.kazantsev@intel.com>
* implemented depth_to_space transformation
* renaming
* added functional tests, fixed mistakes in implementation of the transformation
* disable ConvertSpaceToDepth/ConvertDepthToSpace transformation for CPU plugin, enable DepthToSpaceFusion for CPU plugin only, add specific creators
* fix wrong include
* fix for functional tests: set transformation callback
* revert callback calls for CPU plugin
* move functions to .cpp file
* Apply review comments
* Apply additional review comments
* fix cast to bool type
* Added explicit calling convention to CAPI callback
* Fixed typo spacing
* Renamed INFERENCE_ENGINE_CALLBACK to INFERENCE_ENGINE_C_API_CALLBAC to make the macro really specific to the C API
* [IE][VPU]: Fixes deallocation data for cases of CMX allocator run
The final loop tries to deallocate data objects that keep shape values for
other data objects that're outputs of a model. But the case when allocator
takes only CMX data into consideration was not handled and since allocation
could not happen, it lead to fail on deallocation of a data object that has
not been allocated.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* [IE][VPU]: Fixes allocator with work on data to shape edges
Since there is new relationship between data objects: some
data objects may contain shape of other data object - allocator
must properly respect that. The thing is if 2 data objects are
connected in such a way, they represent unite entity (dynamic
data object) and should have the same lifetime.
Signed-off-by: Gladilov, Gleb <gleb.gladilov@intel.com>
* * Added ie_core_read_network_from_memory to the C ie_bridge.
* Added size argument for xml_content, fixed const correctness of the weight_blob, fixed unit test
* * Removed debug message
* Changed variables names from model_xxx to weights_xxx to be more consistent with the argument name of the tested function.
* Added a description for xml_content_size in ie_core_read_network_from_memory.
* * xml_content is now passed as uint8_t
* reading function factorized in the unit-test
* Small fix in the transformation ConvertGroupedStridedSlice. Now VariadicSplit is generated only in the case when node has at least 2 output nodes.
* Added unittests for the case when there is only one StridedSlice.
* Fixed ONNX Mask-RCNN conversion
* Fixed validate_and_infet_types for NMS ops: added check for number of connected inputs
* Updated NMS ops to properly handle optional input with index 2
* Fixed typo in the implementation
* Updated ConvertStridedSliceToStridedSliceIE transformation to support dynamic shapes
* Fixed stridesluce to crop transform not to fail with dynamic shapes
CumSum performs cumulative summation of the input elements along the given axis.
Details:
By default, it will do the sum inclusively meaning the first element is
copied as is. Through an "exclusive" attribute, this behavior can change
to exclude the first element. It can also perform summation in the
opposite direction of the axis. For that, set "reverse" attribute to
true.
JIRA: 29994
* [VPU][GT] Extract order manipulation into separate methods
* [VPU][GT] Rename data -> dependency
* [VPU][GT] Extend unit tests
* [VPU][GT] Introduce replacement and removal methods for StageDependency
* [VPU][GT] Update DataToShape connection methods
This change enables int8/uint8 standalone activation to use optimized
block format (b_fs_yx_fsv16). This should eliminate cases where such
activation had reorders before and after.
Support for this is already provided by activation_kernel_ref implementation.
Related JIRA: CVS-28494
* Updated Mul Add conversion to support dynamic shapes
* Keep changes
* Fix for cases when eltwise performs broadcasting via Constant
* Added comments;Fixed eltwise shape infer; Updated tests
When stdout is not a terminal, Python will buffer it by default. This
means that a consumer of MO's output will not see the argument information
until the buffer is flushed, which will normally only happen once MO
finishes (which might take a while).
Flushing stdout explicitly allows the consumer to see this info as soon
as it's printed.
If you want to contribute to a project documentation and make it better, your help is very welcome.
This guide puts together the guidelines to help you figure out how you can offer your feedback and contribute to the documentation.
## Contribute in Multiple ways
There are multiple ways to help improve our documentation:
* [Log an issue](https://jira.devtools.intel.com/projects/CVS/issues): Enter an issue for the OpenVINO™ documentation component for minor issues such as typos.
* Make a suggestion: Send your documentation suggestion to the mailing list.
* Contribute via GitHub: Submit pull requests in the [GitHub](https://github.com/openvinotoolkit/openvino/tree/master/docs) documentation repository.
## Contribute via GitHub
Use the following steps to contribute in the OpenVINO™ Toolkit documentation
### Use Documentation Guidelines
The documentation for our project is written using Markdown. Use our [guidelines](./docs/documentation_guidelines.md) and best practices to write consistent, readable documentation:
> **NOTE**: Please check if that information can be added to existing documents instead of creating a new one.
1. Fork the [OpenVINO™ Toolkit](https://github.com/openvinotoolkit/openvino) repository.
2. Create a new branch.
3. Create a new markdown file in an appropriate folder.
> **REQUIRED**: The document title must contain a document label in a form: `{#openvino_docs_<name>}`. For example: `Deep Learning Network Intermediate Representation and Operation Sets in OpenVINO™ {#openvino_docs_MO_DG_IR_and_opsets}`.
4. Add your file to the documentation structure. Open the documentation structure file [docs/doxygen/ie_docs.xml](./docs/doxygen/ie_docs.xml) and add your file path to the appropriate section.
5. Commit changes to your branch.
6. Create a pull request.
7. Once the pull request is created, automatic checks are started. All checks must pass to continue.
8. Discuss, review, and update your contributions.
9. Get merged once the maintainer approves.
### Edit Existing Document
1. Fork the [OpenVINO™ Toolkit](https://github.com/openvinotoolkit/openvino) repository.
2. Create a new branch.
3. Edit the documentation markdown file and commit changes to the branch.
4. Create a pull request.
5. Once the pull request is created, automatic checks are started. All checks must pass to continue.
6. Discuss, review, and update your contributions.
7. Get merged once the maintainer approves.
### Delete Document from the Documentation
1. Fork the [OpenVINO™ Toolkit](https://github.com/openvinotoolkit/openvino) repository.
2. Create a new branch.
3. Remove the documentation file.
4. Remove your file from the documentation structure. Open the documentation structure file [docs/doxygen/ie_docs.xml](./docs/doxygen/ie_docs.xml) and remove all occurences of your file path.
5. Remove all references to that file from other documents or replace with links to alternatives topics (if any).
6. Commit changes to your branch.
7. Create a pull request.
8. Once the pull request is created, automatic checks are started. All checks must pass to continue.
9. Discuss, review, and update your contributions.
The Intel® Distribution of OpenVINO™ toolkit supports neural network model layers in multiple frameworks including TensorFlow*, Caffe*, MXNet*, Kaldi* and ONYX*. The list of known layers is different for each of the supported frameworks. To see the layers supported by your framework, refer to [supported frameworks](../MO_DG/prepare_model/Supported_Frameworks_Layers.md).
Custom layers are layers that are not included in the list of known layers. If your topology contains any layers that are not in the list of known layers, the Model Optimizer classifies them as custom.
This guide illustrates the workflow for running inference on topologies featuring custom layers, allowing you to plug in your own implementation for existing or completely new layers.
For a step-by-step example of creating and executing a custom layer, see the [Custom Layer Implementation Tutorials for Linux and Windows.](https://github.com/david-drew/OpenVINO-Custom-Layers/tree/master/2019.r2.0)
## Terms used in this guide
- *Layer* — The abstract concept of a math function that is selected for a specific purpose (relu, sigmoid, tanh, convolutional). This is one of a sequential series of building blocks within the neural network.
- *Kernel* — The implementation of a layer function, in this case, the math programmed (in C++ and Python) to perform the layer operation for target hardware (CPU or GPU).
- *Intermediate Representation (IR)* — Neural Network used only by the Inference Engine in OpenVINO abstracting the different frameworks and describing topology, layer parameters and weights.
The original format will be a supported framework such as TensorFlow, Caffe, or MXNet.
- *Model Extension Generator* — Generates template source code files for each of the extensions needed by the Model Optimizer and the Inference Engine.
- *Inference Engine Extension* — Device-specific module implementing custom layers (a set of kernels).
## Custom Layer Overview
The [Model Optimizer](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md) searches the list of known layers for each layer contained in the input model topology before building the model's internal representation, optimizing the model, and producing the Intermediate Representation files.
The [Inference Engine](../IE_DG/Deep_Learning_Inference_Engine_DevGuide.md) loads the layers from the input model IR files into the specified device plugin, which will search a list of known layer implementations for the device. If your topology contains layers that are not in the list of known layers for the device, the Inference Engine considers the layer to be unsupported and reports an error. To see the layers that are supported by each device plugin for the Inference Engine, refer to the [Supported Devices](../IE_DG/supported_plugins/Supported_Devices.md) documentation.
<br>
> **NOTE:** If a device doesn't support a particular layer, an alternative to creating a new custom layer is to target an additional device using the HETERO plugin. The [Heterogeneous Plugin](../IE_DG/supported_plugins/HETERO.md) may be used to run an inference model on multiple devices allowing the unsupported layers on one device to "fallback" to run on another device (e.g., CPU) that does support those layers.
## Custom Layer Implementation Workflow
When implementing a custom layer for your pre-trained model in the Intel® Distribution of OpenVINO™ toolkit, you will need to add extensions to both the Model Optimizer and the Inference Engine.
## Custom Layer Extensions for the Model Optimizer
The following figure shows the basic processing steps for the Model Optimizer highlighting the two necessary custom layer extensions, the Custom Layer Extractor and the Custom Layer Operation.

The Model Optimizer first extracts information from the input model which includes the topology of the model layers along with parameters, input and output format, etc., for each layer. The model is then optimized from the various known characteristics of the layers, interconnects, and data flow which partly comes from the layer operation providing details including the shape of the output for each layer. Finally, the optimized model is output to the model IR files needed by the Inference Engine to run the model.
The Model Optimizer starts with a library of known extractors and operations for each [supported model framework](../MO_DG/prepare_model/Supported_Frameworks_Layers.md) which must be extended to use each unknown custom layer. The custom layer extensions needed by the Model Optimizer are:
- Custom Layer Extractor
- Responsible for identifying the custom layer operation and extracting the parameters for each instance of the custom layer. The layer parameters are stored per instance and used by the layer operation before finally appearing in the output IR. Typically the input layer parameters are unchanged, which is the case covered by this tutorial.
- Custom Layer Operation
- Responsible for specifying the attributes that are supported by the custom layer and computing the output shape for each instance of the custom layer from its parameters. <br> The `--mo-op` command-line argument shown in the examples below generates a custom layer operation for the Model Optimizer.
## Custom Layer Extensions for the Inference Engine
The following figure shows the basic flow for the Inference Engine highlighting two custom layer extensions for the CPU and GPU Plugins, the Custom Layer CPU extension and the Custom Layer GPU Extension.

Each device plugin includes a library of optimized implementations to execute known layer operations which must be extended to execute a custom layer. The custom layer extension is implemented according to the target device:
- Custom Layer CPU Extension
- A compiled shared library (.so or .dll binary) needed by the CPU Plugin for executing the custom layer on the CPU.
- Custom Layer GPU Extension
- OpenCL source code (.cl) for the custom layer kernel that will be compiled to execute on the GPU along with a layer description file (.xml) needed by the GPU Plugin for the custom layer kernel.
## Model Extension Generator
Using answers to interactive questions or a *.json* configuration file, the Model Extension Generator tool generates template source code files for each of the extensions needed by the Model Optimizer and the Inference Engine. To complete the implementation of each extension, the template functions may need to be edited to fill-in details specific to the custom layer or the actual custom layer functionality itself.
### Command-line
The Model Extension Generator is included in the Intel® Distribution of OpenVINO™ toolkit installation and is run using the command (here with the "--help" option):
```bash
python3 /opt/intel/openvino/deployment_tools/tools/extension_generator/extgen.py new --help
```
where the output will appear similar to:
```
usage: You can use any combination of the following arguments:
Arguments to configure extension generation in the interactive mode:
optional arguments:
-h, --help show this help message and exit
--mo-caffe-ext generate a Model Optimizer Caffe* extractor
--mo-mxnet-ext generate a Model Optimizer MXNet* extractor
--mo-tf-ext generate a Model Optimizer TensorFlow* extractor
--mo-op generate a Model Optimizer operation
--ie-cpu-ext generate an Inference Engine CPU extension
--ie-gpu-ext generate an Inference Engine GPU extension
--output_dir OUTPUT_DIR
set an output directory. If not specified, the current
directory is used by default.
```
The available command-line arguments are used to specify which extension(s) to generate templates for the Model Optimizer or Inference Engine. The generated extension files for each argument will appear starting from the top of the output directory as follows:
The workflow for each generated extension follows the same basic steps:

**Step 1: Generate:** Use the Model Extension Generator to generate the Custom Layer Template Files.
**Step 2: Edit:** Edit the Custom Layer Template Files as necessary to create the specialized Custom Layer Extension Source Code.
**Step 3: Specify:** Specify the custom layer extension locations to be used by the Model Optimizer or Inference Engine.
## Caffe\* Models with Custom Layers <a name="caffe-models-with-custom-layers"></a>
If your Caffe\* model has custom layers:
**Register the custom layers as extensions to the Model Optimizer**. For instructions, see [Extending Model Optimizer with New Primitives](../MO_DG/prepare_model/customize_model_optimizer/Extending_Model_Optimizer_with_New_Primitives.md). When your custom layers are registered as extensions, the Model Optimizer generates a valid and optimized Intermediate Representation. You will need a bit of Python\* code that lets the Model Optimizer;
- Generate a valid Intermediate Representation according to the rules you specified.
- Be independent from the availability of Caffe on your computer.
If your model contains Custom Layers, it is important to understand the internal workflow of the Model Optimizer. Consider the following example.
**Example**:
The network has:
* One input layer (#1)
* One output Layer (#5)
* Three internal layers (#2, 3, 4)
The custom and standard layer types are:
* Layers #2 and #5 are implemented as Model Optimizer extensions.
* Layers #1 and #4 are supported in Model Optimizer out-of-the box.
* Layer #3 is neither in the list of supported layers nor in extensions, but is specified in CustomLayersMapping.xml.
> **NOTE**: If any of the layers are not in one of three categories described above, the Model Optimizer fails with an appropriate message and a link to the corresponding question in [Model Optimizer FAQ](../MO_DG/prepare_model/Model_Optimizer_FAQ.md).
**Step 1:** The example model is fed to the Model Optimizer that **loads the model** with the special parser built on top of the `caffe.proto` file. In case of failure, the Model Optimizer asks you to prepare the parser that can read the model. For more information, refer to the Model Optimizer, <ahref="MO_FAQ.html#FAQ1">FAQ #1</a>.
**Step 2:** The Model Optimizer **extracts the attributes of all layers** by going through the list of layers and attempting to find the appropriate extractor. In order of priority, the Model Optimizer checks if the layer is:
* A. Registered as a Model Optimizer extension
* B. Registered as a standard Model Optimizer layer
When the Model Optimizer finds a satisfying condition from the list above, it extracts the attributes according to the following rules:
* For A. - takes only the parameters specified in the extension
* For B. - takes only the parameters specified in the standard extractor
<br>
**Step 3:** The Model Optimizer **calculates the output shape of all layers**. The logic is the same as it is for the priorities. **Important:** the Model Optimizer always takes the first available option.
**Step 4:** The Model Optimizer **optimizes the original model and produces the two Intermediate Representation (IR) files in .xml and .bin**.
<br>
## TensorFlow\* Models with Custom Layers <a name="Tensorflow-models-with-custom-layers"></a>
You have two options for TensorFlow\* models with custom layers:
<br>
***Register those layers as extensions to the Model Optimizer.** In this case, the Model Optimizer generates a valid and optimized Intermediate Representation.
***If you have sub-graphs that should not be expressed with the analogous sub-graph in the Intermediate Representation, but another sub-graph should appear in the model, the Model Optimizer provides such an option.** This feature is helpful for many TensorFlow models. To read more, see [Sub-graph Replacement in the Model Optimizer](../MO_DG/prepare_model/customize_model_optimizer/Subgraph_Replacement_Model_Optimizer.md).
## MXNet\* Models with Custom Layers <a name="mxnet-models-with-custom-layers"></a>
There are two options to convert your MXNet* model that contains custom layers:
1. Register the custom layers as extensions to the Model Optimizer. For instructions, see [Extending MXNet Model Optimizer with New Primitives](../MO_DG/prepare_model/customize_model_optimizer/Extending_MXNet_Model_Optimizer_with_New_Primitives.md). When your custom layers are registered as extensions, the Model Optimizer generates a valid and optimized Intermediate Representation. You can create Model Optimizer extensions for both MXNet layers with op `Custom` and layers which are not standard MXNet layers.
2. If you have sub-graphs that should not be expressed with the analogous sub-graph in the Intermediate Representation, but another sub-graph should appear in the model, the Model Optimizer provides such an option. In MXNet the function is actively used for ssd models provides an opportunity to for the necessary subgraph sequences and replace them. To read more, see [Sub-graph Replacement in the Model Optimizer](../MO_DG/prepare_model/customize_model_optimizer/Subgraph_Replacement_Model_Optimizer.md).
## Kaldi\* Models with Custom Layers <a name="Kaldi-models-with-custom-layers"></a>
For information on converting your Kaldi* model containing custom layers see [Converting a Kaldi Model in the Model Optimizer Developer Guide](../MO_DG/prepare_model/convert_model/Convert_Model_From_Kaldi.md).
## ONNX\* Models with Custom Layers <a name="ONNX-models-with-custom-layers"></a>
For information on converting your ONNX* model containing custom layers see [Converting an ONNX Model in the Model Optimizer Developer Guide](../MO_DG/prepare_model/convert_model/Convert_Model_From_ONNX.md).
## Step-by-Step Custom Layers Tutorial
For a step-by-step walk-through creating and executing a custom layer, see [Custom Layer Implementation Tutorial for Linux and Windows.](https://github.com/david-drew/OpenVINO-Custom-Layers/tree/master/2019.r2.0)
## Additional Resources
- Intel® Distribution of OpenVINO™ toolkit home page: [https://software.intel.com/en-us/openvino-toolkit](https://software.intel.com/en-us/openvino-toolkit)
Each group contains [parameterized](https://github.com/google/googletest/blob/master/googletest/docs/advanced.md) tests. The main idea is that to add a new test, you only need to add a new parameter. Except for scenarios different from the generalized case.
## Classsification and Detection tests
These groups contains two cases:
* For generalized scenario (` VpuNoClassificationRegression, VpuNoDetectionRegression`)
* For specific scenario (` VpuNoClassificationRegressionSpecific, VpuNoDetectionRegressionSpecific`)
### Generalized scenario
If You want test new parameter(batch, precision, model and etc.) then You need to edit the existing initialization of parameterized tests or create a new one.
If You need a test to perform some actions that are not provided in the generalized scenario, then add a specific test case. As with the generalized scenario You can change parameters for these tests.
There is no generalized scenario and recommendations are the same as for specific test cases for Classification/Detection groups.
## Compilation tests
The tests are in the `vpu_classification_regression.cpp` file and contains only one scenario ` VpuNoRegressionWithCompilation `. To add a new test just update parameters just as in generalized scenarion of Classification/Detection test groups.
Inference Engine with the bfloat16 inference implemented on CPU must support the `avx512_bf16` instruction and therefore the bfloat16 data format.
## Introduction
Bfloat16 computations (referred to as BF16) is the Brain Floating-Point format with 16 bits. This is a truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format FP32. BF16 preserves 8 exponent bits as FP32 but reduces precision of the sign and mantissa from 24 bits to 8 bits.
![bf16_format]
Preserving the exponent bits keeps BF16 to the same range as the FP32 (~1e-38 to ~3e38). This simplifies conversion between two data types: you just need to skip or flush to zero 16 low bits.
Truncated mantissa leads to occasionally less precision, but according to [investigations](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus), neural networks are more sensitive to the size of the exponent than the mantissa size. Also, in lots of models, precision is needed close to zero but not so much at the maximum range.
Another useful feature of BF16 is possibility to encode an INT8 in BF16 without loss of accuracy, because INT8 range completely fits in BF16 mantissa field. It reduces data flow in conversion from INT8 input image data to BF16 directly without intermediate representation in FP32, or in combination of [INT8 inference](Int8Inference.md) and BF16 layers.
See the [Intel's site](https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf) for more bfloat16 format details.
There are two ways to check if CPU device can support bfloat16 computations for models:
1. Query the instruction set via system `lscpu | grep avx512_bf16` or `cat /proc/cpuinfo | grep avx512_bf16`.
2. Use [Query API](InferenceEngine_QueryAPI.md) with `METRIC_KEY(OPTIMIZATION_CAPABILITIES)`, which should return `BF16` in the list of CPU optimization options:
Current Inference Engine solution for bfloat16 inference uses Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and supports inference of the following layers in BF16 computation mode:
* Convolution
* FullyConnected
* InnerProduct
* LRN
* Pooling
This means that BF16 inference can only be performed with the CPU plugin on the layers listed above. All other layers are executed in FP32.
## Lowering Inference Precision
Lowering precision to increase performance is [widely used](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html) for optimization of inference. The bfloat16 data type usage on CPU for the first time opens the possibility of default optimization approach.
The embodiment of this approach is to use the optimization capabilities of the current platform to achieve maximum performance while maintaining the accuracy of calculations within the acceptable range.
Bfloat16 data usage provides the following benefits that increase performance:
1. Faster multiplication of two BF16 numbers because of shorter mantissa of bfloat16 data.
2. No need to support denormals and handling exceptions as this is a performance optimization.
3. Fast conversion of float32 to bfloat16 and vice versa.
4. Reduced size of data in memory, as a result, larger models fit in the same memory bounds.
5. Reduced amount of data that must be transferred, as a result, reduced data transition time.
For default optimization on CPU, source model converts from FP32 or FP16 to BF16 and executes internally on platforms with native BF16 support. In that case, `KEY_ENFORCE_BF16` is set to `YES`.
The code below demonstrates how to check if the key is set:
To disable BF16 internal transformations, set the `KEY_ENFORCE_BF16` to `NO`. In this case, the model infers AS IS without modifications with precisions that were set on each layer edge.
An exception with message `Platform doesn't support BF16 format` is formed in case of setting `KEY_ENFORCE_BF16` to `YES` on CPU without native BF16 support.
Low-Precision 8-bit integer models do not convert to BF16, even if bfloat16 optimization is set by default.
## Performance Counters
Information about layer precision is stored in the performance counters that are
available from the Inference Engine API. The layers have the following marks:
* Suffix `BF16` for layers that had bfloat16 data type input and were computed in BF16 precision
* Suffix `FP32` for layers computed in 32-bit precision
For example, the performance counters table for the Inception model can look as follows:
Cross Check Tool is a console application that enables comparing accuracy and performance metrics for two successive
model inferences that are performed
on two different supported Intel® devices or with different precisions.
The Cross Check Tool can compare metrics per layer or all over the model.
On Linux* OS, before running the Cross Check Tool binary, make sure your application can find the
Deep Learning Inference Engine libraries.
Navigate to the `<INSTALL_DIR>/deployment_tools/inference_engine/bin` folder and run the `setvars.sh` script to
set all necessary environment variables:
```sh
source setvars.sh
```
## Running the Cross Check Tool
Cross Check Tool is distributed as a binary file and there is no need to build it. To run the Cross Check Tool,
execute the tool's binary file with necessary parameters. Please note that the Inference Engine assumes that weights
are in the same folder as the _.xml_ file.
You can get the list of all available options using the -h option:
```sh
$./cross_check_tool -h
InferenceEngine:
API version ............ 1.0
Build .................. ###
[ INFO ] Parsing input parameters
./cross_check_tool [OPTION]
Options:
-h Prints a usage message.
-i "<path>" Optional. Path to an input image file or multi-input file to infer. Generates input(s) from normal distribution if empty
-m "<path>" Required. Path to an .xml file that represents the first IR of the trained model to infer.
-l "<absolute_path>" Required for MKLDNN (CPU)-targeted custom layers. Absolute path to a shared library with the kernels implementation.
Or
-c "<absolute_path>" Required for clDNN (GPU)-targeted custom kernels. Absolute path to the xml file with the kernels description.
-conf "<path>" Optional. Path to config file for -d device plugin
-ref_conf "<path>" Optional. Path to config file for -ref_d device plugin
-pp "<path>" Optional. Path to a plugin folder.
-d "<device>" Required. The first target device to infer the model specified with the -m option. CPU, GPU, HDDL or MYRIAD is acceptable.
-ref_m "<path>" Optional. Path to an .xml file that represents the second IR in different precision to compare the metrics.
-ref_d "<device>" Required. The second target device to infer the model and compare the metrics. CPU, GPU, HDDL or MYRIAD is acceptable.
-layers "<options>" Defines layers to check. Options: all, None - for output layers check, list of comma-separated layer names to check. Default value is None.
-eps "<float>" Optional. Threshold for filtering out those blob statistics that do not statify the condition: max_abs_diff < eps.
-dump Enables blobs statistics dumping
-load "<path>" Path to a file to load blobs from
```
### Examples
1. To check per-layer accuracy and performance of inference in FP32 precision on the CPU against the GPU, run:
- [nGraph](../nGraph_DG/nGraph_dg.md) — graph representation and manipulation engine which is used to represent a model inside Inference Engine and allows the run-time model construction without using Model Optimizer.
* [OpenCV](https://docs.opencv.org/) — OpenCV* community version compiled for Intel® hardware.
Includes PVL libraries for computer vision.
* Drivers and runtimes for OpenCL™ version 2.1
* [Intel® Media SDK](https://software.intel.com/en-us/media-sdk)
* [OpenVX*](https://software.intel.com/en-us/cvsdk-ovx-guide) — Intel's implementation of OpenVX*
optimized for running on Intel® hardware (CPU, GPU, IPU).
* [Demos and samples](Samples_Overview.md).
This Guide provides overview of the Inference Engine describing the typical workflow for performing
inference of a pre-trained and optimized deep learning model and a set of sample applications.
> **NOTES:**
> - Before you perform inference with the Inference Engine, your models should be converted to the Inference Engine format using the Model Optimizer or built directly in run-time using nGraph API. To learn about how to use Model Optimizer, refer to the [Model Optimizer Developer Guide](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md). To learn about the pre-trained and optimized models delivered with the OpenVINO™ toolkit, refer to [Pre-Trained Models](@ref omz_models_intel_index).
> - [Intel® System Studio](https://software.intel.com/en-us/system-studio) is an all-in-one, cross-platform tool suite, purpose-built to simplify system bring-up and improve system and IoT device application performance on Intel® platforms. If you are using the Intel® Distribution of OpenVINO™ with Intel® System Studio, go to [Get Started with Intel® System Studio](https://software.intel.com/en-us/articles/get-started-with-openvino-and-intel-system-studio-2019).
## Table of Contents
* [Inference Engine API Changes History](API_Changes.md)
* [Introduction to Inference Engine](inference_engine_intro.md)
Inference Engine Extension API allows to register operation sets (opsets) with custom nGraph operations, it allows to support Networks with unknown operations.
## Operation Class
To add your custom nGraph operation, create a new class that extends `ngraph::Op`, which is in turn derived from `ngraph::Node`, the base class for all graph operations in nGraph. Follow the steps below:
1. Define a `NodeTypeInfo` object that identifies the type of the operation to the graph users and helps with dynamic type resolution. The type info of an nGraph operation currently consists of a string identifier and a version number, but this may change in the future.
2. Implement constructors that can optionally take the operation inputs and attributes as parameters.
3. Override the shape inference method `validate_and_infer_types`. This method is called multiple times during graph manipulations to determine the shapes and element types of the outputs of the operations. You can access the input shapes through the `get_input_partial_shape()` method and input element types through the `get_input_element_type()` method of `ngraph::Node`. Set the inferred shape and element type of the output using `set_output_type`.
4. Override the `clone_with_new_inputs` method, which allows graph manipulation routines to create copies of this operation and connect it to different nodes during optimization.
5. Override the `visit_attributes` method, which allows serialization and deserialization of attributes. An `AttributeVisitor` is passed to the method, and the implementation is expected to walk over all the attributes in the op using the type-aware `on_attribute` helper. Helpers are already implemented for standard C++ types like `int64_t`, `float`, `bool`, `vector` and for existing nGraph defined types.
6. Override `evaluate`, which is an optional method that enables the application of constant folding if there is a custom operation on the constant branch.
Based on that, declaration of a operation class can look as follows:
@snippet op.hpp op:header
### Class Fields
The provided implementation has several fields:
*`add` of type `int64_t` is an attribute of custom operation
*`type_info` of type `ngraph::NodeTypeInfo` defines the type and version of operation
### Operation Constructors
nGraph operation contains two constructors: a default constructor, which allows to create operation without attributes and a constructor that creates and validates operation with specified inputs and attributes.
@snippet op.cpp op:ctor
### `validate_and_infer_types()`
`ngraph::Node::validate_and_infer_types` method validates operation attributes and calculates output shapes using attributes of operation.
@snippet op.cpp op:validate
### `clone_with_new_inputs()`
`ngraph::Node::clone_with_new_inputs` method creates a copy of nGraph operation with new inputs.
@snippet op.cpp op:copy
### `visit_attributes()`
`ngraph::Node::visit_attributes` method allows to visit all operation attributes.
@snippet op.cpp op:visit_attributes
### `evaluate()`
`ngraph::Node::evaluate` method allows to apply constant folding to an operation.
@snippet op.cpp op:evaluate
## Register Custom Operations in Extension Class
To add custom operations to the [Extension](Extension.md) class, create an operation set with custom operations and implement the `InferenceEngine::IExtension::getOpSets` method:
@snippet extension.cpp extension:getOpSets
This method returns a map of opsets that exist in the extension library.
nGraph provides opsets mechanism for operation versioning. Different opsets distinguish between different versions of one operation.
When specifying opset names, follow the rules below:
* Use unique opset names.
* Do not use the following built-in opset names: `extension`, `experimental`, `opset1`, `opest2`.
* Make sure that the Model Optimizer and your extension use the same opset names.
* IR v10 layers have the mandatory `version` attribute specifying the opset.
*`opset1` is the name of default operations set.
Operations from the default opset cannot be redefined.
Use a custom opset to create a new operation or extend functionality of an existing operation from another opset.
# How to Implement Custom CPU Layers {#openvino_docs_IE_DG_Extensibility_DG_CPU_Kernel}
The primary vehicle for the performance of the CPU codepath in the Inference Engine is the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), and new CPU kernels extend the Inference Engine plugin for the Intel MKL-DNN. Implementing the InferenceEngine::ILayerExecImpl defines a general CPU-side extension. There are no Intel MKL-DNN specifics in the way you need to implement a kernel.
## Implementation Class
All custom kernels for the CPU plugin should be inherited from the InferenceEngine::ILayerExecImpl interface.
Based on that, declaration of a kernel implementation class can look as follows:
@snippet cpu_kernel.hpp cpu_implementation:header
### Class Fields
The provided implementation has several fields:
*`add` of the type `int64_t` is an attribute of a custom operation
*`inShape` of the type `ngraph::Shape` is an input shape
*`outShape` of the type `ngraph::Shape` is an output shape
*`error` of the type `std::string` is a field to handle errors from a constructor
### Constructor of Implementation
An implementation constructor checks parameters of nGraph operation, stores needed attributes, and stores an error message in the case of an error.
@snippet cpu_kernel.cpp cpu_implementation:ctor
### `getSupportedConfigurations`
InferenceEngine::ILayerExecImpl::getSupportedConfigurations method returns all supported configuration formats (input/output tensor layouts) for your implementation. To specify formats of data, use InferenceEngine::TensorDesc. Refer to the [Memory Primitives](../Memory_primitives.md) section for instructions on how to do it.
InferenceEngine::ILayerExecImpl::init method gets a runtime-selected configuration from a vector that is populated from the `getSupportedConfigurations` method and checks the parameters:
@snippet cpu_kernel.cpp cpu_implementation:init
### `execute`
InferenceEngine::ILayerExecImpl::execute method accepts and processes the actual tenors as input/output blobs:
# How to Implement Custom GPU Layers {#openvino_docs_IE_DG_Extensibility_DG_GPU_Kernel}
The GPU codepath abstracts many details about OpenCL™. You need to provide the kernel code in OpenCL C and the configuration file that connects the kernel and its parameters to the parameters of the layer.
There are two options of using custom layer configuration file:
* Include a section with your kernels into the global automatically-loaded `cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml` file, which is hosted in the `<INSTALL_DIR>/deployment_tools/inference_engine/bin/intel64/{Debug/Release}` folder
* Call the `InferenceEngine::Core::SetConfig()` method from your application with the `InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE` key and the configuration file name as a value before loading the network that uses custom layers to the plugin:
All Inference Engine samples, except trivial `hello_classification`,
feature a dedicated command-line option `-c` to load custom kernels. For example, to load custom layers for the classification sample, run the command below:
`Kernel` node contains all kernel source code configuration. No kernel
node structure exists.
**Sub-nodes**: `Source` (1+), `Define` (0+)
### Source Node and Sub-node Structure
`Source` node points to a single OpenCL source file.
| Attribute Name | \# ||
|-----|-----|-----|
| `filename` | (1) | Name of the file containing OpenCL source code. Notice that path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
**Sub-nodes**: None
### Define Node and Sub-node Structure
`Define` node configures a single `#‍define` instruction to be added to
the sources during compilation (JIT).
| Attribute Name | \# | Description |
|------|-------|------|
| `name` | (1) | The name of the defined JIT. For static constants, this can include the value as well (taken as a string). |
| `param` | (0/1) | This parameter value is used as the value of this JIT definition. |
| `type` | (0/1) | The parameter type. Accepted values: `int`, `float`, and `int[]`, `float[]` for arrays. |
| `default` | (0/1) | The default value to be used if the specified parameters is missing from the layer in the IR. |
**Sub-nodes:** None
The resulting JIT has the following form:
`#‍define [name] [type] [value/default]`.
### Buffers Node and Sub-node Structure
`Buffers` node configures all input/output buffers for the OpenCL entry
function. No buffers node structure exists.
**Sub-nodes:**`Data` (0+), `Tensor` (1+)
### Data Node and Sub-node Structure
`Data` node configures a single input with static data (for example,
weights or biases).
| Attribute Name | \# | Description |
|----|-----|------|
| `name` | (1) | Name of a blob attached to a layer in the IR |
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to |
**Sub-nodes**: None
### Tensor Node and Sub-node Structure
`Tensor` node configures a single input or output tensor.
| Attribute Name | \# | Description |
|------|-------|-------|
| `arg-index` | (1) | 0-based index in the entry function arguments to be bound to. |
| `type` | (1) | `input` or `output` |
| `port-index` | (1) | 0-based index in the layer’s input/output ports in the IR |
| `format` | (0/1) | Data layout declaration for the tensor. Accepted values: `BFYX`, `BYXF`, `YXFB`, `FYXB` (also in all lowercase). Default value: `BFYX` |
### CompilerOptions Node and Sub-node Structure
`CompilerOptions` node configures the compilation flags for the OpenCL
sources.
| Attribute Name | \# | Description |
|--------|-----|------|
| `options` | (1) | Options string to be passed to the OpenCL compiler |
**Sub-nodes**: None
### WorkSizes Node and Sub-node Structure
`WorkSizes` node configures the global/local work sizes to be used when
queuing the OpenCL program for execution.
| Attribute Name | \# | Description |
|-----|------|-----|
| `global`<br>`local` | (0/1)<br>(0/1) | An array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used during execution.<br> The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,\*,% (all evaluated in integer arithmetic). <br>Default value: `global=”B*F*Y*X” local=””` |
| `dim` | (0/1) | A tensor to take the work size from. Accepted values: `input N`, `output`, where `N` is an index of input tensor starting with 0. Default value: `output` |
**Sub-nodes**: None
## Example Configuration File
The following code sample provides an example configuration file (in the
`.xml` format). For information on configuration file structure, see
The following table includes definitions that are attached before
the user sources, where `<TENSOR>` is the actual input and output, for
example, `INPUT0` or `OUTPUT0`.
For an example, see [Example Kernel](#example-kernel).
| Name | Value |
|---|---|
| `NUM_INPUTS` | Number of the input tensors bound to this kernel |
| `GLOBAL_WORKSIZE` | An array of global work sizes used to execute this kernel |
| `GLOBAL_WORKSIZE_SIZE` | The size of the `GLOBAL_WORKSIZE` array |
| `LOCAL_WORKSIZE` | An array of local work sizes used to execute this kernel |
| `LOCAL_WORKSIZE_SIZE` | The size of the `LOCAL_WORKSIZE` array |
| `<TENSOR>_DIMS`| An array of the tensor dimension sizes. Always ordered as `BFYX` |
| `<TENSOR>_DIMS_SIZE`| The size of the `<TENSOR>_DIMS` array.|
| `<TENSOR>_TYPE`| The datatype of the tensor: `float`, `half`, or `char`|
| `<TENSOR>_FORMAT_` | The format of the tensor, BFYX, BYXF, YXFB , FYXB, or ANY. The format is concatenated to the defined name. You can use the tensor format to define codepaths in your code with `#‍ifdef/#‍endif`. |
| `<TENSOR>_LOWER_PADDING` | An array of padding elements used for the tensor dimensions before they start. Always ordered as BFYX.|
| `<TENSOR>_ LOWER_PADDING_SIZE` | The size of the `<TENSOR>_LOWER_PADDING` array |
| `<TENSOR>_UPPER_PADDING` | An array of padding elements used for the tensor dimensions after they end. Always ordered as BFYX. |
| `<TENSOR>_UPPER_PADDING_SIZE` | The size of the `<TENSOR>_UPPER_PADDING` array |
| `<TENSOR>_PITCHES` | The number of elements between adjacent elements in each dimension. Always ordered as BFYX.|
| `<TENSOR>_PITCHES_SIZE`| The size of the `<TENSOR>_PITCHES` array |
| `<TENSOR>_OFFSET`| The number of elements from the start of the tensor to the first valid element (bypassing the lower padding) |
All `<TENSOR>` values are automatically defined for every tensor
bound to this layer (`INPUT0`, `INPUT1`, `OUTPUT0`, and so on), as shown
in the following for example:
```sh
#define INPUT0_DIMS_SIZE 4
#define INPUT0_DIMS (int []){ 1,96,55,55, }
```
## Example Kernel<a name="example-kernel"></a>
```c
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernelvoidexample_relu_kernel(
const__globalINPUT0_TYPE*input0,
__globalOUTPUT0_TYPE*output)
{
constuintidx=get_global_id(0);
constuintidy=get_global_id(1);
constuintidbf=get_global_id(2);//batches*features, as OpenCL supports 3D nd-ranges only
constuintfeature=idbf%OUTPUT0_DIMS[1];
constuintbatch=idbf/OUTPUT0_DIMS[1];
//notice that pitches are in elements, not in bytes!
Inference Engine Extensibility API allows to add support of custom operations to the Inference Engine.
Extension should contain operation sets with custom operations and execution kernels for custom operations.
Physically, an extension library can be represented as a dynamic library exporting the single `CreateExtension` function that allows to create a new extension instance.
Extensibility library can be loaded to the InferenceEngine::Core object using the InferenceEngine::Core::AddExtension method.
## Inference Engine Extension Library
Inference Engine Extension dynamic library contains several main components:
* [Extension class](Extension.md):
- Contains custom operation sets
- Provides CPU implementations for custom operations
* [Custom operations](Intro.md):
- Allows to use InferenceEngine::Core::ReadNetwork to read Intermediate Representation (IR) with unsupported operations
- Allows to create `ngraph::Function` with unsupported operations
- Provides shape inference mechanism for custom operations
> **NOTE**: This documentation is written based on the `Template extension`, which demonstrates extension
development details. Find the complete code of the `Template extension`, which is fully compilable and up-to-date,
at `<dldt source tree>/docs/template_extension`.
## Execution Kernels
The Inference Engine workflow involves the creation of custom kernels and either custom or existing operations.
An _Operation_ is a Network building block implemented in the training framework, for example, `Convolution` in Caffe*.
A _Kernel_ is defined as the corresponding implementation in the Inference Engine.
Refer to the [Custom Layers in the Model Optimizer](../../MO_DG/prepare_model/customize_model_optimizer/Customize_Model_Optimizer.md) section for details on how
mapping between framework layers and Inference Engine kernels is registered.
In short, you can plug your own kernel implementations into the Inference Engine and map them to the layers in the original framework.
The following pages describe how to integrate custom _kernels_ into the Inference Engine:
* [Introduction to development of custom CPU kernels](CPU_Kernel.md)
* [Introduction to development of custom GPU kernels](GPU_Kernel.md)
* [Introduction to development of custom VPU kernels](VPU_Kernel.md)
## Additional Resources
* [Build an extension library using CMake*](Building.md)
> **NOTE:** OpenCL compiler, targeting Intel® Neural Compute Stick 2 for the SHAVE* processor only, is redistributed with OpenVINO.
OpenCL support is provided by ComputeAorta*, and is distributed under a license agreement between Intel® and Codeplay* Software Ltd.
The OpenCL™ toolchain for the Intel® Neural Compute Stick 2 supports offline compilation only, so first compile OpenCL C code using the standalone `clc` compiler. You can find the compiler binary at `<INSTALL_DIR>/deployment_tools/tools/cl_compiler`.
> **NOTE:** By design, custom OpenCL layers support any OpenCL kernels written with 1.2 version assumed. It also supports half float
extension and is optimized for this type, because it is a native type for Intel® Movidius™ VPUs.
1. Prior to running a compilation, make sure that the following variables are set:
2. Run the compilation with the command below. You should use `--strip-binary-header` to make an OpenCL runtime-agnostic binary runnable with the Inference Engine.
```bash
cd <INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin
To tie the topology IR for a layer you customize, prepare a configuration file, so that the Inference Engine can find parameters for your kernel and the execution work grid is described.
For example, given the following OpenCL kernel signature:
Each custom layer is described with the `CustomLayer` node. It has the following nodes and attributes:
- Root node `CustomLayer` contains the following attributes:
-`name`– (Required) A name of the Inference Engine layer to bind the kernel with.
-`type` and `version`– (Required) Reserved for future use. Set them to `MVCL` and `1` respectively.
-`max-shaves`– (Optional) The maximum number of SHAVE cores that should be dedicated for the layer. It is useful for debugging concurrency issues or for resource saving if memory bound kernel does not scale well with the number of cores, so more resources can be left for the rest of a topology.
- Sub-node `Kernel` must contain the following attributes:
-`entry`– A name of your kernel function as you defined it in a source file (in the example above, it is `reorg_nhwc`).
- Node `Source` must contain the following attributes:
-`filename`– A path to a compiled binary relative to the `.xml` binding file.
- Sub-node `Parameters`– Describes parameters bindings. For more information, see the description below.
- Sub-node `WorkSizes`– Describes local and global work group sizes and the source for dimension deduction as a pair `direction,port`. In the example above, the work group is described relatively to the dimension of the input tensor that comes through port 0 in the IR. `global` and `local` work group configurations support any simple math expressions with +,-,\*,/, and () from `B`(batch), `Y`(height), `X`(width) and `F`(channels).
- Sub-node `Where`– Allows to customize bindings with the `key="value"` attribute. For example, to substitute only 3x3 convolutions, write `<Where kernel="3,3"/>` in the binging xml.
Parameter description supports `Tensor` of one of tensor types such as `input`, `output`, `input_buffer`, `output_buffer` or `data`, `Scalar`, or `Data` nodes and has the following format:
- Each `Tensor` node of `input` or `output` type must contain the following attributes:
-`arg-name`– A name of a kernel parameter in the kernel signature.
-`type`– Node type: `input` or `output` as in the IR.
-`port-index`– A number of input/output ports as in the IR.
-`format`– The channel order in the tensor. Optional conversion layers are generated if the custom layer format is not compatible with formats of neighboring layers. `BFXY`, `BYXF`, and `ANY` formats are supported currently.
- Each `Tensor` node of `input_buffer` or `output_buffer` type must contain the following attributes:
-`arg-name`– A name of a kernel parameter in the kernel signature.
-`type`– Node type: `input_buffer` or `output_buffer`. Use the appropriate type to bind multiple kernels that correspond to different stages of the same layer.
-`port-index`– The unique identifier to bind by.
-`dim`– The dim source with the same `direction,port` format used for `WorkSizes` bindings.
-`size`– Amount of bytes needed. Current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and might be expended in the future.
Here is an example of multi-stage MVN layer binding:
- Each `Scalar` node must contain the following attributes:
- `arg-name` – A name of a kernel parameter in the kernel signature.
- `type` – `int` or `float` value. It is used for correct argument extraction from IR parameters.
- `source` – Contains the name of the parameter in the IR file or input/output (`I`/`O`, `In`/`On`, where `n` is a port number)
followed by dimension `B`(batch), `Y`(height), `X`(width), or `F`(channels).
- Each `Data` node must contain the following attributes:
- `arg-name` – A name of a kernel parameter in the kernel signature.
- `type` – Node type. Currently, `local_data` is the only supported value, which defines buffer allocated in fast local on-chip memory. It is limited to 100K for all `__local` and
`__private` arrays defined inside the kernel as well as all `__local` parameters passed to the kernel. Please, consider that a manual-DMA extension requires double buffering.
If the custom layer is detected to run out of local memory, the inference fails.
- `dim` – The dim source with the same `direction,port` format used for `WorkSizes` bindings.
- `size` – Amount of bytes needed. The current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and may be extended in the future.
The example binding below illustrates a kernel with two local buffers passed to the kernel.
> **NOTE**: If both native and custom layer implementations are present, the custom kernel has a priority over the native one.
Before loading the network that features the custom layers, provide a separate configuration file and load it using the InferenceEngine::Core::SetConfig() method with the PluginConfigParams::KEY_CONFIG_FILE key and the configuration file name as a value:
```cpp
InferenceEngine::Core core;
// Load custom layers
core.SetConfig({ { InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "<path to the xml file>" } }, "MYRIAD");
```
Optionally, set a path to a custom layers description with a pair of `VPU_CUSTOM_LAYERS` and `/path/to/your/customLayers.xml`
auto exeNetwork = core.LoadNetwork(cnnNetwork, "MYRIAD", networkConfig);
```
## Optimizing Kernels with OpenCL™ for VPU (Intel® Neural Compute Stick 2)
This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL
programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.
| OpenCL Model | VPU Mapping|
|-----|----|
| Device code | Executed on SHAVE cores |
| Private memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while the work group is executed |
| Local memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while the work group is executed |
| Global memory | Mapped to DDR, used to pass execution preserved parameters for inputs, outputs, and blobs |
| Work group | Executed on a single SHAVE core iterating over multiple work items |
Note that by the OpenCL specification, the work group execution order is not specified. This means that it is your
responsibility to ensure that race conditions among work groups are not introduced. Custom layer runtime spits evenly
work grid among available compute resources and executes them in an arbitrary order. This static scheduling approach works best if the load is evenly spread out across work groups, which is a typical case for Deep Learning kernels. The following guidelines are recommended to use for work group partitioning:
1. Split work evenly across work groups.
2. Adjust work group granularity to maintain equal workload for all compute codes.
3. Set the maximum number of cores (using the `max-shaves` attribute for the `CustomLayer` node). This keeps more resources for the rest of topology. It is also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization.
4. Try an alternate data layout (`BFXY`/`BYXF`) for the kernel if it improves work group partitioning or data access patterns.
Consider full topology performance (not just specific layer boost) since data conversion layers would be automatically inserted
as appropriate.
Offline OpenCL compiler (`clc`) features automatic vectorization over `get_global_id(0)` usage, if uniform access is detected.
For example, the kernel below could be automatically vectorized:
However, this work-group based vectorizer (WGV) conflicts with the default LLVM vectorizer based on superword level parallelism
(SLP) for the current compiler version. Manual vectorization is recommended to provide the best performance for non-uniform code
patterns. WGV works if and only if vector types are not used in the code.
Here is a short list of optimization tips:
1. Help auto-vectorizer ensure non-aliasing pointers for kernel parameters by putting `restrict` where possible.
- This may give a performance boost, especially for kernels with unrolling, like `ocl_grn` from the example below.
- Place `restrict` markers for kernels with manually vectorized codes. In the `ocl_grn` kernel below, the unrolled version without `restrict` is up to 20% slower than the most optimal one, which combines unrolling and `restrict`.
2. Put `#‍pragma unroll N` to your loop header. Since the compiler does not trigger unrolling by default, it is your responsibility to
annotate the code with pragmas as appropriate. The `ocl_grn` version with `#‍pragma unroll 4` is up to 50% faster, most of which comes from unrolling the first loop, because LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the fist loop
`variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);` is only 2-stage (load-compute). Please, pay
attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that
still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from `4`to `6` results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel® Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store, and compute stages:
Both versions perform the same, but the second one has more complex code.
3. If it is easy to predict the work group size, you can also use the `reqd_work_group_size` kernel attribute to ask the compiler
to unroll the code up to local size of the work group. Please note that if the kernel is actually executed with the
different work group configuration, the result is undefined.
4. Prefer to use the `half` compute, if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functions `half_*` are mapped to a single hardware instruction.
Use the standard `native_*` function for the rest of types.
5. Prefer to use the `convert_half` function over `vstore_half` if conversion to 32-bit float is required. `convert_half` is mapped to a single hardware instruction. For the `cvtf32f16` kernel above, the line `outImage[idx] = convert_half(inImage[idx]*scale+bais);` is 8 times slower than the code with `vstore_half`.
6. Mind early exits. Early exit may be extremely costly for the current version of the `clc` compiler due to conflicts with the
auto-vectorizer. The generic advice would be to setup local size by `x` dimension equal to inputs or/and outputs width.
If it is impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks, for example,
`if (get_global_id(0) >= width) return`, use line-wise kernel variant with manual vectorization.
The kernel example below demonstrates the impact of early exits on kernel performance.
This `reorg` kernel is auto-vectorizable, but an input for YOLO v2 topology is `NCHW=<1,64,26,26>` and it is not multiple of vector width (which is `8` for `half` data type). As a result, the Inference Engine does not select the auto-vectorized kernel.
To compare performance of auto-vectorized and scalar version of the kernel, change the input size to`NCHW=<1,64,26,32>`. This allows the auto-vectorized version to be selected by the Inference Engine and can give you about 30% uplift.
Since the auto-vectorized version is faster, it makes sense to enable it for the YOLO v2 topology input size by setting the local size multiple of vector (e.g. 32) and adjust global sizes accordingly. As a result, the execution work grid exceeds actual input dimension, so out-of-bound checks should be inserted. See the updated kernel version below:
```cpp
// Version with out-of-bound checks added
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int W, int stride)
This code performs the same as the initial kernel above (scalar) due to branching overhead. If you replace min/max expression `w = min(w, W-1);` with `if (w >= W) return;`, runtime increases up to 2x against to code without branching (initial version).<br>
If branching is inevitable for your element-based kernel, it is recommended to change the scheme to line-based. See the kernel variant below:
```cpp
// Line-wise version
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int H, int W, int stride)
This decreases the execution time up to 40% against the best performing vectorized kernel without early exits (initial version).
7. Reuse computations among work items by using line-based kernels or sharing values though `__local` memory.
8. Improve data access locality. Most of custom kernels are memory bound while convolution and fully connected layers are hardware-implemented. The code below demonstrates a further optimized version of the `reorg` kernel unrolled by `stride`:
`scr` data in this case loaded only once. As the result, the cycle count drops up to 45% against the line-wise version.
9. Copy data from `__dlobal` to `__local` or `__private` memory if the data is accessed more than once. Access to
`__dlobal` memory is orders of magnitude slower than access to `__local`/`__private` due to statically scheduled pipeline, which
stalls completely on memory access without any prefetch. The same recommendation is applicable for scalar load/store
from/to a `__blobal` pointer since work-group copying could be done in a vector fashion.
10. Use a manual DMA extension. Local (on-chip) memory throughput is up to 24x higher than DDR throughput. Starting from OpenVINO™ 2020.1, VPU OpenCL features manual-DMA kernel extension to copy sub-tensor used by work group into local memory and performing compute without DDR evolved. Here is the simple GRN kernel implementation that runs over DDR. Local size is equal to (width of the input tensor, 1, 1) to define a large enough work group to get code automatically vectorized and unrolled, while global size is (width of the input tensor, height of the input tensor, 1):
```cpp
__kernel void grn_NCHW(
__global const half* restrict src_data,
__global half* restrict dst_data,
int C,
float bias)
{
float variance = bias + 1e-9f;
#pragma unroll 4
for (int c = 0; c < C; c++)
{
float val = (float) src_data[c*get_global_size(1)*get_global_size(0) + get_global_id(1)*get_global_size(0) + get_global_id(0)];
This kernel can be rewritten to introduce special data binding `__dma_preload` and `__dma_postwrite intrinsics`. This means that instead of one kernel, a group of three kernels should be implemented: `kernelName`, `__dma_preload_kernelName` and `__dma_postwrite_kernelName`. `__dma_preload_kernelName` for a particular work group `n` is guaranteed to be executed before `n`-th work group itself, while `__dma_postwrite_kernelName` is guarantied to be executed after a corresponding work group. You can define one of those functions that are intended to be used to copy data from-to `__global` and `__local` memory. The syntactics requires exact functional signature match. The example below illustrates how to prepare your kernel for manual-DMA.
```cpp
__kernel void __dma_preload_grn_NCHW(
__global const half* restrict src,
__global half* restrict dst,
__local half* restrict local_src,
__local half* restrict local_dst,
int C,
float bias)
{
// ToDO: copy required piece of src tensor into local_src
}
__kernel void __dma_postwrite_grn_NCHW(
__global const half* restrict src,
__global half* restrict dst,
__local const half* restrict local_src,
__local half* restrict local_dst,
int C,
float bias)
{
// ToDO: copy back computed piece of local_dst into dst
}
__kernel void grn_NCHW(
__global const half* restrict src_data,
__global half* restrict dst_data,
__local half* restrict src,
__local half* restrict dst,
int C,
float bias)
{
// same as the example above
}
```
GRN kernel operates on channel-major tensors to compute average over full channel range and then normalizes input elements to produce the output.
As a part of manual DMA extension, a group of work group copy functions are introduced in addition to `async_work_group_copy`, which is also mapped to DMA call.
Here is the list of supported functions:
```cpp
// 2D sub-tensor copy
event_t WorkGroupDmaCreateStrideTransaction(
const local T *src,
global T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t size, // total number of bytes loaded for all lines from source to destination
event_t event) __OVERLOAD;
event_t WorkGroupDmaCreateStrideTransaction(
const global T *src,
local T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t size, // total number of bytes loaded for all lines from source to destination
event_t event) __OVERLOAD;
// 3D sub-tensor copy
event_t WorkGroupDmaCreate3DTransaction(
const local T *src,
global T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t num_planes, // number of planes to be copied
size_t src_plane_stride, // stride between corresponding 2 consecutive planes of source in bytes
size_t dst_plane_stride, // stride between corresponding 2 consecutive planes of destination in bytes
size_t size, // size of the loaded plane in bytes, analogues to the size in 2D case
event_t event) __OVERLOAD;
event_t WorkGroupDmaCreate3DTransaction(
const global T *src,
local T *dst,
size_t src_width, // width of the line of source in bytes
size_t dst_width, // width of the line of destination in bytes
size_t src_stride, // stride between corresponding 2 consecutive lines of source in bytes
size_t dst_stride, // stride between corresponding 2 consecutive lines of destination in bytes
size_t num_planes, // number of planes to be copied
size_t src_plane_stride, // stride between corresponding 2 consecutive planes of source in bytes
size_t dst_plane_stride, // stride between corresponding 2 consecutive planes of destination in bytes
size_t size, // size of the loaded plane in bytes, analogues to the size in 2D case
event_t event) __OVERLOAD;
```
where `T` can be `uchar`, `char`, `short`, `ushort`, `int`, `uint`, `long`, `ulong`, `half` or `float`.
Modified version of the GRN kernel could be the following:
Please note `get_local_size` and `get_local_id` usage inside the kernel. 21x speedup is expected for a kernel on enet-curbs setup since it was completely limited by memory usage.
An alternative method of using DMA is to use work item copy extension. Those functions are executed inside a kernel and requires work groups equal to single work item.
Here is the list of supported work item functions:
Using GPU Kernels Tuning {#openvino_docs_IE_DG_GPU_Kernels_Tuning}
======================
GPU Kernels Tuning allows you to tune models, so the heavy computational layers are configured to fit better into
hardware, which the tuning was done on. It is required to achieve best performance on GPU.
> **NOTE** Currently only convolution and fully connected layers undergo tuning process. It means that the performance boost depends on the amount of that layers in the model.
OpenVINO™ releases include the `<INSTALL_DIR>/inference_engine/bin/intel64/Release/cache.json` file with pretuned data for current state of the art models. It is highly recommended to do the
tuning for new kind of models, hardwares or drivers.
## Tuned data
GPU tuning data is saved in JSON format.
File's content is composed of 2 types of attributes and 1 type of value:
1. Execution units number - this attribute splits the content into different EU sections.
2. Hash - hashed tuned kernel data.
Key: Array with kernel name and kernel's mode index.
## Usage
---
You can activate Kernels Tuning process by setting `KEY_TUNING_MODE` flag to `TUNING_CREATE` and `KEY_TUNING_FILE` to `<"filename">` in a configuration map that is
passed to the plugin while loading a network.
This configuration modifies the behavior of the `ExecutableNetwork` object. Instead of standard network compilation, it will run the tuning process.
Please keep in mind that the tuning can be very time consuming. The bigger the network, the longer it will take.
File with tuned data is the result of this step.
> **NOTE** If a filename passed to `KEY_TUNING_FILE` points to existing tuned data and you are tuning a new model, then this file will be extended by new data. This allows you to extend existing `cache.json` provided in the OpenVINO™ release package.
The example below shows how to set and use the key files:
| WINAPI | Windows Application Programming Interface |
## Terms
Glossary of terms used in the Inference Engine
| Term | Description |
| :--- | :--- |
| Batch | Number of images to analyze during one call of infer. Maximum batch size is a property of the network and it is set before loading of the network to the plugin. In NHWC, NCHW and NCDHW image data layout representation, the N refers to the number of images in the batch |
| Blob | Memory container used for storing inputs, outputs of the network, weights and biases of the layers |
| Device (Affinitity) | A preferred Intel(R) hardware device to run the inference (CPU, GPU, etc.) |
| Extensibility mechanism, Custom layers | The mechanism that provides you with capabilities to extend the Inference Engine and Model Optimizer so that they can work with topologies containing layers that are not yet supported |
| <code>ICNNNetwork</code> | An Interface of the Convolutional Neural Network that Inference Engine reads from IR. Consists of topology, weights and biases |
| <code>IExecutableNetwork</code> | An instance of the loaded network which allows the Inference Engine to request (several) infer requests and perform inference synchronously or asynchronously |
| <code>IInferRequest</code> | Interface that represents the end point of inference on the model loaded to the plugin and represented by executable network. Inputs are set here, outputs should be requested from this interface as well |
| <code>InferenceEngineProfileInfo</code> | Represents basic inference profiling information per layer |
| Inference Engine | A C++ library with a set of classes that you can use in your application to infer input data (images) and get the result |
| Inference Engine API | The basic default API for all supported devices, which allows you to load a model from Intermediate Representation, set input and output formats and execute the model on various devices |
| Inference Engine <code>Core<code> | Inference Engine Core is a software component that manages inference on certain Intel(R) hardware devices: CPU, GPU, MYRIAD, GNA, etc. |
| Layer catalog or Operations specification | A list of supported layers or operations and its parameters. Sets of supported layers are different for different plugins, please check the documentation on plugins to verify if the Inference Engine supports certain layer on the dedicated hardware |
| <code>Layout</code> | Image data layout refers to the representation of images batch. Layout shows a sequence of 4D or 5D tensor data in memory. A typical NCHW format represents pixel in horizontal direction, rows by vertical dimension, planes by channel and images into batch |
| <code>OutputsDataMap</code> | Structure which contains information about output precisions and layouts |
| Precision | Represents data precision. For example, FP32 is 32-bit floating point, FP16 is 16-bit floating point. Precision can be changed before loading the network to the plugin |
| <code>PreProcessInfo</code> | Class that represents input data for the network. It contains information about input precision, its layout, and pre-processing |
| <code>ResponseDesc</code> | Represents debug information for an error |
## See Also
* [Deep Learning Model Optimizer IR Operations Catalog](../ops/opset.md)
Introduction to Inference Engine Device Query API {#openvino_docs_IE_DG_InferenceEngine_QueryAPI}
===============================
This section provides a high-level description of the process of querying of different device properties and configuration values.
Refer to the [Hello Query Device Sample](../../inference-engine/samples/hello_query_device/README.md) sources and [Multi-Device Plugin guide](supported_plugins/MULTI.md) for example of using the Inference Engine Query API in user applications.
## Using the Inference Engine Query API in Your Code
The Inference Engine `Core` class provides the following API to query device information, set or get different device configuration properties:
*<code>InferenceEngine::Core::GetAvailableDevices</code> - Provides a list of available devices. If there are more than one instance of a specific device, the devices are enumerated with `.suffix` where `suffix` is a unique string identifier. The device name can be passed to all methods of the `InferenceEngine::Core` class that work with devices, for example `InferenceEngine::Core::LoadNetwork`.
*<code>InferenceEngine::Core::GetMetric</code> - Provides information about specific device.
<code>InferenceEngine::Core::GetConfig</code> - Gets the current value of a specific configuration key.
*<code>InferenceEngine::Core::SetConfig</code> - Sets a new value for the configuration key.
The `InferenceEngine::ExecutableNetwork` class is also extended to support the Query API:
For documentation about common configuration keys, refer to `ie_plugin_config.hpp`. Device specific configuration keys can be found in corresponding plugin folders.
### GetMetric()
* To extract device properties such as available device, device name, supported configuration keys, and others, use the `InferenceEngine::Core::GetMetric` method:
A returned value looks as follows: `Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz`.
> **NOTE**: All metrics have specific type, which is specified during metric instantiation. The list of common device-agnostic metrics can be found in `ie_plugin_config.hpp`. Device specific metrics (for example, for `HDDL`, `MYRIAD` devices) can be found in corresponding plugin folders.
## Query API in the ExecutableNetwork Class
### GetMetric()
The method is used to get executable network specific metric such as `METRIC_KEY(OPTIMAL_NUMBER_OF_INFER_REQUESTS)`:
Inference Engine with low-precision 8-bit integer inference requires the following prerequisites to be satisfied:
- Inference Engine [CPU Plugin](supported_plugins/CPU.md) must be built with the Intel® Math Kernel Library (Intel® MKL) dependency. In the Intel® Distribution of OpenVINO™ it is
satisfied by default, this is mostly the requirement if you are using OpenVINO™ available in open source, because [open source version of OpenVINO™](https://github.com/openvinotoolkit/openvino) can be built with OpenBLAS* that is unacceptable if you want to use 8-bit integer inference.
- Intel® platforms that support at least one extension to x86 instruction set from the following list:
- A model must be quantized. To quantize the model, you can use the [Post-Training Optimization Tool](@ref pot_README) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.
The 8-bit inference feature was validated on the following topologies:
A lot of investigation was made in the field of deep learning with the idea of using low precision computations during inference in order to boost deep learning pipelines and gather higher performance. For example, one of the popular approaches is to shrink the precision of activations and weights values from `fp32` precision to smaller ones, for example, to `fp11` or `int8`. For more information about this approach, refer to
**Brief History of Lower Precision in Deep Learning** section in [this whitepaper](https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training).
8-bit computations (referred to as `int8`) offer better performance compared to the results of inference in higher precision (for example, `fp32`), because they allow loading more data into a single processor instruction. Usually the cost for significant boost is a reduced accuracy. However, it is proved that an accuracy drop can be negligible and depends on task requirements, so that the application engineer can set up the maximum accuracy drop that is acceptable.
Current Inference Engine solution for low-precision inference uses Intel MKL-DNN and supports inference of the following layers in 8-bit integer computation mode:
* Convolution
* FullyConnected
* ReLU
* ReLU6
* Reshape
* Permute
* Pooling
* Squeeze
* Eltwise
* Concat
* Resample
* MVN
This means that 8-bit inference can only be performed with the CPU plugin on the layers listed above. All other layers are executed in the format supported by the CPU plugin: 32-bit floating point format (`fp32`).
## Low-Precision 8-bit Integer Inference Workflow
For 8-bit integer computations, a model must be quantized. If the model is not quantized then you can use the [Post-Training Optimization Tool](@ref pot_README) to quantize the model. The quantization process adds `FakeQuantize` layers on activations and weights for most layers. Read more about mathematical computations under the hood in the [white paper](https://intel.github.io/mkl-dnn/ex_int8_simplenet.html).
8-bit inference pipeline includes two stages (also refer to the figure below):
1.*Offline stage*, or *model quantization*. During this stage, `FakeQuantize` layers are added before most layers to have quantized tensors before layers in a way that low-precision accuracy drop for 8-bit integer inference satisfies the specified threshold. The output of this stage is a quantized model. Quantized model precision is not changed, quantized tensors are in original precision range (`fp32`). `FakeQuantize` layer has `Quantization Levels` attribute whic defines quants count. Quants count defines precision which is used during inference. For `int8` range `Quantization Levels` attribute value has to be 255 or 256.
2.*Run-time stage*. This stage is an internal procedure of the [CPU Plugin](supported_plugins/CPU.md). During this stage, the quantized model is loaded to the plugin. The plugin updates each `FakeQuantize` layer on activations and weights to have `FakeQuantize` output tensor values in low precision range.
![int8_flow]
### Offline Stage: Model Quantization
To infer a layer in low precision and get maximum performance, the input tensor for the layer has to be quantized and each value has to be in the target low precision range. For this purpose, `FakeQuantize` layer is used in the OpenVINO™ intermediate representation file (IR). To quantize the model, you can use the [Post-Training Optimization Tool](@ref pot_README) delivered with the Intel® Distribution of OpenVINO™ toolkit release package.
When you pass the calibrated IR to the [CPU plugin](supported_plugins/CPU.md), the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference, the model is inferred in precision that this plugin supports.
### Run-Time Stage: Quantization
This is the second stage of the 8-bit integer inference. After you load the quantized model IR to a plugin, the pluing uses the `Low Precision Transformation` component to update the model to infer it in low precision:
* Updates `FakeQuantize` layers to have quantized output tensors in low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in next `FakeQuantize` or `ScaleShift` layers.
* Weights are quantized and stored in `Const` layers.
* Biases are updated to avoid shifts in dequantization layers.
## Performance Counters
Information about layer precision is stored in the performance counters that are
available from the Inference Engine API. The layers have the following marks:
* Suffix `I8` for layers that had 8-bit data type input and were computed in 8-bit precision
* Suffix `FP32` for layers computed in 32-bit precision
For example, the performance counters table for the Inception model can look as follows:
Integrate the Inference Engine with Your Application {#openvino_docs_IE_DG_Integrate_with_customer_application_new_API}
===============================
This section provides a high-level description of the process of integrating the Inference Engine into your application.
Refer to the [Hello Classification Sample](../../inference-engine/samples/hello_classification/README.md) sources
for example of using the Inference Engine in applications.
## Use the Inference Engine API in Your Code
The core `libinference_engine.so` library implements loading and parsing a model Intermediate Representation (IR), and triggers inference using a specified device. The core library has the following API:
C++ Inference Engine API wraps the capabilities of core library:
*`InferenceEngine::CNNNetwork`
*`InferenceEngine::ExecutableNetwork`
*`InferenceEngine::InferRequest`
## Integration Steps
Integration process includes the following steps:
![integration_process]
1) **Create Inference Engine Core** to manage available devices and read network objects:
```cpp
InferenceEngine::Corecore;
```
2) **Read a model IR** created by the Model Optimizer (.xml is supported format):
```cpp
autonetwork=core.ReadNetwork("Model.xml");
```
**Or read the model from ONNX format** (.onnx and .prototxt are supported formats). You can find more information about the ONNX format support in the document [ONNX format support in the OpenVINO™](./ONNX_Support.md).
```cpp
autonetwork=core.ReadNetwork("model.onnx");
```
3) **Configure input and output**. Request input and output information using `InferenceEngine::CNNNetwork::getInputsInfo()`, and `InferenceEngine::CNNNetwork::getOutputsInfo()`
methods:
```cpp
/** Take information about all topology inputs **/
Optionally, set the number format (precision) and memory layout for inputs and outputs. Refer to the
[Supported configurations](supported_plugins/Supported_Devices.md) chapter to choose the relevant configuration.
You can also allow input of any size. To do this, mark each input as resizable by setting a desired resize algorithm (e.g. `BILINEAR`) inside of the appropriate input info.
Basic color format conversions are supported as well. By default, the Inference Engine assumes
that the input color format is `BGR` and color format conversions are disabled. The Inference
Engine supports the following color format conversions:
*`RGB->BGR`
*`RGBX->BGR`
*`BGRX->BGR`
*`NV12->BGR`
where `X` is a channel that will be ignored during inference. To enable the conversions, set a
desired color format (for example, `RGB`) for each input inside of the appropriate input info.
If you want to run inference for multiple images at once, you can use the built-in batch
pre-processing functionality.
> **NOTE**: Batch pre-processing is not supported if input color format is set to `ColorFormat::NV12`.
You can use the following code snippet to configure input and output:
/** output_buffer[] - accessing output blob data **/
```
## Build Your Application
For details about building your application, refer to the CMake files for the sample applications.
All samples source code is located in the `<INSTALL_DIR>/openvino/inference_engine/samples` directory, where `INSTALL_DIR` is the OpenVINO™ installation directory.
### CMake project creation
1.**Create a structure** for the project:
``` sh
project/
├── CMakeLists.txt - CMake file to build
├── ... - Additional folders like includes/
└── src/ - source folder
└── main.cpp
build/ - build directory
...
```
2. **Include Inference Engine, nGraph and OpenCV libraries** in `project/CMakeLists.txt`
[OpenCV](https://docs.opencv.org/master/db/df5/tutorial_linux_gcc_cmake.html) integration is needed mostly for pre-processing input data and ngraph for more complex applications using [ngraph API](../nGraph_DG/nGraph_dg.md).
3. **To build your project** using CMake with the default build tools currently available on your machine, execute the following commands:
> **NOTE**: Make sure **Set the Environment Variables** step in [OpenVINO Installation](../../inference-engine/samples/hello_nv12_input_classification/README.md) document is applied to your terminal, otherwise `InferenceEngine_DIR` and `OpenCV_DIR` variables won't be configured properly to pass `find_package` calls.
```sh
cd build/
cmake ../project
cmake --build .
```
It's allowed to specify additional build options (e.g. to build CMake project on Windows with a specific build tools). Please refer to the [CMake page](https://cmake.org/cmake/help/latest/manual/cmake.1.html#manual:cmake(1)) for details.
### Run Your Application
> **NOTE**: Before running, make sure you completed **Set the Environment Variables** section in [OpenVINO Installation](../../inference-engine/samples/hello_nv12_input_classification/README.md) document so that the application can find the libraries.
To run compiled applications on Microsoft* Windows* OS, make sure that Microsoft* Visual C++ 2017
Redistributable and Intel® C++ Compiler 2017 Redistributable packages are installed and
`<INSTALL_DIR>/bin/intel64/Release/*.dll` files are placed to the
application folder or accessible via `%PATH%` environment variable.
# Introduction to the Performance Topics {#openvino_docs_IE_DG_Intro_to_Performance}
This section is a shorter version of the
[Optimization Guide](supported_plugins/MULTI.md) for the Intel Deep Learning Deployment Toolkit.
## Precision
Inference precision directly affects the performance.
Model Optimizer can produce an IR with different precision. For example, float16 IR initially targets VPU and GPU devices, while, for example, the CPU can also execute regular float32.
Also, further device-specific inference precision settings are available, for example, [8-bit integer](Int8Inference.md) or [bfloat16](Bfloat16Inference.md) inference on the CPU.
Note that for [MULTI device](supported_plugins/MULTI.md) that supports automatic inference on multiple devices in parallel, you can use the FP16 IR.
You can find more information, including preferred data types for specific devices, in the
Default optimization is used for CPU and implies that inference is made with lower precision if it is possible on a given platform to reach better performance with acceptable range of accuracy.
This approach is used for CPU device if platform supports the AVX512_BF16 instruction. In this case, a regular float32 model is converted to [bfloat16](Bfloat16Inference.md) internal representation and inference is provided with bfloat16 layers usage.
Below is the example command line to disable this feature on the CPU device with the AVX512_BF16 instruction and execute regular float32.
```
$ benchmark_app -m <model.xml> -enforcebf16=false
```
## Latency vs. Throughput
One way to increase computational efficiency is batching, which combines many (potentially tens) of
input images to achieve optimal throughput. However, high batch size also comes with a
latency penalty. So, for more real-time oriented usages, lower batch sizes (as low as a single input) are used.
Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample, which allows latency vs. throughput measuring.
## Using Async API
To gain better performance on accelerators, such as VPU, the Inference Engine uses the asynchronous approach (see
[Integrating Inference Engine in Your Application (current API)](Integrate_with_customer_application_new_API.md)).
The point is amortizing the costs of data transfers, by pipe-lining, see [Async API explained](@ref omz_demos_object_detection_demo_ssd_async_README).
Since the pipe-lining relies on the availability of the parallel slack, running multiple inference requests in parallel is essential.
Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample, which enables running a number of inference requests in parallel. Specifying different number of request produces different throughput measurements.
## Best Latency on the Multi-Socket CPUs
Note that when latency is of concern, there are additional tips for multi-socket systems.
When input is limited to the single image, the only way to achieve the best latency is to limit execution to the single socket.
The reason is that single image is simply not enough
to saturate more than one socket. Also NUMA overheads might dominate the execution time.
Below is the example command line that limits the execution to the single socket using numactl for the best *latency* value
(assuming the machine with 28 phys cores per socket):
Note that if you have more than one input, running as many inference requests as you have NUMA nodes (or sockets)
usually gives the same best latency as a single request on the single socket, but much higher throughput. Assuming two NUMA nodes machine:
```
$ benchmark_app -m <model.xml> -nstreams 2
```
Number of NUMA nodes on the machine can be queried via 'lscpu'.
Please see more on the NUMA support in the [Optimization Guide](supported_plugins/MULTI.md).
## Throughput Mode for CPU
Unlike most accelerators, CPU is perceived as an inherently latency-oriented device.
Since 2018 R5 release, the Inference Engine introduced the "throughput" mode, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the throughput.
Internally, the execution resources are split/pinned into execution "streams".
Using this feature gains much better performance for the networks that originally are not scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines.
Run the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) and play with number of infer requests running in parallel, next section.
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance.
In addition to the number of streams, it is also possible to play with the batch size to find the throughput sweet-spot.
The throughput mode relaxes the requirement to saturate the CPU by using a large batch: running multiple independent inference requests in parallel often gives much better performance, than using a batch only.
This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance.
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.
## Benchmark App
[Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample is the best performance reference.
It has a lot of device-specific knobs, but the primary usage is as simple as:
```bash
$ ./benchmark_app –d GPU –m <model> -i <input>
```
to measure the performance of the model on the GPU.
Or
```bash
$ ./benchmark_app –d CPU –m <model> -i <input>
```
to execute on the CPU instead.
For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param).
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams.
Finally, notice that when you don't specify number of streams with `-nstreams`, "AUTO" value for the streams is used, e.g. for the CPU this is [CPU_THROUGHPUT_AUTO](supported_plugins/CPU.md). You can spot the actual value behind "AUTO" for your machine in the application output.
Notice that the "AUTO" number is not necessarily most optimal, so it is generally recommended to play either with the benchmark_app's "-nstreams" as described above, or via [new Workbench tool](@ref workbench_docs_Workbench_DG_Introduction).This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance.
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.
## Kernels Tuning for GPU
GPU backend comes with a feature, that allows models tuning, so the workload is configured to fit better into hardware.
Tuning is time consuming process, which internally execute every layer several (or even hundreds) times to find most performant configuration.
This configuration is saved into json-formatted file, whose name can be passed as plugin param to network. GPU backend will process this data to configure kernels for the best performance.
For more details about Kernels Tuning and How-To please refer to [GPU Kernels Tuning](GPU_Kernels_Tuning.md).
4. [Integrate Inference Engine](Integrate_with_customer_application_new_API.md) in your application to deploy the model in the target environment.
## Model Optimizer <a name = "MO"></a>
Model Optimizer is a cross-platform command line tool that facilitates the transition between the training and
deployment environment, performs static model analysis and automatically adjusts deep learning
models for optimal execution on end-point target devices.
Model Optimizer is designed to support multiple deep learning [supported frameworks and formats](#SupportedFW).
While running Model Optimizer you do not need to consider what target device you wish to use, the same output of the MO can be used in all targets.
### Model Optimizer Workflow
The process assumes that you have a network model trained using one of the [supported frameworks](#SupportedFW).
The Model Optimizer workflow can be described as following:
* [Configure Model Optimizer](../MO_DG/prepare_model/Config_Model_Optimizer.md) for one of the supported deep learning framework that was used to train the model.
* Provide as input a trained network that contains a certain network topology, and the adjusted weights and
biases (with some optional parameters).
* [Run Model Optimizer](../MO_DG/prepare_model/convert_model/Converting_Model.md) to perform specific model optimizations (for example, horizontal fusion of certain network layers). Exact optimizations
are framework-specific, refer to appropriate documentation pages: [Converting a Caffe Model](../MO_DG/prepare_model/convert_model/Convert_Model_From_Caffe.md),
[Converting a TensorFlow Model](../MO_DG/prepare_model/convert_model/Convert_Model_From_TensorFlow.md), [Converting a MXNet Model](../MO_DG/prepare_model/convert_model/Convert_Model_From_MxNet.md), [Converting a Kaldi Model](../MO_DG/prepare_model/convert_model/Convert_Model_From_Kaldi.md),
[Converting an ONNX Model](../MO_DG/prepare_model/convert_model/Convert_Model_From_ONNX.md).
* Model Optimizer produces as output an [Intermediate Representation (IR)](../MO_DG/IR_and_opsets.md) of the network which is used as an input for the Inference Engine on all targets.
### Supported Frameworks and Formats <a name = "SupportedFW"></a>
* Caffe* (most public branches)
* TensorFlow*
* MXNet*
* Kaldi*
* ONNX*
### Supported Models
For the list of supported models refer to the framework or format specific page:
Intermediate representation describing a deep learning model plays an important role connecting the OpenVINO™ toolkit components.
The IR is a pair of files:
*`.xml`: The topology file - an XML file that describes the network topology
*`.bin`: The trained data file - a .bin file that contains the weights and biases binary data
Intermediate Representation (IR) files can be read, loaded and inferred with the [Inference Engine](#IE).
Inference Engine API offers a unified API across a number of [supported Intel® platforms](#SupportedTargets).
IR is also consumed, modified and written by Post-Training Optimization Tool which provides quantization capabilities.
Refer to a dedicated description about [Intermediate Representation and Operation Sets](../MO_DG/IR_and_opsets.md) for further details.
## nGraph Integration
OpenVINO toolkit is powered by nGraph capabilities for Graph construction API, Graph transformation engine and Reshape.
nGraph Function is used as an intermediate representation for a model in the run-time underneath the CNNNetwork API.
The conventional representation for CNNNetwork is still available if requested for backward compatibility when some conventional API methods are used.
Please refer to the [Overview of nGraph](../nGraph_DG/nGraph_dg.md) describing the details of nGraph representation.
## Inference Engine <a name = "IE"></a>
Inference Engine is a runtime that delivers a unified API to integrate the inference with application logic:
* Takes a model as an input. The model can be presented in [the native ONNX format](./ONNX_Support.md) or in the specific form of [Intermediate Representation (IR)](../MO_DG/IR_and_opsets.md)
produced by Model Optimizer.
* Optimizes inference execution for target hardware.
* Delivers inference solution with reduced footprint on embedded inference platforms.
The Inference Engine supports inference of multiple image classification networks,
including AlexNet, GoogLeNet, VGG and ResNet families of networks, fully convolutional networks like FCN8 used for image
segmentation, and object detection networks like Faster R-CNN.
For the full list of supported hardware, refer to the
For Intel® Distribution of OpenVINO™ toolkit, the Inference Engine package contains [headers](files.html), runtime libraries, and
[sample console applications](Samples_Overview.md) demonstrating how you can use
the Inference Engine in your applications.
The open source version is available in the [OpenVINO™ toolkit GitHub repository](https://github.com/openvinotoolkit/openvino) and can be built for supported platforms using the <ahref="https://github.com/openvinotoolkit/openvino/wiki/BuildingCode">Inference Engine Build Instructions</a>.
## See Also
- [Inference Engine Samples](Samples_Overview.md)
- [Intel® Deep Learning Deployment Toolkit Web Page](https://software.intel.com/en-us/computer-vision-sdk)
[scheme]: img/workflow_steps.png
#### Optimization Notice
<sup>For complete information about compiler optimizations, see our [Optimization Notice](https://software.intel.com/en-us/articles/optimization-notice#opt-en).</sup>
# Known Issues and Limitations {#openvino_docs_IE_DG_Known_Issues_Limitations}
## Multiple OpenMP Loadings
If the application uses the Inference Engine with third-party components that depend on Intel OpenMP, multiple loadings of the libiomp library may occur and cause OpenMP runtime initialization conflicts. This may happen, for example, if the application uses Intel® Math Kernel Library (Intel® MKL) through the “Single Dynamic Library” (<code>libmkl_rt.so</code>) mechanism and calls Intel MKL after loading the Inference Engine plugin.
The error log looks as follows:
```sh
OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
```
Possible workarounds:
* Preload the OpenMP runtime using the <code>LD_PRELOAD</code> variable:
# Legal Information {#openvino_docs_IE_DG_Legal_Information}
<sup>No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.</sup><br/>
<sup>Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.</sup><br/>
<sup>This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.</sup><br/>
<sup>The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.</sup><br/>
<sup>Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting [<b>www.intel.com/design/literature.htm</b>](http://www.intel.com/design/literature.htm).</sup><br/>
<sup>Intel, Intel logo, Intel Core, VTune, Xeon are trademarks of Intel Corporation in the U.S. and other countries.</sup><br/>
<sup>\* Other names and brands may be claimed as the property of others.</sup><br/>
<sup>This software and the related documents are Intel copyrighted materials, and your use of them is governed by the express license under which they were provided to you (License). Unless the License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or transmit this software or the related documents without Intel's prior written permission.</sup><br/>
<sup>This software and the related documents are provided as is, with no express or implied warranties, other than those that are expressly stated in the License.</sup><br/>
<code>InferenceEngine::TensorDesc</code> is a special class that provides layout format description.
This class allows to create planar layouts using the standard formats (like <code>InferenceEngine::Layout::NCDHW</code>, <code>InferenceEngine::Layout::NCHW</code>, <code>InferenceEngine::Layout::NC</code>, <code>InferenceEngine::Layout::C</code> and etc) and also non-planar layouts using <code>InferenceEngine::BlockingDesc</code>.
In order to create a complex layout you should use <code>InferenceEngine::BlockingDesc</code> which allows to define the blocked memory with offsets and strides.
## Examples
1. You can define a blob with dimensions {N: 1, C: 25, H: 20, W: 20} and format NHWC with using next parameters:<br/>
<preclass="brush:cpp">
InferenceEngine::BlockingDesc({1, 20, 20, 25}, {0, 2, 3, 1}); // or
2. If you have a memory with real dimensions {N: 1, C: 25, H: 20, W: 20} but with channels which are blocked by 8, you can define it using next parameters:<br/>
[DEPRECATED] Migration from Inference Engine Plugin API to Core API {#openvino_docs_IE_DG_Migration_CoreAPI}
===============================
For 2019 R2 Release, the new Inference Engine Core API is introduced. This guide is updated to reflect the new API approach. The Inference Engine Plugin API is still supported, but is going to be deprecated in future releases.
This section provides common steps to migrate your application written using the Inference Engine Plugin API (`InferenceEngine::InferencePlugin`) to the Inference Engine Core API (`InferenceEngine::Core`).
To learn how to write a new application using the Inference Engine, refer to [Integrate the Inference Engine Request API with Your Application](Integrate_with_customer_application_new_API.md) and [Inference Engine Samples Overview](Samples_Overview.md).
## Inference Engine Core Class
The Inference Engine Core class is implemented on top existing Inference Engine Plugin API and handles plugins internally.
The main responsibility of the `InferenceEngine::Core` class is to hide plugin specifics inside and provide a new layer of abstraction that works with devices (`InferenceEngine::Core::GetAvailableDevices`). Almost all methods of this class accept `deviceName` as an additional parameter that denotes an actual device you are working with. Plugins are listed in the `plugins.xml` file, which is loaded during constructing `InferenceEngine::Core` objects:
```bash
<ie>
<plugins>
<plugin name="CPU"location="libMKLDNNPlugin.so">
</plugin>
...
</ie>
```
## Migration Steps
Common migration process includes the following steps:
1. Migrate from the `InferenceEngine::InferencePlugin` initialization:
# ONNX* Importer API Tutorial {#openvino_docs_IE_DG_OnnxImporterTutorial}
> **NOTE**: This tutorial is deprecated. Since OpenVINO™ 2020.4 version, Inference Engine enables reading ONNX models via the Inference Engine Core API
> and there is no need to use directly the low-level ONNX* Importer API anymore.
> To read ONNX\* models, it's recommended to use the `Core::ReadNetwork()` method that provide a uniform way to read models from IR or ONNX format.
This tutorial demonstrates how to use the ONNX\* Importer API.
This API makes it possible to create an nGraph `Function` object from an imported ONNX model.
All functions of the ONNX Importer API are in the [onnx.hpp][onnx_header] header file.
Two categories of API functions:
* Helper functions that check which ONNX ops are supported in a current version of the ONNX Importer
* Functions that read ONNX models from a stream or file and result in an nGraph function, which can be executed using the Inference Engine
## Check Which ONNX Ops Are Supported
To list all supported ONNX ops in a specific version and domain, use the `get_supported_operators`
The above code produces a list of all the supported operators for the `version` and `domain` you specified and outputs a list similar to this:
```cpp
Abs
Acos
...
Xor
```
To determine whether a specific ONNX operator in a particular version and domain is supported by the importer, use the `is_operator_supported` function as shown in the example below:
Once you create the `ng_function`, you can use it to run computation on the Inference Engine.
As it was shown in [Build a Model with nGraph Library](../nGraph_DG/build_function.md), `std::shared_ptr<ngraph::Function>` can be transformed into a `CNNNetwork`.
### <a name="stream">Stream as Input</a>
The code below shows how to convert the ONNX ResNet50 model to the nGraph function using `import_onnx_model` with the stream as an input:
The Inference Engine sample applications are simple console applications that show how to utilize specific Inference Engine capabilities within an application, assist developers in executing specific tasks such as loading a model, running inference, querying specific device capabilities and etc.
After installation of Intel® Distribution of OpenVINO™ toolkit, С, C++ and Python* sample applications are available in the following directories, respectively:
*`<INSTALL_DIR>/inference_engine/samples/c`
*`<INSTALL_DIR>/inference_engine/samples/cpp`
*`<INSTALL_DIR>/inference_engine/samples/python`
Inference Engine sample applications include the following:
- **[Automatic Speech Recognition C++ Sample](../../inference-engine/samples/speech_sample/README.md)** – Acoustic model inference based on Kaldi neural networks and speech feature vectors.
- **Benchmark Application** – Estimates deep learning inference performance on supported devices for synchronous and asynchronous modes.
- [Benchmark C++ Application](../../inference-engine/samples/benchmark_app/README.md)
- **Hello Classification Sample** – Inference of image classification networks like AlexNet and GoogLeNet using Synchronous Inference Request API. Input of any size and layout can be set to an infer request which will be pre-processed automatically during inference (the sample supports only images as inputs and supports Unicode paths).
- [Hello Classification C++ Sample](../../inference-engine/samples/hello_classification/README.md)
- [Hello Classification C Sample](../../inference-engine/ie_bridges/c/samples/hello_classification/README.md)
- **Hello NV12 Input Classification Sample** – Input of any size and layout can be provided to an infer request. The sample transforms the input to the NV12 color format and pre-process it automatically during inference. The sample supports only images as inputs.
- [Hello NV12 Input Classification C++ Sample](../../inference-engine/samples/hello_nv12_input_classification/README.md)
- [Hello NV12 Input Classification C Sample](../../inference-engine/ie_bridges/c/samples/hello_nv12_input_classification/README.md)
- **Hello Query Device Sample** – Query of available Inference Engine devices and their metrics, configuration values.
- [Hello Query Device C++ Sample](../../inference-engine/samples/hello_query_device/README.md)
- **[Hello Reshape SSD C++ Sample**](../../inference-engine/samples/hello_reshape_ssd/README.md)** – Inference of SSD networks resized by ShapeInfer API according to an input size.
- **Image Classification Sample Async** – Inference of image classification networks like AlexNet and GoogLeNet using Asynchronous Inference Request API (the sample supports only images as inputs).
- [Image Classification C++ Sample Async](../../inference-engine/samples/classification_sample_async/README.md)
- **[Image Classification Python* Sample](../../inference-engine/ie_bridges/python/sample/classification_sample/README.md)** – Inference of image classification networks like AlexNet and GoogLeNet using Synchronous Inference Request API (the sample supports only images as inputs).
- **Neural Style Transfer Sample** – Style Transfer sample (the sample supports only images as inputs).
- [Neural Style Transfer C++ Sample](../../inference-engine/samples/style_transfer_sample/README.md)
- [Neural Style Transfer Python* Sample](../../inference-engine/ie_bridges/python/sample/style_transfer_sample/README.md)
- **[nGraph Function Creation C++ Sample](../../inference-engine/samples/ngraph_function_creation_sample/README.md)** – Construction of the LeNet network using the nGraph function creation sample.
- **Object Detection for SSD Sample** – Inference of object detection networks based on the SSD, this sample is simplified version that supports only images as inputs.
- [Object Detection for SSD C++ Sample](../../inference-engine/samples/object_detection_sample_ssd/README.md)
- [Object Detection for SSD C Sample](../../inference-engine/ie_bridges/c/samples/object_detection_sample_ssd/README.md)
- [Object Detection for SSD Python* Sample](../../inference-engine/ie_bridges/python/sample/object_detection_sample_ssd/README.md)
## Media Files Available for Samples
To run the sample applications, you can use images and videos from the media files collection available at https://github.com/intel-iot-devkit/sample-videos.
## Samples that Support Pre-Trained Models
You can download the [pre-trained models](@ref omz_models_intel_index) using the OpenVINO [Model Downloader](@ref omz_tools_downloader_README) or from [https://download.01.org/opencv/](https://download.01.org/opencv/).
## Build the Sample Applications
### <a name="build_samples_linux"></a>Build the Sample Applications on Linux*
The officially supported Linux* build environment is the following:
> **NOTE**: For building samples from the open-source version of OpenVINO™ toolkit, see the [build instructions on GitHub](https://github.com/openvinotoolkit/openvino/wiki/BuildingCode).
To build the C or C++ sample applications for Linux, go to the `<INSTALL_DIR>/inference_engine/samples/c` or `<INSTALL_DIR>/inference_engine/samples/cpp` directory, respectively, and run the `build_samples.sh` script:
```sh
build_samples.sh
```
Once the build is completed, you can find sample binaries in the following folders:
* C samples: `~/inference_engine_c_samples_build/intel64/Release`
* C++ samples: `~/inference_engine_cpp_samples_build/intel64/Release`
You can also build the sample applications manually:
> **NOTE**: If you have installed the product as a root user, switch to root mode before you continue: `sudo -i`
1. Navigate to a directory that you have write access to and create a samples build directory. This example uses a directory named `build`:
```sh
mkdir build
```
> **NOTE**: If you ran the Image Classification verification script during the installation, the C++ samples build directory was already created in your home directory: `~/inference_engine_samples_build/`
2. Go to the created directory:
```sh
cd build
```
3. Run CMake to generate the Make files for release or debug configuration. For example, for C++ samples:
For the release configuration, the sample application binaries are in `<path_to_build_directory>/intel64/Release/`;
for the debug configuration — in`<path_to_build_directory>/intel64/Debug/`.
### <a name="build_samples_windows"></a>Build the Sample Applications on Microsoft Windows* OS
The recommended Windows* build environment is the following:
* Microsoft Windows* 10
* Microsoft Visual Studio* 2017, or 2019
* CMake* version 3.10 or higher
> **NOTE**:If you want to use MicrosoftVisual Studio 2019, you are required to install CMake 3.14.
To build the C or C++ sample applications on Windows, go to the `<INSTALL_DIR>\inference_engine\samples\c` or `<INSTALL_DIR>\inference_engine\samples\cpp` directory, respectively, and run the `build_samples_msvc.bat` batch file:
```sh
build_samples_msvc.bat
```
By default, the script automatically detects the highest Microsoft Visual Studio version installed on the machine and uses it to create and build
a solution for a sample code. Optionally, you can also specify the preferred Microsoft Visual Studio version to be used by the script. Supported
versions are `VS2017` and `VS2019`. For example, to build the C++ samples using the Microsoft Visual Studio 2017, use the following command:
Once the build is completed, you can find sample binaries in the following folders:
* C samples: `C:\Users\<user>\Documents\Intel\OpenVINO\inference_engine_c_samples_build\intel64\Release`
* C++ samples: `C:\Users\<user>\Documents\Intel\OpenVINO\inference_engine_cpp_samples_build\intel64\Release`
You can also build a generated solution manually. For example, if you want to build C++ sample binaries in Debug configuration, run the appropriate version of the
Microsoft Visual Studio and open the generated solution file from the `C:\Users\<user>\Documents\Intel\OpenVINO\inference_engine_cpp_samples_build\Samples.sln`
directory.
## Get Ready for Running the Sample Applications
### Get Ready for Running the Sample Applications on Linux*
Before running compiled binary files, make sure your application can find the
Inference Engine and OpenCV libraries.
Run the `setupvars` script to set all necessary environment variables:
```sh
source <INSTALL_DIR>/bin/setupvars.sh
```
**(Optional)**: The OpenVINO environment variables are removed when you close the
shell. As an option, you can permanently set the environment variables as follows:
1. Open the `.bashrc` file in `<user_home_directory>`:
```sh
vi <user_home_directory>/.bashrc
```
2. Add this line to the end of the file:
```sh
source /opt/intel/openvino/bin/setupvars.sh
```
3. Save and close the file: press the **Esc** key, type `:wq` and press the **Enter** key.
4. To test your change, open a new terminal. You will see `[setupvars.sh] OpenVINO environment initialized`.
You are ready to run sample applications. To learn about how to run a particular
sample, read the sample documentation by clicking the sample name in the samples
list above.
### Get Ready for Running the Sample Applications on Windows*
Before running compiled binary files, make sure your application can find the
Inference Engine and OpenCV libraries.
Use the `setupvars` script, which sets all necessary environment variables:
```sh
<INSTALL_DIR>\bin\setupvars.bat
```
To debug or run the samples on Windows in Microsoft Visual Studio, make sure you
have properly configured **Debugging** environment settings for the **Debug**
and **Release** configurations. Set correct paths to the OpenCV libraries, and
debug and release versions of the Inference Engine libraries.
For example, for the **Debug** configuration, go to the project's
**Configuration Properties** to the **Debugging** category and set the `PATH`
variable in the **Environment** field to the following:
Using Shape Inference {#openvino_docs_IE_DG_ShapeInference}
==========================================
Inference Engine takes three kinds of a model description as an input, which are converted into an `InferenceEngine::CNNNetwork` object:
1. [Intermediate Representation (IR)](../MO_DG/IR_and_opsets.md) through `InferenceEngine::Core::ReadNetwork`
2. [ONNX model](../IE_DG/OnnxImporterTutorial.md) through `InferenceEngine::Core::ReadNetwork`
3. [nGraph::Function](../nGraph_DG/nGraph_dg.md) through the constructor of `InferenceEngine::CNNNetwork`
`InferenceEngine::CNNNetwork` keeps an `ngraph::Function` object with the model description internally.
The object should have fully defined input shapes to be successfully loaded to the Inference Engine plugins.
To resolve undefined input dimensions of a model, call the `CNNNetwork::reshape` method providing new input shapes before loading to the Inference Engine plugin.
Run the following code right after `InferenceEngine::CNNNetwork` creation to explicitly check for model input names and shapes:
```cpp
CNNNetworknetwork=...// read IR / ONNX model or create from nGraph::Function explicitly
std::cout<<"ATTENTION: Input shape is not fully defined. Use the CNNNetwork::reshape method to resolve it."<<std::endl;
}
```
To feed input data of a shape that is different from the model input shape, reshape the model first.
OpenVINO™ provides the following methods for runtime model reshaping:
* **Set a new input shape** with the `InferenceEngine::CNNNetwork::reshape` method.<br>
The `InferenceEngine::CNNNetwork::reshape` method updates input shapes and propagates them down to the outputs of the model through all intermediate layers.
You can reshape a model multiple times like in this application scheme:
> - Starting with the 2021.1 release, the Model Optimizer converts topologies keeping shape-calculating sub-graphs by default, which enables correct shape propagation during reshaping.
> - Older versions of IRs are not guaranteed to reshape successfully. Please regenerate them with the Model Optimizer of the latest version of OpenVINO™.<br>
> - If an ONNX model does not have a fully defined input shape and the model was imported with the ONNX importer, reshape the model before loading it to the plugin.
* **Set a new batch dimension value** with the `InferenceEngine::CNNNetwork::setBatchSize` method.<br>
The meaning of a model batch may vary depending on the model design.
The `InferenceEngine::CNNNetwork::setBatchSize` method deduces the index of a batch dimension based only on the input rank.
This method does not work for models with a non-zero index batch placement or models with inputs without a batch dimension.
The batch-setting algorithm does not involve the shape inference mechanism.
Batch of input and output shapes for all layers is set to a new batch value without layer validation.
It may cause both positive and negative side effects.
Due to the limitations described above, the current method is not recommended to use.
If you need to set a new batch size for the model, use the `CNNNetwork::reshape` method instead.
Do not use runtime reshaping methods simultaneously, especially do not call the `CNNNetwork::reshape` method after you use `InferenceEngine::CNNNetwork::setBatchSize`.
The `InferenceEngine::CNNNetwork::setBatchSize` method causes irreversible conversion of the internal model representation into the legacy model representation.
The method does not use nGraph for shape inference which leads to reduced reshape opportunities and may affect the performance of the model.
There are other approaches to reshape the model during the stage of <a href="_docs_MO_DG_prepare_model_convert_model_Converting_Model_General.html#when_to_specify_input_shapes">IR generation</a> or [nGraph::Function creation](../nGraph_DG/build_function.md).
Practically, some models are not ready to be reshaped. In this case, a new input shape cannot be set with the Model Optimizer or the `InferenceEngine::CNNNetwork::reshape` method.
## Troubleshooting Reshape Errors
Operation semantics may impose restrictions on input shapes of the operation.
Shape collision during shape propagation may be a sign that a new shape does not satisfy the restrictions.
Changing the model input shape may result in intermediate operations shape collision.
Examples of such operations:
- <a href="_docs_MO_DG_prepare_model_convert_model_IR_V10_opset1.html#Reshape">`Reshape` operation</a> with a hard-coded output shape value
- <a href="_docs_MO_DG_prepare_model_convert_model_IR_V10_opset1.html#MatMul">`MatMul` operation</a> with the `Const` second input cannot be resized by spatial dimensions due to operation semantics
Model structure and logic should not change significantly after model reshaping.
- The Global Pooling operation is commonly used to reduce output feature map of classification models output.
Having the input of the shape [N, C, H, W], Global Pooling returns the output of the shape [N, C, 1, 1].
Model architects usually express Global Pooling with the help of the `Pooling` operation with the fixed kernel size [H, W].
During spatial reshape, having the input of the shape [N, C, H1, W1], Pooling with the fixed kernel size [H, W] returns the output of the shape [N, C, H2, W2], where H2 and W2 are commonly not equal to `1`.
It breaks the classification model structure.
For example, [publicly available Inception family models from TensorFlow*](https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models) have this issue.
- Changing the model input shape may significantly affect its accuracy.
For example, Object Detection models from TensorFlow have resizing restrictions by design.
To keep the model valid after the reshape, choose a new input shape that satisfies conditions listed in the `pipeline.config` file.
For details, refer to the <a href="_docs_MO_DG_prepare_model_convert_model_tf_specific_Convert_Object_Detection_API_Models.html#tf_od_custom_input_shape">Tensorflow Object Detection API models resizing techniques</a>.
## Usage of Reshape Method <a name="usage_of_reshape_method"></a>
The primary method of the feature is `InferenceEngine::CNNNetwork::reshape`.
It gets new input shapes and propagates it from input to output for all intermediates layers of the given network.
The method takes `InferenceEngine::ICNNNetwork::InputShapes` - a map of pairs: name of input data and its dimension.
The algorithm for resizing network is the following:
1) **Collect the map of input names and shapes from Intermediate Representation (IR)** using helper method `InferenceEngine::CNNNetwork::getInputShapes`
2) **Set new input shapes**
3) **Call reshape**
Here is a code example:
```cpp
InferenceEngine::Core core;
// ------------- 0. Read IR and image ----------------------------------------------
OpenVINO™ tools are C++ and Python\* console command line applications that can be used for models downloading, accuracy measurement, calibration and checking.
The OpenVINO™ toolkit installation includes the following tools:
oid sha256:a9aae473dcc469ebdb5c2d9ac8067bf8c7caa11d4cdbc7e0dd0b2006621ce526
size 4267
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.