* [IE CLDNN] Memory allocation optimizations (#2178) * [GNA] Safety fixes (#2193) * LSTMCell test [GNA] LSTMCell fix for GNA (#2216) * [GNA] fix scale factor calculation for unfused bias after fc (2021.1) (#2195) * [GNA] fix scale factor calculation for unfused bias after fc * change check * add test * apply requested changes * cpplint fix * apply test changes * modify model for test to match ::op:: * [LPT] Copy constant with several outputs before blob update (#2197) * [LPT] Copy constant implementation * [LPT] the same Constant ops as FQ interval boundaries * [Scripts] Fixing issue with exporting path-like env when it undef (#2164) * setupvars.sh: Added logic for exporting path env in case if it not defined * setupvars: Removed duplicated colon * Kept quotes where they were * setupvars: updated copyrights * FakeQuantize + Mul fusion (#2133) * FQ+Mul fusion transform skeleton * FQ+Mul fusion transform tests prep * Basic UT for the transform * Basic implementation of the transform * Parametrized UTs for FQMul transform * Parametrization of FQ+Mul UTs * Make sure that the shapes of constants match * Check if the mul constant matches FQ data * CentOs compilation error fix * PR feedback and adjusted tests * NHWC layout of the mul constant * UT: FQ output limits 4D * Redundant CF pass removed * Rewrite the graph in a different way * Shape checking infrastructure skeleton * Handle some negative cases * Check the rt info in the fusion test * Fuse all Mul nodes detected after FQ node * Dont cast the original FQ node * Dont throw if CF fails in new output range calculation * More UTs * Accept any type of input to FQ in the transformation * Test the fusion when all FQ inputs are non-const * Fusion test when only one output limit is const * Extend error message (#2174) * some nGraph KW fixes (#2176) * Removed redundant methods * Fixed KW for linux * Fix QueryNetwork for networks with KSO (#2202) * Added a test to reproduce QueryNetwork with KSO * Fixed QueryNetwork for networks with KSO * Added additional test * Fixed output names for case with redundant ops before result (#2209) * [IE][VPU]: Workaround to support parameter Beta for layer Swish (#2207) * Workaround to full support Swish layer. It is faster than native Swish for now. * [IE][VPU]: Remove the second call of ngraph::CommonOptimizations (#2221) * Remove the second call of ngraph::CommonOptimizations in myriad plugin * Reuse code with vpu ngraph transformations * Duplicate PR 2167 for release branch: GatherTree description was extended and outdated link fixed (#2235) * add more alrifications to description * move clarification to comment * pseudo code become more accurate * review changes * Add exposing function signatures via Cython (#2244) * [DOC] Reshape feature (#2194) * [IE][VPU][OpenCL] 2021.1 release compiler (#2189) * Statically analyzed issues. (#2261) * [IE][VPU]: Fix K propagation through Reshape (2021.1) (#2180) * Fix K propagation through Reshape * Add test cases * Revert "[IE TESTS] dynavic batch for mvn layer (#1010)" (#2256) This reverts commit2e3378c50f. * Fixed KW warning and review issues (#2262) * [IE][VPU]: update firmware 1381 (#2236) * Reverting devicePriorities to be vector and respect the order, as opposed to the incorrect (recent?) refactoring that introduced the unordered_map that effectively ignores the priorities (#2251) * update OpenCV version to 4.5.0 (#2260) * Add VPUX configuration to compile_tool (#2248) * [IE][TESTS] Fix compareRawBuffers and compareBlobData methods (#2246) Use `<=` comparison instead of `<` with thresholds. This allows to use `0` threshold for bit-exact comparison. * [IE][VPU]: KW fixes (#2186) * Some KW fixes * Fix printTo in vpu ngraph transformations * Fix for static PartialShape detection algorithm (#2177) * Fixes for Interpolate-4. (#2281) * Update get_ov_update_message.py (#2286) * Clone a specific tag for pybind11 (#2296) * [Scripts] Fix setting PYTHONPATH logic (#2305) * setupvars.sh: Added logic for exporting path env in case if it not defined * setupvars: Removed duplicated colon * install_openvino_dependencies: Updated copyrights setupvars.bat: Updated notification about incorrect Python version. Removed checking ICC2019 setupvars.sh: Removed logic with choosing higher version of installed Python. Added dynamic detecting python3 major and minor version for setting path. Add checking minimum required Python version(now 3.6) * Added python3-gi package and fixed libglib2.0-0 package location. (#2294) * [IE TESTS] CoreThreading_LoadNetwork tests were disabled for GPU plugin (#2245) (#2283) * setupvars: Updated notifications, fixed calling python in Windows case (#2318) * Updated operations specification documents (2021.1) (#2268) * Updated documentation structure and remove incorrect added files for Acosh-1, Asinh-1 and Atanh-1 * Fixed broken links * Fixed c samples build (#2278) (#2304) * Fixed c samples build fixed CVS-38816 - Failure to build samples in C * Fixed issue with gflags * Revert "[IE][VPU]: Fix K propagation through Reshape (2021.1) (#2180)" (#2322) This reverts commitd604a03ac0. * Added ONNX Resize-11 and ONNX Resize-13 to supported frameworks layers list. (#2325) * Implement `run_executable.py` to run `TimeTests` several times (#2125) (#2188) CI passed * install_NEO_OCL_driver: Updated exit codes, messages. Updated way to remove old driver on Ubuntu (#2333) * Bump cmake version to 3.13 (#2339) * install_NEO_OCL_driver: Added checking of installed packages before trying to remove them. Added quotes for echo. (#2350) * convert to doxygen comments * add doxygen doc build configurations (#2191) Co-authored-by: Nikolay Tyukaev <ntyukaev_lo@jenkins.inn.intel.com> * [DOCS] Added an evaluate method for custom operation (#2272) * Added an evaluate method for custom operation * Fixed comments * Downgrade cmake for samples (#2372) * Downgrade cmake for samples Downgraded cmake version to default version for Ubuntu 18.04 * Updated supported python version The minimal python version in 2021.1 is 3.5 * Added notes about cmake requirements for samples and demo * Install dependency refactoring. (#2381) * Updated Transformation development doc (#2370) * Delete xfail for resolved known issue (#2385) * Fix layout links for dl streamer and c api (#2375) * fix layouts * change the dl-streamer link Co-authored-by: Nikolay Tyukaev <ntyukaev_lo@jenkins.inn.intel.com> * Added link options for cross-compilation (#2397) * Added new GSG for macOS, made minor changes in Windows GSG (#2070) (#2405) * Added new GSG for macOS, made minor changes in Windows GSG * Update get_started_macos.md Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com> * Fixed docs build on Windows (#2383) * layouts and code comments * Replace absolute links to docs.openvinotoolkit.org by relative ones (#2439) * Replaced direct links to docs.openvinotoolkit.org with relative links * Replaced direct links to docs.openvinotoolkit.org with relative links. Added GSGs for Win and macOS * Minor fixes in GSGs * Replaced direct links to docs.openvinotoolkit.org with relative links * Removed links to OpenVINO markdown files that contain anchor - they don't work in the current implementation of the doc process * Fixed Notes * Removed links to OpenVINO markdown files that contain anchor - they don't work in the current implementation of the doc process * fixed link to installing-openvino-linux.md * Update the menu to align with POT doc headers (#2433) * Update the menu to align with POT doc headers It changes the menu to align with Post-training Optimization Toolkit documentation titles. * Corrected one title Run Examples => How to Run Examples * Added closing braсket (#2466) Fixed syntax error (b4b03b1) * Remove the deprecation notice (#2314) * Removed deprecation notice * Removed the note from other files * [DOCS] Update Installation Guide - GPU steps (#2308) * Initial commit * fixing lists * Update installing-openvino-linux.md * Get rid of the note * Added the scrrenshot * Update installing-openvino-linux.md * fixes * separate layout * [Docs] Update MO What's new description (#2481) * Azure CI: Add separated pipelines for Windows, Linux, Mac * Feature/azaytsev/benchmarks 2021 1 (#2501) * Initial changes for 2021.1 * Inserted Graphtool scripts, updated configurations info * Updated FAQ and minor changes to performance_benchmarks.md * Updated for 2021.1 * Updated * incorporated review comments * incorporated review comments for FAQ * fixed link * Update build-instruction.md for MacOsX (#2457) * Update build-instruction.md for MacOsX * Removed call of install_dependencies.sh from the steps * Changed layouts * Feature/azaytsev/cvs-38240 (#2469) * Updated for 2020 version, replaced Ubuntu 16.04 with Ubuntu 20.04 * Updated the release package numbers * Removed FPGA from the documentation * Updated according to the comments in the ticket CVS-37827 (#2448) * Updated according to CVS-38225 * some changes * Update docs for speech libs and demos (#2518) * Made changes to benchmarks according to review comments * Remove `--collect_results_only` (#2523) * Remove `--collect_results_only` from MemCheckTests * Remove CLI keys from README * Added logo info to the Legal_Information, updated Ubuntu, CentOS supported versions * Updated supported Intel® Core™ processors list * Fixed table formatting * [Jenkinsfile] Bump infra (#2546) * [GNA] Documentation updates for 2021.1 (#2460) * [GNA] Documentation updates for 2021.1 * Take Mike's comments into account * More fixes according to review * Fix processor generation names * update api layouts * Added new index page with overview * Changed CMake and Python versions * Fixed links * some layout changes * some layout changes * nGraph Python API tutorial (#2500) * nGraph Python API tutorial * Tweaks * Code review comments * Code review comments * some layout changes * COnverted svg images to png * layouts * update layout * Added a label for nGraph_Python_API.md * fixed links * Fixed image * First draft of nGraph documentation (#2271) * First draft of nGraph documentation * updated according to review comments * Updated * Reviewed the nGraph Transformation section, added missing images * Update nGraph_dg.md * Delete python_api.md Removed since there is already the nGraph_Python_API.md document with a comprehensive overview. Co-authored-by: Andrey Zaytsev <andrey.zaytsev@intel.com> Co-authored-by: CCR\avladimi <anastasiya.ageeva@intel.com> * Feature/azaytsev/docs 2021 1 (#2560) * Removed FPGA from the documentation * Updated according to CVS-38225 * Added logo info to the Legal_Information, updated Ubuntu, CentOS supported versions * Updated supported Intel® Core™ processors list * Added new index page with overview * Changed CMake and Python versions * Fixed links * COnverted svg images to png * Added a label for nGraph_Python_API.md * fixed links * Fixed image * Update SW requirements in build instructions and change latest release to 2021.1 (#2565) * removed links to ../IE_DG/Introduction.md * Removed links to tools overview page as removed * some changes * Remove link to Integrate_your_kernels_into_IE.md * remove openvino_docs_IE_DG_Graph_debug_capabilities from layout as it was removed * Fixed links to images (#2569) * update layouts * Added deprecation note for PassConfig class (#2593) * Post-release fixes and installation path changes * Added pip install documentation (#2465) * Added pip install documentation * Change references * tiny fixes of links * Update installing-openvino-pip.md Co-authored-by: Alina Alborova <alina.alborova@intel.com> * Update OpenVino ONNX CI check (#2599) * Update OpenVino ONNX CI * Change parallel execution to single * Enlarge timeout * Remove timeout * Add timeout to test execution * Added PIP installation and Build from Source to the layout * Fixed formatting issue, removed broken link * Renamed section EXAMPLES to RESOURCES according to review comments * add mo faq navigation by url param * Skip hanging test case of OpenVino ONNX CI (#2608) * Update OpenVino ONNX CI * Change parallel execution to single * Enlarge timeout * Remove timeout * Add timeout to test execution * Skip hanging test * Add description to skip issue * Removed DLDT description * Replaced wrong links * MInor fix for path to the cpp samples * fixes * Update ops.py * Fix style * Improve pip installation guide (#2644) * Improve pip installation guide * Updated after comments * Feature/ntyukaev/separate layout (#2629) * convert to doxygen comments * layouts and code comments * separate layout * Changed layouts * Removed FPGA from the documentation * Updated according to CVS-38225 * some changes * Made changes to benchmarks according to review comments * Added logo info to the Legal_Information, updated Ubuntu, CentOS supported versions * Updated supported Intel® Core™ processors list * Fixed table formatting * update api layouts * Added new index page with overview * Changed CMake and Python versions * Fixed links * some layout changes * some layout changes * some layout changes * COnverted svg images to png * layouts * update layout * Added a label for nGraph_Python_API.md * fixed links * Fixed image * removed links to ../IE_DG/Introduction.md * Removed links to tools overview page as removed * some changes * Remove link to Integrate_your_kernels_into_IE.md * remove openvino_docs_IE_DG_Graph_debug_capabilities from layout as it was removed * update layouts * Post-release fixes and installation path changes * Added PIP installation and Build from Source to the layout * Fixed formatting issue, removed broken link * Renamed section EXAMPLES to RESOURCES according to review comments * add mo faq navigation by url param * Removed DLDT description * Replaced wrong links * MInor fix for path to the cpp samples * fixes * Update ops.py * Fix style Co-authored-by: Nikolay Tyukaev <ntyukaev_lo@jenkins.inn.intel.com> Co-authored-by: Tyukaev <nikolay.tyukaev@intel.com> Co-authored-by: aalborov <alina.alborova@intel.com> Co-authored-by: Rafal Blaczkowski <rafal.blaczkowski@intel.com> Co-authored-by: Alexander Zhogov <alexander.zhogov@intel.com> * Fixed CVS-35316 (#2072) (#2670) Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com> * [install_dependencies.sh] install latest cmake if current version is lower 3.13 (#2695) (#2701) * [install_dependencies.sh] install latest cmake if current version is lower 3.13 * add shellcheck for Ubuntu * install python 2.7 for Ubuntu * Removed redundant file * Exclude files that we didn't changed from merging Co-authored-by: Sergey Shlyapnikov <sergey.shlyapnikov@intel.com> Co-authored-by: Denis Orlov <denis.orlov@intel.com> Co-authored-by: Kamil Magierski <kamil.magierski@intel.com> Co-authored-by: Anna Alberska <anna.alberska@intel.com> Co-authored-by: Edward Shogulin <edward.shogulin@intel.com> Co-authored-by: Artyom Anokhov <artyom.anokhov@intel.com> Co-authored-by: Tomasz Dołbniak <tomasz.dolbniak@intel.com> Co-authored-by: Ilya Churaev <ilya.churaev@intel.com> Co-authored-by: Roman Vyunov (Intel) <roman.vyunov@intel.com> Co-authored-by: Maksim Doronin <maksim.doronin@intel.com> Co-authored-by: Svetlana Dolinina <svetlana.a.dolinina@intel.com> Co-authored-by: Evgeny Talanin <evgeny.talanin@intel.com> Co-authored-by: Evgenya Stepyreva <evgenya.stepyreva@intel.com> Co-authored-by: Maxim Kurin <maxim.kurin@intel.com> Co-authored-by: Nikolay Shchegolev <nikolay.shchegolev@intel.com> Co-authored-by: Andrew Bakalin <andrew.bakalin@intel.com> Co-authored-by: Gorokhov Dmitriy <dmitry.gorokhov@intel.com> Co-authored-by: Evgeny Latkin <evgeny.latkin@intel.com> Co-authored-by: Maxim Shevtsov <maxim.y.shevtsov@intel.com> Co-authored-by: Alexey Suhov <alexey.suhov@intel.com> Co-authored-by: Alexander Novak <sasha-novak@yandex.ru> Co-authored-by: Vladislav Vinogradov <vlad.vinogradov@intel.com> Co-authored-by: Vladislav Volkov <vladislav.volkov@intel.com> Co-authored-by: Vladimir Gavrilov <vladimir.gavrilov@intel.com> Co-authored-by: Zoe Cayetano <zoe.cayetano@intel.com> Co-authored-by: Dmitrii Denisov <dmitrii.denisov@intel.com> Co-authored-by: Irina Efode <irina.efode@intel.com> Co-authored-by: Evgeny Lazarev <evgeny.lazarev@intel.com> Co-authored-by: Mikhail Ryzhov <mikhail.ryzhov@intel.com> Co-authored-by: Vitaliy Urusovskij <vitaliy.urusovskij@intel.com> Co-authored-by: Nikolay Tyukaev <ntyukaev_lo@jenkins.inn.intel.com> Co-authored-by: Nikolay Tyukaev <nikolay.tyukaev@intel.com> Co-authored-by: Gleb Kazantaev <gleb.kazantaev@intel.com> Co-authored-by: Rafal Blaczkowski <rafal.blaczkowski@intel.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Anastasiya Ageeva <anastasiya.ageeva@intel.com> Co-authored-by: Maksim Proshin <mvproshin@gmail.com> Co-authored-by: Alina Alborova <alina.alborova@intel.com> Co-authored-by: Maxim Vafin <maxim.vafin@intel.com> Co-authored-by: azhogov <alexander.zhogov@intel.com> Co-authored-by: Alina Kladieva <alina.kladieva@intel.com> Co-authored-by: Michał Karzyński <4430709+postrational@users.noreply.github.com> Co-authored-by: Anton Romanov <anton.romanov@intel.com>
100 lines
7.8 KiB
Markdown
100 lines
7.8 KiB
Markdown
# Introduction to the Performance Topics {#openvino_docs_IE_DG_Intro_to_Performance}
|
||
|
||
This section is a shorter version of the
|
||
[Optimization Guide](supported_plugins/MULTI.md) for the Intel Deep Learning Deployment Toolkit.
|
||
|
||
## Precision
|
||
Inference precision directly affects the performance.
|
||
|
||
Model Optimizer can produce an IR with different precision. For example, float16 IR initially targets VPU and GPU devices, while, for example, the CPU can also execute regular float32.
|
||
Also, further device-specific inference precision settings are available, for example, [8-bit integer](Int8Inference.md) or [bfloat16](Bfloat16Inference.md) inference on the CPU.
|
||
Note that for [MULTI device](supported_plugins/MULTI.md) that supports automatic inference on multiple devices in parallel, you can use the FP16 IR.
|
||
You can find more information, including preferred data types for specific devices, in the
|
||
[Supported Devices](supported_plugins/Supported_Devices.md) section.
|
||
|
||
## Lowering Inference Precision
|
||
Default optimization is used for CPU and implies that inference is made with lower precision if it is possible on a given platform to reach better performance with acceptable range of accuracy.
|
||
This approach is used for CPU device if platform supports the AVX512_BF16 instruction. In this case, a regular float32 model is converted to [bfloat16](Bfloat16Inference.md) internal representation and inference is provided with bfloat16 layers usage.
|
||
Below is the example command line to disable this feature on the CPU device with the AVX512_BF16 instruction and execute regular float32.
|
||
```
|
||
$ benchmark_app -m <model.xml> -enforcebf16=false
|
||
```
|
||
|
||
## Latency vs. Throughput
|
||
One way to increase computational efficiency is batching, which combines many (potentially tens) of
|
||
input images to achieve optimal throughput. However, high batch size also comes with a
|
||
latency penalty. So, for more real-time oriented usages, lower batch sizes (as low as a single input) are used.
|
||
Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample, which allows latency vs. throughput measuring.
|
||
|
||
## Using Async API
|
||
To gain better performance on accelerators, such as VPU, the Inference Engine uses the asynchronous approach (see
|
||
[Integrating Inference Engine in Your Application (current API)](Integrate_with_customer_application_new_API.md)).
|
||
The point is amortizing the costs of data transfers, by pipe-lining, see [Async API explained](@ref omz_demos_object_detection_demo_ssd_async_README).
|
||
Since the pipe-lining relies on the availability of the parallel slack, running multiple inference requests in parallel is essential.
|
||
Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample, which enables running a number of inference requests in parallel. Specifying different number of request produces different throughput measurements.
|
||
|
||
## Best Latency on the Multi-Socket CPUs
|
||
Note that when latency is of concern, there are additional tips for multi-socket systems.
|
||
When input is limited to the single image, the only way to achieve the best latency is to limit execution to the single socket.
|
||
The reason is that single image is simply not enough
|
||
to saturate more than one socket. Also NUMA overheads might dominate the execution time.
|
||
Below is the example command line that limits the execution to the single socket using numactl for the best *latency* value
|
||
(assuming the machine with 28 phys cores per socket):
|
||
```
|
||
limited to the single socket).
|
||
$ numactl -m 0 --physcpubind 0-27 benchmark_app -m <model.xml> -api sync -nthreads 28
|
||
```
|
||
Note that if you have more than one input, running as many inference requests as you have NUMA nodes (or sockets)
|
||
usually gives the same best latency as a single request on the single socket, but much higher throughput. Assuming two NUMA nodes machine:
|
||
```
|
||
$ benchmark_app -m <model.xml> -nstreams 2
|
||
```
|
||
Number of NUMA nodes on the machine can be queried via 'lscpu'.
|
||
Please see more on the NUMA support in the [Optimization Guide](supported_plugins/MULTI.md).
|
||
|
||
## Throughput Mode for CPU
|
||
Unlike most accelerators, CPU is perceived as an inherently latency-oriented device.
|
||
Since 2018 R5 release, the Inference Engine introduced the "throughput" mode, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the throughput.
|
||
|
||
Internally, the execution resources are split/pinned into execution "streams".
|
||
Using this feature gains much better performance for the networks that originally are not scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines.
|
||
|
||
Run the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) and play with number of infer requests running in parallel, next section.
|
||
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance.
|
||
|
||
In addition to the number of streams, it is also possible to play with the batch size to find the throughput sweet-spot.
|
||
|
||
The throughput mode relaxes the requirement to saturate the CPU by using a large batch: running multiple independent inference requests in parallel often gives much better performance, than using a batch only.
|
||
This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance.
|
||
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.
|
||
|
||
## Benchmark App
|
||
[Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample is the best performance reference.
|
||
It has a lot of device-specific knobs, but the primary usage is as simple as:
|
||
```bash
|
||
$ ./benchmark_app –d GPU –m <model> -i <input>
|
||
```
|
||
to measure the performance of the model on the GPU.
|
||
Or
|
||
```bash
|
||
$ ./benchmark_app –d CPU –m <model> -i <input>
|
||
```
|
||
to execute on the CPU instead.
|
||
|
||
For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param).
|
||
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams.
|
||
|
||
Finally, notice that when you don't specify number of streams with `-nstreams`, "AUTO" value for the streams is used, e.g. for the CPU this is [CPU_THROUGHPUT_AUTO](supported_plugins/CPU.md). You can spot the actual value behind "AUTO" for your machine in the application output.
|
||
Notice that the "AUTO" number is not necessarily most optimal, so it is generally recommended to play either with the benchmark_app's "-nstreams" as described above, or via [new Workbench tool](@ref workbench_docs_Workbench_DG_Introduction).This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance.
|
||
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.
|
||
|
||
## Kernels Tuning for GPU
|
||
|
||
GPU backend comes with a feature, that allows models tuning, so the workload is configured to fit better into hardware.
|
||
|
||
Tuning is time consuming process, which internally execute every layer several (or even hundreds) times to find most performant configuration.
|
||
|
||
This configuration is saved into json-formatted file, whose name can be passed as plugin param to network. GPU backend will process this data to configure kernels for the best performance.
|
||
|
||
For more details about Kernels Tuning and How-To please refer to [GPU Kernels Tuning](GPU_Kernels_Tuning.md).
|