* auto-batching POC squashed (all commits from auto-batch-2021.3 branch)
(cherry picked from commit d7742f2c747bc514a126cc9a4d5b99f0ff5cbbc7)
* applying/accomodating the API changes after rebase to the master
* replaying modified version of actual batch selection
* eearly experiments with model mem footprint
* changes from rebasing to the latest master
* experimenting with DG1 on the batch size selection, also collecting the mem footprint
* WIP:moving the auto-batching to the icore to let the MULT/AUTO support that, ALLOW_AUTO_BATCHING as a conventional config key. still fials hot device swap
* quick-n-dirty batch footpint vs device total mem
* code style
* testing which models perform badly due to kernels and NOT (batched) footprint
* stub pipeline task to comunicate the readiness rather than promise/future
* quick-n-dirty timeout impl
* explicit _completionTasks,reverting BA to use the timeout
* inputs outputs copies, works with AUTO and demo now
* accomodate the config per device-id, after rebase to the latest master
* allowing the auto-batching only with tput hint to let more conventional tests pass
* fix the pre-mature timeout restaring via waiting for batch1 requests completion
* moved the bacthed request statring ( along with input copies) to the dedicated thread
* [IE CLDNN] Disable bs_fs_yx_bsv16_fsv16 format for int8 convolution
* code style
* increasing the timeout to test the ssd_* models perf (timeout?) issues
* reducing number of output stuff in BA to avoid bloating the logs in experiments
* more aggressive batching for experiments, not limited to 32 and also 4 as a min
* more accurate timeout debugging info
* getting the reqs limitation from the plugin SetConfig as well
* refactor the reshape logic a bit to accomodate CPU for bathcing, also added remeote context
* let the benchamrk_app to consume specific batch values for the auto-batching such as BATCH:GPU(4)
* auto-batching functional test (with results check vs ref) and GPU instance for that
* fixed arithemtic on blobs ptrs
* clang
* handling possible batched network failure
* BATCH as the constants device name in test
* ENABLE_BATCH
* func tests for CPU, also DetectionOutput hetero tests (CPU and GPU)
* DetectionOutput hetero test for the CPU
* reenabling the Auto-Batching in the AUTO
* auto-batching device enabled in the test
* fixed the DO test
* improve the loading loop logic
* brushed the config keys
* allow hetero code-path for explicit device name like BATCH:GPU(4), used in the hetero code-path tests
* fix the test after refactoring
* clang
* moving ThreadSafeQueue to the ie_parallel, as it is re-used in the AUTO/MULTI and BATCH now
* auto-batching hetero test (subgraph with DetectionOutput)
* fixed minor changes that were result of experiments with impl
* code-style
* brushing, disabling CPU's HETERO tests until planned activity for 22.2
* removing home-baked MAX_BATCH_SZIE and swicthing to the official impl by GPU team
* remote blobs tests for the auto-batching (old API)
* brushed names a bit
* CreateContext and LoadNEtwork with context for the Auto-Batching plus remote-blobs tests
* fixed the ieUnitTests with adding CreateContext stub to the MockICore
* clang
* improved remote-blobs tests
* revert the back BA from exeprimenents with AB + device_use_mem
* conformance tests for BATCH, alos batch size 1 is default for BATCH:DEVICE
* remote blobs 2.0 tests, issue with context having the orig device name
* debugging DG1 perf drop (presumably due to non-fitting the device-mem)
* disbaling WA with batch/=2 for excesive mem footptint, leaving only streams 2
* remote blobs 2.0 tests for different tensor sharing types
* converting assert to throw to accomodate legacy API where the lock() was possible to be called
* revert the timeout back to avoid mixing the studies, fixed the footprint calc
* reverting to estimating the max batch by extrapolating from bacth1 size
* more conservative footptint etimation (with bacth1), graceful bacth 1 handling without duplication
* even graceful batch 1 handling without duplication
* WA for MAX_BATCH_SIZE failure, removing batch4 as a min for the auto-batching
* AutoBatchPlugin -> ov_auto_batch_plugin
* WA for gcc 4.8
* clang
* fix misprint
* fixed errors resulted from recent OV's Variant to Any transition
* skip auto-batching for already-batched networks
* AUTO_BATCH_TIMEOUT and tests
* GPU-specific L3
* switched to pure config, also improved ALLOW_AUTO_BATCHING config key handling logic
* debugging device info
* enabling the config tests for the GPU and fixing the Auto-batching tests to pass
* making the default (when not recognized the driver) cache size more aggressive, to accomodate recent HW with old drivers
* skip auto-batching for RNNs and alikes (e.g. single CHW input)
* fixed fallback to the bacth1 and moved HETERO path under condition to avoid bloating
* brushing
* Auto plugin GetMetric support gpu auto-batch
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* add test case
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* add comments on test
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* brushing the vars names, alos adding the excpetion handling
* disabling the auto-batching for the networks with non-batched outputs and faster-rcnn and alikes (CVS-74085) to minimize the of #failures
* add try catch
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* brushing the code changed in the GPU plugin
* Auto-Batch requests tests
* brushed varibles a bit (ref)
* cleaned debug output from the ie_core
* cleaned cmake for the Auto-Batch
* removed batchN estimation from batch1
* cleaned from debug printf
* comments, cleanup
* WA the mock test errors introduced with merging the https://github.com/myshevts/openvino/pull/13
* Adding back removed batchN estimation from batch1 to debug degradations on DG1 (resulted from too optimistic MAX_BATCH_SIZE?). This partially reverts commit e8f1738ac1.
* brushing ie_core.cpp
* fix 32bit compilation
* Code review: ENABLE_AUTO_BATCH
* consolidate the auot-batching logic in ie_core.cpp into single ApplyAutoBAtching
* renamed brushed the OPTIMAL_BATCH (now with_SIZE) and mimicks the MAX_BATCH_SZIE wrt MODEL_PTR
* default value for the OPTIMAL_BATCH_SIZE
* clang
* accomodate new func tests location
* fix shuffle of headers after clang + copyrights
* fixed misprint made during code refactoring
* moving the common therad-safe containers (like ThreadSafeQueue) to the dedicated dev_api header
* switch from the device name to the OPTIMAL_BATCH_SIZE metric presence as a conditin to consider Auto-Batching
* switching from the unsafe size() and minimizing time under lock
* code style
* brushed the ApplyAutoBatching
* brushed the netric/config names and descriptions
* completed the core intergration tests for the auto-batching
* ExecGraphInfo and check for incorrect cfg
* removed explicit dependencies from cmake file of the plugin
* disabling Auto-Batching thru the tput hint (to preserve current product default), only excplicit like BATCH:GPU used in the tests
Co-authored-by: Roman Lyamin <roman.lyamin@intel.com>
Co-authored-by: Hu, Yuan2 <yuan2.hu@intel.com>
* Remove some legacy targets
* Replace some targets
* Removed inference_engine_plugin_api dependency
* Minor comment for developer config
* Fixed include paths
* Small fixes for static build
* Try to fix build pyopenvino
* Fixed comments
* Try to fix build
* Include OpenVINODeveloperPackage inside InferenceEngineDeveloperPackageConfig
* Try to fix GAPI tests
* Implement the proposal and experimental_detecron_generate_proposals
* Implement the proposal shape infer
* Add ROI_Align OP shape infer implement.
* Fix building issue
* Fix bug.
* Update test cases.
* Add test cases for the OPs
* Apply the CI coding style check.
* Move the shape_infer API to the new folder.
* Update some fix.
* Applied review comments
* Move the shape infer tests into new folder.
* Apply review comments.
* Fix missing header when mering with master
+ Modified a way to add padding in prepare_padding
+ Changed condition of assertion for onednn padding
Signed-off-by: Min, Byungil <byungil.min@intel.com>
* Fix some bugs of cldnn eltwise kernel at bs_fs_yx_bsv32_fsv16 format.
+ Add a condition so that the eltwise_simple_vload8 kernel is not selected when the tensor is not aligned as bsv32_fsv16 or bsv32_fsv32.
+ Optimize gws/lws of eltwise_ref kernel for bsv32_fsv16 format.
* Check feature align of b_fs_yx_fsv32 for eltwise vload8 kernel
+ Minor fix for OV_GPU_Help option
* [GPU] Implement Random Uniform kernel
* [GPU] Add random uniform kernel selector
* [GPU] Add random uniform primitive
* [GPU] Add random uniform instantiation helpers
* [GPU] Implement unit test for the random uniform operation
* [GPU] Add random uniform operation builder in cldnn_engine plugin
* [GPU] Add single layer tests for the random uniform operation
* [GPU] Add conversion for Random Uniform output type
* [GPU] Add RTTI macroses to RandomUniform operation
* Create ov_runtime library
* Added dev target
* Fixed SOURCE_DIR property for interface lib
* Fixed missed include path
* Fixed SOURCES for plugins and frontends
* Fixed Windows export
* Try to fix preproc build
* Fixed MKLDNNPlugin
* Fix C API
* Try to fix public CI
* Fixed comments
* Use TARGET_NAME variable
* Fixed static build
* Fixed target name in comfig
* Revert "Fixed target name in comfig"
This reverts commit f61c1e0c09.
* Try to remove WA with include path
* Try to fix TBB dependency for old CMake
* Revert install to lib for legacy tests
* Try to fix public tests
* Add WA for v7 reader
* Fixed dev export name
Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
* Add DumpLayersLimitBatch config to GPU debug utils.
+ Support OV_GPU_Help config
+ Only run first inference if OV_GPU_DumpLayersPath is set.
+ Fix dump graph bug.
* Apply some comments
* Remove unnecessary code.