* auto-batching POC squashed (all commits from auto-batch-2021.3 branch)
(cherry picked from commit d7742f2c747bc514a126cc9a4d5b99f0ff5cbbc7)
* applying/accomodating the API changes after rebase to the master
* replaying modified version of actual batch selection
* eearly experiments with model mem footprint
* changes from rebasing to the latest master
* experimenting with DG1 on the batch size selection, also collecting the mem footprint
* WIP:moving the auto-batching to the icore to let the MULT/AUTO support that, ALLOW_AUTO_BATCHING as a conventional config key. still fials hot device swap
* quick-n-dirty batch footpint vs device total mem
* code style
* testing which models perform badly due to kernels and NOT (batched) footprint
* stub pipeline task to comunicate the readiness rather than promise/future
* quick-n-dirty timeout impl
* explicit _completionTasks,reverting BA to use the timeout
* inputs outputs copies, works with AUTO and demo now
* accomodate the config per device-id, after rebase to the latest master
* allowing the auto-batching only with tput hint to let more conventional tests pass
* fix the pre-mature timeout restaring via waiting for batch1 requests completion
* moved the bacthed request statring ( along with input copies) to the dedicated thread
* [IE CLDNN] Disable bs_fs_yx_bsv16_fsv16 format for int8 convolution
* code style
* increasing the timeout to test the ssd_* models perf (timeout?) issues
* reducing number of output stuff in BA to avoid bloating the logs in experiments
* more aggressive batching for experiments, not limited to 32 and also 4 as a min
* more accurate timeout debugging info
* getting the reqs limitation from the plugin SetConfig as well
* refactor the reshape logic a bit to accomodate CPU for bathcing, also added remeote context
* let the benchamrk_app to consume specific batch values for the auto-batching such as BATCH:GPU(4)
* auto-batching functional test (with results check vs ref) and GPU instance for that
* fixed arithemtic on blobs ptrs
* clang
* handling possible batched network failure
* BATCH as the constants device name in test
* ENABLE_BATCH
* func tests for CPU, also DetectionOutput hetero tests (CPU and GPU)
* DetectionOutput hetero test for the CPU
* reenabling the Auto-Batching in the AUTO
* auto-batching device enabled in the test
* fixed the DO test
* improve the loading loop logic
* brushed the config keys
* allow hetero code-path for explicit device name like BATCH:GPU(4), used in the hetero code-path tests
* fix the test after refactoring
* clang
* moving ThreadSafeQueue to the ie_parallel, as it is re-used in the AUTO/MULTI and BATCH now
* auto-batching hetero test (subgraph with DetectionOutput)
* fixed minor changes that were result of experiments with impl
* code-style
* brushing, disabling CPU's HETERO tests until planned activity for 22.2
* removing home-baked MAX_BATCH_SZIE and swicthing to the official impl by GPU team
* remote blobs tests for the auto-batching (old API)
* brushed names a bit
* CreateContext and LoadNEtwork with context for the Auto-Batching plus remote-blobs tests
* fixed the ieUnitTests with adding CreateContext stub to the MockICore
* clang
* improved remote-blobs tests
* revert the back BA from exeprimenents with AB + device_use_mem
* conformance tests for BATCH, alos batch size 1 is default for BATCH:DEVICE
* remote blobs 2.0 tests, issue with context having the orig device name
* debugging DG1 perf drop (presumably due to non-fitting the device-mem)
* disbaling WA with batch/=2 for excesive mem footptint, leaving only streams 2
* remote blobs 2.0 tests for different tensor sharing types
* converting assert to throw to accomodate legacy API where the lock() was possible to be called
* revert the timeout back to avoid mixing the studies, fixed the footprint calc
* reverting to estimating the max batch by extrapolating from bacth1 size
* more conservative footptint etimation (with bacth1), graceful bacth 1 handling without duplication
* even graceful batch 1 handling without duplication
* WA for MAX_BATCH_SIZE failure, removing batch4 as a min for the auto-batching
* AutoBatchPlugin -> ov_auto_batch_plugin
* WA for gcc 4.8
* clang
* fix misprint
* fixed errors resulted from recent OV's Variant to Any transition
* skip auto-batching for already-batched networks
* AUTO_BATCH_TIMEOUT and tests
* GPU-specific L3
* switched to pure config, also improved ALLOW_AUTO_BATCHING config key handling logic
* debugging device info
* enabling the config tests for the GPU and fixing the Auto-batching tests to pass
* making the default (when not recognized the driver) cache size more aggressive, to accomodate recent HW with old drivers
* skip auto-batching for RNNs and alikes (e.g. single CHW input)
* fixed fallback to the bacth1 and moved HETERO path under condition to avoid bloating
* brushing
* Auto plugin GetMetric support gpu auto-batch
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* add test case
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* add comments on test
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* brushing the vars names, alos adding the excpetion handling
* disabling the auto-batching for the networks with non-batched outputs and faster-rcnn and alikes (CVS-74085) to minimize the of #failures
* add try catch
Signed-off-by: Hu, Yuan2 <yuan2.hu@intel.com>
* brushing the code changed in the GPU plugin
* Auto-Batch requests tests
* brushed varibles a bit (ref)
* cleaned debug output from the ie_core
* cleaned cmake for the Auto-Batch
* removed batchN estimation from batch1
* cleaned from debug printf
* comments, cleanup
* WA the mock test errors introduced with merging the https://github.com/myshevts/openvino/pull/13
* Adding back removed batchN estimation from batch1 to debug degradations on DG1 (resulted from too optimistic MAX_BATCH_SIZE?). This partially reverts commit e8f1738ac1.
* brushing ie_core.cpp
* fix 32bit compilation
* Code review: ENABLE_AUTO_BATCH
* consolidate the auot-batching logic in ie_core.cpp into single ApplyAutoBAtching
* renamed brushed the OPTIMAL_BATCH (now with_SIZE) and mimicks the MAX_BATCH_SZIE wrt MODEL_PTR
* default value for the OPTIMAL_BATCH_SIZE
* clang
* accomodate new func tests location
* fix shuffle of headers after clang + copyrights
* fixed misprint made during code refactoring
* moving the common therad-safe containers (like ThreadSafeQueue) to the dedicated dev_api header
* switch from the device name to the OPTIMAL_BATCH_SIZE metric presence as a conditin to consider Auto-Batching
* switching from the unsafe size() and minimizing time under lock
* code style
* brushed the ApplyAutoBatching
* brushed the netric/config names and descriptions
* completed the core intergration tests for the auto-batching
* ExecGraphInfo and check for incorrect cfg
* removed explicit dependencies from cmake file of the plugin
* disabling Auto-Batching thru the tput hint (to preserve current product default), only excplicit like BATCH:GPU used in the tests
Co-authored-by: Roman Lyamin <roman.lyamin@intel.com>
Co-authored-by: Hu, Yuan2 <yuan2.hu@intel.com>
* Squashed commit of previous work
* Fix mock tests
* clang
* Fix rebase errors
* remove unnecessary changes
* One more finding
* Copy ov::Model runtime info as well
* Fix review comments
* Commit missing file
* Copy m_shared_object when cloning model
* removed copy_shared_objects and use clone_model(model, NodeMap) as a friend for ov::Model
* Added OPENVINO_API to forward declaration
* add OPENVINO_API to friend function declaration
* Update setupvars.bat with CMAKE_BUILD_TYPE
setupvars.bat paths set depends on CMAKE_BUILD_TYPE.
RelWithDebInfo didn't work correctly.
* Check binaries path before patching setupvars
* Fix loosing semicolon in setupvars,bat
* Shape as value propagation in i32
* Comments adressed
* code style
* Modifies test to cover both Numpy and Bidirectional broadcasting
* MYR dynamic tests: made cases truly dynamic. Due to better shape inference it turned out that test cases were actually static.
* Deleting static shape test cases
* Fixes in the infer function of MO operation Select.
* Fixes in the nGraph transformation SharedShapeOf.
* Deleted commented code.
* Added more tests for the infer function of the MO operation Select.
* Started to write tests for the transformation SharedShapeOf.
* Added more tests.
* Now the transformation can correctly process a mix of opset1::ShapeOf and opset8::ShapeOf.
* Small change.
* Used opset1 and opset3 instead of opset1 and opset8.
* Used get_output_element_type(0) instead of checking the version of ShapeOf.
* Remove some legacy targets
* Replace some targets
* Removed inference_engine_plugin_api dependency
* Minor comment for developer config
* Fixed include paths
* Small fixes for static build
* Try to fix build pyopenvino
* Fixed comments
* Try to fix build
* Include OpenVINODeveloperPackage inside InferenceEngineDeveloperPackageConfig
* Try to fix GAPI tests
* Implement the proposal and experimental_detecron_generate_proposals
* Implement the proposal shape infer
* Add ROI_Align OP shape infer implement.
* Fix building issue
* Fix bug.
* Update test cases.
* Add test cases for the OPs
* Apply the CI coding style check.
* Move the shape_infer API to the new folder.
* Update some fix.
* Applied review comments
* Move the shape infer tests into new folder.
* Apply review comments.
* Fix missing header when mering with master
* Fix incomprehensible error message during layout conversion when layout rank doesn't match with shape rank
* Stash
* stash
* Memcpy implementation
Added tests
* Revert "Fix incomprehensible error message during layout conversion when layout rank doesn't match with shape rank"
This reverts commit 37064741b2.
* Fix clang-format and remove redundant headers
* Covered "cached" case (+ tested on Myriad)
* Apply review comments
Introduced 'applyBatchedBlob' function which allows override 'memcpy' on inferefnce time
* clang-format fix
* Added dynamic shape case
* - Review comments
- Deep copy of parameters/results for caching from cnnNetwork. Deep copy logic is moved to Utils
- Caching Tests: return correct inputs/outputs map after ImportNetwork mock call
* Reworked according to discussion
Also introduced 'SetBlobsImpl' which throws 'Not implemented' exception by default.
Template plugin updates internal '_batched_inputs' map
* Updated according to moved tests
* don't support 'memcpy' for ROI tensors
* Fix caching tests
* Just to retrigger CI
* Correct offset padding (however there is no test update as current implementation will not hit here due to other checks)
* Fix clang-format
* Applied review comments
* Added check that 'get_tensor' throws if set_tensors/set_input_tensors is used
* Fix review comments - part 1
* Fix caching tests - mock implementation becomes more complicated
Cached mock model shall identify its inputs/outputs, otherwise core will assert on SetExeNetworkInfo stage
* More comments fix
* More comments fixes
* More cleanup
* And more style comment
* typo fix
* Try fix caching windows tests
* Blind attempt to fix Ubuntu20 CI