* [GPU] Optimize stable_diffusion performance in iGPU.
Change the existing heuristic shape condition to permute and no transpose gemm in case of transpose gemm.
Signed-off-by: hyunback <hyunback.kim@intel.com>
* add dynamic shape support for dgpu in prepare_buffer_fusing
* add unit test
* add space between test cases
* update condition of impl create() for concat dynamic shape
* update unit test
* add comment and update unit test
* add impl_param.is_type() function
* [GPU] Impl cldnn::condition to support dynamic shape (#18051)
* Impl CreateIfOp
* Update calc_output_layouts and execute_impl
* Enable gpu unit test
* Create gpu functional test
* [GPU] Follow-up code review (#18051)
* remove redundant codes
* create custom execute method for condition_inst
* change name from update_loop_primitive_map to update_inner_program_io_map
* [GPU] Fix gpu func test failures for fp16
* Add more test-cases to support fp16 and nested if case
* [GPU] remove redundant codes
* refactoring var names
* fix windows build error
* [GPU] Fix windows build issue
* [GPU] update calc_output_layouts
* [GPU] remove custom condition_inst::execute
* Remove virtual keyword from primitive_inst::execute()
* [GPU] Share single task executor between main program and inner program
* [GPU] Fix input rank issue for const inner network in condition op
* [GPU] apply calc_output_layouts for roi_align
Co-authored-by: Vladimir Paramuzov <vladimir.paramuzov@intel.com>
* [GPU] avoid checking allow_new_shape_infer for inner program
---------
Co-authored-by: Vladimir Paramuzov <vladimir.paramuzov@intel.com>
* Fix get_partial_shape tensor API to access the correct index of dimensions
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Update the rule specifying output_type to the legacy one by referring to calc_output_layout
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add reproducible TCs related to issues for ov_gpu_unit_tests
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Fix failed fc dynamic i8 TCs for ov_gpu_unit_tests
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Fix are_data_types_sutable_for_onednn not to invalidate output layout
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Apply comment
Signed-off-by: Andrew Park <andrew.park@intel.com>
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Not to add sync if the node is within shape of subgraph
Because the dependency is cpu impl so the execution is already finished.
* Fixed as review comment : Skip clFinish only when the runtime dep is shape of subgraph, not the current node
* [IE TESTS] Add Global test config for Subgraph base test
* Replace using option by function redefinition
* fix build
* remove extra changes for gna/template
* code style
* add nvidia to devices
* Fix debian
* remove nvidia
* Fixed to use input shape rank when calculating output layout, added unit test case
* Fixed to use input shape rank when creating shape_of primitive, added functional tests
* [GPU] Fix skipped GemmBaseTests in iGPU.
Current GemmBaseTests in iGPU are skipped, just showed pass, but actual not run.
Signed-off-by: hyunback <hyunback.kim@intel.com>
* keep Const+DecompressionConvert pattern for CPU
* temporary disabled failing unit-tests
* disable CF by modifying bounds evaluate as well; minor corrections
* added TODOs with ticket numbers
* join const+decompression markings
* minimized convert_precision.cpp changes
* minor corrections
* refactor fp16 transformations: moved into separate fp16_compression folder
* style-fix
* minor fixes
* do not disable evaluate and CF in shape path
* safer disabling of Const conversion
* style-fix and minor corrections
* restore original placement of ConvertPrecision
* [GPU] Unique-10 operation implementation.
* Handled flattened case.
* Created results for all outputs in single layer test.
* Save total unique count as fifth output.
* Handled axis case.
* Added unique reshape kernel.
* Moved data types to unique primitive constructor.
* Added shape agnostic Unique ref kernel.
* Added blocked layout support to Unique-10.
* Use int in bubble sort.
* Added unit tests.
* Added support for blocked layouts to flattened mode.
* Fixed usage of shape_info in kernel.
* Use correct total data size for dynamic shapes.
* Commented some functional tests.
For some reasons big shapes cause std::bad_alloc.
* Initialize out_counts with zeros.
* Implemented new approach for reducing memory footprint.
Changed first kernel to only count unique values and changed second kernel to fill all outputs.
* Revert "Commented some functional tests."
This reverts commit a7f9763c575e71e14b85ee37adf1e98f10785c15.
* Fixed calc output layouts for flattened case when rank in greater than 4.
* Added temporary fix for axis case when rank is greater than 4.
* Revert "Added temporary fix for axis case when rank is greater than 4."
This reverts commit 236640d2f0e9d5b1f8dcbbf9482763badd7fde66.
* Renamed "unique" to "unique_count" and "unique_reshape" to "unique_gather" primitives.
* Quick fix for add_intermediate_node to consider dep_idx of multiple output
* Fix bug for multiple output:
1) get_reorder was getting reorder from cache regardless of the dep_idx.
2) remove_redundant_reorder was not considering original dep_idx
* Fixed conflicts.
* Fixed win build issue.
* Fixed build issue.
* Revert "Fix bug for multiple output:"
This reverts commit d4a2c4f32eabe9108df31d4837fed8995c93bd1c.
* Revert "Quick fix for add_intermediate_node to consider dep_idx of multiple output"
This reverts commit 2dfd2aaefdf32067a7469505b35f7096632ac5f2.
* Added some tests to skip config.
---------
Co-authored-by: Taylor Yeonbok Lee <taylor.lee@intel.com>
* Remove NV12 and I420 blobs and deprecate some legacy API
* Fixed some errors
* Remove NV12 blobs
* Remote NV12 conversion
* Fixed other warnings
* Suppress version
* Fix some warnings
* Fixed version
* Try to fix some warnings
* Suppress warnings in C header
* Suppress warnings in C
* Fixed Windows exceptions
* Try to fix warnings
* Try to fix C bindings build
* Suppress InferRequest
* Fixed some build issues
* Fixed some errors
* Fuse convert reorder to prev MVN/Concat node
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add dynamic TCs for ov_gpu_unit_test
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add descriptions for changes
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Fix kernel selection failure
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add is_type_conversion_only function for reorder_node
Signed-off-by: Andrew Park <andrew.park@intel.com>
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>
* [GPU] Add shape of subgraphs markup and initial cpu implementations for some of primitives
* Apply review comments
* Exclude eltwise with boolean mode types from shape of subgraphs and fix leftovers
* There were two issues in runtime buffer fusing
1) Missing condition in matcher for dyanmic tensor
2) If the node is marked as can_be_optimized = true at build time and then turned out to false at runtime, the kernel compilation has been skipped becuaes it was checking node->can_be_optimized
=> To resolve this issue, added can_be_optimzied to impl_param and let the impl create check can_be_optimized in impl_param instead of that in node.
* Fixed primtiive::can_be_optimize to be set through function
* [GPU] Optimized out permute in permute-gemm(onednn) pattern.
Permute can be optimized out when permute's in and out are compatible and onednn gemm.
Signed-off-by: hyunback <hyunback.kim@intel.com>
* Initial impl for runtime buffer fusing
Passing unittest with static kernel
* pass unittest with dynamic impl
* Refactor allocate_output
* Separate header of buffer fusing
* Refactored buffer fusing :: matcher/optimize
* More cleanup
* Fix crash in dolly
* Reset can_be_optimized of primitive_inst when it is not
* Fix empty tensor : Primitive with empty data should be skipped
* Fix issue in dynamic padding : Static kernel should not contain dynamic padding dims
Fix missing reset of update_shape_done_by_other flag
* Not to add cache with emtpy kernel for optimized out inst
* Fix corner case error in buffer fusing
- Shapes of some preds may not be changed, but still needed to do update_impl because 1) paddings are changed 2) output memory should be updated
- optimizable impl should not be added to the cache
* Allowing reorder & permute_ref to be optimized concat predecessor
* Some more fixes :
runtime buffer fusing is available only when all preds/concat are dynamic
runtime buffer fusing is to be executed only if the node is dynamic
* Fix allocate_output parameter called by get_estimated_device_mem_usage according to the new change
* Fixed error in cascaded concatt
* Need to reinterprete even though the size is same