* [GPU] Change lws to avoid synchronization issue in nonzero_count (#16116)
* [GPU] Add unit test (#16116)
* [GPU] update count_nonzero_ref kernel(#16116)
- Support the case total data size exceed max work group size
- Add dynamic shape test case
* [GPU] Change input indexing calculation and add random input generator in unit test (#16116)
* [GPU] update random generation input funciton in nonzero_count (#16116)
* [GPU] update unit test (#16116)
* [GPU] cldnn unit test: update random generation function for other test failure (fusings_gpu/conv_fp32_multi_eltwise_quantization.basic/0) (#16116)
* [GPU] Enabled ComparisonLayerTest in single layer tests.
It seems that before, these tests were disabled cause of some failures. Now I cannot see any errors, so I just enabled all of them.
* [GPU] Run clang format for comparison single layer tests.
* [GPU] Added handling of f16 type to IsInfLayerTest.
* [GPU] Added single-layer tests for IsFinite and IsNaN operations.
* [GPU] Added single-layer test for IsInf operation.
* [GPU] Implemented IsFinite, IsInf, and IsNaN operations as activation functions.
But notice that currently, the activation kernel support only the same output data type as the input data type. So an additional reorder would be needed to convert to the correct output data type for these ops. Also worth noting is that activation functions are fused in reorder kernel. But for now, it's not working for these ops because in reorder activation call, there is a hard conversion of input data to output data type before activation. I don't know why it's added there, but it breaks fusion. So need to fix this activation fusion or disable this fusion for these ops.
* Revert "[GPU] Implemented IsFinite, IsInf, and IsNaN operations as activation functions."
This reverts commit 3f9ffe617ecddce6dbbcdeab9584a7ddeb6d1845.
* [GPU] Implemented IsFinite, IsInf, and IsNaN operations as eltwise op.
* [GPU] Changed CLDNN_ERROR_MESSAGE to OPENVINO_ASSERT in check_inputs_count method.
* [GPU] Minor fix for dynamic bert-base-uncased-qqp
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Fix to check full tensor only for static shape during creating onednn gemm
Signed-off-by: Andrew Park <andrew.park@intel.com>
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>
- Previously, PR15386 changed allocation of memory of primitives which are to be used as shape infer dep to host memory, for better shape infer perf.
- However this causes cache coherence issue in dGPU.
- Reverting this change so that the memory will be allocated to devicet
* [dGPU] Enable stable diffusion
+ Prevent to fuse swish into oneDNN reorder.
+ Makes concat explicitly if batch size is greater than 1 and the siblings are oneDNN impl.
* [GPU] Added shape agnostic optimized SoftMax kernel
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Update SoftmaxKernelBaseBF::Validate policy for shape agnostic kernel
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add softmax_gpu_bf shape agnostic TC for ov_gpu_unit_tests
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Fix failed TCs for ie-tests-linux-ubuntu20-gpu
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Update to use stack array instead of global buffer
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Remove global buffer usage completely
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add #undef directive
Signed-off-by: Andrew Park <andrew.park@intel.com>
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Enable crop shape agnostic kernel
* Added unit test
* Added new scalar argument for crop (eltwise) for being used as runtime input offset in shape agnostic kernel
* Fix eltwise to have runtime offset only for crop
* Fix unittest error
* Applied review comment
* [GPU] Fix output format not changing at runtime
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add remove_redundant_reorders pass TC for ov_gpu_unit_tests
Signed-off-by: Andrew Park <andrew.park@intel.com>
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>
* [GPU] Apply multi-threads for async compilation context (#15683)
- Use CPUStreamExecutor in compilation context
- Use single compilation context, impl_cache and kernels_cache for multple streams
- Move compilation context to cldnn::program
- Move impl_cache to cldnn::program
- Create thread-safe impl_cache
- Create thread independent compilation function in kernels_cache
- Use kernels_cache in program and remove it from network
* [GPU] Fix segfault issue: ocl_engine and ocl_device are released during remained compilation context task are running (#15683)
- compilation context has own CPUStreamExecutor
* [GPU] Follow-up codereview (#15683)
- LruCacheThreadSafe inherit LruCache
- FuncRemoveItem has std::pair<Key,Value> as input
- Change prepare_tools to init_program
* [GPU] Create primitive_impl::build_kernels (#15683)
* [GPU] Fix unit test build error (#15683)
* [GPU] Remove redundant code (#15683)
- Remove try catch for debug
- Call compilation_context.cancel() in destructor of network
* [GPU] combine two atomic counter in kernels_cache (#15683)
* [GPU] Follow-up code review (#15683)
* [GPU] Fix nullptr exception in unit test (#15683)
* [GPU] Follow-up code review (#15683)
- Modify mutex lock in compilation context
* [GPU] Fix windows build issue (#15683)
* use kernel caching for dynamic models
* replaced cl_cache with blob
* updated to serialize dims info of input and output
* updated to skip unicode tests in Windows
* C++ exception with description write lock_type thrown in the test body.
Use get_output_values_to_float()
* fusings_gpu/gemm_2in_act_scale_quantize_eltwise_i8.basic/2
* fusings_gpu/gemm_2in_act_scale_eltwise.basic/2
* Remove WA test code of [GPU][DG2] Fix fusings_gpu/gemm_2in_scale.basic/7 #15353
* Now non full-tensor post-ops are broadcasted