* [GPU] Added shape agnostic optimized SoftMax kernel
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Update SoftmaxKernelBaseBF::Validate policy for shape agnostic kernel
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add softmax_gpu_bf shape agnostic TC for ov_gpu_unit_tests
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Fix failed TCs for ie-tests-linux-ubuntu20-gpu
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Update to use stack array instead of global buffer
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Remove global buffer usage completely
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add #undef directive
Signed-off-by: Andrew Park <andrew.park@intel.com>
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Enable crop shape agnostic kernel
* Added unit test
* Added new scalar argument for crop (eltwise) for being used as runtime input offset in shape agnostic kernel
* Fix eltwise to have runtime offset only for crop
* Fix unittest error
* Applied review comment
* [GPU] Fix output format not changing at runtime
Signed-off-by: Andrew Park <andrew.park@intel.com>
* Add remove_redundant_reorders pass TC for ov_gpu_unit_tests
Signed-off-by: Andrew Park <andrew.park@intel.com>
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>
* [GPU] Apply multi-threads for async compilation context (#15683)
- Use CPUStreamExecutor in compilation context
- Use single compilation context, impl_cache and kernels_cache for multple streams
- Move compilation context to cldnn::program
- Move impl_cache to cldnn::program
- Create thread-safe impl_cache
- Create thread independent compilation function in kernels_cache
- Use kernels_cache in program and remove it from network
* [GPU] Fix segfault issue: ocl_engine and ocl_device are released during remained compilation context task are running (#15683)
- compilation context has own CPUStreamExecutor
* [GPU] Follow-up codereview (#15683)
- LruCacheThreadSafe inherit LruCache
- FuncRemoveItem has std::pair<Key,Value> as input
- Change prepare_tools to init_program
* [GPU] Create primitive_impl::build_kernels (#15683)
* [GPU] Fix unit test build error (#15683)
* [GPU] Remove redundant code (#15683)
- Remove try catch for debug
- Call compilation_context.cancel() in destructor of network
* [GPU] combine two atomic counter in kernels_cache (#15683)
* [GPU] Follow-up code review (#15683)
* [GPU] Fix nullptr exception in unit test (#15683)
* [GPU] Follow-up code review (#15683)
- Modify mutex lock in compilation context
* [GPU] Fix windows build issue (#15683)
* use kernel caching for dynamic models
* replaced cl_cache with blob
* updated to serialize dims info of input and output
* updated to skip unicode tests in Windows
* C++ exception with description write lock_type thrown in the test body.
Use get_output_values_to_float()
* fusings_gpu/gemm_2in_act_scale_quantize_eltwise_i8.basic/2
* fusings_gpu/gemm_2in_act_scale_eltwise.basic/2
* Remove WA test code of [GPU][DG2] Fix fusings_gpu/gemm_2in_scale.basic/7 #15353
* Now non full-tensor post-ops are broadcasted
* Fix remote blob creation to use original shape
* Revert "Fix remote blob creation to use original shape"
This reverts commit 35c674aa97.
* Fix cldnn tensor adjusted blob to be reinterpreted with actual input layout
* gpu model caching unit tests
* added serialization unit tests
* added save and load for quantize primitive_inst
* reduced the range of inputs for Gemm tests
* updated the copyright year
* [GPU] Fix a bug of permute optimization
For int8 models, if there is FakeQuantize between permute and convolution, an operation like data type casting could be fused to permute. In this case, do not optimize permute.