Is SYCL-implemented kernel able to utilize SSE/AVX on CPU? #9547
Replies: 1 comment
-
Hi @ghostplant, The idea of SYCL is to provide high-level abstractions for low-level HW details. For your example about operating on vectors of data and use corresponding SSE/AVX instructions, the following language mechanisms are available to you and implementation to get efficient code:
In our implementation, we don't perform any vectorizing/widening optimizations at SYCL compiler level and we offload that to device-specific backend compilers. So, the question is actually whether device-specific backend compiler is able to vectorize input generated by SYCL compiler. The answer is yes, it is generally done. Actual level of optimizations and vectorization depends on particular backend compiler and its version. For example, for OpenCL CPU RT we have the following documentation available. |
Beta Was this translation helpful? Give feedback.
-
SYCL kernel doesn't allow to use non-standard instructions / functions. Whether and how to let SYCL kernel to utilize SSE/AVX? Let's look at this example: https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C%2B%2BSYCL/ParallelPatterns/PrefixSum/src/PrefixSum.cpp#L76-L80
Assume I am writing an SYCL kernel code in above link, when I want to do an 256bit int32-add, writing _mm_adds_epi16(xx) is not allowed there and it reports that instruction is not allowed by SYCL IR, so whether there is alternative way of coding in SYCL kernel bodies to enable those SSE-like efficient adding?
Is it something not supported by current SYCL? or I should program "z[i] = x[i] + y[i]; z[i + 1] = x[i + 1] + y[i + 1]; z[i + 2] = x[i + 2] + y[i + 2]; z[i + 3] = x[i + 3] + y[i + 3];" so that OneAPI compiler can turn it into SSE-optimized code for CPU?
Beta Was this translation helpful? Give feedback.
All reactions