Is SYCL-implemented kernel able to utilize SSE/AVX on CPU? #9547

ghostplant · 2023-04-23T01:35:31Z

ghostplant
Apr 23, 2023

SYCL kernel doesn't allow to use non-standard instructions / functions. Whether and how to let SYCL kernel to utilize SSE/AVX? Let's look at this example: https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C%2B%2BSYCL/ParallelPatterns/PrefixSum/src/PrefixSum.cpp#L76-L80

Assume I am writing an SYCL kernel code in above link, when I want to do an 256bit int32-add, writing _mm_adds_epi16(xx) is not allowed there and it reports that instruction is not allowed by SYCL IR, so whether there is alternative way of coding in SYCL kernel bodies to enable those SSE-like efficient adding?

Is it something not supported by current SYCL? or I should program "z[i] = x[i] + y[i]; z[i + 1] = x[i + 1] + y[i + 1]; z[i + 2] = x[i + 2] + y[i + 2]; z[i + 3] = x[i + 3] + y[i + 3];" so that OneAPI compiler can turn it into SSE-optimized code for CPU?

AlexeySachkov · 2023-04-27T18:11:25Z

AlexeySachkov
Apr 27, 2023
Collaborator

Hi @ghostplant,

The idea of SYCL is to provide high-level abstractions for low-level HW details. For your example about operating on vectors of data and use corresponding SSE/AVX instructions, the following language mechanisms are available to you and implementation to get efficient code:

SYCL kernel execution model: essentially you are writing your kernel for a single work-item and then submit a bunch of them to be executed. That ND-range model allows underlying implementation to try and vectorize by merging several work-items together into a single SIMD instruction. Sub-groups (3.9.4. Work-group data parallel kernels) are often mapped to SIMD lanes by implementations
vec class is specifically designed to be a high-level abstraction over low-level vector data types.

In our implementation, we don't perform any vectorizing/widening optimizations at SYCL compiler level and we offload that to device-specific backend compilers. So, the question is actually whether device-specific backend compiler is able to vectorize input generated by SYCL compiler. The answer is yes, it is generally done. Actual level of optimizations and vectorization depends on particular backend compiler and its version.

For example, for OpenCL CPU RT we have the following documentation available.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is SYCL-implemented kernel able to utilize SSE/AVX on CPU? #9547

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is SYCL-implemented kernel able to utilize SSE/AVX on CPU? #9547

ghostplant Apr 23, 2023

Replies: 1 comment

AlexeySachkov Apr 27, 2023 Collaborator

ghostplant
Apr 23, 2023

AlexeySachkov
Apr 27, 2023
Collaborator