-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: enabling oneDPL and sort primitive refactoring #3046
base: main
Are you sure you want to change the base?
feature: enabling oneDPL and sort primitive refactoring #3046
Conversation
Before merging, please remember to add this new dependency to the installation instructions in |
/intelci: run |
/azp run CI |
Azure Pipelines failed to run 1 pipeline(s). |
@@ -23,6 +23,7 @@ Required Software: | |||
* BLAS and LAPACK libraries - both provided by oneMKL | |||
* Python version 3.9 or higher | |||
* TBB library (repository contains script to download it) | |||
* oneDPL library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Alexandr-Solovev Please remember to update also the conda instructions that appear towards the end of this file.
The package name for conda should be onedpl-devel
, and it needs to update the list of environment variables to add DPL_ROOT
.
/intelci: run |
@icfaust currently it works only with custom ci branch: |
@@ -113,9 +114,25 @@ is available as an alternative to the manual setup. | |||
|
|||
./dev/download_tbb.sh | |||
|
|||
6. Download and install Python (version 3.9 or higher). | |||
6. Set up Intel(R) Threading Building Blocks (Intel(R) TBB): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6. Set up Intel(R) Threading Building Blocks (Intel(R) TBB): | |
6. Set up Intel(R) OneDPL |
sycl::buffer<std::int64_t, 1> num_buf{ | ||
&sum_result, | ||
sycl::range<1>(1) | ||
}; // Create buffer with a single element | ||
|
||
const sycl::nd_range<1> nd_range = | ||
bk::make_multiple_nd_range_1d(ctx.selected_row_total_count_, 1); | ||
|
||
queue_ | ||
.submit([&](sycl::handler& h) { | ||
// Create an accessor for the buffer | ||
sycl::accessor<std::int64_t, | ||
1, | ||
sycl::access::mode::read_write, | ||
sycl::access::target::device> | ||
acc(num_buf, h); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are buffer-accessor APIs used here instead of USM?
auto event = oneapi::dpl::experimental::kt::gpu::esimd::radix_sort_by_key<true, 8>( | ||
queue, | ||
val_in.get_mutable_data(), | ||
val_in.get_mutable_data() + val_in.get_count(), | ||
ind_in.get_mutable_data(), | ||
dpl::experimental::kt::kernel_param<256, 32>{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic value 8, 256, 32 should be at least explained here.
Also, it might be beneficial not to hardcode them, but let the algorithmic kernel define them. I think different values can be chosen for different algorithms, or for different hardware platforms for better performance.
const auto col_count = val_in.get_dimension(1); | ||
sycl::event radix_sort_event; | ||
|
||
for (std::int64_t row = 0; row < row_count; ++row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the previous implementation can be more performant for the matrices with large number of rows and small number of columns.
I would not delete it, but provide some performance considerations when it is preferred to use the old implementation, and when - the new one.
Description:
Feature: enabling oneDPL and sort primitive refactoring
Summary:
This PR introduces oneDPL enabling and radix sort replacement.
PR completeness and readability
Testing
Performance