Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xe: conv_v2: enable Stream-K kernels #2345

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

echeresh
Copy link
Contributor

@echeresh echeresh commented Jan 7, 2025

Jira: MFDNN-11721

PR updates performance modeling and benchmarking logic to handle Stream-K, and also updates the registry to use new Stream-K kernels. Will add more details later.

ResNet-50 performance data on PVC is below. A few comments:

  • Excluded cases: first convolution and strided backward by data
    • 1st convolution is very slow now, needs special tiling approach and maybe a reorder (to be covered in the next month)
    • Strided backward by data is not supported, needs dynamic filter indexing (also to be covered soon)
  • Low small batch forward performance: needs a deeper look. Expecting ~80% for the current level of optimizations (now only at 50-60%)
  • Large batch performance is relatively good though this is mostly a baseline, and more optimization work is underway
Propagation Batch size Data Type # of layers Weighted ratio (v2 to v1). If > 1: v2 is faster
FWD_I 1 s8 22 0.47
FWD_I 1 bf16 22 0.47
FWD_I 1 f32 22 0.60
FWD_I 32 s8 22 0.92
FWD_I 32 bf16 22 0.85
FWD_I 32 f32 22 0.92
BWD_D 32 bf16 16 1.35
BWD_D 32 f32 16 0.96
BWD_W 32 bf16 22 0.80
BWD_W 32 f32 22 1.06
FWD_I 128 s8 22 1.00
FWD_I 128 bf16 22 0.88
FWD_I 128 f32 22 0.93
BWD_D 128 bf16 16 0.89
BWD_D 128 f32 16 0.95
BWD_W 128 bf16 22 0.72
BWD_W 128 f32 22 0.88

@echeresh echeresh added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Jan 7, 2025
@echeresh echeresh requested review from a team as code owners January 7, 2025 01:54
@github-actions github-actions bot added the documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc label Jan 7, 2025
Base automatically changed from echeresh/streamk to main January 7, 2025 22:03
@echeresh echeresh force-pushed the echeresh/echeresh/streamk-enable branch 2 times, most recently from 04c449b to acce223 Compare January 7, 2025 22:13
@echeresh echeresh force-pushed the echeresh/echeresh/streamk-enable branch from acce223 to f766ccd Compare January 7, 2025 23:00
@echeresh echeresh requested a review from a team as a code owner January 7, 2025 23:00
@github-actions github-actions bot added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Jan 7, 2025
@echeresh
Copy link
Contributor Author

echeresh commented Jan 7, 2025

make test
disable device_cpu
enable device_gpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_conv
enable benchdnn_deconv
enable benchdnn_reorder
enable benchdnn_sum
enable arch_xe2-lpg
enable arch_xe-hpg
enable arch_xe-hpc

oss << std::uppercase << std::hex << std::setw(2) << std::setfill('0')
<< (int)d;
<< (int)v;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: into<int>(v)

This conversion is perfectly safe, the suggestion is largely to prevent additional noise when searching for conversion issues. On the other hand, all unsafe conversion should go through into<T> to enable runtime validation in debug builds.

@@ -239,6 +239,8 @@ struct deserializer_t {
}
}

bool empty() const { return idx >= s.get_data().size(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool empty() const { return idx >= s.get_data().size(); }
bool empty() const { return s.get_data().empty(); }

@@ -614,6 +614,7 @@ bench_data_t bench(const bench_manager_t &bench_mger,

bool try_create(
const bench_manager_t &bench_mger, const kernel_desc_t &kernel_desc) {
clear_primitive_cache();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to just set the primitive cache capacity to 0?

@rjoursler
Copy link
Contributor

Thanks for the data comparing to v1. What is the before and after performance improvement from this optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants