xe: conv_v2: enable Stream-K kernels #2345

echeresh · 2025-01-07T01:54:23Z

Jira: MFDNN-11721

PR updates performance modeling and benchmarking logic to handle Stream-K, and also updates the registry to use new Stream-K kernels. Will add more details later.

ResNet-50 performance data on PVC is below. A few comments:

Excluded cases: first convolution and strided backward by data
- 1st convolution is very slow now, needs special tiling approach and maybe a reorder (to be covered in the next month)
- Strided backward by data is not supported, needs dynamic filter indexing (also to be covered soon)
Low small batch forward performance: needs a deeper look. Expecting ~80% for the current level of optimizations (now only at 50-60%)
Large batch performance is relatively good though this is mostly a baseline, and more optimization work is underway

Propagation	Batch size	Data Type	# of layers	Weighted ratio (v2 to v1). If > 1: v2 is faster
FWD_I	1	s8	22	0.47
FWD_I	1	bf16	22	0.47
FWD_I	1	f32	22	0.60
FWD_I	32	s8	22	0.92
FWD_I	32	bf16	22	0.85
FWD_I	32	f32	22	0.92
BWD_D	32	bf16	16	1.35
BWD_D	32	f32	16	0.96
BWD_W	32	bf16	22	0.80
BWD_W	32	f32	22	1.06
FWD_I	128	s8	22	1.00
FWD_I	128	bf16	22	0.88
FWD_I	128	f32	22	0.93
BWD_D	128	bf16	16	0.89
BWD_D	128	f32	16	0.95
BWD_W	128	bf16	22	0.72
BWD_W	128	f32	22	0.88

echeresh · 2025-01-07T23:01:05Z

make test
disable device_cpu
enable device_gpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_conv
enable benchdnn_deconv
enable benchdnn_reorder
enable benchdnn_sum
enable arch_xe2-lpg
enable arch_xe-hpg
enable arch_xe-hpc

rjoursler · 2025-01-09T16:56:54Z

src/gpu/intel/jit/utils/utils.hpp

        oss << std::uppercase << std::hex << std::setw(2) << std::setfill('0')
-            << (int)d;
+            << (int)v;


nit: into<int>(v)

This conversion is perfectly safe, the suggestion is largely to prevent additional noise when searching for conversion issues. On the other hand, all unsafe conversion should go through into<T> to enable runtime validation in debug builds.

rjoursler · 2025-01-09T16:58:04Z

src/gpu/intel/serialization.hpp

@@ -239,6 +239,8 @@ struct deserializer_t {
        }
    }

+    bool empty() const { return idx >= s.get_data().size(); }


Suggested change

bool empty() const { return idx >= s.get_data().size(); }

bool empty() const { return s.get_data().empty(); }

rjoursler · 2025-01-09T17:01:31Z

src/gpu/intel/jit/v2/conv/planner/bench.cpp

@@ -614,6 +614,7 @@ bench_data_t bench(const bench_manager_t &bench_mger,

 bool try_create(
        const bench_manager_t &bench_mger, const kernel_desc_t &kernel_desc) {
+    clear_primitive_cache();


Would it be reasonable to just set the primitive cache capacity to 0?

rjoursler · 2025-01-09T17:27:02Z

Thanks for the data comparing to v1. What is the before and after performance improvement from this optimization?

echeresh added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Jan 7, 2025

echeresh requested review from a team as code owners January 7, 2025 01:54

github-actions bot added the documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc label Jan 7, 2025

Base automatically changed from echeresh/streamk to main January 7, 2025 22:03

echeresh added 10 commits January 7, 2025 14:04

xe: conv_v2: update README

e562a4d

xe: jit: utils: extend hex (de)serialize functions

5004397

xe: jit: utils: introduce parse result

36cd02e

xpu: ocl, sycl: profiler: add per-kernel time query

eb336ee

xe: conv_v2: remove unused code

3dcb49c

xe: conv_v2: fix BWD_D data type check

eba3852

xe: conv_v2: introduce bench_time_t

ef22d54

xe: conv_v2: handle descriptor defaults in one place

3d612f6

xe: conv_v2: update planner logic

7fbf427

xe: conv_v2: generalize performance modeling

1aeb360

echeresh force-pushed the echeresh/echeresh/streamk-enable branch 2 times, most recently from 04c449b to acce223 Compare January 7, 2025 22:13

echeresh added 4 commits January 7, 2025 15:00

x64: fix copyrights

5492d31

xe: jit: add atomic_add for int data type

f4dd9a2

xe: conv_v2: enable Stream-K kernels

dbc96ca

xe: conv_v2: remove hw from kernel descriptor

f766ccd

echeresh force-pushed the echeresh/echeresh/streamk-enable branch from acce223 to f766ccd Compare January 7, 2025 23:00

echeresh requested a review from a team as a code owner January 7, 2025 23:00

github-actions bot added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Jan 7, 2025

rjoursler approved these changes Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xe: conv_v2: enable Stream-K kernels #2345

xe: conv_v2: enable Stream-K kernels #2345

echeresh commented Jan 7, 2025 •

edited

Loading

echeresh commented Jan 7, 2025

rjoursler Jan 9, 2025

rjoursler Jan 9, 2025

rjoursler Jan 9, 2025

rjoursler commented Jan 9, 2025

	bool empty() const { return idx >= s.get_data().size(); }
	bool empty() const { return s.get_data().empty(); }

xe: conv_v2: enable Stream-K kernels #2345

Are you sure you want to change the base?

xe: conv_v2: enable Stream-K kernels #2345

Conversation

echeresh commented Jan 7, 2025 • edited Loading

echeresh commented Jan 7, 2025

rjoursler Jan 9, 2025

Choose a reason for hiding this comment

rjoursler Jan 9, 2025

Choose a reason for hiding this comment

rjoursler Jan 9, 2025

Choose a reason for hiding this comment

rjoursler commented Jan 9, 2025

echeresh commented Jan 7, 2025 •

edited

Loading