Unified GEMM on Nvidia Tensor Cores, Intel XMX of PVC and DG2, and Intel AMX of SPR using SYCL joint matrix

joint_matrix_bf16_fill_k_cache.cpp:

Portable Optimizations:

cache tiling of i and j
cache tiling on k as well (so no reordering is needed)
data reuse of A and B in physical layer

Specific Optimizations for PVC:

Out of Bounds checking is used for PVC using -DOOB
Prefetch for PVC is enabled under -DPREFETCH

Specific options for AMX and SG2

Both row major and VNNI transform options. For row major ommit -DVNNI

Missing optimizations:

no reordering, no SLM for DG2/Nvidia

Important:

For maximum performance, cache and registers blocking parameters are different between Nvidia Tensor Cores, AMX and DPAS of DG2 vs PVC. See specific parameters below:

Build Command lines

Nvidia (~70 Tflops) Add -DNVIDIA

2048

icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_bf16_fill_k_cache.cpp -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128

4096

icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_bf16_fill_k_cache.cpp -DMATRIX_SIZE=4096 -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128

PVC row major (~220 TFlops)

2048

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DPREFETCH -DOOB

4096 VNNI

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DPREFETCH -DOOB -DMATRIX_SIZE=4096

DG2 VNNI (~45 Tflops)

2048

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DVNNI

4096 VNNI

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DMATRIX_SIZE=4096 -DVNNI

SPR VNNI (~60 Tflops)

2048

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=1024 -DVNNI

4096

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=256 -DNCACHE2=256 -DKCACHE2=1024 -DMATRIX_SIZE=4096 -DVNNI

Execution command lines

To run on Nvidia GPU:

ONEAPI_DEVICE_SELECTOR=cuda:0 ./a.out

To run on Intel GPU:

SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file" ./a.out

To run on CPU: DPCPP_CPU_NUM_CUS=112 ./a.out

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
common.hpp		common.hpp
joint_matrix_bf16_fill_k_cache.cpp		joint_matrix_bf16_fill_k_cache.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified GEMM on Nvidia Tensor Cores, Intel XMX of PVC and DG2, and Intel AMX of SPR using SYCL joint matrix

joint_matrix_bf16_fill_k_cache.cpp:

Portable Optimizations:

Specific Optimizations for PVC:

Specific options for AMX and SG2

Missing optimizations:

Important:

Build Command lines

Nvidia (~70 Tflops) Add -DNVIDIA

2048

4096

PVC row major (~220 TFlops)

2048

4096 VNNI

DG2 VNNI (~45 Tflops)

2048

4096 VNNI

SPR VNNI (~60 Tflops)

2048

4096

Execution command lines

To run on Nvidia GPU:

To run on Intel GPU:

About

Releases

Packages

Languages

dkhaldi/sycl_joint_matrix_kernels

Folders and files

Latest commit

History

Repository files navigation

Unified GEMM on Nvidia Tensor Cores, Intel XMX of PVC and DG2, and Intel AMX of SPR using SYCL joint matrix

joint_matrix_bf16_fill_k_cache.cpp:

Portable Optimizations:

Specific Optimizations for PVC:

Specific options for AMX and SG2

Missing optimizations:

Important:

Build Command lines

Nvidia (~70 Tflops) Add -DNVIDIA

2048

4096

PVC row major (~220 TFlops)

2048

4096 VNNI

DG2 VNNI (~45 Tflops)

2048

4096 VNNI

SPR VNNI (~60 Tflops)

2048

4096

Execution command lines

To run on Nvidia GPU:

To run on Intel GPU:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages