Unified GEMM on Nvidia Tensor Cores, Intel XMX of PVC and DG2, and Intel AMX of SPR using SYCL joint matrix

joint_matrix_bf16_fill_k_cache.cpp:

gave best performance so far
No reordering was needed.
Blocking was done on the k dimension as well
Both row major and VNNI transform options. For row major ommit -DVNNI
missing optimizations: no prefetch, no reordering, no SLM for DG2/Nvidia
For maximum performance, cache and registers blocking parameters are different between Nvidia Tensor Cores, AMX and DPAS of DG2 vs PVC

Build Command lines

Nvidia (~70 Tflops) Add -DNVIDIA

2048

icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_bf16_fill_k_cache.cpp -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128

4096

icpx -fsycl -fsycl-targets=nvidia_gpu_sm_80 joint_matrix_bf16_fill_k_cache.cpp -DMATRIX_SIZE=4096 -DNVIDIA -DMCACHE1=64 -DNCACHE1=64 -DMCACHE2=128 -DNCACHE2=128

PVC VNNI (~185 Tflops)

2048

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DVNNI

4096 VNNI

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DMATRIX_SIZE=4096 -DVNNI

PVC row major

2048

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp

4096 VNNI

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DMATRIX_SIZE=4096

DG2 VNNI (~45 Tflops)

2048

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DVNNI

4096 VNNI

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=16 -DMATRIX_SIZE=4096 -DVNNI

SPR VNNI (~60 Tflops)

2048

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=128 -DNCACHE2=128 -DKCACHE2=1024 -DVNNI

4096

icpx -fsycl joint_matrix_bf16_fill_k_cache.cpp -DNCACHE1=32 -DKCACHE1=32 -DMCACHE2=256 -DNCACHE2=256 -DKCACHE2=1024 -DMATRIX_SIZE=4096 -DVNNI

Execution command lines

To run on Nvidia GPU:

ONEAPI_DEVICE_SELECTOR=cuda:0 ./a.out

To run on Intel GPU:

SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file" ./a.out

To run on CPU: DPCPP_CPU_NUM_CUS=112 ./a.out

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
common.hpp		common.hpp
joint_matrix_bf16_fill_k_cache.cpp		joint_matrix_bf16_fill_k_cache.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified GEMM on Nvidia Tensor Cores, Intel XMX of PVC and DG2, and Intel AMX of SPR using SYCL joint matrix

joint_matrix_bf16_fill_k_cache.cpp:

Build Command lines

Nvidia (~70 Tflops) Add -DNVIDIA

2048

4096

PVC VNNI (~185 Tflops)

2048

4096 VNNI

PVC row major

2048

4096 VNNI

DG2 VNNI (~45 Tflops)

2048

4096 VNNI

SPR VNNI (~60 Tflops)

2048

4096

Execution command lines

To run on Nvidia GPU:

To run on Intel GPU:

About

Releases

Packages

Languages

Hitman4Reason/sycl_joint_matrix_kernels

Folders and files

Latest commit

History

Repository files navigation

Unified GEMM on Nvidia Tensor Cores, Intel XMX of PVC and DG2, and Intel AMX of SPR using SYCL joint matrix

joint_matrix_bf16_fill_k_cache.cpp:

Build Command lines

Nvidia (~70 Tflops) Add -DNVIDIA

2048

4096

PVC VNNI (~185 Tflops)

2048

4096 VNNI

PVC row major

2048

4096 VNNI

DG2 VNNI (~45 Tflops)

2048

4096 VNNI

SPR VNNI (~60 Tflops)

2048

4096

Execution command lines

To run on Nvidia GPU:

To run on Intel GPU:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages