replace kernel implementation using CK tile-programming performant kernels #33

carlushuang · 2024-01-10T04:24:07Z

We are planning to replace the underneath kernel implementation with the newly developed CK tile-programming fmha kernel. The performance is much better for MI200/MI300, especially for MI300 cases. After this is done, the current implementation in main branch will be deprecated.

fwd integration with hdim=64/128, support mask, varlen, different kernels for padding case.
fwd extend to other hdims
dropout support
bwd integration (to be planed)

sabreshao · 2024-01-11T06:28:08Z

@carlushuang Our top priority ask on FA should be enabling E2E training for GPT3/LLAMA2/Qwen/MPT first.
Are you proposing to start integration based on current ROCM FA or base on latest upstream FA?

carlushuang assigned carlushuang, poyenc and danyao12 Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace kernel implementation using CK tile-programming performant kernels #33

replace kernel implementation using CK tile-programming performant kernels #33

carlushuang commented Jan 10, 2024 •

edited

Loading

sabreshao commented Jan 11, 2024

replace kernel implementation using CK tile-programming performant kernels #33

replace kernel implementation using CK tile-programming performant kernels #33

Comments

carlushuang commented Jan 10, 2024 • edited Loading

sabreshao commented Jan 11, 2024

carlushuang commented Jan 10, 2024 •

edited

Loading