support grouped query attention(MQA & GQA) for flash_attn #22

iclementine · 2024-05-16T02:14:46Z

support grouped query attention(GQA) for flash_attn(related kernels: fwd, bwd, split_kv, total_attention)

The MQA paper

Shazeer, Noam. “Fast Transformer Decoding: One Write-Head Is All You Need.” arXiv, November 5, 2019. https://doi.org/10.48550/arXiv.1911.02150.

The GQA paper

Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” arXiv, December 23, 2023.

Mind the layout of the heads in the query.

…v, total_attention)

tongxin

LG

tongxin · 2024-05-16T06:30:15Z

src/flag_attn/flash.py

 # --------------------------- public API ---------------------------
 class FlashAttention(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v, causal, sm_scale, return_log_normalizer, return_total_attention):
+        # size, stride, dtype checking
        Dq, Dk, Dv = q.shape[-1], k.shape[-1], v.shape[-1]


I think we could first run an all equal test over the input shapes, using iterator, as a fast path, to avoid interpreter overhead.

… flash attention.

support grouped query attention(GQA) for flash_attn(fwd, bwd, split_k…

6923543

…v, total_attention)

tongxin approved these changes May 16, 2024

View reviewed changes

add mqa/gqa into feature list; update documentations and testings for…

05d3a92

… flash attention.

iclementine changed the title ~~support grouped query attention(GQA) for flash_attn~~ support grouped query attention(MQA & GQA) for flash_attn May 27, 2024

iclementine merged commit 13664fc into FlagOpen:main May 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support grouped query attention(MQA & GQA) for flash_attn #22

support grouped query attention(MQA & GQA) for flash_attn #22

iclementine commented May 16, 2024 •

edited

Loading

tongxin left a comment

tongxin May 16, 2024

support grouped query attention(MQA & GQA) for flash_attn #22

support grouped query attention(MQA & GQA) for flash_attn #22

Conversation

iclementine commented May 16, 2024 • edited Loading

tongxin left a comment

Choose a reason for hiding this comment

tongxin May 16, 2024

Choose a reason for hiding this comment

iclementine commented May 16, 2024 •

edited

Loading