Misc. bug: Loop range computation question of Vulkan matmul shaders #12082

blurSong · 2025-02-26T12:56:18Z

Name and Version

build: 9b17d3b (4778)

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

No response

Command line

Problem description & steps to reproduce

I am trying to tune the matmul shader's performance of Vulkan backends. However, I cannot get the expected results. So I checked the shaders mul_mm.comp itself and I am now confused by this

// line:176
#ifdef MUL_MAT_ID
    const uint start_k = 0;
    const uint end_k = p.K;
#else
    const uint start_k = ik * p.k_split;
    const uint end_k = min(p.K, (ik + 1) * p.k_split);
#endif

If my understanding is correct, the start_k and end_k are used to control the range of K loop:

    for (uint block = start_k; block < end_k; block += BK) {
        [[unroll]] for (uint l = 0; l < BM; l += loadstride_a) {

This confused me because in my kernel I have k_split=1 and ik=0, so the end_k turns to be 1. This will makes the matmul main loop run only once! Is line 159 bugged or my knowledge is wrong? Looking forward to an answer!

First Bad Commit

No response

Relevant log output

The text was updated successfully, but these errors were encountered:

0cc4m · 2025-02-26T13:50:04Z

The push constant p.k_split is not actually the value of k_split in the C++ function call. In case k_split == 1 it gets set to the value of k. Otherwise it's CEIL_DIV(k, split_k). The push constant name is a little confusing.

llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp

Lines 3875 to 3897 in 69050a1

    
           static void ggml_vk_matmul( 
        
                   ggml_backend_vk_context * ctx, vk_context& subctx, vk_pipeline& pipeline, 
        
                   vk_subbuffer&& a, vk_subbuffer&& b, vk_subbuffer&& d, vk_subbuffer&& split_k_buffer, 
        
                   uint32_t m, uint32_t n, uint32_t k, uint32_t stride_a, uint32_t stride_b, uint32_t stride_d, 
        
                   uint32_t batch_stride_a, uint32_t batch_stride_b, uint32_t batch_stride_d, 
        
                   uint32_t split_k, uint32_t batch, uint32_t ne02, uint32_t ne12, uint32_t broadcast2, uint32_t broadcast3) { 
        
                   VK_LOG_DEBUG("ggml_vk_matmul(a: (" << a.buffer->buffer << ", " << a.offset << ", " << a.size << "), b: (" << b.buffer->buffer << ", " << b.offset << ", " << b.size << "), d: (" << d.buffer->buffer << ", " << d.offset << ", " << d.size << "), split_k: (" << (split_k_buffer.buffer != nullptr ? split_k_buffer.buffer->buffer : VK_NULL_HANDLE) << ", " << split_k_buffer.offset << ", " << split_k_buffer.size << "), m: " << m << ", n: " << n << ", k: " << k << ", stride_a: " << stride_a << ", stride_b: " << stride_b << ", stride_d: " << stride_d << ", batch_stride_a: " << batch_stride_a << ", batch_stride_b: " << batch_stride_b << ", batch_stride_d: " << batch_stride_d << ", split_k: " << split_k << ", batch: " << batch << ", ne02: " << ne02 << ", ne12: " << ne12 << ", broadcast2: " << broadcast2 << ", broadcast3: " << broadcast3 << ")"); 
        
               ggml_vk_sync_buffers(subctx); 
        
               if (split_k == 1) { 
        
                   const vk_mat_mat_push_constants pc = { m, n, k, stride_a, stride_b, stride_d, batch_stride_a, batch_stride_b, batch_stride_d, k, ne02, ne12, broadcast2, broadcast3 }; 
        
                   ggml_vk_dispatch_pipeline(ctx, subctx, pipeline, { a, b, d }, sizeof(vk_mat_mat_push_constants), &pc, { m, n, batch }); 
        
                   return; 
        
               } 
        
               GGML_ASSERT(batch_stride_d == m * n); 
        
               const vk_mat_mat_push_constants pc1 = { m, n, k, stride_a, stride_b, stride_d, batch_stride_a, batch_stride_b, batch_stride_d, CEIL_DIV(k, split_k), ne02, ne12, broadcast2, broadcast3 }; 
        
               // Make sure enough workgroups get assigned for split k to work 
        
               ggml_vk_dispatch_pipeline(ctx, subctx, pipeline, { a, b, split_k_buffer }, sizeof(vk_mat_mat_push_constants), &pc1, { (CEIL_DIV(m, pipeline->wg_denoms[0]) * pipeline->wg_denoms[0]) * split_k, n, batch }); 
        
               ggml_vk_sync_buffers(subctx); 
        
               const std::array<uint32_t, 2> pc2 = { (uint32_t)(m * n * batch), split_k }; 
        
               ggml_vk_dispatch_pipeline(ctx, subctx, ctx->device->pipeline_matmul_split_k_reduce, { split_k_buffer, d }, pc2.size() * sizeof(uint32_t), pc2.data(), { m * n * batch, 1, 1 }); 
        
           }

blurSong added the bug-unconfirmed label Feb 26, 2025

blurSong changed the title ~~Misc. bug:~~ Misc. bug: Loop range computation question of Vulkan matmul shaders Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Loop range computation question of Vulkan matmul shaders #12082

Misc. bug: Loop range computation question of Vulkan matmul shaders #12082

blurSong commented Feb 26, 2025 •

edited

Loading

0cc4m commented Feb 26, 2025

Misc. bug: Loop range computation question of Vulkan matmul shaders #12082

Misc. bug: Loop range computation question of Vulkan matmul shaders #12082

Comments

blurSong commented Feb 26, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

0cc4m commented Feb 26, 2025

blurSong commented Feb 26, 2025 •

edited

Loading