[QST] Why use column-major tv layout to encode the mma? #1226

mammoth831 · 2023-12-03T08:04:59Z

CuTe uses column-major to encode the TV layout of mma’s multiplicand, the doc explains that
Since CuTe layouts return indices rather than coordinates, we choose a column-major encoding of the (m,n) coordinates:.

But it seems it leads to wrong results for a row-major matrix.

e.g. for multiplicand A of this mma inst:

cutlass/include/cute/atom/mma_traits_sm80.hpp

Lines 66 to 79 in 4a1709e

    
           template <> 
        
           struct MMA_Traits<SM80_16x8x8_F16F16F16F16_TN> 
        
           { 
        
             using ElementDVal = half_t; 
        
             using ElementAVal = half_t; 
        
             using ElementBVal = half_t; 
        
             using ElementCVal = half_t; 
        
             using Shape_MNK = Shape<_16,_8,_8>; 
        
             using ThrID   = Layout<_32>; 
        
             using ALayout = SM80_16x8_Row; 
        
             using BLayout = SM80_8x8_Row; 
        
             using CLayout = SM80_16x8_Row; 
        
           };

Then I use the following code to read values from global memory directly.

#include <cute/tensor.hpp>
#include <cute/atom/mma_atom.hpp>
#include <cute/atom/copy_atom.hpp>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

using namespace cute;

template <class TiledMma, class T>
__global__ void mma_test(const T* A) {
        TiledMma tiled_mma;
        using GmemCopyAtom = Copy_Atom<DefaultCopy, T>;
        auto inst_m = size<0>(typename TiledMma::AtomShape_MNK{});
        auto inst_k = size<2>(typename TiledMma::AtomShape_MNK{});

        auto gA = make_tensor(make_gmem_ptr(A), make_shape(inst_m, inst_k));
        auto thr_mma = tiled_mma.get_thread_slice(threadIdx.x);
        Tensor tCrA  = thr_mma.partition_fragment_A(gA);

        auto thr_copy_A       = make_tiled_copy_A(GmemCopyAtom{}, tiled_mma).get_thread_slice(threadIdx.x);
        Tensor tCgA           = thr_copy_A.partition_S(gA);
        Tensor tCrA_copy_view = thr_copy_A.retile_D(tCrA);

        clear(tCrA);
        copy(tCgA, tCrA_copy_view);
        if (thread0()) {
          for (int i = 0; i < 4; ++i)
            printf("t0v%d: %f \n", i, float(tCrA_copy_view(i)));
        }
}

int main() {
        using TiledMma = TiledMMA<MMA_Atom<SM80_16x8x8_F16F16F16F16_TN>>;
        auto inst_m = size<0>(TiledMma::AtomShape_MNK{});
        auto inst_k = size<2>(TiledMma::AtomShape_MNK{});
        thrust::host_vector<half_t> h_A(inst_m * inst_k);

        for (int i = 0; i < inst_m * inst_k; ++i) h_A[i] = half_t(i);
        thrust::device_vector<half_t> d_A = h_A;

        mma_test<TiledMma><<<1, 32>>>(thrust::raw_pointer_cast(d_A.data()));

        printf("Multiplicand A(16x8):\n");
        for (int i = 0; i < inst_m; ++i) {
          for (int j = 0; j < inst_k; ++j) {
            // row-major matrix
            printf("%3.1f ", float(h_A[i * inst_k + j]));
          }
          printf("\n");
        }
}

Since I fill the matrix using naturally increasing elements and A is a row-major matrix, I think the thread0 should read 0, 1, 64, 65(as PTX ref:https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m16n8k8)

But the output is:

t0v0: 0.000000
t0v1: 16.000000
t0v2: 8.000000
t0v3: 24.000000

Shall we use row-major tv layout to encode the row-major multiplicand? Or is there something wrong in my code?

The text was updated successfully, but these errors were encountered:

ccecka · 2023-12-03T18:41:34Z

The TV Layout in the MMAs are not "row-major" or "col-major", they describe the partitioning patterns of each instruction and can be applied to any Tensor with any Layout.

If you would like to use row-major data, then you should use row-major data:

Tensor gA = make_tensor(make_gmem_ptr(A), make_shape(inst_m, inst_k), make_stride(inst_k, Int<1>{}));
// or
Tensor gA = make_tensor(make_gmem_ptr(A), make_shape(inst_m, inst_k), GenRowMajor{});

mammoth831 · 2023-12-04T04:51:28Z

The TV Layout in the MMAs are not "row-major" or "col-major", they describe the partitioning patterns of each instruction and can be applied to any Tensor with any Layout.

If you would like to use row-major data, then you should use row-major data:
Tensor gA = make_tensor(make_gmem_ptr(A), make_shape(inst_m, inst_k), make_stride(inst_k, Int<1>{}));
// or
Tensor gA = make_tensor(make_gmem_ptr(A), make_shape(inst_m, inst_k), GenRowMajor{});

Thank you! It solved my problem.

mammoth831 added ? - Needs Triage question Question labels Dec 3, 2023

mammoth831 closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Why use column-major tv layout to encode the mma? #1226

[QST] Why use column-major tv layout to encode the mma? #1226

mammoth831 commented Dec 3, 2023

ccecka commented Dec 3, 2023

mammoth831 commented Dec 4, 2023

[QST] Why use column-major tv layout to encode the mma? #1226

[QST] Why use column-major tv layout to encode the mma? #1226

Comments

mammoth831 commented Dec 3, 2023

ccecka commented Dec 3, 2023

mammoth831 commented Dec 4, 2023