You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is your question?
Getting CopyAtom, src / dst layout doesn't vectorize into registers when trying to implement the following tiled copy:
using g2s_copy_op = SM80_CP_ASYNC_CACHEGLOBAL<cute::uint128_t>;
using g2s_copy_traits = Copy_Traits<g2s_copy_op>;
using g2s_copy_atom = Copy_Atom<g2s_copy_traits, T>;
using G2SCopyA =
decltype(make_tiled_copy(g2s_copy_atom{},
make_layout(make_shape(Int<16>{}, Int<2>{}),
make_stride(Int<2>{}, Int<1>{})),
make_layout(make_shape(Int<1>{}, Int<8>{}))));
Tensor gA = make_tensor(make_gmem_ptr(A), make_layout(make_shape(Int<M>{}, Int<K>{}), make_stride(Int<K>{}, 1)));
Tensor sA = make_tensor(make_smem_ptr(smemA), make_layout(make_shape(Int<M>{}, Int<K>{}), make_stride(Int<K>{}, 1)));
G2SCopyA g2s_tiled_copy_a;
auto g2s_thr_copy_a = g2s_tiled_copy_a.get_slice(threadIdx.x);
auto tAgA_copy = g2s_thr_copy_a.partition_S(gA);
auto tAsA_copy = g2s_thr_copy_a.partition_D(sA);
cute::copy(g2s_tiled_copy_a, tAgA_copy, tAsA_copy);
In the above, M and K are multiples of 16 and T = cutlass::half_t. smemA is a static shared T array of len MxK.
My understanding is that the tiled_copy I've defined is using 32 threads (16 x 2) and that each thread is copying 8 elements (1 x 8) such that each copy tile is shape 16 x 16, and the intent is to have each thread do a vectorized copy from global to shared memory. Where am I going wrong?
The text was updated successfully, but these errors were encountered:
What is your question?
Getting
CopyAtom, src / dst layout doesn't vectorize into registers
when trying to implement the followingtiled copy
:In the above,
M
andK
are multiples of16
andT = cutlass::half_t
.smemA
is a static sharedT
array of lenMxK
.My understanding is that the
tiled_copy
I've defined is using 32 threads (16 x 2
) and that each thread is copying8
elements (1 x 8
) such that eachcopy tile
is shape16 x 16
, and the intent is to have each thread do a vectorized copy from global to shared memory. Where am I going wrong?The text was updated successfully, but these errors were encountered: