-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] make_tiled_copy should not assume 2d data, Thr and Val layouts #1284
Comments
In summary, the secret behaviour limit the usage of |
The problem appears to be with
and
should be explicitly rank-1 rather than rank-2 (and implicitly intended to be interpreted as 1-D). Both I'll include this in my documentation+pre/postconditions update coming shortly. You can use
to get more information and potentially use |
I happen to create my own utility function: #include "cute/tensor.hpp"
using namespace cute;
template <class ThrLayout, class ValLayout, class Tensor>
auto make_tv_view(Tensor&& tensor, const ThrLayout& thr_layout = {}, const ValLayout& val_layout = {}) {
auto layout = raked_product(thr_layout, val_layout);
auto with_iter = zipped_divide(tensor.layout(), layout);
auto layout_per_iter = get<0>(with_iter);
auto iter_layout = get<1>(with_iter);
auto layout_tv_iter = make_layout(
right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
iter_layout
);
return std::forward<Tensor>(tensor).compose(layout_tv_iter);
}
int main() {
std::vector<int> tensor_buffer(12);
for (int i = 0; i < 12; i++) {
tensor_buffer[i] = i;
}
{
print("====================================== case 5 ======================================\n");
// auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_4{}, _3{})));
auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{})));
Layout<Shape<_2>, Stride<_1>> thr;
Layout<Shape<_2>, Stride<_1>> val;
print_tensor(make_tv_view(t, thr, val)(make_coord(0, _), _));
}
} cute should somehow provide a utility function like this. |
If it works for you, that's great. I'll take a closer look soon. Until then, it appears it can be simplified to template <class ThrLayout, class ValLayout, class Tensor>
auto make_tv_view(Tensor && tensor,
ThrLayout const& thr_layout = {}, // (m,n,...) -> thr_idx
ValLayout const& val_layout = {}) // (m,n,...) -> val_idx
{
auto layout_mn = raked_product(thr_layout, val_layout);
auto layout_tv = right_inverse(layout_mn).with_shape(make_shape(size(thr_layout), size(val_layout)));
return zipped_divide(tensor, layout_mn).compose(layout_tv, _);
} |
Great to know that I can use underscore in |
EDIT: removed unnecessary compose @ccecka The #include <cute/tensor.hpp>
using namespace cute;
template <typename ThrLayout, typename ValLayout, typename Layout>
__host__ __device__ constexpr auto make_tv_layout(
const Layout& target, // the target layout we want to compose with
const ThrLayout& thr_layout = {}, // (t0,...) -> thr_idx
const ValLayout& val_layout = {} // (v0,...) -> val_idx
) {
auto layout = raked_product(thr_layout, val_layout);
auto with_iter = zipped_divide(target, layout);
auto layout_per_iter = get<0>(with_iter);
auto iter_layout = get<1>(with_iter);
auto layout_tv_iter = make_layout(
right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
iter_layout
);
return layout_tv_iter;
}
template <typename ThrLayout, typename ValLayout, typename Layout>
__host__ __device__ constexpr auto make_tv_layout_ccecka(
const Layout& target, // the target layout we want to compose with
const ThrLayout& thr_layout = {}, // (t0,...) -> thr_idx
const ValLayout& val_layout = {} // (v0,...) -> val_idx
) {
auto layout = raked_product(thr_layout, val_layout); // (t0*v0,t1*v1,...) -> (thr_idx,val_idx)
auto layout_tv = right_inverse(layout).with_shape(make_shape(size(thr_layout), size(val_layout)));
return zipped_divide(target, layout).compose(layout_tv, _);
}
int main() {
constexpr int XSize = 16;
constexpr int YSize = 64;
int zsize = 32; // each warp process a seperate z idx, zsize is a runtime value
// The plan
constexpr int XVec = 4;
constexpr int YVec = 8;
constexpr int NumXVec = 4;
constexpr int NumYVec = 8;
// make_tile cause per mode product
auto layout = make_layout(make_shape(make_shape(make_shape(Int<XSize>{}, Int<YSize>{})), zsize));
auto tcoord = make_identity_tensor(layout.shape());
auto [layout_tvi_mine, layout_tvi_ccecka] = [&]() {
constexpr int NumThreads = 128;
constexpr int WarpSize = 32;
static_assert(NumXVec * NumYVec == WarpSize);
constexpr int NumWarps = NumThreads / WarpSize;
constexpr auto thr_layout = make_layout(make_shape(make_shape(Int<NumXVec>{}, Int<NumYVec>{}), Int<NumWarps>{}));
constexpr auto val_layout = make_layout(make_shape(make_shape(Int<XVec>{}, Int<YVec>{}), _1{}));
// X Y Z
// 16 64 32 target tensor
// 4 8 4 thr_layout
// 4 8 1 val_layout
// 16 64 4 raked_product(thr_layout, val_layout)
// 1 1 8 iter
print("thr_layout:"), print(thr_layout), print("\n");
print("val_layout:"), print(val_layout), print("\n");
return std::make_pair(make_tv_layout(layout, thr_layout, val_layout), make_tv_layout_ccecka(layout, thr_layout, val_layout));
}();
// Let see mapped coords of thread 0
auto t0_mine = tcoord.compose(layout_tvi_mine)(make_coord(0, _), _);
auto t0_ccecka = tcoord.compose(layout_tvi_ccecka)(make_coord(0, _), _);
print("size (mine): "), print(size(t0_mine)), print("size (ccecka): "), print(size(t0_ccecka)), print("\n");
for(int i=0;i<size(t0_mine); i++) {
print(i), print("\tcoord (mine): "), print(t0_mine(i)), print("\tcoord (ccecka): "), print(t0_ccecka(i)), print("\n");
}
// size (mine): 256size (ccecka): 256
// 0 coord (mine): (((0,0)),0) coord (ccecka): (((0,0)),0)
// 1 coord (mine): (((1,0)),0) coord (ccecka): (((0,8)),0)
// 2 coord (mine): (((2,0)),0) coord (ccecka): (((0,16)),0)
// 3 coord (mine): (((3,0)),0) coord (ccecka): (((0,24)),0)
// 4 coord (mine): (((4,0)),0) coord (ccecka): (((0,32)),0)
// 5 coord (mine): (((5,0)),0) coord (ccecka): (((0,40)),0)
// 6 coord (mine): (((6,0)),0) coord (ccecka): (((0,48)),0)
// 7 coord (mine): (((7,0)),0) coord (ccecka): (((0,56)),0)
// 8 coord (mine): (((8,0)),0) coord (ccecka): (((0,0)),1)
// 9 coord (mine): (((9,0)),0) coord (ccecka): (((0,8)),1)
// 10 coord (mine): (((10,0)),0) coord (ccecka): (((0,16)),1)
// 11 coord (mine): (((11,0)),0) coord (ccecka): (((0,24)),1)
// 12 coord (mine): (((12,0)),0) coord (ccecka): (((0,32)),1)
// 13 coord (mine): (((13,0)),0) coord (ccecka): (((0,40)),1)
// 14 coord (mine): (((14,0)),0) coord (ccecka): (((0,48)),1)
// 15 coord (mine): (((15,0)),0) coord (ccecka): (((0,56)),1)
return 0;
} |
I found this because I applied the simplification, and it works pretty well for 1d and 2d case. For 3d, my sliced tensor is pretty weird and I scratched my head for a whole day before I finally decide to try the old version of tv layout maker. |
I don't know what you mean by "works". Neither approach appears to be giving you what you actually intend. This is why I don't recommend attempting to reinvent partitioning. The difference between the two is that "your's" is composing with auto layout_tv_iter = make_layout(
right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
iter_layout
); which is a make_tile(layout_tv, _) which is a The other problems that are going to prevent this from generalizing are
For these kinds of partitions, you have to ask yourself what you actually care about
The pattern that CuTe uses for partitioning is a template <class Layout, class Tiler, class LayoutTV>
__host__ __device__ constexpr auto
tv_tiling(Layout const& target, // the target layout we want to compose with
Tiler const& tiler, // the tiler to extract out relevant elements
LayoutTV const& layout_tv) // the TV transform to apply to the tile of data
{
return zipped_divide(target, tiler).compose(layout_tv, _);
}
int main() {
auto layout = make_layout(make_shape(make_shape(make_shape(Int<16>{}, Int<64>{})), 32));
auto tcoord = make_identity_tensor(layout.shape());
// Select out of the target the tile of data we want to work on
auto tiler = Shape<Shape<Shape<_16,_64>>, _4>{}; // (M,N,L) -> tensor coords
// Transform the tile of data to a TV format
auto thr_layout = Layout<Shape<_4,_8,_4>>{}; // (m,n,l) -> thr_idx
auto val_layout = Layout<Shape<_4,_8,_1>>{}; // (m,n,l) -> val_idx
// (M,N,L) -> (thr_idx,val_idx)
auto thrval_layout = raked_product(thr_layout, val_layout);
// (thr_idx, val_idx) -> (M,N,L)
auto layout_tv = right_inverse(thrval_layout).with_shape(size(thr_layout), size(val_layout));
// We've construct the layout_tv to index into the coords of the tiler
CUTE_STATIC_ASSERT_V(cosize(layout_tv) == size(tiler));
auto result = tv_tiling(tcoord, tiler, layout_tv); // ((thr, val), iter)
// Let see mapped coords of thread 0
auto t0_result = result(make_coord(0, _), _);
for(int i = 0; i < size(t0_result); ++i) {
print(i), print("\t"), print(t0_result(i)), print("\n");
}
// 0 (((0,0)),0) // Each thread gets a (((4,8)),1) "block" of data 8 times in the target
// 1 (((1,0)),0)
// 2 (((2,0)),0)
// 3 (((3,0)),0)
// 4 (((0,1)),0)
// 5 (((1,1)),0)
// 6 (((2,1)),0)
// 7 (((3,1)),0)
// 8 (((0,2)),0)
// 9 (((1,2)),0)
// 10 (((2,2)),0)
// 11 (((3,2)),0)
// 12 (((0,3)),0)
// 13 (((1,3)),0)
// 14 (((2,3)),0)
// 15 (((3,3)),0)
// ...
} This is exactly the interface of |
I start to see some blind spots that was ignore by me previously. The very important one is in derived_tiler = raked_product(thr_layout, val_layout); // this the confused part, it should be explicit, however.
thrval_layout = raked_product(thr_layout, val_layout); // forward mapping
layout_tv = right_inverse(thrval_layout).with_shape(size(thr_layout), size(val_layout)) // reversed mapping In my previous example, each warp process a whole I also see some blind spots of yours now.
I wouldn't call it
Sorry for the confusion, this is some uncleaned code from copying, the |
For rank-sensitivity of raked_product, I'd imagine an utility |
This issue has been labeled |
Closing due to inactivity |
@mnicely Sorry, but I think this is still relevant. Inactivity does not imply it is solved. |
The bugs I pointed out in #1284 (comment) were fixed with the 3.4 docs update. Those fixes strengthen and improve the post-conditions of rank-sensitive The rest of the concepts in this thread remain intact, but the functionality of the original code should be more consistent with expectations. |
I manually verified with lastest main@ bbe579a, The case 1 works now (aka, for 1d data Thr and Val). The case 3 works as expected now (aka, for 2d data with 1d Thr and Val with automatic padding). I'd assume higher rank cases will work as expected also. @ccecka I had been thinking that this was caused by the design that it intrinsically only supported 2d, But it turn out to be a bug. Sorry for misunderstanding. |
Describe the bug
make_tiled_copy
also should not secretly padThr
andVal
. See code sample and discussion.Steps/Code to reproduce bug
Expected behavior
Note, the tensor is 1d for some reason, threads access data in interleaved style, in multiple iters.
case 1
Case 1 is a perfect use case of "compose then slice" of 1d
Tensor
, 1dThr
and 1dVal
all along the same direction. The hiddenappend
in impl preventsmake_tiled_copy
being useful in this case.case 2
Case 2 is what we want and it works as expected. It works because we make everything explict and correctly express the ording.
case 3
case 3 is extremely weird and confusing due to the
append<R>
, it produces(_2,_1):(_1,_0)
from_2:_1
. It produces incomprehensible iteraion mode. This should not happen...case 4
case 4 is what case 3 should have been. It is a manually fixed version of case 3.
Environment details (please complete the following information):
a75b4ac
The text was updated successfully, but these errors were encountered: