Skip to content

Commit

Permalink
Update XeGPU.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Jianhui-Li authored Feb 9, 2025
1 parent 991fba7 commit a2a25cd
Showing 1 changed file with 64 additions and 54 deletions.
118 changes: 64 additions & 54 deletions docs/rfcs/XeGPU.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,6 @@ create_nd_tdesc also accepts a memref as input instead of a memory address, shap

The example below accepts a memory address and an offset and creates a 1D tensor_desc. The tensor_desc describes a 1D vector that is loaded by all work items combined within the subgroup.
```mlir
#sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
#tdesc_attr1 = !xegpu.block_tdesc_attr<memory_space=slm, boundary_check=false, sg= #sg_map_a>
%tdesc1 = xegpu.create_nd_tdesc %mem_addr, %offset :
uint64, index into tensor_desc<16xbf16, #tdesc_attr1>
#tdesc_attr2 = !xegpu.block_tdesc_attr<memory_space=slm, boundary_check=false>
%tdesc2 = xegpu.create_nd_tdesc %mem_addr, %offset :
uint64, index into tensor_desc<16xbf16, #tdesc_attr2>
Expand Down Expand Up @@ -344,52 +339,64 @@ Within the sg_map, `wi_layout` specifies the layout of work items, describing th
into tensor_desc<8x16xbf16, #sg_map_a>
```

wi_data_size refers to the data size mapped to indiviudal work item, and sg_map_size to the collective size by all the work items as specified by sg_map. distribute_unit_size represents the minimun size of 2D tensor to be distributed to work items in a subgroup. tensor_size refers to the size of the tensor sepcified by tensor_desc.
In the example above, wi_data_size is 1, sg_map_size is 16, tensor_size is 128.
wi_data_size refers to the data size mapped to indiviudal work item, and distribute_unit_size represents the minimun size of 2D tensor to be distributed to all work items in a subgroup. tensor_size refers to the size of the tensor sepcified by tensor_desc. In the example above, wi_data_size is 1, distribute_unit_size is 16, tensor_size is 128.
```mlir
wi_data_size = wi_data[0] x wi_data[1]
subgroup_size == wi_layout[0] x wi_layout[1]
sg_map_size[0] = wi_layout[0] x wi_data[0]
sg_map_size[1] = wi_layout[1] x wi_data[1]
distribute_unit_size = sg_map_size[0] x sg_map_size[1] = subgroup_size x wi_data_size
distribute_unit_size = subgroup_size x wi_data_size
tensor_size = tensor_desc[0] x tensor_desc[1]
```
wi_data_size can be larger than 1, meaning that each work item operates on multiple elements, which is eventually lowered to "SIMT-flavor" vector, like SPIR-V vector or llvm vector, or packed to a storage data type for matrix operations. The multiple elements indicated by wi_data can only be from one dimension and must be contiguous in the memory along either dimension.
wi_data_size can be greater than 1, meaning each work item can own multiple elements in a single distribution. These multiple elements, indicated by wi_data, must come from one dimension and must be contiguous along that dimension.

To distribute a tensor, tensor_size must be divisible by distribute_unit_size. More specifically, tensor_desc[0] must be divisible by wi_layout[0] x wi_data[0], tensor_desc[1] by wi_layout[1] x wi_data[1]. The 2D subtensor is evenly distributed to work items, so each work item gets a 2D data fragment, which may contain mulitple distribution of wi_data elements.
To distribute a tensor, tensor_size must be divisible by distribute_unit_size. Specifically, tensor_desc[0] must be divisible by wi_layout[0] x wi_data[0], and tensor_desc[1] must be divisible by wi_layout[1] x wi_data[1]. The 2D subtensor is distributed in a round-robin manner, meaning each work item receives multiple distributions. As a result, the distribution assigns a 2D data fragment to each work item, where the first dimension corresponds to the number of distributions, and the second corresponds to wi_data_size. The 2D data fragment is ultimately lowered to a "SIMT-flavored" vector, such as a SPIR-V vector or LLVM vector, where each vector element contains the packed data of the wi_data. The size of packed data could be 32-bit or 16-bit.

The size of the result data fragement per work item can be computed by the following:
```mlir
WI_data_frag[0] = tensor_desc[0]/wi_layout[0]
WI_data_frag[1] = tensor_desc[1]/wi_layout[1]
WI_data_frag[0] = tensor_size / distribute_unit_size
WI_data_frag[1] = wi_data_size
```

The WI dsitribution is represented by the shape of the result vector being loaded, which is being reduced from [tensor_decs[0], tensor_desc[1]] to [WI_data_frag[0], WI_data_frag[1]].

With `sg_map` attribute attached to tensor_desc, xegpu.load_nd operates in SIMT flavor and returns back a fragement associated with individual work item. The tensor_desc in the example below specifies a tensor of 8x16 elements, which is decomposed to 8x1 subtensors, each with sg_map_size 1x16. The result vector <8x16xbf16> is loaded and distributed to each WI as <8x1xbf16>.
With `sg_map` attribute attached to tensor_desc, xegpu.load_nd operates in SIMT flavor and returns back a fragement associated with individual work item. The tensor_desc in the first example below specifies a tensor of 8x16 elements, which is distributed 8 times so each work item gets <8x1xbf16>. The second example shows the each work gets <8x2xint8> with 2 int8 elements packed as one 16-bit data.
```mlir
#sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
%vector_a = xegpu.load_nd %tdesc_a:
tensor_desc<8x16xbf16, #sg_map_a> into vector<8x1xbf16>
#sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>
%vector_a = xegpu.load_nd %tdesc_a:
tensor_desc<8x32xint8, #sg_map_a> into vector<8x2xint8>
```

For load_nd with `packed` attribute, wi_data[0] must equal to the size required to “pack” the low-precision data into 32-bit, also known as `vnni_factor`. The result vector takes wi_data[0] as inner dimension size, to indicate the effects of layout change known as "VNNI" transformation. The data fragement <16x1xbf16> is loaded and packed as <8x1x2xbf16>.
The example below shows the wi_data contains 2 elements for the first dimension. The result vector takes wi_data_size as inner dimension size, the data fragement <16x1xbf16> is loaded and packed as <8x2xbf16>, a process also known as "VNNI"
transformation.

```mlir
#sg_map_b = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>
%vector_b = xegpu.load_nd {packed} %tdesc1:
tensor_desc<16x16xbf16, #sg_map_b> into vector<8x1x2xbf16>
tensor_desc<16x16xbf16, #sg_map_b> into vector<8x2xbf16>
```

For load_nd with `transpose` attribute, wi_layout is transposed to match with the tensor dimension swap. The tensor is first distributed to WI using `sg_map`, so each WI get 1x8xf32 in the example below, and then transposed to 8x1xf32. The data fragement <1x8xf32> is loaded and transposed as <8x1xf32>.
For load_nd with `transpose` attribute, wi_layout is transposed to match with the tensor dimension swap. The tensor is distributed 8 times, each time get one f32 elements, so each WI get <8x1xf32>.
```mlir
#sg_map_at = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
%at = xegpu.load_nd %tdesc1 {transpose = [1,0]} :
tensor_desc<16x8xf32, #sg_map_at> into vector<8x1xf32>
```
The examples below demonstrate how wi_data can be used to model the transpose_bit_width. When wi_data is [1, 2], the transpose treats the matrix as consisting of 32-bit data elements. In this case, each work item receives 8x2 bf16 elements, rather than 16x1 bf16.
```mlir
#sg_map_at = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
%at = xegpu.load_nd %tdesc1 {transpose = [1,0]} :
tensor_desc<16x16xfp16> into vector<16x1xfp16>
#sg_map_at = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 2]>
%at = xegpu.load_nd %tdesc1 {transpose = [1,0], transpose_bit_width = 32} :
tensor_desc<16x16xbf16, #sg_map_at> into vector<8x2xbf16>
```

`xegpu.sg_map` is also applied to 1d vector load for WI data distribution. When the tensor_desc only specify 1d tensor, `sg_map.wi_layout[0]` and `sg_map.wi_data[0]` must be 1, and they are ignored in the WI distribution.

`xegpu.sg_map` is also used to describe the WI data distribution for regular load. Below example shows that each WI loads one fp32 data element. The result vector <16xfp32> is loaded and distributed to each WI as <1xf32>.
xegpu.sg_map is also used to describe the WI data distribution for regular loads. The example below shows that each work item (WI) loads one fp32 data element. The resulting vector <16 x fp32> is loaded and distributed to each WI as <1 x fp32>.
```mlir
#sg_map_t = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
#scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
Expand All @@ -400,9 +407,9 @@ For load_nd with `transpose` attribute, wi_layout is transposed to match with th
tensor_desc<16xfp32, #tdesc_attr, #sg_map_t>, vector<1xi1> -> vector<1xfp32>
```

Below example shows that each WI loads 4 fp32 data element with the chunk_size. This load with chunk_size is effectively load 2D tensor and transpose. The data fragement <1x4xf32> is loaded and transposed as <4x1xf32>.
The example below illustrates how each work item loads 4 fp32 data elements with the chunk_size. This loading process, combined with the chunk_size, effectively loads a 2D tensor and performs a transpose, resulting in the transposition of the wi_layout.
```mlir
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]>
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
#scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
%scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
{chunk_size=4} :
Expand All @@ -412,6 +419,18 @@ Below example shows that each WI loads 4 fp32 data element with the chunk_size.
tensor_desc<16x4xfp32, #tdesc_attr, #sg_map_t>, vector<1xi1> -> vector<4x1xfp32>
```

The load with chunk_size pack the low-precision data to 32-bit data using wi_data = [1, 2].
```mlir
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 2]>
#scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
%scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
{chunk_size=4} :
uint64, vector<16xindex> into tensor_desc<16x8xbf16, #scatter_attr, #sg_map_t>
%result = xegpu.load_gather %scatter_tdesc_chunk, %mask {L1 = cached, L2 = uncached, transpose=[1,0]} :
tensor_desc<16x8xbf16, #tdesc_attr, #sg_map_t>, vector<1xi1> -> vector<4x2xbf16>
```

User must use legal sg_map value for the WI data distribution for certain operations on PVC and ARC. It includes load_nd/store_nd, load/store with chunk_size, and DPAS.

## Rules of sg_map setting for load and store on PVC and ARC
Expand All @@ -425,11 +444,11 @@ For matrix A load
#sg_map_a_ui8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]> // WI data distribute from [8, 32] to [8, 2]
#sg_map_a_si8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]> // WI data distribute from [8, 32] to [8, 2]
For matrix B load
#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]> // WI data distribute from [16, 16] to [16, 1], packed as [8, 1, 2]
#sg_map_b_f16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]> // WI data distribute from [16, 16] to [16, 1], packed as [8, 1, 2]
#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]> // WI data distribute from [16, 16] to [8, 2]
#sg_map_b_f16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]> // WI data distribute from [16, 16] to [8, 2]
#sg_map_b_tf32 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]> // WI data distribute from [8, 16] to [8, 1]
#sg_map_b_ui8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]> // WI data distribute from [32, 16] to [32, 1], packed as [8, 1, 4]
#sg_map_b_si8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]> // WI data distribute from [32, 16] to [32, 1], packed as [8, 1, 4]
#sg_map_b_ui8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]> // WI data distribute from [32, 16] to [8, 4]
#sg_map_b_si8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]> // WI data distribute from [32, 16] to [8, 4]
For matrix C load
#sg_map_c_f32 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]> // WI data distribute from [8, 16] to [8, 1]
#sg_map_c_si32 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]> // WI data distribute from [8, 16] to [8, 1]
Expand All @@ -450,11 +469,11 @@ For matrix A load
#sg_map_a_ui8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]> // WI data distribute from [8, 32] to [8, 4]
#sg_map_a_si8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]> // WI data distribute from [8, 32] to [8, 4]
For matrix B load
#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]> // WI data distribute from [16, 8] to [16, 1], packed as [8, 1, 2]
#sg_map_b_f16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]> // WI data distribute from [16, 8] to [16, 1], packed as [8, 1, 2]
#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]> // WI data distribute from [16, 8] to [8, 2]
#sg_map_b_f16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]> // WI data distribute from [16, 8] to [8, 2]
#sg_map_b_tf32 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]> // WI data distribute from [8, 8] to [8, 1]
#sg_map_b_ui8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]> // WI data distribute from [32, 8] to [32, 1], packed as [8, 1, 4]
#sg_map_b_si8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]> // WI data distribute from [32, 8] to [32, 1], packed as [8, 1, 4]
#sg_map_b_ui8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]> // WI data distribute from [32, 8] to [8, 4]
#sg_map_b_si8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]> // WI data distribute from [32, 8] to [8, 4]
For matrix C load
#sg_map_c_f32 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]> // WI data distribute from [8, 8] to [8, 1]
#sg_map_c_si32 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]> // WI data distribute from [8, 8] to [8, 1]
Expand All @@ -471,48 +490,39 @@ user must use for the WI data distribution of 1d block load and regular load wit
```mlir
For 1d block load
# assert (wi_layout[0] x wi_layout[1] == subgroup_size) // PVC subgroup_size = 16
#sg_map = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
#sg_map = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]> // for 32-bit data element
For regular load with chunk_size // PVC subgroup_size = 16
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, chunk_size]>
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]> // for 32-bit data element
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 2]> // for 16-bit data element like bf16, f16
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]> // for 8-bit data element like uint8, sint8
For 1d block load
# assert (wi_layout[0] x wi_layout[1] == subgroup_size) // ARC subgroup_size = 8
#sg_map = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>
#sg_map = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]> // for 32-bit data element
For regular load with chunk_size // ARC subgroup_size = 8
#sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, chunk_size]>
#sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, 1]> // for 32-bit data element
#sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, 2]> // for 16-bit data element like bf16, f16
#sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, 4]> // for 8-bit data element like uint8, sint8
```

## Rules of sg_map setting for DPAS on PVC and ARC
The sg_map setting rule for DPAS is applied to the input and output vector operand. The sg_map setting rules of 2d block load for matrix A, B, C/D are reused. For matirx B, as the data being loaded from memory is VNNI transformed, so the wi_data needs to change accordingly so that it is consistent with the vector for operand B. It should use the following.

For PVC subgroup_size = 16
#sg_map_b_reg_bf16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]> // WI data distribute from [8, 16, 2] to [8, 1, 2]
#sg_map_b_reg_f16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]> // WI data distribute from [8, 16, 2] to [8, 1, 2]
#sg_map_b_reg_ui8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 4]> // WI data distribute from [8, 16, 4] to [8, 1, 4]
#sg_map_b_reg_si8 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 4]> // WI data distribute from [8, 16, 4] to [8, 1, 4]

For ARC subgroup_size = 8
#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 2]> // WI data distribute from [8, 8, 2] to [8, 1, 2]
#sg_map_b_f16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 2]> // WI data distribute from [8, 8, 2] to [8, 1, 2]
#sg_map_b_ui8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]> // WI data distribute from [8, 8, 4] to [8, 1, 4]
#sg_map_b_si8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]> // WI data distribute from [8, 8, 4] to [8, 1, 4]
The DPAS operation can be associated with sg_map temporarily within the sg_map propagation pass, as DPAS requires each work item to hold the correct data fragments for both the input and output tensors. The sg_map does not describe the in-register data layout; rather, it describes the data fragments based on the plain layout of the input and output tensors. Therefore, the sg_map settings for 2D block loads of matrices A, B, and C/D are the same. Below are a few examples of DPAS's sg_map.

```mlir
PVC BF16 example
#sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
#sg_map_c = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
#sg_map_b_reg = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>
#sg_map_b = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>
%vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x1x2xbf16> into vector<8x1xfloat>
%vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x2xbf16> into vector<8x1xfloat>
ARC int8 example
#sg_map_a_ui8 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>
#sg_map_c = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>
#sg_map_b_ui8_reg = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>
#sg_map_b_ui8_reg = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]>
%vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a_ui #sg_map_b_ui8_reg #sg_map_c} :vector<8x4xui8>, vector<8x1x4xui8> into vector<8x1xfloat>
%vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a_ui #sg_map_b_ui8_reg #sg_map_c} :vector<8x4xui8>, vector<8x4xui8> into vector<8x1xfloat>
```

## sg_map use case - 2d load
Expand All @@ -538,7 +548,7 @@ An example on how to load a 2d block, perform dpas, and store back to memory.
%vector_b = xegpu.load_nd %tdesc1:
tensor_desc<16x16xbf16, #sg_map_b> into vector<8x2xbf16>
%vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x1x2xbf16> into vector<8x1xfloat>
%vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x2xbf16> into vector<8x1xfloat>
xegpu.store_nd %vector_c, %tdesc2:
vector<8x1xfloat>, tensor_desc<8x16xfloat, #sg_map_c>
Expand All @@ -553,7 +563,7 @@ An example on how to perform transpose using load_gather with chunk_size in SIMT

```mlir
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]>
#sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
#scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
%scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
{chunk_size=4} :
Expand Down

0 comments on commit a2a25cd

Please sign in to comment.