Update XeGPU.md

Jianhui-Li · Feb 9, 2025 · a2a25cd · a2a25cd
1 parent 991fba7
commit a2a25cd
Showing 1 changed file with 64 additions and 54 deletions.
diff --git a/docs/rfcs/XeGPU.md b/docs/rfcs/XeGPU.md
@@ -95,11 +95,6 @@ create_nd_tdesc also accepts a memref as input instead of a memory address, shap
 
 The example below accepts a memory address and an offset and creates a 1D tensor_desc. The tensor_desc describes a 1D vector that is loaded by all work items combined within the subgroup.
 ```mlir
-  #sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
-  #tdesc_attr1 = !xegpu.block_tdesc_attr<memory_space=slm, boundary_check=false, sg= #sg_map_a>
-  %tdesc1 = xegpu.create_nd_tdesc %mem_addr, %offset :
-		uint64, index into tensor_desc<16xbf16, #tdesc_attr1>
-
   #tdesc_attr2 = !xegpu.block_tdesc_attr<memory_space=slm, boundary_check=false>
   %tdesc2 = xegpu.create_nd_tdesc %mem_addr, %offset :
 		uint64, index into tensor_desc<16xbf16, #tdesc_attr2>
@@ -344,52 +339,64 @@ Within the sg_map, `wi_layout` specifies the layout of work items, describing th
      	into tensor_desc<8x16xbf16, #sg_map_a>
 ```
 
-wi_data_size refers to the data size mapped to indiviudal work item, and sg_map_size to the collective size by all the work items as specified by sg_map. distribute_unit_size represents the minimun size of 2D tensor to be distributed to work items in a subgroup. tensor_size refers to the size of the tensor sepcified by tensor_desc.
-In the example above, wi_data_size is 1, sg_map_size is 16, tensor_size is 128.
+wi_data_size refers to the data size mapped to indiviudal work item, and distribute_unit_size represents the minimun size of 2D tensor to be distributed to all work items in a subgroup. tensor_size refers to the size of the tensor sepcified by tensor_desc. In the example above, wi_data_size is 1, distribute_unit_size is 16, tensor_size is 128.
 ```mlir
         wi_data_size = wi_data[0] x wi_data[1]
 	subgroup_size == wi_layout[0] x wi_layout[1]
- 	sg_map_size[0] = wi_layout[0] x wi_data[0]
-        sg_map_size[1] = wi_layout[1] x wi_data[1]
- 	distribute_unit_size = sg_map_size[0] x sg_map_size[1] = subgroup_size x wi_data_size
+ 	distribute_unit_size = subgroup_size x wi_data_size
   	tensor_size = tensor_desc[0] x tensor_desc[1]
 ```
-wi_data_size can be larger than 1, meaning that each work item operates on multiple elements, which is eventually lowered to "SIMT-flavor" vector, like SPIR-V vector or llvm vector, or packed to a storage data type for matrix operations. The multiple elements indicated by wi_data can only be from one dimension and must be contiguous in the memory along either dimension.
+wi_data_size can be greater than 1, meaning each work item can own multiple elements in a single distribution. These multiple elements, indicated by wi_data, must come from one dimension and must be contiguous along that dimension.
 
-To distribute a tensor, tensor_size must be divisible by distribute_unit_size. More specifically, tensor_desc[0] must be divisible by wi_layout[0] x wi_data[0], tensor_desc[1] by wi_layout[1] x wi_data[1]. The 2D subtensor is evenly distributed to work items, so each work item gets a 2D data fragment, which may contain mulitple distribution of wi_data elements.
+To distribute a tensor, tensor_size must be divisible by distribute_unit_size. Specifically, tensor_desc[0] must be divisible by wi_layout[0] x wi_data[0], and tensor_desc[1] must be divisible by wi_layout[1] x wi_data[1]. The 2D subtensor is distributed in a round-robin manner, meaning each work item receives multiple distributions. As a result, the distribution assigns a 2D data fragment to each work item, where the first dimension corresponds to the number of distributions, and the second corresponds to wi_data_size. The 2D data fragment is ultimately lowered to a "SIMT-flavored" vector, such as a SPIR-V vector or LLVM vector, where each vector element contains the packed data of the wi_data. The size of packed data could be 32-bit or 16-bit. 
 
-The size of the result data fragement per work item can be computed by the following:
 ```mlir
-	WI_data_frag[0] = tensor_desc[0]/wi_layout[0]
-	WI_data_frag[1] = tensor_desc[1]/wi_layout[1]
+	WI_data_frag[0] = tensor_size / distribute_unit_size 
+	WI_data_frag[1] = wi_data_size 
 ```
 
 The WI dsitribution is represented by the shape of the result vector being loaded, which is being reduced from [tensor_decs[0], tensor_desc[1]] to [WI_data_frag[0], WI_data_frag[1]].
 
-With `sg_map` attribute attached to tensor_desc, xegpu.load_nd operates in SIMT flavor and returns back a fragement associated with individual work item. The tensor_desc in the example below specifies a tensor of 8x16 elements, which is decomposed to 8x1 subtensors, each with sg_map_size 1x16. The result vector <8x16xbf16> is loaded and distributed to each WI as <8x1xbf16>.
+With `sg_map` attribute attached to tensor_desc, xegpu.load_nd operates in SIMT flavor and returns back a fragement associated with individual work item. The tensor_desc in the first example below specifies a tensor of 8x16 elements, which is distributed 8 times so each work item gets <8x1xbf16>. The second example shows the each work gets <8x2xint8> with 2 int8 elements packed as one 16-bit data.  
 ```mlir
   #sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
   %vector_a = xegpu.load_nd %tdesc_a:
      tensor_desc<8x16xbf16, #sg_map_a> into vector<8x1xbf16>
+
+  #sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>
+  %vector_a = xegpu.load_nd %tdesc_a:
+     tensor_desc<8x32xint8, #sg_map_a> into vector<8x2xint8>
 ```
 
-For load_nd with `packed` attribute, wi_data[0] must equal to the size required to “pack” the low-precision data into 32-bit, also known as `vnni_factor`. The result vector takes wi_data[0] as inner dimension size, to indicate the effects of layout change known as "VNNI" transformation. The data fragement <16x1xbf16> is loaded and packed as <8x1x2xbf16>.
+The example below shows the wi_data contains 2 elements for the first dimension. The result vector takes wi_data_size as inner dimension size, the data fragement <16x1xbf16> is loaded and packed as <8x2xbf16>, a process also known as "VNNI" 
+ transformation.
 
 ```mlir
   #sg_map_b = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>
   %vector_b = xegpu.load_nd {packed} %tdesc1:
-     tensor_desc<16x16xbf16, #sg_map_b> into vector<8x1x2xbf16>
+     tensor_desc<16x16xbf16, #sg_map_b> into vector<8x2xbf16>
 ```
 
-For load_nd with `transpose` attribute, wi_layout is transposed to match with the tensor dimension swap. The tensor is first distributed to WI using `sg_map`, so each WI get 1x8xf32 in the example below, and then transposed to 8x1xf32. The data fragement <1x8xf32> is loaded and transposed as <8x1xf32>.
+For load_nd with `transpose` attribute, wi_layout is transposed to match with the tensor dimension swap. The tensor is distributed 8 times, each time get one f32 elements, so each WI get <8x1xf32>.
 ```mlir
   #sg_map_at = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
   %at = xegpu.load_nd %tdesc1 {transpose = [1,0]} :
      tensor_desc<16x8xf32, #sg_map_at> into vector<8x1xf32>
 ```
+The examples below demonstrate how wi_data can be used to model the transpose_bit_width. When wi_data is [1, 2], the transpose treats the matrix as consisting of 32-bit data elements. In this case, each work item receives 8x2 bf16 elements, rather than 16x1 bf16.
+```mlir
+  #sg_map_at = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
+  %at = xegpu.load_nd %tdesc1 {transpose = [1,0]} :
+     tensor_desc<16x16xfp16> into vector<16x1xfp16>
+
+  #sg_map_at = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 2]>
+  %at = xegpu.load_nd %tdesc1 {transpose = [1,0], transpose_bit_width = 32} :
+     tensor_desc<16x16xbf16, #sg_map_at> into vector<8x2xbf16>
+```
+
 `xegpu.sg_map` is also applied to 1d vector load for WI data distribution. When the tensor_desc only specify 1d tensor, `sg_map.wi_layout[0]` and `sg_map.wi_data[0]` must be 1, and they are ignored in the WI distribution.
 
-`xegpu.sg_map` is also used to describe the WI data distribution for regular load. Below example shows that each WI loads one fp32 data element. The result vector <16xfp32> is loaded and distributed to each WI as <1xf32>.
+xegpu.sg_map is also used to describe the WI data distribution for regular loads. The example below shows that each work item (WI) loads one fp32 data element. The resulting vector <16 x fp32> is loaded and distributed to each WI as <1 x fp32>.
 ```mlir
   #sg_map_t = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
   #scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
@@ -400,9 +407,9 @@ For load_nd with `transpose` attribute, wi_layout is transposed to match with th
           tensor_desc<16xfp32, #tdesc_attr, #sg_map_t>, vector<1xi1> -> vector<1xfp32>
 ```
 
-Below example shows that each WI loads 4 fp32 data element with the chunk_size. This load with chunk_size is effectively load 2D tensor and transpose. The data fragement <1x4xf32> is loaded and transposed as <4x1xf32>.
+The example below illustrates how each work item loads 4 fp32 data elements with the chunk_size. This loading process, combined with the chunk_size, effectively loads a 2D tensor and performs a transpose, resulting in the transposition of the wi_layout.
 ```mlir
-  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]>
+  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
   #scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
   %scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
 		{chunk_size=4} :
@@ -412,6 +419,18 @@ Below example shows that each WI loads 4 fp32 data element with the chunk_size.
           tensor_desc<16x4xfp32, #tdesc_attr, #sg_map_t>, vector<1xi1> -> vector<4x1xfp32>
 ```
 
+The load with chunk_size pack the low-precision data to 32-bit data using wi_data = [1, 2]. 
+```mlir
+  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 2]>
+  #scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
+  %scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
+		{chunk_size=4} :
+		uint64, vector<16xindex> into tensor_desc<16x8xbf16, #scatter_attr, #sg_map_t>
+
+  %result = xegpu.load_gather %scatter_tdesc_chunk, %mask {L1 = cached, L2 = uncached, transpose=[1,0]} :
+          tensor_desc<16x8xbf16, #tdesc_attr, #sg_map_t>, vector<1xi1> -> vector<4x2xbf16>
+```
+
 User must use legal sg_map value for the WI data distribution for certain operations on PVC and ARC. It includes load_nd/store_nd, load/store with chunk_size, and DPAS.
 
 ## Rules of sg_map setting for load and store on PVC and ARC
@@ -425,11 +444,11 @@ For matrix A load
 #sg_map_a_ui8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>    // WI data distribute from [8, 32] to [8, 2]
 #sg_map_a_si8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>    // WI data distribute from [8, 32] to [8, 2]
 For matrix B load
-#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>   // WI data distribute from [16, 16] to [16, 1], packed as [8, 1, 2]
-#sg_map_b_f16  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>   // WI data distribute from [16, 16] to [16, 1], packed as [8, 1, 2]
+#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>   // WI data distribute from [16, 16] to [8, 2]
+#sg_map_b_f16  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>   // WI data distribute from [16, 16] to [8, 2]
 #sg_map_b_tf32 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>   // WI data distribute from [8, 16] to [8, 1]
-#sg_map_b_ui8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]>   // WI data distribute from [32, 16] to [32, 1], packed as [8, 1, 4]
-#sg_map_b_si8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]>   // WI data distribute from [32, 16] to [32, 1], packed as [8, 1, 4]
+#sg_map_b_ui8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]>   // WI data distribute from [32, 16] to [8, 4]
+#sg_map_b_si8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [4, 1]>   // WI data distribute from [32, 16] to [8, 4]
 For matrix C load
 #sg_map_c_f32  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>  // WI data distribute from [8, 16] to [8, 1]
 #sg_map_c_si32 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>  // WI data distribute from [8, 16] to [8, 1]
@@ -450,11 +469,11 @@ For matrix A load
 #sg_map_a_ui8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>   // WI data distribute from [8, 32] to [8, 4]
 #sg_map_a_si8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>   // WI data distribute from [8, 32] to [8, 4]
 For matrix B load
-#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]>    // WI data distribute from [16, 8] to [16, 1], packed as [8, 1, 2]
-#sg_map_b_f16  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]>    // WI data distribute from [16, 8] to [16, 1], packed as [8, 1, 2]
+#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]>    // WI data distribute from [16, 8] to [8, 2]
+#sg_map_b_f16  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [2, 1]>    // WI data distribute from [16, 8] to [8, 2]
 #sg_map_b_tf32 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>    // WI data distribute from [8, 8] to [8, 1]
-#sg_map_b_ui8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]>    // WI data distribute from [32, 8] to [32, 1], packed as [8, 1, 4]
-#sg_map_b_si8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]>    // WI data distribute from [32, 8] to [32, 1], packed as [8, 1, 4]
+#sg_map_b_ui8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]>    // WI data distribute from [32, 8] to [8, 4]
+#sg_map_b_si8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]>    // WI data distribute from [32, 8] to [8, 4]
 For matrix C load
 #sg_map_c_f32  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>   // WI data distribute from [8, 8] to [8, 1]
 #sg_map_c_si32 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>   // WI data distribute from [8, 8] to [8, 1]
@@ -471,48 +490,39 @@ user must use for the WI data distribution of 1d block load and regular load wit
 ```mlir
   For 1d block load
   # assert (wi_layout[0] x wi_layout[1] == subgroup_size) // PVC subgroup_size = 16
-  #sg_map = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
+  #sg_map = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>  // for 32-bit data element
 
   For regular load with chunk_size  // PVC subgroup_size = 16
-  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, chunk_size]>
+  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]> // for 32-bit data element
+  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 2]>  // for 16-bit data element like bf16, f16
+  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]>  // for 8-bit data element like uint8, sint8
 
   For 1d block load
   # assert (wi_layout[0] x wi_layout[1] == subgroup_size) // ARC subgroup_size = 8
-  #sg_map = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>
+  #sg_map = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>  // for 32-bit data element
 
   For regular load with chunk_size // ARC subgroup_size = 8
-  #sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, chunk_size]>
+  #sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, 1]> // for 32-bit data element
+  #sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, 2]>  // for 16-bit data element like bf16, f16
+  #sg_map_t = xegpu.sg_map<wi_layout = [8, 1], wi_data = [1, 4]>  // for 8-bit data element like uint8, sint8
 ```
 
-## Rules of sg_map setting for DPAS on PVC and ARC
-The sg_map setting rule for DPAS is applied to the input and output vector operand. The sg_map setting rules of 2d block load for matrix A, B, C/D are reused. For matirx B, as the data being loaded from memory is VNNI transformed, so the wi_data needs to change accordingly so that it is consistent with the vector for operand B. It should use the following.
-
-For PVC subgroup_size = 16
-#sg_map_b_reg_bf16 = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>   // WI data distribute from [8, 16, 2] to [8, 1, 2]
-#sg_map_b_reg_f16  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>   // WI data distribute from [8, 16, 2] to [8, 1, 2]
-#sg_map_b_reg_ui8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 4]>   // WI data distribute from [8, 16, 4] to [8, 1, 4]
-#sg_map_b_reg_si8  = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 4]>   // WI data distribute from [8, 16, 4] to [8, 1, 4]
-
-For ARC subgroup_size = 8
-#sg_map_b_bf16 = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 2]>    // WI data distribute from [8, 8, 2] to [8, 1, 2]
-#sg_map_b_f16  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 2]>    // WI data distribute from [8, 8, 2] to [8, 1, 2]
-#sg_map_b_ui8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>    // WI data distribute from [8, 8, 4] to [8, 1, 4]
-#sg_map_b_si8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>    // WI data distribute from [8, 8, 4] to [8, 1, 4]
+The DPAS operation can be associated with sg_map temporarily within the sg_map propagation pass, as DPAS requires each work item to hold the correct data fragments for both the input and output tensors. The sg_map does not describe the in-register data layout; rather, it describes the data fragments based on the plain layout of the input and output tensors. Therefore, the sg_map settings for 2D block loads of matrices A, B, and C/D are the same. Below are a few examples of DPAS's sg_map.
 
 ```mlir
   PVC BF16 example
   #sg_map_a = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
   #sg_map_c = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
-  #sg_map_b_reg = xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 2]>
+  #sg_map_b = xegpu.sg_map<wi_layout = [1, 16], wi_data = [2, 1]>
 
-  %vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x1x2xbf16> into vector<8x1xfloat>
+  %vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x2xbf16> into vector<8x1xfloat>
 
   ARC int8 example
   #sg_map_a_ui8  = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>
   #sg_map_c = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 1]>
-  #sg_map_b_ui8_reg = xegpu.sg_map<wi_layout = [1, 8], wi_data = [1, 4]>
+  #sg_map_b_ui8_reg = xegpu.sg_map<wi_layout = [1, 8], wi_data = [4, 1]>
 
-  %vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a_ui #sg_map_b_ui8_reg #sg_map_c} :vector<8x4xui8>, vector<8x1x4xui8> into vector<8x1xfloat>
+  %vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a_ui #sg_map_b_ui8_reg #sg_map_c} :vector<8x4xui8>, vector<8x4xui8> into vector<8x1xfloat>
 ```
 
 ## sg_map use case - 2d load
@@ -538,7 +548,7 @@ An example on how to load a 2d block, perform dpas, and store back to memory.
   %vector_b = xegpu.load_nd %tdesc1:
      tensor_desc<16x16xbf16, #sg_map_b> into vector<8x2xbf16>
 
-  %vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x1x2xbf16> into vector<8x1xfloat>
+  %vector_c = xegpu.dpas %vector_a, %vector_b {#sg_map_a #sg_map_b_reg #sg_map_c} :vector<8x1xbf16>, vector<8x2xbf16> into vector<8x1xfloat>
 
   xegpu.store_nd %vector_c, %tdesc2:
           vector<8x1xfloat>, tensor_desc<8x16xfloat, #sg_map_c>
@@ -553,7 +563,7 @@ An example on how to perform transpose using load_gather with chunk_size in SIMT
 
 ```mlir
 
-  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 4]>
+  #sg_map_t = xegpu.sg_map<wi_layout = [16, 1], wi_data = [1, 1]>
   #scatter_attr = !xegpu.tdesc_attr< memory_space=slm, scattered=true>
   %scatter_tdesc_chunk = xegpu.create_tdesc, %src_addr, %offsets
 		{chunk_size=4} :