Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast support for elementwise ops #148

Merged
merged 6 commits into from
Jul 25, 2024
Merged

Conversation

slyalin
Copy link
Owner

@slyalin slyalin commented Jul 24, 2024

TODO:

  • On-demand taking of dynamic dims as values instead of pre-initialization with all dynamic dims. (Will be covered with a further PR, should have no negative impact if we are lowering to linalg as it always requires taking all dynamic dimensions to construct the outputs. Plus I would expect dead code elimination from the MLIR pipeline.)
  • Element type restriction for the new binary eltwise convertor
  • Fix incorrect shape collapsing in case of broadcast with unaligned ranks
  • Reuse dynamic dimensions map in MatMul conversion.

Now the entire graph (except tensor constants) from #146 (comment) is converted as a single MLIR function:

module @fragment_name {
  func.func @entry(%arg0: memref<?x?xf32>, %arg1: memref<1x1xf32>, %arg2: memref<128x1024xf32>, %arg3: memref<1x128xf32>, %arg4: memref<?x128xf32>) {
    %0 = bufferization.to_tensor %arg0 restrict : memref<?x?xf32>
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %0, %c0 : tensor<?x?xf32>
    %c1 = arith.constant 1 : index
    %dim_0 = tensor.dim %0, %c1 : tensor<?x?xf32>
    %1 = bufferization.to_tensor %arg1 restrict : memref<1x1xf32>
    %2 = bufferization.to_tensor %arg2 restrict : memref<128x1024xf32>
    %3 = bufferization.to_tensor %arg3 restrict : memref<1x128xf32>
    %4 = tensor.empty(%dim, %dim_0) : tensor<?x?xf32>
    %5 = linalg.mul ins(%0, %0 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%4 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %collapsed = tensor.collapse_shape %1 [] : tensor<1x1xf32> into tensor<f32>
    %6 = tensor.empty(%dim, %dim_0) : tensor<?x?xf32>
    %broadcasted = linalg.broadcast ins(%collapsed : tensor<f32>) outs(%6 : tensor<?x?xf32>) dimensions = [0, 1] 
    %7 = tensor.empty(%dim, %dim_0) : tensor<?x?xf32>
    %8 = linalg.add ins(%5, %broadcasted : tensor<?x?xf32>, tensor<?x?xf32>) outs(%7 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %9 = tensor.empty(%dim, %dim_0) : tensor<?x?xf32>
    %10 = linalg.sub ins(%0, %8 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%9 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %11 = tensor.empty(%dim, %dim_0) : tensor<?x?xf32>
    %12 = linalg.add ins(%0, %0 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%11 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %13 = tensor.empty(%dim, %dim_0) : tensor<?x?xf32>
    %14 = linalg.mul ins(%12, %10 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%13 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %15 = tensor.empty(%dim, %dim_0) : tensor<?x?xf32>
    %16 = linalg.div ins(%14, %0 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%15 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %17 = tensor.empty(%dim) : tensor<?x128xf32>
    %cst = arith.constant 0.000000e+00 : f32
    %18 = linalg.fill ins(%cst : f32) outs(%17 : tensor<?x128xf32>) -> tensor<?x128xf32>
    %19 = linalg.matmul_transpose_b ins(%16, %2 : tensor<?x?xf32>, tensor<128x1024xf32>) outs(%18 : tensor<?x128xf32>) -> tensor<?x128xf32>
    %collapsed_1 = tensor.collapse_shape %3 [[0, 1]] : tensor<1x128xf32> into tensor<128xf32>
    %20 = tensor.empty(%dim) : tensor<?x128xf32>
    %broadcasted_2 = linalg.broadcast ins(%collapsed_1 : tensor<128xf32>) outs(%20 : tensor<?x128xf32>) dimensions = [0] 
    %21 = tensor.empty(%dim) : tensor<?x128xf32>
    %22 = linalg.add ins(%19, %broadcasted_2 : tensor<?x128xf32>, tensor<?x128xf32>) outs(%21 : tensor<?x128xf32>) -> tensor<?x128xf32>
    bufferization.materialize_in_destination %22 in restrict writable %arg4 : (tensor<?x128xf32>, memref<?x128xf32>) -> ()
    return
  }
}

}
}

auto squeezed = builder.create<tensor::CollapseShapeOp>(loc, inputs[i], squeeze_map);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this way of "preparation for broadcast" with @adam-smnk.
I think this approach is not convenient because the whole reason for using named ops is simplicity. If a widely used broadcast semantics requires tricks, it is not simple to use, and too low level to be an entry point dialect in OV case.

@adam-smnk, @rengolin, please propose a better way of doing that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging a bit deeper, it is the intended approach in line with the original op design:
https://discourse.llvm.org/t/rfc-primitive-ops-add-broadcastop-to-linalg/66313

Size-1 expansion is deliberately forbidden.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use generics for now, but we're discussing adding broadcast and transpose to named ops.
https://discourse.llvm.org/t/rfc-transpose-attribute-for-linalg-matmul-operations/80092/

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have experience using linalg generics in C++, so cannot quickly produce this code. I will merge it as-is with collapse/broadcast.

@slyalin slyalin marked this pull request as ready for review July 25, 2024 15:53
@slyalin slyalin merged commit f0f76b9 into mlir Jul 25, 2024
14 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants