Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate temporary buffer in binary op conversion #130

Merged
merged 3 commits into from
Jun 28, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 31 additions & 5 deletions src/common/transformations/src/transformations/mlir/convert.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,15 @@ static void prepareMLIRKernelWithoutWrapper(mlir::OwningOpRef<mlir::ModuleOp>& m

#else // Simplified default lowering to LLVM from LLVM tests

// Remove empty tensors to avoid converting them into temporary buffers.
pm.addPass(bufferization::createEmptyTensorEliminationPass());

pm.addPass(bufferization::createOneShotBufferizePass());
// TODO: Add deallocation pass/pipeline to avoid memory leaks.

// Cleanup after bufferization - possibly remove redundant copies.
pm.addNestedPass<func::FuncOp>(createCanonicalizerPass());
pm.addNestedPass<func::FuncOp>(createCSEPass());

// Blanket-convert any remaining high-level vector ops to loops if any remain.
pm.addNestedPass<func::FuncOp>(createConvertVectorToSCFPass());
Expand Down Expand Up @@ -553,11 +561,22 @@ template <typename TargetOp>
struct ConvertBinary {
void operator()(ConversionContext& context, std::shared_ptr<ov::Node> node) {
auto loc = createLocation(context.context, node);
auto& builder = context.builder();
// TODO: Support broadcasts
const auto inputs = context.getInputs(node);
auto op = context.builder().create<TargetOp>(loc,
mlir::ValueRange{inputs[0], inputs[1]},
/* FIXME: Use linalg.fill or tensor.empty */ mlir::ValueRange{inputs[0]});
auto outType = cast<mlir::ShapedType>(inputs[0].getType());
// Named binary ops directly overwrite data in `outs` buffer so, there is no need to provide non-empty
// destination at the tensor-level.
// Use `tensor.empty` to avoid temporary buffer allocation and memcpy after bufferization.
llvm::SmallVector<Value> dynamicSizes;
for (auto [idx, dim] : llvm::enumerate(outType.getShape())) {
if (!mlir::ShapedType::isDynamic(dim))
continue;
auto dimSize = builder.create<tensor::DimOp>(loc, inputs[0], idx);
dynamicSizes.push_back(dimSize);
}
auto empty = builder.create<tensor::EmptyOp>(loc, outType, dynamicSizes);
Comment on lines +568 to +578
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we still need to ensure that the output tensor is properly sized even if the op itself implies strict shapes alignment for inputs and outputs. As we see here, it involves some additional stuff with dynamics dimensions but semantically iteration space has already been completely defined by shape of the input that doesn't require special decoding of dynamic dimensions (and would be a part of normal tensor lowering with dynamic dimensions). I would like to see more compactness from the named ops at least to avoid writing this dynamic-dimensions code block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that eltwise named ops' semantics are sufficient to automatically derived the sizes.
However, hiding these details involves tradeoffs in terms of implicit vs explicit information. Overall, MLIR design leans more toward the explicit side as it allows for simpler IR verification, analysis, and pattern matching - same motivation for having named ops vs generic, or explicit broadcast semantics through linalg.broadcast.

In this case, you can view tensor.empty as a temporary buffer so, we need to provide all the necessary information required for potential allocation. Having to infer that dynamically would imply stronger coupling between ops/dialects (tensor.empty would have to understand all potential consumers) and would bring more infrastructure boilerplate. Also, having implicit output is not possible as it goes against linalg's DPS interface.

Overall, we can't (and probably shouldn't) remove this extra dynamic logic from IR.
However, we could provide simpler named ops builders that only take inputs and create correct tensor.empty automatically.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, MLIR design leans more toward the explicit side as it allows for simpler IR verification, ...

It is well understood for a single operation, but if it involves a sub-graph consisting of multiple ops that are doing the same thing again and again along the program it raises questions. Because we already have tensors that carry dynamic dimensions.

However, we could provide simpler named ops builders that only take inputs and create correct tensor.empty automatically.

That would be a good thing that we are missing in the current approach, at least for tensors (not memrefs). Because otherwise it just becomes a boilerplate code in the lowering part instead of having written it once as a part of the dialect/infrastructure. So similar boilerplate infrastructure will appear multiple times in each project that uses Linalg on tensors as the next dialect in its lowering path. If each OV operation is lowered individually it will generate a lot of such sub-expressions that are taking various dimensions dynamically just to tailor shape and pass it to the next op. So you will need to optimize the additionally generated expressions that appear even if they are dealing with the same dynamic dimensions along the whole program.

auto op = builder.create<TargetOp>(loc, mlir::ValueRange{inputs[0], inputs[1]}, mlir::ValueRange{empty});
context.addOutputs(node, op);
}
};
Expand Down Expand Up @@ -606,8 +625,15 @@ mlir::OwningOpRef<mlir::ModuleOp> ngraph_to_mlir(MLIRContext* context,
auto tensor = conversion_context.nodeOutputMap.at(outputs[i]);
auto memref = func.getArgument(i + inputs.size());
auto loc = createLocation(context, outputs[i].get_node_shared_ptr());
auto materialize = block_builder.create<bufferization::MaterializeInDestinationOp>(loc, tensor, memref);
materialize.setWritable(true); // TODO: Can I set it as ctor argument above?
// Ensure the result is stored in the provided function argument.
// Mark as restrict to avoid temporary buffer and copy.
// Mark as writable to ensure the output can be written to the buffer.
block_builder.create<bufferization::MaterializeInDestinationOp>(loc,
TypeRange{},
tensor,
memref,
/*restrict=*/true,
/*writable=*/true);
}

const auto retLoc = createLayerLocation(context, "output", "Output");
Expand Down
Loading