Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets] Add debug caps for dumping snippets parameters #28378

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions src/common/snippets/docs/debug_capabilities/parameters_dump.md
v-Golubev marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Snippet parameters dump

The pass dumps selected properties of some performance-critical operations in Subgraphs. Only MatMuls are currently supported by this pass.

To turn on snippet properties dump feature, the following environment variable should be used:
```sh
OV_SNIPPETS_DUMP_BRGEMM_PARAMS="path=<path_to_csv_dump_file>" binary ...
```

Examples:
```sh
OV_SNIPPETS_DUMP_BRGEMM_PARAMS="path=brgemm.csv" binary ...
```

Output example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add a line break, so the output will be displayed as a table

Suggested change
Output example:
Output example:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


| subgraph_name | name | in_type | out_type | in_shapes | out_shapes | in_layouts | out_layouts | M | N | K | m_block | n_block | k_block | acc_max_time | avg_max_time |
|--------------------|------------|-------------|----------|-------------------------------------|----------------------|--------------------------|-------------|-----|-----|-----|---------|----------|----------|---------------|---------------|
| FakeQuantitze_457 | MatMul_438 | i8;i8;f32 | i32 | 1 16 128 64;1 16 64 128;1 16 64 128 | 1 16 128 128 | 0 2 1 3;0 1 2 3;0 1 2 3; | 0 1 2 3; | 128 | 128 | 64 | 32 | FULL_DIM | FULL_DIM | 41482 | 5185 |
| FakeQuantitze_457 | MatMul_452 | u8;i8 | i32 | 1 16 128 128;1 16 128 64 | 1 16 128 64 | 0 1 2 3;0 1 2 3; | 0 1 2 3; | 128 | 64 | 128 | 32 | FULL_DIM | FULL_DIM | 39427 | 4928 |
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
// Copyright (C) 2025 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#ifdef SNIPPETS_DEBUG_CAPS

#pragma once

#include "snippets/itt.hpp"
#include "snippets/lowered/loop_manager.hpp"
#include "snippets/lowered/specific_loop_iter_handlers.hpp"
#include "snippets/lowered/pass/iter_handler.hpp"
#include "snippets/op/brgemm.hpp"
#include "snippets/utils/utils.hpp"

namespace ov {
namespace snippets {
namespace lowered {
namespace pass {

/**
* @interface BrgemmDebugParams
* @brief Brgemm parameters dump pass
* @ingroup snippets
*/
template <typename BRGEMM_TYPE,
typename std::enable_if<std::is_base_of<ov::snippets::op::Brgemm, BRGEMM_TYPE>::value, bool>::type = true>
class BrgemmDebugParams : public snippets::lowered::pass::RangedPass {
Comment on lines +26 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to implement this pass as a template? What is the reasoning behind this?
I'm sure that in the absolute majority of use cases we'll be interested in BrgemmCPU.
We can also map onto snippets::op::Brgemm, if we want to support all Brgemm types

public:
BrgemmDebugParams(const std::string& subgraph_name) : m_subgraph_name(subgraph_name) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to come up with more descriptive naming. Since this pass inserts some expressions into LIR, it should probably have insert in its name (line insert_buffers or insert_load_store). Moreover, there is already a very similar pass InsertPerfCount, so we should name this one in a similar way. Smth like InsertPerfCountVerbose or InsertPerfCountCsvDump or smth like that.

OPENVINO_RTTI("BrgemmDebugParams", "", RangedPass);

bool run(snippets::lowered::LinearIR& linear_ir,
snippets::lowered::LinearIR::constExprIt begin,
snippets::lowered::LinearIR::constExprIt end) override final { // NOLINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that we'll never want to override this method?

OV_ITT_SCOPED_TASK(ov::pass::itt::domains::SnippetsTransform, "Snippets::BrgemmDebugParams")
if (linear_ir.get_config().debug_config.dumpParams.csv_path.empty()) {
return false;
}
static size_t seq_number = 0;
bool modified = false;
auto csv_path = linear_ir.get_config().debug_config.dumpParams.csv_path;
for (auto expr_it = begin; expr_it != end; expr_it++) {
const auto& brgemm_expr = *expr_it;
const auto brgemm = ov::as_type_ptr<BRGEMM_TYPE>(brgemm_expr->get_node());
if (!brgemm)
continue;
// Collect brgemm parameters
auto params = collect_params(brgemm_expr, linear_ir);
const auto& perf_count_begin = std::make_shared<snippets::op::PerfCountBegin>();
perf_count_begin->set_friendly_name(std::string("PerfCount_Begin_") + std::to_string(seq_number) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a very good practice to use node name to communicate some information ("_DebugParams"). If we need PerfCountBegin with some special properties/functionality, then we should probably create a derived class for this.

UPD: But even if we don't want to create a dedicated class for some reason, I still don't get it why we should rename the node, since we already "marked" it by adding some meta information to rt_info.

"_DebugParams");
const auto empty_inputs = std::vector<PortConnectorPtr>{};
linear_ir.insert_node(perf_count_begin, empty_inputs, expr_it->get()->get_loop_ids(), false, expr_it);

const auto& perf_count_end = std::make_shared<snippets::op::PerfCountEnd>(perf_count_begin->output(0));
perf_count_end->set_friendly_name(std::string("PerfCount_End_") + std::to_string(seq_number) +
"_DebugParams");
// Attach brgemm parameters to PerfCountEnd node
perf_count_end->get_rt_info()["brgemm_params"] = params;
perf_count_end->get_rt_info()["brgemm_params_csv_path"] = csv_path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 concerns with respect to csv_path:

  1. All of the perf count nodes should have the same csv_path. This assumption is used implicitly in the node implementation (dump once when the last node is destroyed), but never enforced. For example, I can update this pass so different csv_path will be set in rt_info for every counter (or a group of counters), but this won't work as expected.
  2. It's excessive to store csv_path in every rt_info, when we actually need one (and only one) set. It looks like we indeed need a dedicated class with a proper setter for this static variable.

linear_ir.insert_node(perf_count_end, empty_inputs, expr_it->get()->get_loop_ids(), false, next(expr_it));
seq_number++;
modified = true;
}
return modified;
}

private:
std::string collect_params(const ov::snippets::lowered::ExpressionPtr& brgemm_expr,
v-Golubev marked this conversation as resolved.
Show resolved Hide resolved
const snippets::lowered::LinearIR& linear_ir) {
const auto brgemm = ov::as_type_ptr<BRGEMM_TYPE>(brgemm_expr->get_node());
OPENVINO_ASSERT(brgemm, "Brgemm is nullptr!");
std::stringstream ss;
ss << m_subgraph_name << ',';
ss << brgemm_expr->get_node()->get_friendly_name() << ',';
for (size_t i = 0; i < brgemm->get_input_size(); ++i) {
ss << brgemm->get_input_element_type(i);
if (i != brgemm->get_input_size() - 1) {
ss << ';';
}
}
ss << ',';
for (size_t i = 0; i < brgemm->get_output_size(); ++i) {
ss << brgemm->get_output_element_type(i);
if (i != brgemm->get_output_size() - 1) {
ss << ';';
}
}
ss << ',';
for (size_t i = 0; i < brgemm->inputs().size(); ++i) {
const auto& port_desc = brgemm_expr->get_input_port_descriptor(i);
const auto& shape = ov::snippets::utils::get_planar_vdims(port_desc->get_shape(), port_desc->get_layout());
ss << utils::tensor2str(shape, " ");
ss << ';';
}
ss.seekp(-1, ss.cur);
ss << ',';
for (size_t i = 0; i < brgemm->outputs().size(); ++i) {
const auto& port_desc = brgemm_expr->get_output_port_descriptor(i);
const auto& shape =
ov::snippets::utils::get_preordered_vdims(port_desc->get_shape(), port_desc->get_layout());
ss << utils::tensor2str(shape, " ");
ss << ';';
}
ss.seekp(-1, ss.cur);
ss << ',';
for (size_t i = 0; i < brgemm->inputs().size(); ++i) {
const auto& port_desc = brgemm_expr->get_input_port_descriptor(i);
ss << utils::tensor2str(port_desc->get_layout(), " ");
ss << ';';
}
ss << ',';
for (size_t i = 0; i < brgemm->outputs().size(); ++i) {
const auto& port_desc = brgemm_expr->get_output_port_descriptor(i);
ss << utils::tensor2str(port_desc->get_layout(), " ");
ss << ';';
}
ss << ',';

const auto& in_0_desc = brgemm_expr->get_input_port_descriptor(0);
const auto& in_1_desc = brgemm_expr->get_input_port_descriptor(1);
const auto& out_desc = brgemm_expr->get_output_port_descriptor(0);

const auto& in_0_planar_dims =
ov::snippets::utils::get_planar_vdims(in_0_desc->get_shape(), in_0_desc->get_layout());
const auto& in_1_planar_dims =
ov::snippets::utils::get_planar_vdims(in_1_desc->get_shape(), in_1_desc->get_layout());
const auto& out_preordered_dims =
ov::snippets::utils::get_preordered_vdims(out_desc->get_shape(), out_desc->get_layout());

const auto& m = *++out_preordered_dims.rbegin();
const auto& n = *out_preordered_dims.rbegin();
const auto& k0 = *in_0_planar_dims.rbegin();
const auto& k1 = *++in_1_planar_dims.rbegin();
size_t k = 0;
OPENVINO_ASSERT(utils::merge_dynamic_dim(k, k0, k1),
"Brgemm input descriptors have incompatible K dimension value.");
ss << static_cast<int64_t>(m) << ',' << static_cast<int64_t>(n) << ',' << static_cast<int64_t>(k) << ',';

size_t m_block = in_0_desc->get_subtensor().front();
size_t n_block = in_1_desc->get_subtensor().back();
size_t k_block = out_desc->get_subtensor().back();

auto append_block_info = [&](size_t block) {
if (block == utils::get_full_dim_value()) {
ss << "FULL_DIM";
} else if (block == utils::get_dynamic_value<size_t>()) {
ss << "?";
} else {
ss << block;
}
ss << ',';
};

append_block_info(m_block);
append_block_info(n_block);
append_block_info(k_block);
return ss.str();
}
Comment on lines +72 to +160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything specific to a particular BRGEMM_TYPE here, so it's another point to consider mapping onto snippets::op::Brgemm


std::string m_subgraph_name;
};

} // namespace pass
} // namespace lowered
} // namespace snippets
} // namespace ov

#endif // SNIPPETS_DEBUG_CAPS
13 changes: 9 additions & 4 deletions src/common/snippets/include/snippets/op/perf_count.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -74,20 +74,25 @@ class PerfCountEnd : public PerfCountEndBase {
public:
OPENVINO_OP("PerfCountEnd", "SnippetsOpset", PerfCountEndBase);
PerfCountEnd(const Output<Node>& pc_begin);
PerfCountEnd() = default;
~PerfCountEnd() {
output_perf_count();
}
PerfCountEnd();
~PerfCountEnd();

void output_perf_count();
std::shared_ptr<Node> clone_with_new_inputs(const OutputVector& inputs) const override;

void init_pc_begin();
void set_accumulated_time();

void dump_brgemm_params_to_csv();

private:
ov::threading::ThreadLocal<uint64_t> accumulation;
ov::threading::ThreadLocal<uint32_t> iteration;
std::shared_ptr<PerfCountBegin> m_pc_begin = nullptr;

static std::string brgemm_csv_path;
static std::map<std::string, std::string> m_debug_params_map;
static size_t nodes_count;
};

} // namespace op
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,15 @@ class DebugCapsConfig {
}
} dumpLIR;

struct : PropertyGroup {
std::string csv_path;
std::vector<PropertySetterPtr> getPropertySetters() override {
return {
PropertySetterPtr(new StringPropertySetter("path", csv_path, "path to dumped brgemm params")),
};
}
} dumpParams;

// Snippets performance count mode
// Disabled - default, w/o perf count for snippets
// Chrono - perf count with chrono call. This is a universal method, and support multi-thread case to output perf
Expand Down
9 changes: 9 additions & 0 deletions src/common/snippets/include/snippets/utils/utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,15 @@ void visit_path(const lowered::ExpressionPtr& expr,
std::function<void(lowered::ExpressionPtr)> func,
bool visit_parent_path);

/**
* @brief Converts a tensor to a string representation.
* Each value in the tensor is converted to a string. If the value is a full dimension, it is represented as
* "FULL_DIM". If the value is dynamic, it is represented as "?".
* @param tensor The tensor to be converted to a string.
* @return A string representation of the tensor.
*/
std::string tensor2str(const VectorDims& tensor, const std::string& delimiter = ", ");

} // namespace utils
} // namespace snippets
} // namespace ov
16 changes: 2 additions & 14 deletions src/common/snippets/src/lowered/expression.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -170,18 +170,6 @@ ExpressionPtr Expression::clone() const {
}

bool Expression::visit_attributes(AttributeVisitor &visitor) {
auto subtensor2str = [](const VectorDims& subtensor) {
std::stringstream ss;
for (size_t i = 0; i < subtensor.size(); ++i) {
const auto& v = subtensor[i];
const auto v_str = utils::is_full_dim_value(v) ? "FULL_DIM" :
utils::is_dynamic_value(v) ? "?" : std::to_string(v);
const auto del = i < subtensor.size() - 1 ? ", " : "";
ss << v_str << del;
}
return ss.str();
};

std::ostringstream in_regs, out_regs;
std::vector<std::pair<std::string, ov::PartialShape>> shapes;
std::vector<std::pair<std::string, std::string>> subtensors;
Expand All @@ -194,7 +182,7 @@ bool Expression::visit_attributes(AttributeVisitor &visitor) {

const auto& subtensor = desc->get_subtensor();
if (!subtensor.empty())
subtensors.emplace_back("in_subtensor_" + std::to_string(i), subtensor2str(subtensor));
subtensors.emplace_back("in_subtensor_" + std::to_string(i), utils::tensor2str(subtensor));

const auto& layout = desc->get_layout();
if (!layout.empty() && !utils::is_planar_layout(layout))
Expand All @@ -210,7 +198,7 @@ bool Expression::visit_attributes(AttributeVisitor &visitor) {

const auto& subtensor = desc->get_subtensor();
if (!subtensor.empty())
subtensors.emplace_back("out_subtensor_" + std::to_string(i), subtensor2str(subtensor));
subtensors.emplace_back("out_subtensor_" + std::to_string(i), utils::tensor2str(subtensor));

const auto& layout = desc->get_layout();
if (!layout.empty() && !utils::is_planar_layout(layout))
Expand Down
11 changes: 8 additions & 3 deletions src/common/snippets/src/lowered/pass/validate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,14 @@ bool Validate::run(LinearIR& linear_ir, lowered::LinearIR::constExprIt begin, lo
if (found != m_validation_map.cend()) {
(found->second)(expr, linear_ir);
}
OPENVINO_ASSERT(expr->get_output_count() == node->get_output_size() ||
ov::is_type<op::LoopEnd>(node) ||
ov::is_type<ov::op::v0::Result>(node), "Incorrect count of output port descriptors!");
bool bypass_output_size_check =
#ifdef SNIPPETS_DEBUG_CAPS
ov::is_type<snippets::op::PerfCountBegin>(node) || ov::is_type<snippets::op::PerfCountEnd>(node) ||
#endif // SNIPPETS_DEBUG_CAPS
ov::is_type<op::LoopEnd>(node) || ov::is_type<ov::op::v0::Result>(node);

OPENVINO_ASSERT(expr->get_output_count() == node->get_output_size() || bypass_output_size_check,
"Incorrect count of output port descriptors!");
expr->validate();
// Loop expr doesn't have shapes and layouts
if (!ov::is_type<op::LoopBase>(node))
Expand Down
56 changes: 55 additions & 1 deletion src/common/snippets/src/op/perf_count.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
//
#ifdef SNIPPETS_DEBUG_CAPS

#include <fstream>

#include "snippets/op/perf_count.hpp"

namespace ov {
Expand Down Expand Up @@ -62,9 +64,30 @@ void PerfCountBegin::set_start_time() {
}

//////////////////PerfCountEnd///////////////
PerfCountEnd::PerfCountEnd(const Output<Node>& pc_begin) : PerfCountEndBase({pc_begin}), accumulation(0ul), iteration(0u) {

size_t PerfCountEnd::nodes_count = 0;
std::map<std::string, std::string> PerfCountEnd::m_debug_params_map;
std::string PerfCountEnd::brgemm_csv_path; // NOLINT

PerfCountEnd::PerfCountEnd() : PerfCountEndBase() {
++nodes_count;
}

PerfCountEnd::PerfCountEnd(const Output<Node>& pc_begin)
: PerfCountEndBase({pc_begin}),
accumulation(0ul),
iteration(0u) {
constructor_validate_and_infer_types();
init_pc_begin();
++nodes_count;
}

PerfCountEnd::~PerfCountEnd() {
output_perf_count();
--nodes_count;
if (nodes_count == 0) {
dump_brgemm_params_to_csv();
}
Comment on lines +87 to +90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite elegant solution to minimize dumping overheads 👍
I think that we should align dumping behavior for default and csv modes.
One way to do it is implement csv-related functionality in a derived class and and use counters like here.

But if you think about it, what you do here is trying to mimic the shared poiner functionality in a way. So Why not to make another step and move the dumping functionality to a separate class altogether. It would be more scalable and convenient, since different types of output can be supported, or different groups of nodes can output to different files (might be convenient to dump MatMul params & say generic counters to separate files). How it might look:

  1. A developer creates a dumper class first, and passes relevant arguments to its constructor (csv path on or case, or nothing for std::cout dumper).
  2. The dumper instance is then passed to every PerfCountEnd node in constructor (through shared_ptr)
  3. PerfCountEnd calls dumper::update(this) in destructor to add a row to the table (or m_debug_params_map)
  4. When the last PerfCountEnd is deleted, the dumper destructor will be called, where we can conveniently write all accumulated table rows to whatever output we support.

What do you think?

}

std::shared_ptr<Node> PerfCountEnd::clone_with_new_inputs(const OutputVector& inputs) const {
Expand Down Expand Up @@ -109,6 +132,37 @@ void PerfCountEnd::output_perf_count() {
std::cout << "max accumulated time:" << acc_max << "ns" << std::endl;
// max avg
std::cout << "max avg time:" << avg_max << "ns" << std::endl;

// Dump brgemm debug parameters to csv file
if (acc_max != 0 && avg_max != 0 && get_friendly_name().find("_DebugParams") != std::string::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for these conditions on acc_max and avg_max? Why can't they be 0?
Also "_DebugParams" should not be used like this, as discussed in the transformation file.

const auto& rt_info = get_rt_info();
auto brgemm_params_it = rt_info.find("brgemm_params");
if (brgemm_params_it == rt_info.end()) {
return;
}
if (brgemm_csv_path.empty()) {
auto brgemm_csv_path_it = rt_info.find("brgemm_params_csv_path");
if (brgemm_csv_path_it != rt_info.end()) {
brgemm_csv_path = brgemm_csv_path_it->second.as<std::string>();
}
}
m_debug_params_map[get_friendly_name()] =
brgemm_params_it->second.as<std::string>() + std::to_string(acc_max) + ',' + std::to_string(avg_max);
}
}

void PerfCountEnd::dump_brgemm_params_to_csv() {
if (m_debug_params_map.empty() || brgemm_csv_path.empty()) {
return;
}
std::ofstream csv_file(brgemm_csv_path);
OPENVINO_ASSERT(csv_file.is_open(), "Failed to open csv file for brgemm debug parameters.");
csv_file << "name,subgraph_name,in_type,out_type,in_shapes,out_shapes,in_layouts,out_layouts,M,N,K,m_block,n_block,k_block,acc_max_time,"
"avg_max_time\n";
for (const auto& [_, params] : m_debug_params_map) {
csv_file << params << '\n';
}
csv_file.close();
}

} // namespace op
Expand Down
3 changes: 3 additions & 0 deletions src/common/snippets/src/utils/debug_caps_config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ void DebugCapsConfig::readProperties() {
dumpLIR.parseAndSet(envVarValue);
OPENVINO_ASSERT(!dumpLIR.passes.empty(), "Passes option in OV_SNIPPETS_DUMP_LIR must be provided.");
}
if ((envVarValue = readEnv("OV_SNIPPETS_DUMP_BRGEMM_PARAMS"))) {
dumpParams.parseAndSet(envVarValue);
}
}

void DebugCapsConfig::PropertyGroup::parseAndSet(const std::string& str) {
Expand Down
Loading
Loading