Update Kafka Connect metrics and dashboard #9331

katheris · 2023-11-07T14:59:46Z

Type of change

Enhancement

Description

Update Kafka Connect metrics and dashboard

Collect coordinator metrics.
Simplify metric regex.
Update dashboard to add panels and display more connector information and rebalance metrics.

Dashboard before

Dashboard after

Checklist

Please go through this checklist and make sure all applicable tasks have been done

Write tests
Make sure all tests pass
Update documentation
Check RBAC rights for Kubernetes / OpenShift roles
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
Reference relevant issue(s) and close them after merging
Update CHANGELOG.md
Supply screenshots for visual changes, such as Grafana dashboards

* Collect coordinator metrics. * Simplify metric regex. * Update dashboard to add panels and display more connector information and rebalance metrics. Signed-off-by: Katherine Stanley <[email protected]>

scholzj

You need to update the Helm Chart dasboards as well I guess?

Signed-off-by: Katherine Stanley <[email protected]>

ppatierno

Good stuff! LGTM. Thanks!

mimaison · 2023-11-08T13:54:11Z

The dashboard looks much nicer!

Some minor suggestions:

What about increasing slightly the width of both graphs in the first row of "Connectors and Tasks"? so they fully fill the row
Could we use the table view for the legend of the "Connection count" and "Connector task state" graphs? I think that makes the values easier to read like you've done for the other graphs.
Have you tried tweaking the "Connector task state with issues" panel to print "0" in case there's no data?

Signed-off-by: Katherine Stanley <[email protected]>

katheris · 2023-11-13T10:12:22Z

These are the updated screenshots from my latest commit:

I've been investigating whether I can display the connector task issues in a better way, but apart from that this addresses the other review comments

mimaison

Thanks for the updates. The dashboard looks great!

scholzj · 2023-11-13T12:52:56Z

@katheris So does this do anything about #9317 as well? Or is this change completely unrelated?

katheris · 2023-11-13T14:02:53Z

@scholzj this is unrelated

@mimaison do you think this panel would be a better visualisation for the tasks with issues? The panel will also display tasks that are returned from the connector-destroyed-task-count and connector-unassigned-task-count metrics

mimaison · 2023-11-13T14:19:45Z

Yes this seems fine

scholzj

It looks nice, thanks for the improvements. The chart showing the assignment of connectors and tasks to nodes is nice (although I'm not sure how it wold look for those running 50 node clusters, but not sure I care about that anyway).

Some comments:

Would it be worth adding a CHANGELOG record? Improved Kafka Connect metrics and dashboard or something like that?
What is the expectation from the Rebalancing gauge? When is it expected to show something given rebalances should IMHO be normally short and the scraping of metrics happens roughly every 30 seconds? My Prometheus does not seem to have any other values than 0 while scaling, changing connectors, etc. So I wonder when does this show something.
The Connector Status and Connector Task Status do not seem to have a fixed ordering and seem to change the order every other refresh. Can we sort them by name for example? It does not seem nice, because my brain registers the change in the dashabord even through nothing really changed. (To be fair, that was an issue in the previous dashboard as well, but would be great to fix this)
Lot of the charts you touched now don't seem to have a simple legend but a table with the values. Even with a small cluster, that seems to now require scrolling to actually see the legend and the values. Was this intentional? Is this something we really want? I do not have a strong opinion, so if everyone wants it that way, fine. But to me, it seemed better when I saw all the legends at once without scrolling (and the actual values are show when I scroll over the chart) rather then this format.
- This is especially obvious for the charts with static number of items: Connection Count, or Connector Task state. There, you see only the first 2 / 3 options. For the other charts, one might argue that for a big cluter the legend would not fit anyway.
When I have a connector with all 3 failed tasks, I see that correctly in the chart and in the task status table. But the table with "Connector task state with issues" shows just one record for the connector with number 3. Is that expected? Is that a sum of failed tasks? I think it is not obvious from the table. Also, What is the point of the table when it pretty much just shows what the chart and the table shows?
Again - not your fault as it was there already before. But maybe we can use this to align the capitalization of the titles? Right now it seems to use two different styles for the chart titles.
How does this impact MM2? What is the expectation with regards to MM2? I guess this is exclusively for regular Connect clusters and MM2 clusters should use the MM2 chart despite running on top of Connect, right?
Can you elaborate more on the changes to the exported metrics? There seems to be a singificant difference in what is scraped. Old YAML scrapes in my cluster 182787 bytes of data. Your new file scrapes 426466 bytes -> more than twice as much. Yet, the dashboard seems to work fine with the old YAML. Given that the MX exporter is rercieved as too slow and resource hungry in big clusters, is that intentionakl / desired?
Moreover, by using the YAML as you updated it, the dashabord stops working because the templating seems to look for kafka_connect_worker_connector_count metric. But you expose it as kafka_connect_connect_worker_metrics_connector_count. I expect there will be more naming issues, but due to this one it does not let me pick the cluster.

katheris · 2023-11-13T14:51:55Z

Hey @scholzj, thanks for the review.

Would it be worth adding a CHANGELOG record? Improved Kafka Connect metrics and dashboard or something like that?

Yes I can add that

What is the expectation from the Rebalancing gauge? When is it expected to show something given rebalances should IMHO be normally short and the scraping of metrics happens roughly every 30 seconds? My Prometheus does not seem to have any other values than 0 while scaling, changing connectors, etc. So I wonder when does this show something.

It should show 1 when a rebalance is in progress, but that is a good point, maybe a graph with it as time series would be more useful? Then people can see the trend of how often things rebalance

The Connector Status and Connector Task Status do not seem to have a fixed ordering and seem to change the order every other refresh. Can we sort them by name for example? It does not seem nice, because my brain registers the change in the dashabord even through nothing really changed. (To be fair, that was an issue in the previous dashboard as well, but would be great to fix this)

I hadn't spotted that, but yes I can fix that

Lot of the charts you touched now don't seem to have a simple legend but a table with the values. Even with a small cluster, that seems to now require scrolling to actually see the legend and the values. Was this intentional? Is this something we really want? I do not have a strong opinion, so if everyone wants it that way, fine. But to me, it seemed better when I saw all the legends at once without scrolling (and the actual values are show when I scroll over the chart) rather then this format.
This is especially obvious for the charts with static number of items: Connection Count, or Connector Task state. There, you see only the first 2 / 3 options. For the other charts, one might argue that for a big cluter the legend would not fit anyway.

Yes it was intentional, the idea was to make it easier to see the current levels at a glance. In the first version of this PR I had left the legends as is for Connection count and Connector task state, but I updated it based on Mickael's feedback. I prefer the table legend view for most graphs but for those two it wasn't obvious to me which version worked better. What do you think @mimaison?

When I have a connector with all 3 failed tasks, I see that correctly in the chart and in the task status table. But the table with "Connector task state with issues" shows just one record for the connector with number 3. Is that expected? Is that a sum of failed tasks? I think it is not obvious from the table. Also, What is the point of the table when it pretty much just shows what the chart and the table shows?

In the PR as it is now I haven't edited that panel, but I've posted a separate comment with a proposed updated version of this panel, it took me a while to get it in a state that I thought was better. I do prefer the new version I posted to the original, and it is nice being able to see problem connectors at a glance, but I agree that maybe it's not really needed anymore. Do you like the new version I shared? If so I can push that change, otherwise I can remove that panel.

Again - not your fault as it was there already before. But maybe we can use this to align the capitalization of the titles? Right now it seems to use two different styles for the chart titles.

Yes definitely! Does Strimzi have a general rule of thumb for capitalization of Grafana panel titles?

How does this impact MM2? What is the expectation with regards to MM2? I guess this is exclusively for regular Connect clusters and MM2 clusters should use the MM2 chart despite running on top of Connect, right?

Yes this is designed for regular Connect clusters, we could probably do a similar thing to the MM2 chart, basically a review to spot any gaps.

Can you elaborate more on the changes to the exported metrics? There seems to be a singificant difference in what is scraped. Old YAML scrapes in my cluster 182787 bytes of data. Your new file scrapes 426466 bytes -> more than twice as much. Yet, the dashboard seems to work fine with the old YAML. Given that the MX exporter is rercieved as too slow and resource hungry in big clusters, is that intentionakl / desired?

I made three kinds of changes:

tweaking the collection to be more explicit with less wildcards
collecting missing rate and total metrics
collecting missing rebalance and coordinator metrics

I'm surprised the size is so much bigger, I think it must be from adding the rate collection for producers and consumers. Since we are calculating the rate on the Grafana side we don't technically need that to determine the rate of bytes incoming/outgoing etc so I can remove those. How do you see the size of the metrics so I can test if removing those changes helps?

Moreover, by using the YAML as you updated it, the dashabord stops working because the templating seems to look for kafka_connect_worker_connector_count metric. But you expose it as kafka_connect_connect_worker_metrics_connector_count. I expect there will be more naming issues, but due to this one it does not let me pick the cluster.

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

scholzj · 2023-11-13T15:03:02Z

It should show 1 when a rebalance is in progress, but that is a good point, maybe a graph with it as time series would be more useful? Then people can see the trend of how often things rebalance

I never managed to have anything else than 0 there even in Prometheus.

Yes it was intentional, the idea was to make it easier to see the current levels at a glance. In the first version of this PR I had left the legends as is for Connection count and Connector task state, but I updated it based on Mickael's feedback. I prefer the table legend view for most graphs but for those two it wasn't obvious to me which version worked better. What do you think @mimaison?

Well, the problem is that it is not easier to see the current levels at a glance because you need to scroll even with a b asic cluster. @ppatierno What do you think about this?

Yes definitely! Does Strimzi have a general rule of thumb for capitalization of Grafana panel titles?

I don't think we have any rule. So not sure which one is correct. It just hit me that they are not the same (although as I said, this was there already before and is not something you changed). Maybe @PaulRMellor can suggest which style is better.

How do you see the size of the metrics so I can test if removing those changes helps?

I just dump it into file with curl -s localhost:9404/metrics. It is not completely easy to diff it, but comparing the size of the files is easy.

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

Did you tried to actually refresh your Grafana? And not just keep the selected cluster onshowing the old data?

mimaison · 2023-11-13T15:11:34Z

I have a strong preference for using tables on most graphs. It's what we mostly used while I was running clusters so I have a bias based on previous habits.

For graphs with a fixed number of series you can resize the graphs (there's a way to set a height) to always see all the rows of the table and avoid scrolling.

mimaison · 2023-11-13T15:12:32Z

For the ordering, you can set the sorting on the tables so the order is deterministic and consistent if the rows stay the same.

scholzj · 2023-11-13T15:15:09Z

For graphs with a fixed number of series you can resize the graphs (there's a way to set a height) to always see all the rows of the table and avoid scrolling.

If it was that you can see all the values, it would make some sense. But not if you anyway just hide it under a scrollbar.

scholzj · 2023-11-13T16:01:10Z

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

Did you tried to actually refresh your Grafana? And not just keep the selected cluster onshowing the old data?

I tried it once again, but I seem to have the same issue :-(.

scholzj · 2023-11-13T16:11:00Z

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

Did you tried to actually refresh your Grafana? And not just keep the selected cluster onshowing the old data?

I tried it once again, but I seem to have the same issue :-(.

Ok, it looks like editing a ConfigMap is too hard for a person with my intellect :-/. So the updated YAML now works for me. And the size seems to be only 206729 bytes. So much closer to the original, so I think both of these comments can be ignored. Sorry. 🤦

Signed-off-by: Katherine Stanley <[email protected]>

katheris · 2023-11-13T16:48:58Z

@scholzj @mimaison for the legend discussion, what about for Connection Count and Connector Task State if we had the legend to the side? I know this makes the actual graph smaller, but I think for these two metrics the current values are just as important as being able to see a trend over time

Signed-off-by: Katherine Stanley <[email protected]>

mimaison · 2023-11-13T17:02:40Z

Pretty resourceful! That would work for me.

scholzj · 2023-11-13T17:15:46Z

@scholzj @mimaison for the legend discussion, what about for Connection Count and Connector Task State if we had the legend to the side? I know this makes the actual graph smaller, but I think for these two metrics the current values are just as important as being able to see a trend over time

I think that is fine for me.

Signed-off-by: Katherine Stanley <[email protected]>

katheris · 2023-11-13T17:30:15Z

@scholzj I think I've addressed all your review comments. Here are the latest screenshots:

katheris · 2023-11-13T17:31:30Z

@mimaison I think I've also addressed yours. I wasn't able to find a way to make the tables display sensibly when no data is returned unfortunately. I think it's due to the kind of query they are using

scholzj · 2023-11-13T18:01:11Z

Looks good, thanks.

scholzj · 2023-11-13T18:01:55Z

@mimaison You didn't approve, so I guess will wait for you if you have something more before merging this.

mimaison · 2023-11-13T18:29:32Z

Sorry, done!

Update Kafka Connect metrics and dashboard

f0a5f88

* Collect coordinator metrics. * Simplify metric regex. * Update dashboard to add panels and display more connector information and rebalance metrics. Signed-off-by: Katherine Stanley <[email protected]>

katheris requested a review from mimaison November 7, 2023 15:02

scholzj reviewed Nov 7, 2023

View reviewed changes

Add Helm Grafana dashboard

5e556a1

Signed-off-by: Katherine Stanley <[email protected]>

scholzj added this to the 0.39.0 milestone Nov 7, 2023

scholzj requested a review from ppatierno November 7, 2023 23:50

ppatierno approved these changes Nov 8, 2023

View reviewed changes

Fix spacing and display legend as table

c3edd4d

Signed-off-by: Katherine Stanley <[email protected]>

mimaison reviewed Nov 13, 2023

View reviewed changes

scholzj reviewed Nov 13, 2023

View reviewed changes

katheris added 2 commits November 13, 2023 16:19

feat: Update connector task issues panel

73bbc5e

Signed-off-by: Katherine Stanley <[email protected]>

Convert rebalancing visualisation to graph

f7b614c

Signed-off-by: Katherine Stanley <[email protected]>

Set connector table sorting and standardise capitalisation

a33e4e0

Signed-off-by: Katherine Stanley <[email protected]>

katheris added 2 commits November 13, 2023 17:25

Put Connection Count and Connector Task State legends to the side

71254df

Signed-off-by: Katherine Stanley <[email protected]>

Update CHANGELOG.md file

294ef82

Signed-off-by: Katherine Stanley <[email protected]>

scholzj approved these changes Nov 13, 2023

View reviewed changes

mimaison approved these changes Nov 13, 2023

View reviewed changes

scholzj merged commit 6785564 into strimzi:main Nov 13, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Kafka Connect metrics and dashboard #9331

Update Kafka Connect metrics and dashboard #9331

katheris commented Nov 7, 2023 •

edited

Loading

scholzj left a comment

ppatierno left a comment

mimaison commented Nov 8, 2023

katheris commented Nov 13, 2023

mimaison left a comment

scholzj commented Nov 13, 2023

katheris commented Nov 13, 2023

mimaison commented Nov 13, 2023

scholzj left a comment

katheris commented Nov 13, 2023

scholzj commented Nov 13, 2023

mimaison commented Nov 13, 2023

mimaison commented Nov 13, 2023

scholzj commented Nov 13, 2023

scholzj commented Nov 13, 2023

scholzj commented Nov 13, 2023

katheris commented Nov 13, 2023

mimaison commented Nov 13, 2023

scholzj commented Nov 13, 2023

katheris commented Nov 13, 2023

katheris commented Nov 13, 2023

scholzj commented Nov 13, 2023

scholzj commented Nov 13, 2023

mimaison commented Nov 13, 2023

Update Kafka Connect metrics and dashboard #9331

Update Kafka Connect metrics and dashboard #9331

Conversation

katheris commented Nov 7, 2023 • edited Loading

Type of change

Description

Dashboard before

Dashboard after

Checklist

scholzj left a comment

Choose a reason for hiding this comment

ppatierno left a comment

Choose a reason for hiding this comment

mimaison commented Nov 8, 2023

katheris commented Nov 13, 2023

mimaison left a comment

Choose a reason for hiding this comment

scholzj commented Nov 13, 2023

katheris commented Nov 13, 2023

mimaison commented Nov 13, 2023

scholzj left a comment

Choose a reason for hiding this comment

katheris commented Nov 13, 2023

scholzj commented Nov 13, 2023

mimaison commented Nov 13, 2023

mimaison commented Nov 13, 2023

scholzj commented Nov 13, 2023

scholzj commented Nov 13, 2023

scholzj commented Nov 13, 2023

katheris commented Nov 13, 2023

mimaison commented Nov 13, 2023

scholzj commented Nov 13, 2023

katheris commented Nov 13, 2023

katheris commented Nov 13, 2023

scholzj commented Nov 13, 2023

scholzj commented Nov 13, 2023

mimaison commented Nov 13, 2023

katheris commented Nov 7, 2023 •

edited

Loading