Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Kafka Connect metrics and dashboard #9331

Merged
merged 8 commits into from
Nov 13, 2023

Conversation

katheris
Copy link
Contributor

@katheris katheris commented Nov 7, 2023

Type of change

  • Enhancement

Description

Update Kafka Connect metrics and dashboard

  • Collect coordinator metrics.
  • Simplify metric regex.
  • Update dashboard to add panels and display more connector information and rebalance metrics.

Dashboard before

KC_Dashboard_Orig

Dashboard after

KC_Dashboard_v2_1
KC_Dashboard_v2_2
KC_Dashboard_v2_3
KC_Dashboard_v2_4

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

* Collect coordinator metrics.
* Simplify metric regex.
* Update dashboard to add panels and display
more connector information and rebalance metrics.

Signed-off-by: Katherine Stanley <[email protected]>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update the Helm Chart dasboards as well I guess?

Signed-off-by: Katherine Stanley <[email protected]>
@scholzj scholzj added this to the 0.39.0 milestone Nov 7, 2023
Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! LGTM. Thanks!

@mimaison
Copy link
Contributor

mimaison commented Nov 8, 2023

The dashboard looks much nicer!

Some minor suggestions:

  • What about increasing slightly the width of both graphs in the first row of "Connectors and Tasks"? so they fully fill the row
  • Could we use the table view for the legend of the "Connection count" and "Connector task state" graphs? I think that makes the values easier to read like you've done for the other graphs.
  • Have you tried tweaking the "Connector task state with issues" panel to print "0" in case there's no data?

@katheris
Copy link
Contributor Author

These are the updated screenshots from my latest commit:
KC_Dashboard_v3_1
KC_Dashboard_v3_2
KC_Dashboard_v3_3
KC_Dashboard_v3_4

I've been investigating whether I can display the connector task issues in a better way, but apart from that this addresses the other review comments

Copy link
Contributor

@mimaison mimaison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. The dashboard looks great!

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

@katheris So does this do anything about #9317 as well? Or is this change completely unrelated?

@katheris
Copy link
Contributor Author

@scholzj this is unrelated

@mimaison do you think this panel would be a better visualisation for the tasks with issues? The panel will also display tasks that are returned from the connector-destroyed-task-count and connector-unassigned-task-count metrics
Screenshot 2023-11-13 at 13 59 08

@mimaison
Copy link
Contributor

Yes this seems fine

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks nice, thanks for the improvements. The chart showing the assignment of connectors and tasks to nodes is nice (although I'm not sure how it wold look for those running 50 node clusters, but not sure I care about that anyway).

Some comments:

  • Would it be worth adding a CHANGELOG record? Improved Kafka Connect metrics and dashboard or something like that?
  • What is the expectation from the Rebalancing gauge? When is it expected to show something given rebalances should IMHO be normally short and the scraping of metrics happens roughly every 30 seconds? My Prometheus does not seem to have any other values than 0 while scaling, changing connectors, etc. So I wonder when does this show something.
  • The Connector Status and Connector Task Status do not seem to have a fixed ordering and seem to change the order every other refresh. Can we sort them by name for example? It does not seem nice, because my brain registers the change in the dashabord even through nothing really changed. (To be fair, that was an issue in the previous dashboard as well, but would be great to fix this)
  • Lot of the charts you touched now don't seem to have a simple legend but a table with the values. Even with a small cluster, that seems to now require scrolling to actually see the legend and the values. Was this intentional? Is this something we really want? I do not have a strong opinion, so if everyone wants it that way, fine. But to me, it seemed better when I saw all the legends at once without scrolling (and the actual values are show when I scroll over the chart) rather then this format.
    • This is especially obvious for the charts with static number of items: Connection Count, or Connector Task state. There, you see only the first 2 / 3 options. For the other charts, one might argue that for a big cluter the legend would not fit anyway.
  • When I have a connector with all 3 failed tasks, I see that correctly in the chart and in the task status table. But the table with "Connector task state with issues" shows just one record for the connector with number 3. Is that expected? Is that a sum of failed tasks? I think it is not obvious from the table. Also, What is the point of the table when it pretty much just shows what the chart and the table shows?
  • Again - not your fault as it was there already before. But maybe we can use this to align the capitalization of the titles? Right now it seems to use two different styles for the chart titles.
  • How does this impact MM2? What is the expectation with regards to MM2? I guess this is exclusively for regular Connect clusters and MM2 clusters should use the MM2 chart despite running on top of Connect, right?
  • Can you elaborate more on the changes to the exported metrics? There seems to be a singificant difference in what is scraped. Old YAML scrapes in my cluster 182787 bytes of data. Your new file scrapes 426466 bytes -> more than twice as much. Yet, the dashboard seems to work fine with the old YAML. Given that the MX exporter is rercieved as too slow and resource hungry in big clusters, is that intentionakl / desired?
  • Moreover, by using the YAML as you updated it, the dashabord stops working because the templating seems to look for kafka_connect_worker_connector_count metric. But you expose it as kafka_connect_connect_worker_metrics_connector_count. I expect there will be more naming issues, but due to this one it does not let me pick the cluster.

@katheris
Copy link
Contributor Author

Hey @scholzj, thanks for the review.

Would it be worth adding a CHANGELOG record? Improved Kafka Connect metrics and dashboard or something like that?

Yes I can add that

What is the expectation from the Rebalancing gauge? When is it expected to show something given rebalances should IMHO be normally short and the scraping of metrics happens roughly every 30 seconds? My Prometheus does not seem to have any other values than 0 while scaling, changing connectors, etc. So I wonder when does this show something.

It should show 1 when a rebalance is in progress, but that is a good point, maybe a graph with it as time series would be more useful? Then people can see the trend of how often things rebalance

The Connector Status and Connector Task Status do not seem to have a fixed ordering and seem to change the order every other refresh. Can we sort them by name for example? It does not seem nice, because my brain registers the change in the dashabord even through nothing really changed. (To be fair, that was an issue in the previous dashboard as well, but would be great to fix this)

I hadn't spotted that, but yes I can fix that

Lot of the charts you touched now don't seem to have a simple legend but a table with the values. Even with a small cluster, that seems to now require scrolling to actually see the legend and the values. Was this intentional? Is this something we really want? I do not have a strong opinion, so if everyone wants it that way, fine. But to me, it seemed better when I saw all the legends at once without scrolling (and the actual values are show when I scroll over the chart) rather then this format.
This is especially obvious for the charts with static number of items: Connection Count, or Connector Task state. There, you see only the first 2 / 3 options. For the other charts, one might argue that for a big cluter the legend would not fit anyway.

Yes it was intentional, the idea was to make it easier to see the current levels at a glance. In the first version of this PR I had left the legends as is for Connection count and Connector task state, but I updated it based on Mickael's feedback. I prefer the table legend view for most graphs but for those two it wasn't obvious to me which version worked better. What do you think @mimaison?

When I have a connector with all 3 failed tasks, I see that correctly in the chart and in the task status table. But the table with "Connector task state with issues" shows just one record for the connector with number 3. Is that expected? Is that a sum of failed tasks? I think it is not obvious from the table. Also, What is the point of the table when it pretty much just shows what the chart and the table shows?

In the PR as it is now I haven't edited that panel, but I've posted a separate comment with a proposed updated version of this panel, it took me a while to get it in a state that I thought was better. I do prefer the new version I posted to the original, and it is nice being able to see problem connectors at a glance, but I agree that maybe it's not really needed anymore. Do you like the new version I shared? If so I can push that change, otherwise I can remove that panel.

Again - not your fault as it was there already before. But maybe we can use this to align the capitalization of the titles? Right now it seems to use two different styles for the chart titles.

Yes definitely! Does Strimzi have a general rule of thumb for capitalization of Grafana panel titles?

How does this impact MM2? What is the expectation with regards to MM2? I guess this is exclusively for regular Connect clusters and MM2 clusters should use the MM2 chart despite running on top of Connect, right?

Yes this is designed for regular Connect clusters, we could probably do a similar thing to the MM2 chart, basically a review to spot any gaps.

Can you elaborate more on the changes to the exported metrics? There seems to be a singificant difference in what is scraped. Old YAML scrapes in my cluster 182787 bytes of data. Your new file scrapes 426466 bytes -> more than twice as much. Yet, the dashboard seems to work fine with the old YAML. Given that the MX exporter is rercieved as too slow and resource hungry in big clusters, is that intentionakl / desired?

I made three kinds of changes:

  • tweaking the collection to be more explicit with less wildcards
  • collecting missing rate and total metrics
  • collecting missing rebalance and coordinator metrics

I'm surprised the size is so much bigger, I think it must be from adding the rate collection for producers and consumers. Since we are calculating the rate on the Grafana side we don't technically need that to determine the rate of bytes incoming/outgoing etc so I can remove those. How do you see the size of the metrics so I can test if removing those changes helps?

Moreover, by using the YAML as you updated it, the dashabord stops working because the templating seems to look for kafka_connect_worker_connector_count metric. But you expose it as kafka_connect_connect_worker_metrics_connector_count. I expect there will be more naming issues, but due to this one it does not let me pick the cluster.

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

It should show 1 when a rebalance is in progress, but that is a good point, maybe a graph with it as time series would be more useful? Then people can see the trend of how often things rebalance

I never managed to have anything else than 0 there even in Prometheus.

Yes it was intentional, the idea was to make it easier to see the current levels at a glance. In the first version of this PR I had left the legends as is for Connection count and Connector task state, but I updated it based on Mickael's feedback. I prefer the table legend view for most graphs but for those two it wasn't obvious to me which version worked better. What do you think @mimaison?

Well, the problem is that it is not easier to see the current levels at a glance because you need to scroll even with a b asic cluster. @ppatierno What do you think about this?

Yes definitely! Does Strimzi have a general rule of thumb for capitalization of Grafana panel titles?

I don't think we have any rule. So not sure which one is correct. It just hit me that they are not the same (although as I said, this was there already before and is not something you changed). Maybe @PaulRMellor can suggest which style is better.

How do you see the size of the metrics so I can test if removing those changes helps?

I just dump it into file with curl -s localhost:9404/metrics. It is not completely easy to diff it, but comparing the size of the files is easy.

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

Did you tried to actually refresh your Grafana? And not just keep the selected cluster onshowing the old data?

@mimaison
Copy link
Contributor

I have a strong preference for using tables on most graphs. It's what we mostly used while I was running clusters so I have a bias based on previous habits.

For graphs with a fixed number of series you can resize the graphs (there's a way to set a height) to always see all the rows of the table and avoid scrolling.

@mimaison
Copy link
Contributor

For the ordering, you can set the sorting on the tables so the order is deterministic and consistent if the rows stay the same.

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

For graphs with a fixed number of series you can resize the graphs (there's a way to set a height) to always see all the rows of the table and avoid scrolling.

If it was that you can see all the values, it would make some sense. But not if you anyway just hide it under a scrollbar.

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

Did you tried to actually refresh your Grafana? And not just keep the selected cluster onshowing the old data?

I tried it once again, but I seem to have the same issue :-(.

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count

Did you tried to actually refresh your Grafana? And not just keep the selected cluster onshowing the old data?

I tried it once again, but I seem to have the same issue :-(.

Ok, it looks like editing a ConfigMap is too hard for a person with my intellect :-/. So the updated YAML now works for me. And the size seems to be only 206729 bytes. So much closer to the original, so I think both of these comments can be ignored. Sorry. 🤦

@katheris
Copy link
Contributor Author

@scholzj @mimaison for the legend discussion, what about for Connection Count and Connector Task State if we had the legend to the side? I know this makes the actual graph smaller, but I think for these two metrics the current values are just as important as being able to see a trend over time
Screenshot 2023-11-13 at 16 44 52

@mimaison
Copy link
Contributor

Pretty resourceful! That would work for me.

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

@scholzj @mimaison for the legend discussion, what about for Connection Count and Connector Task State if we had the legend to the side? I know this makes the actual graph smaller, but I think for these two metrics the current values are just as important as being able to see a trend over time

I think that is fine for me.

@katheris
Copy link
Contributor Author

@scholzj I think I've addressed all your review comments. Here are the latest screenshots:
KC_Dashboard_v4_1
KC_Dashboard_v4_2
KC_Dashboard_v4_3

@katheris
Copy link
Contributor Author

@mimaison I think I've also addressed yours. I wasn't able to find a way to make the tables display sensibly when no data is returned unfortunately. I think it's due to the kind of query they are using

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

Looks good, thanks.

@scholzj
Copy link
Member

scholzj commented Nov 13, 2023

@mimaison You didn't approve, so I guess will wait for you if you have something more before merging this.

@mimaison
Copy link
Contributor

Sorry, done!

@scholzj scholzj merged commit 6785564 into strimzi:main Nov 13, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants