-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Kafka Connect metrics and dashboard #9331
Conversation
* Collect coordinator metrics. * Simplify metric regex. * Update dashboard to add panels and display more connector information and rebalance metrics. Signed-off-by: Katherine Stanley <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update the Helm Chart dasboards as well I guess?
Signed-off-by: Katherine Stanley <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! LGTM. Thanks!
The dashboard looks much nicer! Some minor suggestions:
|
Signed-off-by: Katherine Stanley <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. The dashboard looks great!
Yes this seems fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks nice, thanks for the improvements. The chart showing the assignment of connectors and tasks to nodes is nice (although I'm not sure how it wold look for those running 50 node clusters, but not sure I care about that anyway).
Some comments:
- Would it be worth adding a CHANGELOG record?
Improved Kafka Connect metrics and dashboard
or something like that? - What is the expectation from the
Rebalancing
gauge? When is it expected to show something given rebalances should IMHO be normally short and the scraping of metrics happens roughly every 30 seconds? My Prometheus does not seem to have any other values than 0 while scaling, changing connectors, etc. So I wonder when does this show something. - The Connector Status and Connector Task Status do not seem to have a fixed ordering and seem to change the order every other refresh. Can we sort them by name for example? It does not seem nice, because my brain registers the change in the dashabord even through nothing really changed. (To be fair, that was an issue in the previous dashboard as well, but would be great to fix this)
- Lot of the charts you touched now don't seem to have a simple legend but a table with the values. Even with a small cluster, that seems to now require scrolling to actually see the legend and the values. Was this intentional? Is this something we really want? I do not have a strong opinion, so if everyone wants it that way, fine. But to me, it seemed better when I saw all the legends at once without scrolling (and the actual values are show when I scroll over the chart) rather then this format.
- This is especially obvious for the charts with static number of items: Connection Count, or Connector Task state. There, you see only the first 2 / 3 options. For the other charts, one might argue that for a big cluter the legend would not fit anyway.
- When I have a connector with all 3 failed tasks, I see that correctly in the chart and in the task status table. But the table with "Connector task state with issues" shows just one record for the connector with number 3. Is that expected? Is that a sum of failed tasks? I think it is not obvious from the table. Also, What is the point of the table when it pretty much just shows what the chart and the table shows?
- Again - not your fault as it was there already before. But maybe we can use this to align the capitalization of the titles? Right now it seems to use two different styles for the chart titles.
- How does this impact MM2? What is the expectation with regards to MM2? I guess this is exclusively for regular Connect clusters and MM2 clusters should use the MM2 chart despite running on top of Connect, right?
- Can you elaborate more on the changes to the exported metrics? There seems to be a singificant difference in what is scraped. Old YAML scrapes in my cluster 182787 bytes of data. Your new file scrapes 426466 bytes -> more than twice as much. Yet, the dashboard seems to work fine with the old YAML. Given that the MX exporter is rercieved as too slow and resource hungry in big clusters, is that intentionakl / desired?
- Moreover, by using the YAML as you updated it, the dashabord stops working because the templating seems to look for
kafka_connect_worker_connector_count
metric. But you expose it askafka_connect_connect_worker_metrics_connector_count
. I expect there will be more naming issues, but due to this one it does not let me pick the cluster.
Hey @scholzj, thanks for the review.
Yes I can add that
It should show 1 when a rebalance is in progress, but that is a good point, maybe a graph with it as time series would be more useful? Then people can see the trend of how often things rebalance
I hadn't spotted that, but yes I can fix that
Yes it was intentional, the idea was to make it easier to see the current levels at a glance. In the first version of this PR I had left the legends as is for Connection count and Connector task state, but I updated it based on Mickael's feedback. I prefer the table legend view for most graphs but for those two it wasn't obvious to me which version worked better. What do you think @mimaison?
In the PR as it is now I haven't edited that panel, but I've posted a separate comment with a proposed updated version of this panel, it took me a while to get it in a state that I thought was better. I do prefer the new version I posted to the original, and it is nice being able to see problem connectors at a glance, but I agree that maybe it's not really needed anymore. Do you like the new version I shared? If so I can push that change, otherwise I can remove that panel.
Yes definitely! Does Strimzi have a general rule of thumb for capitalization of Grafana panel titles?
Yes this is designed for regular Connect clusters, we could probably do a similar thing to the MM2 chart, basically a review to spot any gaps.
I made three kinds of changes:
I'm surprised the size is so much bigger, I think it must be from adding the rate collection for producers and consumers. Since we are calculating the rate on the Grafana side we don't technically need that to determine the rate of bytes incoming/outgoing etc so I can remove those. How do you see the size of the metrics so I can test if removing those changes helps?
Hmm, that's odd, I don't see that error, I've been deleting and reimporting the yaml quite often to check it still works. In prometheus I see kafka_connect_worker_connector_count but not kafka_connect_connect_worker_metrics_connector_count |
I never managed to have anything else than 0 there even in Prometheus.
Well, the problem is that it is not easier to see the current levels at a glance because you need to scroll even with a b asic cluster. @ppatierno What do you think about this?
I don't think we have any rule. So not sure which one is correct. It just hit me that they are not the same (although as I said, this was there already before and is not something you changed). Maybe @PaulRMellor can suggest which style is better.
I just dump it into file with
Did you tried to actually refresh your Grafana? And not just keep the selected cluster onshowing the old data? |
I have a strong preference for using tables on most graphs. It's what we mostly used while I was running clusters so I have a bias based on previous habits. For graphs with a fixed number of series you can resize the graphs (there's a way to set a height) to always see all the rows of the table and avoid scrolling. |
For the ordering, you can set the sorting on the tables so the order is deterministic and consistent if the rows stay the same. |
If it was that you can see all the values, it would make some sense. But not if you anyway just hide it under a scrollbar. |
I tried it once again, but I seem to have the same issue :-(. |
Ok, it looks like editing a ConfigMap is too hard for a person with my intellect :-/. So the updated YAML now works for me. And the size seems to be only |
Signed-off-by: Katherine Stanley <[email protected]>
Signed-off-by: Katherine Stanley <[email protected]>
Signed-off-by: Katherine Stanley <[email protected]>
Pretty resourceful! That would work for me. |
I think that is fine for me. |
Signed-off-by: Katherine Stanley <[email protected]>
Signed-off-by: Katherine Stanley <[email protected]>
@scholzj I think I've addressed all your review comments. Here are the latest screenshots: |
@mimaison I think I've also addressed yours. I wasn't able to find a way to make the tables display sensibly when no data is returned unfortunately. I think it's due to the kind of query they are using |
Looks good, thanks. |
@mimaison You didn't approve, so I guess will wait for you if you have something more before merging this. |
Sorry, done! |
Type of change
Description
Update Kafka Connect metrics and dashboard
Dashboard before
Dashboard after
Checklist
Please go through this checklist and make sure all applicable tasks have been done