You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As we are settling into running Razee for production we have found a few things it would be nice to monitor and alert on.
We are trying to answer questions like:
How many locked resources are in the cluster?
Are runs completing successfully?
when was the last time a controller completed a run?
how long are runs taking?
How long is each resource taking? (this is more for future exploration and enhancements)
Describe the solution you'd like
It would be nice to see an openmetrics compatible set of metrics exposed that could easily be scraped by prometheus/sysdig/other openmetrics agents from each controller. The types of metrics we think might help address the above questions include:
Number of resources from last run, with a breakdown by
success
failed
skipped due to debug flag
A heat map of time to process each resource
Bool state of cluster lock state
Last run completion time
Describe alternatives you've considered
We've thought about trying to figure some of this out strictly from logs or by writing scripts to scrape the environment. Will probably do some here, but its less "native" and shareable.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
As we are settling into running Razee for production we have found a few things it would be nice to monitor and alert on.
We are trying to answer questions like:
Describe the solution you'd like
It would be nice to see an openmetrics compatible set of metrics exposed that could easily be scraped by prometheus/sysdig/other openmetrics agents from each controller. The types of metrics we think might help address the above questions include:
Describe alternatives you've considered
We've thought about trying to figure some of this out strictly from logs or by writing scripts to scrape the environment. Will probably do some here, but its less "native" and shareable.
The text was updated successfully, but these errors were encountered: