-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add runbook for PrometheusOperatorReconcileError alert
- Loading branch information
1 parent
0cbf19d
commit a69fa34
Showing
1 changed file
with
29 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
--- | ||
title: How to Investigate PrometheusOperatorReconcile Errors | ||
weight: 218 | ||
last_reviewed_on: 2024-06-17 | ||
review_in: 6 months | ||
--- | ||
|
||
# <%= current_page.data.title %> | ||
|
||
When you see a `PrometheusOperatorReconcile` alert in the `low-priority-alerts` channel, it means that the Prometheus Operator is unable to reconcile the state of the Prometheus resources in the cluster. | ||
This means some of the prometheus rules or alerts are having issues and has not applied fine. | ||
|
||
## Troubleshooting | ||
|
||
Check the logs of the Prometheus Operator pod to see if there are any errors: | ||
|
||
```bash | ||
kubectl logs -n monitoring prometheus-operator-kube-p-operator-<pod-id> -f | ||
``` | ||
|
||
If you see any error like below: | ||
``` | ||
level=info ts=2024-02-23T10:31:29.0543824Z caller=rules.go:345 component=prometheusoperator msg="Invalid rule" err="group \"XXX-elasticache\", rule 1, \"elasticache-enginecpu-utilisation\": annotation \"message\": template: __alert_elasticache-enginecpu-utilisation:1: undefined variable \"$clusterId\"" | ||
``` | ||
|
||
This could stops Prometheus from sending out alerts to certain channels and stops changes/new ones being created. You may also see an alert "PrometheusErrorSendingAlertsToSomeAlertmanagers" if that was the case. | ||
|
||
You will need to fix the erroring PrometheusRule. If the rule is not configured in [cloud-platform-environments](https://github.com/ministryofjustice/cloud-platform-environments) repository, | ||
find the namespace that rule is applied and get the team slack-channel or the last person who made a change and inform them to fix the rule. |