Skip to content

Commit

Permalink
Increases PodDown alert threshold from 1h to 4h (#905)
Browse files Browse the repository at this point in the history
1h is too suceptible to catching transient errors that clear themselves after a
while.
  • Loading branch information
nkinkade authored Sep 23, 2024
1 parent e34577f commit 64c276f
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions config/prometheus/alerts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -386,16 +386,14 @@ groups:
gmx_machine_maintenance == 1 or
up{job="kubernetes-nodes"} == 0
)
for: 1h
for: 4h
labels:
repo: ops-tracker
severity: ticket
cluster: platform
annotations:
summary: A {{ $labels.deployment }} pod is down or broken.
description: A {{ $labels.deployment }} pod is down or broken. Verify that the
DaemonSet or Deployment is healthy. Check the status of the node that the
pod is scheduled on. Check the status of the pod itself, if it exists.
description: https://github.com/m-lab/ops-tracker/wiki/Alerts-&-Troubleshooting#platformcluster_poddown
dashboard: https://grafana.mlab-staging.measurementlab.net/d/rJ7z2Suik/k8s-site-overview

# Etcd alerts.
Expand Down

0 comments on commit 64c276f

Please sign in to comment.