Skip to content

Commit

Permalink
Add alerting on systemctl is-system-running
Browse files Browse the repository at this point in the history
Systemd is quite good in supervising failing processes, so this signal
is useful "generic" alert.  Individual systemd units are not monitored
yet as utility of that data is unclear. See also #220 and #226.
  • Loading branch information
darkk committed Oct 31, 2018
1 parent 4a52e2d commit 5b443d1
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 0 deletions.
3 changes: 3 additions & 0 deletions ansible/roles/node_exporter/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ node_exporter_collectors: >
--no-collector.arp --no-collector.bcache --no-collector.infiniband
--no-collector.ipvs --no-collector.wifi --no-collector.zfs
--collector.ntp --collector.ntp.local-offset-tolerance=5ms
{% if ansible_service_mgr == 'systemd' %}
--collector.systemd --collector.systemd.unit-whitelist=^$
{% endif %}
node_exporter_disk_ignored: ""

Expand Down
12 changes: 12 additions & 0 deletions ansible/roles/node_exporter/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,18 @@
owner: root
group: root
remote_src: true # file is ALREADY on the remote system. sigh.
creates: '{{ node_exporter_base }}/{{ node_exporter_basename }}/node_exporter'

# for some unknown reason some nodes do not have `dbus`, but systemd depends on it :-/
- name: install dbus to punch hole to systemd
apt:
name: dbus
state: present
update_cache: yes
cache_valid_time: 28800
install_recommends: false
tags: debug
when: ansible_service_mgr == 'systemd'

- name: Install node_exporter systemd service file
notify:
Expand Down
5 changes: 5 additions & 0 deletions ansible/roles/prometheus/files/alert_rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ groups:
annotations:
summary: '{{ $labels.instance }} is not `up`'

- alert: systemd # yes, just "systemd", it's unclear what's going wrong :-)
expr: node_systemd_system_running != 1 # that's basically output of `systemctl is-system-running`
annotations:
summary: '{{ $labels.instance }} is not OK, check `systemctl list-units | grep failed`'

- alert: IOWaitHigh
expr: sum without (cpu) (irate(node_cpu{mode="iowait"}[1m])) > 0.9
for: 5m # matters to avoid spikes
Expand Down

0 comments on commit 5b443d1

Please sign in to comment.