Monitoring an application can be complex and produce a wide variety of data. In order to standardize the handling of threshold values on the command line, to reduce the number of command line parameters and their interdependencies and to enable independent and thus extended designs of the Grafana panels, each topic should be dealt with in a separate check (following the Linux mantra: "one tool, one task").
Avoid an extensive check that covers a wide variety of aspects:
myapp --action threading --warning 1500 --critical 2000
myapp --action memory-usage --warning 80 --critical 90
myapp --action deployment-status
(warning and critical command line options not supported)
Better write three separate checks:
myapp-threading --warning 1500 --critical 2000
myapp-memory-usage --warning 80 --critical 90
myapp-deployment-status
All plugins are written in Python and will be licensed under the [UNLICENSE](https://unlicense.org/), which is a license with no conditions whatsoever that dedicates works to the public domain.
All plugins are coded in Python. Use at least Python 3.6 and max. Python 3.8 for development.
Simply clone the libraries and monitoring plugins and start working:
git clone [email protected]:Linuxfabrik/lib.git
git clone [email protected]:Linuxfabrik/monitoring-plugins.git
Checklist:
- The plugin itself.
- A nice 16x16 transparent PNG icon, for example based on https://simpleicons.org or font-awesome (not in Git, will be put for download on https://download.linuxfabrik.ch).
- README file explaining "How?" and Why?"
- Optional:
unit-test/run
- the unittest file (see Unit Tests) - Optional:
requirements.txt
- If providing performance data: Grafana dashboard (see GRAFANA) and
.ini
file for the Icinga Web 2 Grafana Module - Icinga Director Basket Config for the check plugin
- Icinga Service Set in
all-the-rest.json
- Optional: sudoers file (see sudoers File)
- Optional: A screenshot of the plugins' output from within Icinga, resized to 423x106, using background-color
#f5f9fa
, hosted on download.linuxfabrik.ch, and listed alphabetically in the projects README. - CHANGELOG
The plugin should be "self configuring" and/or using best practise defaults, so that it runs without parameters wherever possible.
Develop with a minimal Linux in mind.
Develop with Icinga2 in mind.
Avoid complicated or fancy (and therefore unreadable) Python statements.
Comments and output should be in English only.
If possible avoid libraries that have to be installed.
Validate user input.
It is not needed to execute system (shell/bash) commands by specifying their full path.
It is ok to use temp files if needed.
Much better: use a local SQLite database if you want to use a temp file.
Keep in mind: Plugins have a limited runtime - typically 10 seconds max. Therefore it is ideal if the plugin executes fast and uses minimal resources (CPU time, memory etc.).
Timeout gracefully on errors (for example
df
on a failed network drive) and return WARN.Return UNKNOWN on missing dependencies or wrong parameters.
Mainly return WARN. Only return CRIT if the operators want to or have to wake up at night. CRIT means "react immediately".
EAFP: Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements.
Use RFC 5737, 3849, 7042 and 2606 in examples / documentation:
- IPv4 Addresses:
192.0.2.0/24
,198.51.100.0/24
,203.0.113.0/24
- IPv6 Addresses:
2001:DB8::/32
- MAC Addresses:
00-00-5E-00-53-00 through 00-00-5E-00-53-FF
(unicast),01-00-5E-90-10-00 through 01-00-5E-90-10-FF
(multicast) - Domains:
*.example
,example.com
- IPv4 Addresses:
Short:
- Use
txt.to_text()
andtxt.to_bytes()
.
The theory:
- Data coming into your plugins must be bytes, encoded with
UTF-8
. - Decode incoming bytes as soon as possible (best within the libraries), producing unicode.
- Use unicode throughout your plugin.
- When outputting data, use library functions, they should do output conversions for you. Library functions like
base.oao
orurl.fetch_json
will take care of the conversion to and from bytes.
See https://nedbatchelder.com/text/unipain.html for details.
The plugin name should match the following regex: ^[a-zA-Z0-9\-\_]*$
. This allows the plugin name to be used as the grafana dashboard uid (according to here).
There are a few Nagios-compatible reserved options that should not be used for other purposes:
-a, --authentication authentication password
-C, --community SNMP community
-c, --critical critical threshold
-h, --help help
-H, --hostname hostname
-l, --logname login name
-p, --password password
-p, --port network port
-t, --timeout timeout
-u, --url URL
-u, --username username
-V, --version version
-v, --verbose verbose
-w, --warning warning threshold
For all other options, use long parameters only. Separate words using a -
. We recommend using some of those:
--activestate
--alarm-duration
--always-ok
--argument
--authtype
--cache-expire
--command
--community
--config
--count
--critical
--critical-count
--critical-cpu
--critical-maxchildren
--critical-mem
--critical-pattern
--critical-regex
--critical-slowreq
--database
--datasource
--date
--device
--donor
--filename
--filter
--full
--hide-ok
--hostname
--icinga-callback
--icinga-password
--icinga-service-name
--icinga-url
--icinga-username
--idsite
--ignore
--ignore-pattern
--ignore-regex
--input
--insecure
--instance
--interface
--interval
--ipv6
--key
--latest
--lengthy
--loadstate
--message
--message-key
--metric
--mib
--mibdir
--mode
--module
--mount
--no-kthreads
--no-proxy
--no-summary
--node
--only-dirs
--only-files
--password
--path
--pattern
--perfdata
--perfdata-key
--period
--port
--portname
--prefix
--privlevel
--response
--service
--severity
--snmp-version
--starttype
--state
--state-key
--status
--substate
--suppress-lines
--task
--team
--test
--timeout
--timerange
--token
--trigger
--type
--unit
--unitfilestate
--url
--username
--version
--virtualenv
--warning
--warning-count
--warning-cpu
--warning-maxchildren
--warning-mem
--warning-pattern
--warning-regex
--warning-slowreq
Parameter types are usually:
type=float
type=int
type=lib.args.csv
type=lib.args.float_or_none
type=lib.args.int_or_none
type=str
(the default)choices=['udp', 'udp6', 'tcp', 'tcp6']
action='store_true'
,action='store_false'
for switches
Hints:
- For complex parameter tupels, use the
csv
type.--input='Name, Value, Warn, Crit'
results in[ 'Name', 'Value', 'Warn', 'Crit' ]
- For repeating parameters, use the
append
action. Adefault
variable has to be a list then.--input=a --input=b
results in[ 'a', 'b' ]
- If you combine
csv
type andappend
action, you get a two-dimensional list:--repeating-csv='1, 2, 3' --repeating-csv='a, b, c'
results in[['1', '2', '3'], ['a', 'b', 'c']]
- If you want to provide default values together with
append
, inparser.add_argument()
, leave thedefault
asNone
. If aftermain:parse_args()
the value is stillNone
, put the desired default list (or any other object) there. The primary purpose of the parser is to parse the commandline - to figure out what the user wants to tell you. There's nothing wrong with tweaking (and checking) theargs
Namespace after parsing. (According to https://bugs.python.org/issue16399)
Lessons learned: When it comes to parameters, stay backwards compatible. If you have to rename or drop parameters, keep the old ones, but silently ignore them. This helps admins deploy the monitoring plugins to thousands of servers, while the monitoring server is updated later for various reasons. To be as tolerant as possible, replace the parameter's help text with help=argparse.SUPPRESS
:
def parse_args():
"""Parse command line arguments using argparse.
"""
parser = argparse.ArgumentParser(description=DESCRIPTION)
parser.add_argument(
'--my-old-and-deprecated-parameter',
help=argparse.SUPPRESS,
dest='MY_OLD_VAR',
)
- Commit messages must start with "plugin-name: " and clearly and precisely state what has changed. Example:
about-me: Should be able to run even if psutil is or cannot be installed
. - If there is an issue, the commit message must consist of the issue title followed by "(fix #issueno)", for example:
about-me: Add OpenVPN (fix #341)
. - For the first commit, use the message
Add <plugin-name>
.
If a threshold has to be handled as a range parameter, this is how to interpret them. Pretty much the same as stated in the Nagios Development Guidelines.
- simple value: a range from 0 up to and including the value
- empty value after
:
: positive infinity ~
: negative infinity@
: if range starts with "@", then alert if inside this range (including endpoints)
Examples:
-w, -c | OK if result is | WARN/CRIT if |
---|---|---|
10 | in (0..10) | not in (0..10) |
-10 | in (-10..0) | not in (-10..0) |
10: | in (10..inf) | not in (10..inf) |
: | in (0..inf) | not in (0..inf) |
~:10 | in (-inf..10) | not in (-inf..10) |
10:20 | in (10..20) | not in (10..20) |
@10:20 | not in (10..20) | in 10..20 |
@~:20 | not in (-inf..20) | in (-inf..20) |
@ | not in (0..inf) | in (0..inf) |
So, a definition like --warning 2:100 --critical 1:150
should return the states:
val 0 1 2 .. 100 101 .. 150 151
-w WA WA OK OK WA WA WA
-c CR OK OK OK OK OK CR
=> CR WA OK OK WA WA CR
Another example: --warning 190: --critical 200:
val 189 190 191 .. 199 200 201
-w WA OK OK OK OK OK
-c CR CR CR CR OK OK
=> CR CR CR CR OK OK
Another example: --warning ~:0 --critical 10
val -2 -1 0 1 .. 9 10 11
-w OK OK OK WA WA WA WA
-c CR CR OK OK OK OK CR
=> CR CR OK WA WA WA CR
Have a look at procs
on how to implement this.
Use cache
if you need a simple key-value store, for example as used in nextcloud-version
. Otherwise, use db_sqlite
as used in cpu-usage
.
- Catch exceptions using
try
/except
, especially in functions. - In functions, if you have to catch exceptions, on such an exception always return
(False, errormessage)
. Otherwise return(True, result)
if the function succeeds in any way. For example, returning(True, False)
means that the function has not raised an exception and its result is simplyFalse
. - A function calling a function with such an extended error handling has to return a
(retc, result)
tuple itself. - In
main()
you can uselib.base.coe()
to simplify error handling. - Have a look at
nextcloud-version
for details.
Print a short concise message in the first line within the first 80 chars if possible.
Use multi-line output for details (
msg_body
), with the most important output in the first line (msg_header
).Don't print "OK".
Print "[WARNING]" or "[CRITICAL]" for clarification next to a specific item using
lib.base.state2str()
.If possible give a help text to solve the problem.
Multiple items checked, and ...
- ... everything ok? Print "Everything is ok." or the most important output in the first line, and optional the items and their data attached in multiple lines.
- ... there are warnings or errors? Print "There are warnings." or "There are errors." or the most important output in the first line, and optional the items and their data attached in multiple lines.
Based on parameters etc. nothing is checked at the end? Print "Nothing checked."
Wrong username or password? Print "Failed to authenticate."
Use short "Units of Measurements" without white spaces, including these terms:
- Bits: use
human.bits2human()
- Bytes: use
human.bytes2human()
- I/O and Throughput:
human.bytes2human() + '/s'
(Byte per Second) - Network: "Rx/s", "Tx/s", use
human.bps2human()
- Numbers: use
human.number2human()
- Percentage: 93.2%
- Read/Write: "R/s", "W/s", "IO/s"
- Seconds, Minutes etc.: use
human.seconds2human()
- Temperatures: 7.3C, 45F.
- Bits: use
Use ISO format for date or datetime ("yyyy-mm-dd", "yyyy-mm-dd hh:mm:ss")
Print human readable datetimes and time periods ("Up 3d 4h", "2019-12-31 23:59:59", "1.5s")
"UOM" means "Unit of Measurement".
Sample:
'label'=value[UOM];[warn];[crit];[min];[max];
label
doesn't need to be machine friendly, so Pages scanned=100;;;;;
is as valuable as pages-scanned=100;;;;;
.
Suffixes:
no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
s - seconds (also us, ms etc.)
% - percentage
B - bytes (also KB, MB, TB etc.). Bytes preferred, they are exact.
c - a continous counter (such as bytes transmitted on an interface [so instead of 'B'])
Wherever possible, prefer percentages over absolute values to assist users in comparing different systems with different absolute sizes.
Be aware of already-aggregated values returned by systems and applications. Apache for example returns a value "137.5 kB/request". Sounds good, but this is not a value at the current time of measurement. Instead, it is the average of all requests during the lifetime of the Apache worker process. If you use this in some sort of Grafana panel, you just get a boring line which converges towards a constant value very fast. Not useful at all.
A monitoring plugin has to calculate such values always on its own. If this is not possible because of missing data, discard them.
We use PEP 8 -- Style Guide for Python Code (where it makes sense).
We document our Libraries using numpydoc docstrings, so that calling pydoc lib/base.py
works, for example.
To further improve code quality, we use PyLint like so:
- Libs:
pylint mylib.py
- Monitoring Plugins:
pylint --disable='invalid-name, missing-function-docstring, missing-module-docstring' plugin-name
Have a look at PyLint's message codes.
To help sort the import
-statements we use isort
:
# to sort all imports
isort --recursive .
# sort in a single plugin
isort plugin-name
Implementing tests:
- Use the
unittest
framework (https://docs.python.org/3/library/unittest.html).Within yourunit-test/run
file, call the plugin as a bash command, capture stdout, stderr and its return code (retc), and run your assertions against stdout, stderr and retc. To test a plugin that needs to run some tools that aren't on your machine or that can't provide special output, provide stdout/stderr files in
unit-test/stdout
,unit-test/stderr
and/orunit-test/retc
and a--test
parameter to feedstdout/stdout-file,stderr/stderr-file,expected-retc
into your plugin. If you get the--test
parameter, skip the execution of your bash/psutil/whatever function.
For example, have a look at the fs-ro
plugin on how to do this.
Running a complete unit test:
# cd into the plugin directory, then:
cd unit-test
# run the Python based test:
./run
If the plugin requires sudo
-permissions to run, please add the plugin to the sudoers
-files for all supported operating systems in assets/sudoers/
. The OS name should match the ansible variables ansible_facts['distribution'] + ansible_facts['distribution_major_version']
(eg CentOS7
). Use symbolic links to prevent duplicate files.
Attention!
The newline at the end is required!
Each plugin should provide its required Director config in form of a Director basket. The basket usually contains at least one Command, one Service Template and some associated Datafields. The rest of the Icinga Director configuration (Host Templates, Service Sets, Notification Templates, Tag Lists, etc) can be placed in the assets/icingaweb2-module-director/all-the-rest.json
file.
The Icinga Director Basket for one or all plugins can be created using the check2basket
tool.
After writing a new check called new-check
, generate a basket file using:
./tools/check2basket --plugin-file check-plugins/new-check/new-check
The basket will be saved as check-plugins/new-check/icingaweb2-module-director/new-check.json
. Inspect the basket, paying special attention to:
- Command:
timeout
- ServiceTemplate:
check_interval
- ServiceTemplate:
criticality
- ServiceTemplate:
enable_perfdata
- ServiceTemplate:
max_check_attempts
- ServiceTemplate:
retry_interval
Never directly edit a basket JSON file. If adjustments must be made to the basket, create a YML/YAML config file for check2basket
.
For example, to set the timeout to 30s, to enable notifications and some other options, the config in check-plugins/new-check/icingaweb2-module-director/new-check.yml
should look as follows:
---
variants:
- linux
- windows
overwrites:
'["Command"]["cmd-check-new-check"]["command"]': '/usr/bin/sudo /usr/lib64/nagios/plugins/new-check'
'["Command"]["cmd-check-new-check"]["timeout"]': 30
'["ServiceTemplate"]["tpl-service-new-check"]["check_command"]': 'cmd-check-new-check-sudo'
'["ServiceTemplate"]["tpl-service-new-check"]["check_interval"]': 3600
'["ServiceTemplate"]["tpl-service-new-check"]["enable_perfdata"]': true
'["ServiceTemplate"]["tpl-service-new-check"]["max_check_attempts"]': 5
'["ServiceTemplate"]["tpl-service-new-check"]["retry_interval"]': 30
'["ServiceTemplate"]["tpl-service-new-check"]["use_agent"]': false
'["ServiceTemplate"]["tpl-service-new-check"]["vars"]["criticality"]': 'C'
Then, re-run check2basket
to apply the overwrites:
./tools/check2basket --plugin-file check-plugins/new-check/new-check
If a parameter was added, changed or deleted in the plugin, simply re-run the check2basket
to update the basket file.
The check2basket
tool also offers to generate so-called variants
of the checks (different flavours of the check command call to run on different operating systems):
linux
: This is the default, and will be used if no other variant is defined. It generates acmd-check-...
,tpl-service-...
and the associated datafields.windows
: Generates acmd-check-...-windows
,cmd-check-...-windows-python
,tpl-service-...-windows
and the associated datafields.sudo
: Generates acmd-check-...-sudo
importing thecmd-check-...
, but with/usr/bin/sudo
prepended to the command, and atpl-service...-sudo
importing thetpl-service...
, but with thecmd-check-...-sudo
as the check command.no-agent
: Generates atpl-service...-no-agent
importing thetpl-service...
, but with command endpoint set to the Icinga2 master.
Specify them in the check-plugins/new-check/icingaweb2-module-director/new-check.yml
configuration as follows:
---
variants:
- linux
- sudo
- windows
- no-agent
To run check2basket
against all checks, for example due to a change in the check2basket
script itself, use:
./tools/check2basket --auto
If you want to create a Service Set, edit assets/icingaweb2-module-director/all-the-rest.json
and append the definition using JSON. Provide new unique UUIDs. Do a syntax check using cat assets/icingaweb2-module-director/all-the-rest.json | jq
afterwards.
If you want to move a service from one Service Set to another, you have to create a new UUID for the new service (this isn't even possible in the Icinga Director GUI).
The title of the dashboard should be capitalized, the name has to match the folder/plugin name (spaces will be replaced with -
, /
will be ignored. eg Network I/O
will become network-io
). Each Grafana panel should be meaningful, especially when comparing it to other related panels (eg memory usage and CPU usage).
Incomplete list of special features in some check-plugins.
README explains Python regular expression negative lookaheads to exclude matches:
Lists "Top X" values (search for --top
parameter):
Alerts only after a certain amount of calls (search for --count
parameter):
Cuts (truncates) its SQLite database table:
Pure/raw network communication using byte-structs and sockets:
Checks for a minimum required 3rd party library version:
"Learns" thresholds on its own (implementing some kind of "threshold warm-up"):
Ports of applications:
- disk-smart: port of GSmartControl to Python.
- All mysql-* plugins: Port of MySQLTuner to Python.
Makes use of FREE
and USED
wording in parameters:
--perfdata-regex
parameter lets you filter for a subset of performance data:
Is aware of its acknowledgement status in Icinga, and will suppress further warnings if it has been ACKed:
Calculates mean and median perfdata over a set of individual items:
Supports human-readable Nagios ranges for bytes:
Sanitizes complex data before querying MySQL/MariaDB:
Reads a file line-by-line, but backwards:
Makes heavy use of patterns versus compiled regexes, matching any() of them:
Using application's config file for authentication:
- All mysql-* plugins
Optionally uses an asset:
- php-status: relies on
monitoring.php
that can provide more PHP insight in the context of the web server
Provides useful feedback from Redis' Memory Doctor:
Work without the jolokia.war
plugin and use the native API:
- All wildfly-* checks
Supports human-readable Nagios ranges for durations:
Differentiates between Windows and Linux (search for lib.base.LINUX
or lib.base.WINDOWS
):