Releases: m-lab/script-exporter-support
Fixes a bug in the apply-tc-rules.service
The systemd unit apply-tc-rules.service
gets run once a day by the systemd timer apply-tc-rules.timer
, but the service ExecStart
directive of the service was using the -ti
flags for docker. Since this is run daily by a mechanism that cannot start a pty, the timer was failing. This release include a commit which simply removes the -ti
flags, which should resolve this issue.
Explicitly starts docker.service on boot
Fixes this bug: m-lab/ops-tracker#348
... which is summed up in this comment:
Updates tc traffic shaping rules on a daily basis
The script_exporter Docker container must have tc traffic shaping rules in place to throttle the ndt-e2e test. If they are not in place for a given IP address, then the test will refuse to run. Previously, the rules were getting out of sync with reality (i.e., sites.py), and updating them would either be manual or require a redeployment of the entire VM. This release adds a new systemd timer on the host VM that should run a docker-exec
command on a daily basis (at midnight) to update the tc rules.
Spreads out NDT e2e tests over cache expiry interval
This release fixes an issue whereby the NDT e2e tests were being run at that same time rather than spread out. This was causing large spikes in resource usage on the VM, and in some extreme cases was causing the OOM killer to kill the Docker container running script_exporter.
The ndt_e2e.sh
script caches the results of the end to end test in a local file for each node. It was previously setting the expiry of the cache file to exactly 10 minutes. However, since Prom probes it every minute, all cache files were expiring between 0 and 60 seconds apart, meaning floods of tests. This release brings in a fix that randomizes the expiry of each cache file between 0 and 10 minutes, which should roughly spread the testing out over a period of 10, which should also spread the load caused by nodejs.
Always [re]start Docker containers
This release contains two updates:
-
bumps the GCE machine type for the mlab-oti project to n1-standard-8, since n1-highmem-4 was apparently not enough CPU.
-
adds a
--restart=always
to thedocker-run
commands so that containerd will always start the containers if they stop for any reason (e.g., the GCE instance was restarted).
Adds node_exporter to GCE instance
The main purpose of this release is to run a Prometheus node_exporter instance in the GCE VM so that we can easily monitor resource usage of the VM.
The release also set the GCE machine type for the mlab-oti project to n1-highmem-4
to account for how resource hungry nodejs is.
Adds ndt_e2e result caching + ndt_queue now returns actual RC
This PR includes several improvements:
- The ndt_e2e script now caches the result of the previous test. If the test result is a pass (return code 0) then the script will return the cached value for 10 minutes. If the return code is not 0, then the script will run the end to end test on every probe (once a minute). This allows us to probe the ndt_e2e script every minute without overloading any servers by testing every minute, and has the added benefit that tests will run more frequently when NDT is down so that a recovery will be detected much sooner.
- Some changes were made to the script_exporter code such that it now returns a new metric named
script_exit_code
. This is useful especially for the ndt_queue script to help us differentiate actual queueing from just a failed test (e.g., a DNS error, a transient network condition, etc.). - The script_exporter fork was moved from nkinkade's personal Github account into the m-lab account.
First production release
This is the first production release of this service. As a start is makes available two probes:
- ndt_e2e: Runs a rate-limited NDT test against the specified target.
- ndt_queue: Checks whether the specified target (and NDT server) is queueing.