Skip to content

Releases: m-lab/script-exporter-support

Fixes a bug in the apply-tc-rules.service

30 Apr 17:40
b3c3467
Compare
Choose a tag to compare

The systemd unit apply-tc-rules.service gets run once a day by the systemd timer apply-tc-rules.timer, but the service ExecStart directive of the service was using the -ti flags for docker. Since this is run daily by a mechanism that cannot start a pty, the timer was failing. This release include a commit which simply removes the -ti flags, which should resolve this issue.

Explicitly starts docker.service on boot

04 Apr 17:30
ce61d31
Compare
Choose a tag to compare

Fixes this bug: m-lab/ops-tracker#348

... which is summed up in this comment:

coreos/bugs#369 (comment)

Updates tc traffic shaping rules on a daily basis

03 Apr 17:22
17968e5
Compare
Choose a tag to compare

The script_exporter Docker container must have tc traffic shaping rules in place to throttle the ndt-e2e test. If they are not in place for a given IP address, then the test will refuse to run. Previously, the rules were getting out of sync with reality (i.e., sites.py), and updating them would either be manual or require a redeployment of the entire VM. This release adds a new systemd timer on the host VM that should run a docker-exec command on a daily basis (at midnight) to update the tc rules.

Spreads out NDT e2e tests over cache expiry interval

26 Feb 17:34
52138ca
Compare
Choose a tag to compare

This release fixes an issue whereby the NDT e2e tests were being run at that same time rather than spread out. This was causing large spikes in resource usage on the VM, and in some extreme cases was causing the OOM killer to kill the Docker container running script_exporter.

The ndt_e2e.sh script caches the results of the end to end test in a local file for each node. It was previously setting the expiry of the cache file to exactly 10 minutes. However, since Prom probes it every minute, all cache files were expiring between 0 and 60 seconds apart, meaning floods of tests. This release brings in a fix that randomizes the expiry of each cache file between 0 and 10 minutes, which should roughly spread the testing out over a period of 10, which should also spread the load caused by nodejs.

Always [re]start Docker containers

20 Feb 21:37
e19a30d
Compare
Choose a tag to compare

This release contains two updates:

  • bumps the GCE machine type for the mlab-oti project to n1-standard-8, since n1-highmem-4 was apparently not enough CPU.

  • adds a --restart=always to the docker-run commands so that containerd will always start the containers if they stop for any reason (e.g., the GCE instance was restarted).

Adds node_exporter to GCE instance

12 Feb 22:02
bd77add
Compare
Choose a tag to compare

The main purpose of this release is to run a Prometheus node_exporter instance in the GCE VM so that we can easily monitor resource usage of the VM.

The release also set the GCE machine type for the mlab-oti project to n1-highmem-4 to account for how resource hungry nodejs is.

Adds ndt_e2e result caching + ndt_queue now returns actual RC

06 Feb 00:12
26b7678
Compare
Choose a tag to compare

This PR includes several improvements:

  • The ndt_e2e script now caches the result of the previous test. If the test result is a pass (return code 0) then the script will return the cached value for 10 minutes. If the return code is not 0, then the script will run the end to end test on every probe (once a minute). This allows us to probe the ndt_e2e script every minute without overloading any servers by testing every minute, and has the added benefit that tests will run more frequently when NDT is down so that a recovery will be detected much sooner.
  • Some changes were made to the script_exporter code such that it now returns a new metric named script_exit_code. This is useful especially for the ndt_queue script to help us differentiate actual queueing from just a failed test (e.g., a DNS error, a transient network condition, etc.).
  • The script_exporter fork was moved from nkinkade's personal Github account into the m-lab account.

First production release

29 Jan 22:37
2c55250
Compare
Choose a tag to compare

This is the first production release of this service. As a start is makes available two probes:

  • ndt_e2e: Runs a rate-limited NDT test against the specified target.
  • ndt_queue: Checks whether the specified target (and NDT server) is queueing.