Skip to content

Commit

Permalink
Adds a new CLUO operator and agent workloads for platform nodes (#226)
Browse files Browse the repository at this point in the history
* Renames update-operator to update-operator-master since we want to have CLOU for platform nodes too.

* Renames update-operator.jsonnet to update-operator-master.jsonnet since we need another operator for the platform too.

* Adds a CLUO agent daemonset for platform nodes.

* Renames update-agent.jsonnet to include a 'master' modifier to distinguish it from the agent file for platform nodes.

* Updates system.jsonnet to find renamed update-operator workloads and the new update-operator workloads for platform nodes.

* Adds a CLUO deployment for platform nodes.

* Schedules the master cluster CLUO operator on the prometheus server cloud node. Previously it was on a master node, like a snake eating its tail.

* Since the master cluster CLUO operator is on the Prometheus server, schedule the platform cluster CLUO operator there too, for consistency.

* Adds -before-reboot-annotations to the CLUO operator deployments for both master and platform nodes.

* Updates the -before-reboot-annotations flag values.

* Adds an annotation to the master node so that CLUO for the master cluster can identify master nodes.

* Adds the 'patch' verb to the things that the update operator role can do to nodes.

* Renames the role file for CLUO to be simpler and shorter.

* Adds a new template function CluoAnnotation(), and runs that function from each update-agent DaemonSet.

* Fixes a number of syntax and format errors.

* Fixes typo 'alpine:lastest' to use 'latest'.

* Install curl and adds line continuation chars at end of each line of the CluoAnnotation initContainer.

* Moves apk-update and apk-add to first line.

* Uses the reboot-coordinator service account explicity rather than defining it and instead using the default namespace SA.

* Attempts to use initContainer to write a node annotation script to a volume which the update-agent container will mount and runn before running the agent.

* Adds a missing comma to update-agent-master.jsonnet.

* Fixes the syntax and spelling of the volumeMounts sections of the initContainers.

* Moves update-operator node annotation to a configmap.

* Removes old reference to CluoAnnotation in templates.jsonnet.

* Adds new configs for the update-operator.

* Adds the update-operator ConfigMap to the reboot-coordinator namespace.

* Changes the format of 'command' for update-agents to something that will hopefully work.

* Runs annotate-node.sh with sh, since is isn't executable.

* Adds sinqle quotes around patch JSON.

* Uses double quotes instead of single quotes for JSON patch to avoid conflict with enclosing single quotes.

* Wraps curl header fields in single quotes.

* Removes single quotes from curl arguments.

* Adds single quotes around JSON patch.

* Removes single quote wrapping from JSON patch.

* Changes update operator annotation from mlab-type-<type> to mlab-type/<type> for better readability.

* Remove the annotation prefix because it didn't really comply with the prefix specification, which is suspposed to be a domain name-style identifier.

* Goes back to using all dashes for the update operator node nnotations, because I just didn't like the underscore.

* Removes the 'master' update opeartor and agent, since we will now just use update-operator for rolling reboots of platform nodes only.

* Renames the 'platform' update operator and agent to just update-operator and update-agent, since we will now only run a single operator and agent for platform nodes only.

* Removes the -master and -platform suffixes from update operator and agent DaemonSet and Deployment, and changes the --before-reboot-annotation to 'mlab-reboot-ok'.

* Sets the update-operator to run on a master node again, now that we aren't using it to reboot master nodes.

* Removes the special update-operator annotation from the master nodes, now that master nodes won't be rebooted by update-operator.

* Adds a new reboot-node.service, along with a Timer to execute it once a day, as well as writing the file to be executed.

* Uses the short weekday name instead of the long one.

* Configures a 'reboot day' for each of the master nodes. Each one will reboot on a different day of the week.

* Updates system.jsonnet to account for the fact that only a single update-operator and agent now exist.

* Do not enable or start the reboot-node.service, as this will cause a reboot loop. The only thing that should run this service is its associated Timer.

* Renames allocate_new_cloud_node.sh to something more intuitive for the platform cluster.

* Calls newly renamed add_platform_cluster_cloud_node.sh.

* Removes the facility for annotation a node for the update-operator, since that annotation will have to be applied by an operator or some operator script.

* Expands a comment about tolerations.

* Adds in an additional safety check in the reboot-node.service script such that a reboot will not occur if the etcd cluster does not have exactly 3 members.

* Restructures reboot-node script with slightly easier to follow logic.
  • Loading branch information
nkinkade authored Aug 5, 2019
1 parent 0672156 commit 0702299
Show file tree
Hide file tree
Showing 10 changed files with 69 additions and 13 deletions.
11 changes: 6 additions & 5 deletions k8s/daemonsets/core/update-agent.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -65,13 +65,14 @@
},
],
nodeSelector: {
'node-role.kubernetes.io/master': '',
'mlab/type': 'platform',
},
serviceAccountName: 'reboot-coordinator',
// This is a pod that should be scheduled under every possible
// circumstance, so tolerate everything.
tolerations: [
{
effect: 'NoSchedule',
key: 'node-role.kubernetes.io/master',
operator: 'Exists',
operator: 'Exists'
},
],
volumes: [
Expand Down Expand Up @@ -104,7 +105,7 @@
},
updateStrategy: {
rollingUpdate: {
maxUnavailable: 1,
maxUnavailable: 2,
},
type: 'RollingUpdate',
},
Expand Down
6 changes: 2 additions & 4 deletions k8s/deployments/update-operator.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,7 @@
containers: [
{
args: [
'-reboot-window-start=Tue 15:00',
'-reboot-window-length=2h',
'-before-reboot-annotations=mlab-reboot-ok',
],
command: [
'/bin/update-operator',
Expand All @@ -49,10 +48,9 @@
nodeSelector: {
'node-role.kubernetes.io/master': '',
},
serviceAccountName: 'reboot-coordinator',
tolerations: [
{
effect: 'NoSchedule',
key: 'node-role.kubernetes.io/master',
operator: 'Exists',
},
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
verbs: [
'get',
'list',
'patch',
'watch',
'update',
],
Expand Down Expand Up @@ -96,7 +97,7 @@
subjects: [
{
kind: 'ServiceAccount',
name: 'default',
name: 'reboot-coordinator',
namespace: 'reboot-coordinator',
},
],
Expand Down
File renamed without changes.
4 changes: 3 additions & 1 deletion manage-cluster/bootstrap_platform_cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -522,6 +522,8 @@ gcloud compute firewall-rules create ${GCE_BASE_NAME}-internal \
#
ETCD_CLUSTER_STATE="new"

idx=0
for zone in $GCE_ZONES; do
create_master $zone
create_master $zone ${REBOOT_DAYS[$idx]}
idx=$(( idx + 1 ))
done
2 changes: 1 addition & 1 deletion manage-cluster/bootstrap_prometheus.sh
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ fi
#######################################################

# Create the new node
./allocate_new_cloud_node.sh -p "${PROJECT}" \
./add_platform_cluster_cloud_node.sh -p "${PROJECT}" \
-m "${MACHINE_TYPE}" \
-n "${PROM_BASE_NAME}" \
-a "${PROM_BASE_NAME}" \
Expand Down
6 changes: 6 additions & 0 deletions manage-cluster/bootstraplib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

function create_master {
local zone=$1
local reboot_day=$2

gce_zone="${GCE_REGION}-${zone}"
gce_name="master-${GCE_BASE_NAME}-${gce_zone}"
Expand Down Expand Up @@ -156,6 +157,11 @@ function create_master {
# Binaries will get installed in /opt/bin, put it in root's PATH
echo "export PATH=$PATH:/opt/bin" >> /root/.bashrc
# Write out the reboot day to a file in /etc. The reboot-node.service
# systemd unit will read the contents of this file to determine when to
# reboot the node.
echo -n "${reboot_day}" > /etc/reboot-node-day
# Install CNI plugins.
mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${K8S_CNI_VERSION}/cni-plugins-amd64-${K8S_CNI_VERSION}.tgz" | tar -C /opt/cni/bin -xz
Expand Down
42 changes: 42 additions & 0 deletions manage-cluster/cloud-config_master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,28 @@ coreos:
[Install]
WantedBy=multi-user.target
- name: reboot-node.service
content: |
[Unit]
Description=reboot-node.service
[Service]
Type=oneshot
ExecStart=/opt/bin/reboot-node
- name: reboot-node.timer
enable: "true"
command: "start"
content: |
[Unit]
Description=Run reboot-node.service daily
[Timer]
OnCalendar=Mon..Fri 15:00:00
[Install]
WantedBy=multi-user.target
write_files:
- path: /etc/ssh/sshd_config
permissions: 0600
Expand Down Expand Up @@ -103,3 +125,23 @@ write_files:
content: |
fs.inotify.max_user_watches=131072
# The smallest of scripts to reboot the machine.
- path: /opt/bin/reboot-node
permissions: 0744
owner: root:root
content: |
#!/bin/bash
REBOOT_DAY=$(cat /etc/reboot-node-day)
TODAY=$(date +%a)
ETCD_MEMBERS=$(/usr/bin/etcdctl member list | wc -l)
if [[ "${REBOOT_DAY}" != "${TODAY}" ]]; then
echo "Reboot day ${REBOOT_DAY} doesn't equal today: ${TODAY}. Not rebooting."
exit 0
fi
if [[ "${ETCD_MEMBERS}" -lt "3" ]]; then
echo "There are less than 3 etcd cluster members. Not rebooting."
exit 1
fi
echo "Reboot day ${REBOOT_DAY} equals today: ${TODAY}. Rebooting node."
/usr/sbin/reboot
6 changes: 6 additions & 0 deletions manage-cluster/k8s_deploy.conf
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ GCE_ZONES_mlab_oti="b c d"
GCS_BUCKET_EPOXY_mlab_oti="epoxy-mlab-oti"
GCS_BUCKET_K8S_mlab_oti="k8s-support-mlab-oti"

# The days on which the master nodes will be rebooted automatically. The days
# map to three GCE_ZONES defined for each project. That is, the first day in
# the below array will apply to the first GCE_ZONE defined for the project, and
# so on.
REBOOT_DAYS=(Tue Wed Thu)

# Whether the script should exit after deleting all existing GCP resources
# associated with creating this k8s cluster. This could be useful, for example,
# if you want to change various object names, but don't want to have to
Expand Down
2 changes: 1 addition & 1 deletion system.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
// Networks (which are in array form already).
import 'k8s/networks/networks.jsonnet',
// Roles (which are in array form already).
import 'k8s/roles/container-linux-update-coordinator.jsonnet',
import 'k8s/roles/update-operator.jsonnet',
import 'k8s/roles/flannel.jsonnet',
import 'k8s/roles/kube-rbac-proxy.jsonnet',
import 'k8s/roles/kube-state-metrics.jsonnet',
Expand Down

0 comments on commit 0702299

Please sign in to comment.