Skip to content

Commit

Permalink
Workaround lengthy startup of many instances
Browse files Browse the repository at this point in the history
When a pool is empty of instances, the launch-stagger mechanism can
introduce a substantial delay to achieving a full-pool of active
workers.  This will negatively impact service availability and worker
utilization - likely resulting in CI tasks queuing.

Add a simple workaround for this condition with the addition of a
`--force` option.  When used, it will force instance creation on
all available dedicated hosts.  Similarly it will also force instance
setup, though with an extended shutdown delay timer.

Update documentation regarding this operational mode and it's purpose.

Signed-off-by: Chris Evich <[email protected]>
  • Loading branch information
cevich committed Nov 20, 2023
1 parent 71622bf commit 92a52d8
Show file tree
Hide file tree
Showing 5 changed files with 61 additions and 21 deletions.
18 changes: 16 additions & 2 deletions mac_pw_pool/LaunchInstances.sh
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,10 @@ else
else
dbg "launchtimes=[${launchtimes[*]}]"
for launch_time in "${launchtimes[@]}"; do
if [[ "$launch_time" == "" ]] || [[ "$launch_time" == "null" ]]; then
warn "Ignoring empty/null instance launch time."
continue
fi
# Assume launch_time is never malformed
launched_hour=$(date -u -d "$launch_time" "$dcmpfmt")
latest_launched_hour=$(date -u -d "$latest_launched" "$dcmpfmt")
Expand All @@ -138,12 +142,22 @@ msg "Operating on $n_dh_total dedicated hosts at $(date -u -Iseconds)"
msg " ${_n_dh_sp}Last instance launch on $latest_launched"
echo -e "# $(basename ${BASH_SOURCE[0]}) run $(date -u -Iseconds)\n#" > "$TEMPDIR/$(basename $DHSTATE)"

# When initializing a new pool of workers, it would take many hours
# to wait for the staggered creation mechanism on each host. This
# would negativly impact worker utilization. Provide a workaround.
force=0
# shellcheck disable=SC2199
if [[ "$@" =~ --force ]]; then
warn "Forcing instance creation: Ignoring staggered creation limits."
force=1
fi

for name_hostid in "${NAME2HOSTID[@]}"; do
n_dh=$(($n_dh+1))
_I=" "
msg " " # make output easier to read

read -r name hostid<<<"$name_hostid"
read -r name hostid junk<<<"$name_hostid"
msg "Working on Dedicated Host #$n_dh/$n_dh_total '$name' for HostID '$hostid'."

hostoutput="$TEMPDIR/${name}_host.output" # JSON or error message from aws describe-hosts
Expand Down Expand Up @@ -204,7 +218,7 @@ for name_hostid in "${NAME2HOSTID[@]}"; do
now_hour=$(date -u "$dcmpfmt")
dbg "launch_threshold_hour=$launch_threshold_hour"
dbg " now_hour=$now_hour"
if [[ $now_hour -lt $launch_threshold_hour ]]; then
if [[ "$force" -eq 0 ]] && [[ $now_hour -lt $launch_threshold_hour ]]; then
msg "Cannot launch new instance until $launch_threshold"
echo "# $name HOST THROTTLE: Inst. creation delayed until $launch_threshold" > "$inststate"
continue
Expand Down
9 changes: 9 additions & 0 deletions mac_pw_pool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,15 @@ From a PR perspective, there is zero control over which instance you
get. It could easily be one somebody's previous task barfed all over
and ruined.

## Initialization

When no dedicated hosts have instances running, complete creation and
setup will take many hours. This may be bypassed by *manually* running
`LaunchInstances.sh --force`. The operator should then wait 20minutes
before *manually* running `SetupInstances.sh --force`. This delay
is necessary to account for the time a Mac instance takes to boot and
become ssh-able.

## Security

To thwart attempts to hijack or use instances for nefarious purposes,
Expand Down
24 changes: 20 additions & 4 deletions mac_pw_pool/SetupInstances.sh
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,20 @@ fi
# N/B: Assumes $DHSTATE represents reality
msg "Operating on $n_inst_total instances from $(head -1 $DHSTATE)"
echo -e "# $(basename ${BASH_SOURCE[0]}) run $(date -u -Iseconds)\n#" > "$TEMPDIR/$(basename $PWSTATE)"
# Indent for messages inside loop

# Assuming the `--force` option was used to initialize a new pool of
# workers, then instances need to be configured with a self-termination
# shutdown delay. This ensures future replacement instances creation
# is staggered, soas to maximize overall worker utilization.
term_addtl=0
# shellcheck disable=SC2199
if [[ "$@" =~ --force ]]; then
warn "Forcing instance creation: Ignoring staggered creation limits."
term_addtl=1 # Multiples of 2-hours to add to self-termination delay
fi

for _dhentry in "${_dhstate[@]}"; do
read -r name instance_id launch_time<<<"$_dhentry"
read -r name instance_id launch_time junk<<<"$_dhentry"
_I=" "
msg " "
n_inst=$(($n_inst+1))
Expand Down Expand Up @@ -158,14 +169,19 @@ for _dhentry in "${_dhstate[@]}"; do
continue
fi

dbg "Additional term-delay hours: $term_addtl"

# Run setup script in background b/c it takes ~5-10m to complete.
$SSH ec2-user@$pub_dns \
env POOLTOKEN=$POOLTOKEN \
bash -c '/var/tmp/setup.sh &> setup.log & disown %-1'
bash -c "/var/tmp/setup.sh $(($term_addtl * 2)) &> setup.log & disown %-1"

msg "Setup script started"
msg "Setup script started w/ $(($term_addtl * 2))hour(s) additional shutdown delay"
set_pw_status setup started

# When starting multiple instance, force self-termination staggering.
term_addtl=$(($term_addtl+1))

# Let it run in the background
continue
fi
Expand Down
19 changes: 7 additions & 12 deletions mac_pw_pool/service_pool.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
# Launch Cirrus-CI PW Pool listener & manager process.
# Intended to be called once from setup.sh on M1 Macs.
# Expects configuration filepath to be passed as the first argument.
# Expects the number of hours until shutdown (and self-termination)
# as the second argument.

set -eo pipefail

Expand All @@ -17,6 +19,9 @@ msg "Listener started at $(date -u -Iseconds)"
[[ -r "$1" ]] || \
die "Can't read configuration file '$1'"

[[ -n "$2" ]] || \
die "Expecting shutdown delay hours as second argument"

# For whatever reason, when this script is run through ssh, the default
# environment isn't loaded automatically.
. /etc/profile
Expand All @@ -27,24 +32,14 @@ PWUSER=$PWINST-worker
die "Unexpectedly empty instance name, is metadata tag access enabled?"

PWCFG="$1"

# CI effectively allows unmitigated access to run or host any
# process or content on this instance as $PWUSER. Limit the
# potential blast-radius of any nefarious use by restricting
# the lifetime of the instance. If this ends up disturbing
# a running task, Cirrus will automatically retry on another
# available pool instance. Shutdown instance after this many
# hours servicing the pool. Note: It's randomized slightly
# to prevent instances going down at similar times.
_rndadj=$((RANDOM%8-4)) # +/- 4 hours
PWLIFE=$((24+$_rndadj))
PWLIFE="$2"

# Configuring a launchd agent to run the worker process is a major
# PITA and seems to require rebooting the instance. Work around
# this with a really hacky loop masquerading as a system service.
# Run it in the background to allow this setup script to exit.
# N/B: CI tasks have access to kill the pool listener process!
expires=$(($(date -u "+%Y%m%d%H") + $PWLIFE))
expires=$(date -u "+%Y%m%d%H" -d "+$PWLIFE hours")
while [[ -r $PWCFG ]]; do
# Don't start new pool listener if it or a CI agent process exist
if ! pgrep -u $PWUSER -f -q "cirrus worker run" && ! pgrep -u $PWUSER -q "cirrus-ci-agent"; then
Expand Down
12 changes: 9 additions & 3 deletions mac_pw_pool/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
# The instance must have both "metadata" and "Allow tags in
# metadata" options enabled. The instance must set the
# "terminate" option for "shutdown behavior".
#
# Script accepts a single argument: The number of hours to
# delay self-termination (including 0).

set -eo pipefail

Expand All @@ -28,6 +31,8 @@ die() { echo "ERROR: ${1:-No error message provided}"; exit 1; }
[[ "$USER" == "ec2-user" ]] || \
die "Expecting to execute as 'ec2-user'."



msg "Configuring paths"
grep -q homebrew /etc/paths || \
echo -e "/opt/homebrew/bin\n/opt/homebrew/opt/coreutils/libexec/gnubin\n$(cat /etc/paths)" \
Expand Down Expand Up @@ -88,7 +93,8 @@ fi
#
# * Increase value to improve instance CI-utilization.
# * Reduce value to lower instability & security risk.
PWLIFE=22
# * Additional hours argument is optional.
PWLIFE=$((22+${1:-0}))

if ! id "$PWUSER" &> /dev/null; then
sudo sysadminctl -addUser $PWUSER
Expand Down Expand Up @@ -119,8 +125,8 @@ sudo chown ${USER}:staff $PWLOG
sudo chmod g+rw $PWLOG

if ! pgrep -q -f service_pool.sh; then
msg "Starting listener supervisor process"
/var/tmp/service_pool.sh "$PWCFG" &
msg "Starting listener supervisor process w/ ${PWLIFE}hour lifetime"
/var/tmp/service_pool.sh "$PWCFG" "$PWLIFE" &
disown %-1
else
msg "Warning: Listener supervisor already running"
Expand Down

0 comments on commit 92a52d8

Please sign in to comment.