Skip to content

On Call Runbooks

Gabriel Zurita edited this page Nov 12, 2024 · 11 revisions

This guide outlines the discrete steps required for on-call duties, including deploying new versions of our software.


Common Issues and Troubleshooting

SecRel Failures

Common Causes:

  • Vulnerabilities detected by Aqua or Snyk
  • Dependency issues

Resolution Steps:

  1. Identify the failing gate check.
  2. Consult the SecRel Getting Started on VRO guide.
  3. Update dependencies or apply suppressions as necessary.

Deployment Failures

Symptoms:

  • Unhealthy applications in ArgoCD

Actions:

  1. Diagnose and fix issues if possible.
  2. Delay deployment if necessary.

Deployment Process

Step 1: Prepare for Deployment

  1. Every first Tuesday of a new sprint, check #benefits-vro-on-call for a message from the Partner Team Production Deployment Slack Workflow (which is currently manually set off, but will be automated).
  2. Verify there is a recent eligible build:
    • In the abd-vro-internal repository on GitHub, under the Actions tab, check the latest successful (Internal) SecRel workflow run.
    • If SecRel has not passed for several days, delay deployment and investigate the cause. Click "Not Ready" to halt the workflow if issues can’t be quickly fixed; otherwise, proceed.
    • Keep this tab open and refer to the image tag in the GHCR Summary section in future steps.
  3. If a Sign Images build is ready (example here), click "Ready" to proceed.

Step 2: Coordinate with Partner Teams

  1. An automated message will request partner teams in #benefits-vro-support to opt-in or opt-out (the workflow will send a message to the #benefits-vro-on-call channel).
  2. If a team opts out, exclude their applications from the deployment.

Step 3: Deploy to Lower Environments

  1. Begin deployment to lower regions by EOD Tuesday or Wednesday morning:

    • In the #benefits-vro-on-call channel workflow, click Deploy: lower env in each partner team’s Slack thread.
    • A GitHub ticket for the deployment will be created; add the VRO-team and deployments labels, link it to the current sprint and Partner team request epic, and assign both on-call engineers.
  2. Build the release:

    • Create a branch in va-abd-rrd-argocd-applications-vault with the format releases/sprint-*, for example releases/sprint-5.
    • For each app (excluding partner teams that opted out) under the deploy directory set for deployment, update the imageTag field in dev.yaml, qa.yaml, and sandbox.yaml configuration files to the latest successful SecRel run image tag. See example PR here.
    • Push changes and get secondary approval.
  3. Deploy changes to lower environments:

    • Merge the PR, initiate sync, and monitor each environment:

      • Monitor On-Call Alerts: Check the on-call alerts channels for any issues.
      • Verify Sync in ArgoCD for Dev Environment:
        • In ArgoCD (namespace: va-abd-rrd-dev), confirm that the last sync timestamp for each pod matches the deployment time of all deployed applications.
        • If an application does not sync automatically, manually initiate the "Sync" action.
      • Sync QA and Sandbox Environments Manually:
        • QA and Sandbox require manual sync. Repeat the following steps for QA (va-abd-rrd-qa) and Sandbox (va-abd-rrd-sandbox):
          1. Select the appropriate namespace (e.g., va-abd-rrd-qa).
          2. Click Sync Apps.
          3. Choose ALL.
          4. Click Sync.
    • Troubleshoot as Needed: Diagnose and resolve any issues that arise during deployment.

  4. Validate with Partner Teams:

    • Notify partner teams via Slack to validate up to sandbox.
    • Partner teams must validate their applications’ health.
    • If any application is unhealthy, coordinate with the partner team to determine whether to patch or defer.
    • If a partner team opts out, have them click Opt-Out in #benefits-vro-support and revert the image tag change in the repository.

Step 4: Production Deployment (Thursday Morning)

  1. Start production deployment:

    • Click Deploy: production in Slack for each partner team's opt-in slack thread.
    • In va-abd-rrd-argocd-applications-vault, make a PR updating the imageTag fields in prod-test.yaml and prod.yaml for production and get secondary approval.
  2. Deploy to production:

    • Merge the PR, manually sync it, and verify the va-abd-rrd-prod-test environment health in ArgoCD.
      • If a platform app is unhealthy, attempt to diagnose any issues before deciding to rollback, following the same steps above for the rollback.
    • In ArgoCD, manually sync and monitor each va-abd-rrd-prod instance.
  3. Complete/Validate production deployment:

    • Click Validate in the #benefits-vro-on-call Slack thread to confirm app health with partner teams.
    • If rollback is needed for any app, follow the rollback steps.
    • Once partner teams have validated their apps are working, they will sign off on their deployment using the workflow and an automatic confirmation message will be sent to the thread in #benefits-vro-on-call.

Step 5: Close Out Deployment

  1. After all validations, close the GitHub deployment tickets.

Dependabot

  1. #TODO
  2. #TODO
Clone this wiki locally