Skip to content
This repository has been archived by the owner on Feb 7, 2025. It is now read-only.

Azure Alerts for Errors - Azure is down #1397

Closed
22 tasks done
scleary1cs opened this issue Oct 8, 2024 · 8 comments
Closed
22 tasks done

Azure Alerts for Errors - Azure is down #1397

scleary1cs opened this issue Oct 8, 2024 · 8 comments

Comments

@scleary1cs
Copy link
Contributor

scleary1cs commented Oct 8, 2024

Story

As a developer, I need to know if Azure is down, so that we can begin an incident.

Acceptance Criteria

  • Alert is working in Slack

Tasks

  • Create alert in TI alert.tf

Definition of Done

  • Documentation tasks completed
    • Documentation and diagrams created or updated
      • ADRs (/adr folder)
      • Main README.md
      • Other READMEs in the repo
      • If applicable, update the ReportStream Setup section in README.md
    • Threat model updated
    • API documentation updated
  • Code quality tasks completed
    • Code refactored for clarity and no design/technical debt
    • Adhere to separation of concerns; code is not tightly coupled, especially to 3rd party dependencies
  • Testing tasks completed
    • Load tests passed
    • Additional e2e tests created
    • Additional RS e2e assertions created in the rs-e2e project for any new transformations. Includes improvements to the assertion code required to make the new assertions
  • Build & Deploy tasks completed
    • Build process updated
    • API(s) are versioned
    • Feature toggles created and/or deleted. Document the feature toggle
    • Source code is merged to the main branch

Notes

@pluckyswan
Copy link
Contributor

Created a Service Health > Health alerts in internal, currently disabled.

@jherrflexion
Copy link
Contributor

Curious if this would be too noisy if it is looking at every Azure service? Will attempt to test this and convert to Terraform today.

@halprin
Copy link
Member

halprin commented Oct 18, 2024

Curious if this would be too noisy if it is looking at every Azure service? Will attempt to test this and convert to Terraform today.

Yeah, we definitely don't want to be looking at services we don't use. And yes, we want all these alert stories done via Terraform.

@jherrflexion
Copy link
Contributor

azure-outage-alert branch created. Internal clickops enabled for testing.

@jherrflexion
Copy link
Contributor

Currently blocked by an error in Terraform "CreateOrUpdate" on the test PR. Attempted to rerun the job a few times and draft another new PR and still received the error.

@pluckyswan
Copy link
Contributor

PR is out.

@jherrflexion
Copy link
Contributor

Addressing PR comments

@halprin
Copy link
Member

halprin commented Oct 22, 2024

Had to revert this work because sadly this is failing deploys in staging. You can see the revert PR for the thinking behind this. Because of all of this, moved this story back into In Progress.

@halprin halprin self-assigned this Oct 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants