Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Azure Alert for Azure is down #1464

Merged
merged 16 commits into from
Oct 22, 2024

Conversation

jherrflexion
Copy link
Contributor

@jherrflexion jherrflexion commented Oct 21, 2024

Setup Azure Alert for Azure is down

Added terraform for creating Azure Alert for when Azure is down.

Issue

#1397

Checklist

Note: You may remove items that are not applicable

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Configuration Logic
The logic to enable or disable the alert based on the environment (non-production vs production) should be reviewed to ensure it aligns with operational requirements.

Hardcoded Values
The alert configuration uses hardcoded values for alert levels and event types which might need to be configurable based on the environment or specific requirements.

operations/template/alert.tf Show resolved Hide resolved
operations/template/alert.tf Show resolved Hide resolved
operations/template/alert.tf Show resolved Hide resolved
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Possible issue
Adjust the scopes attribute to monitor the correct Azure resources

Ensure that the scopes attribute in the azurerm_monitor_activity_log_alert resource
is correctly set to monitor the intended Azure resources. Currently, it is set to
monitor only the Azure Container Registry (azurerm_container_registry.registry.id).
If the intention is to monitor more resources or different types, this should be
adjusted accordingly.

operations/template/alert.tf [36]

-scopes              = [azurerm_container_registry.registry.id]
+scopes              = [azurerm_resource_group.group.id, azurerm_container_registry.registry.id]
Suggestion importance[1-10]: 7

Why: The suggestion correctly identifies a potential issue with the scope of resources being monitored, which is crucial for the alert's effectiveness. Adjusting the scope could significantly improve monitoring coverage.

7
Possible bug
Ensure dynamic referencing in action_group_id is accurate for alert routing

Ensure that the action_group_id attribute dynamically references the correct action
group by verifying the indexing and conditions used. This is crucial to ensure that
alerts are sent to the correct recipients.

operations/template/alert.tf [49]

-action_group_id = azurerm_monitor_action_group.notify_slack_email[count.index].id
+action_group_id = azurerm_monitor_action_group.notify_slack_email[local.non_pr_environment ? 0 : 1].id
Suggestion importance[1-10]: 7

Why: Ensuring accurate dynamic referencing in action_group_id is crucial for alert routing to the correct recipients. This suggestion addresses a potential bug that could affect the alert system's reliability.

7
Enhancement
Expand the severity levels to enhance monitoring effectiveness

Consider adding more severity levels in the levels attribute within the criteria
block of the azurerm_monitor_activity_log_alert resource. Currently, it only
triggers on "Error". Including other levels like "Warning" might help in proactive
monitoring and issue resolution.

operations/template/alert.tf [40]

-levels   = ["Error"]
+levels   = ["Error", "Warning"]
Suggestion importance[1-10]: 6

Why: Expanding the severity levels can indeed enhance the monitoring by catching issues before they escalate. This is a valuable enhancement for proactive monitoring.

6
Best practice
Review the ignore_changes settings to ensure critical updates are not overlooked

Verify the ignore_changes lifecycle configuration to ensure it aligns with the
operational requirements. Ignoring changes to many tags might lead to overlooking
important updates that could affect resource management and compliance.

operations/template/alert.tf [56-68]

 ignore_changes = [
   tags["business_steward"],
-  ...
   tags["zone"]
 ]
Suggestion importance[1-10]: 5

Why: The suggestion to review ignore_changes is valid as it could prevent overlooking important updates. However, the suggestion lacks specific improvements and mainly asks for a review, which is less actionable.

5

@pluckyswan pluckyswan changed the title Azure outage alert pain and suffering Setup Azure Alert for Azure is down Oct 21, 2024
@pluckyswan pluckyswan marked this pull request as ready for review October 21, 2024 19:39
Comment on lines 23 to 26
variable "service_health_locations" {
type = string
default = "global"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Do we actually customize this to anything? I feel we could just hardcode "global" in the azurerm_monitor_activity_log_alert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed variable for now as its only used in this file. If more instances pop up, will refactor later.

resource "azurerm_monitor_activity_log_alert" "azure_service_health_alert" {
count = local.non_pr_environment ? 1 : 0
name = "cdcti-${var.environment}-azure-status-alert"
location = "global"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be data.azurerm_resource_group.group.location? For every other resource that asks for a resource_group_name and location, we've put data.azurerm_resource_group.group.location for the location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Copy link
Member

@halprin halprin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The System is Down

@jherrflexion jherrflexion merged commit ebd3e21 into main Oct 22, 2024
17 checks passed
@jherrflexion jherrflexion deleted the azure-outage-alert-pain-and-suffering branch October 22, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants