Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Azure Outage Alert #1455

Closed
wants to merge 5 commits into from
Closed

WIP Azure Outage Alert #1455

wants to merge 5 commits into from

Conversation

jherrflexion
Copy link
Contributor

Add a PR title

Describe what changed in this PR at a high level.

Issue

Add a link to the issue here. Consider using
closing keywords
if the this PR isn't for a story (stories will be closed through different means).

Checklist

  • I have added tests to cover my changes
  • I have added logging where useful (with appropriate log level)
  • I have added JavaDocs where required
  • I have updated the documentation accordingly

Note: You may remove items that are not applicable

Co-Authored-By: Samuel Aquino <[email protected]>
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Configuration Logic
The condition for creating the 'azurerm_monitor_activity_log_alert' resource is based on 'local.non_pr_environment'. This logic should be reviewed to ensure it aligns with the intended environments for deployment.

Hardcoded Values
The alert configuration contains hardcoded values for locations and services which might not be suitable for all deployment scenarios. Consider making these values configurable.

category = "ServiceHealth"
levels = ["Error"]
service_health {
locations = ["East US", "Global"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider parameterizing the 'locations' and 'services' fields in the service_health criteria to enhance flexibility and maintainability of the alert configuration. [important]


lifecycle {
ignore_changes = [
tags["business_steward"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the necessity of ignoring so many tags in the lifecycle configuration. This could potentially lead to overlooking important changes in these tags. [important]

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Possible issue
Adjust the scope to target the correct Azure resources for monitoring

Ensure that the scopes field in the azurerm_monitor_activity_log_alert resource is
correctly set to target the intended Azure resources. Currently, it is set to use
the ID of azurerm_container_registry.registry, which might not be relevant for
monitoring Azure service health.

operations/template/alert.tf [36]

-scopes              = [azurerm_container_registry.registry.id]
+scopes              = [data.azurerm_resource_group.group.id]
Suggestion importance[1-10]: 7

Why: The suggestion correctly identifies a potential misconfiguration in the 'scopes' field, which could lead to monitoring the wrong resources. Adjusting this to the correct resource group ID as suggested could significantly improve the relevance of the monitoring setup.

7
Best practice
Reevaluate lifecycle ignore changes to maintain compliance and security oversight

Review the ignore_changes lifecycle configuration to ensure it aligns with the
operational requirements. Ignoring changes to critical tags like
"security_compliance" and "pii_data" might lead to oversight in compliance tracking.

operations/template/alert.tf [57-69]

 ignore_changes = [
   tags["business_steward"],
   ...
-  tags["pii_data"],
-  tags["security_compliance"],
+  tags["support_group"],
+  tags["system"],
   ...
 ]
Suggestion importance[1-10]: 6

Why: The suggestion to review the 'ignore_changes' configuration is valid as ignoring critical tags like "security_compliance" and "pii_data" could lead to compliance issues. Adjusting this configuration could enhance the security and compliance monitoring of the infrastructure.

6
Enhancement
Expand the severity levels monitored to enhance alert coverage

Consider adding more severity levels in the levels field of the criteria block to
ensure comprehensive monitoring. Currently, only "Error" level is monitored.

operations/template/alert.tf [40]

-levels   = ["Error"]
+levels   = ["Error", "Critical", "Warning"]
Suggestion importance[1-10]: 5

Why: Adding more severity levels to the monitoring criteria can help in capturing a broader range of issues, thus enhancing the alert system's effectiveness. However, the necessity of this change depends on the specific monitoring needs and might not be critical.

5
Include additional event types in the alert criteria to improve monitoring effectiveness

Verify and potentially expand the events list under service_health to include other
relevant event types like "Maintenance" alongside "Incident" to ensure all pertinent
service health issues are captured.

operations/template/alert.tf [43]

-events    = ["Incident"]
+events    = ["Incident", "Maintenance"]
Suggestion importance[1-10]: 5

Why: Including more event types such as "Maintenance" alongside "Incident" could provide a more comprehensive monitoring of service health. This suggestion is beneficial for capturing a wider range of service health issues.

5

Co-Authored-By: Samuel Aquino <[email protected]>
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant