Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup for 5xx Error Alerts #204

Merged
merged 3 commits into from
Oct 24, 2024
Merged

Setup for 5xx Error Alerts #204

merged 3 commits into from
Oct 24, 2024

Conversation

pluckyswan
Copy link
Contributor

@pluckyswan pluckyswan commented Oct 24, 2024

Description

Added terraform for 5xx alerts for sftp ingestion service. Tested in internal triggering a 500 error to see Slack activation and deactivation alert.

Issue

CDCgov/trusted-intermediary#1394

Checklist

Note: You may remove items that are not applicable

Co-authored-by: Sylvie <[email protected]>
Co-authored-by: James Herr <[email protected]>
@pluckyswan pluckyswan changed the title Setup for 5xx Alerts Setup for 5xx Error Alerts Oct 24, 2024
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Configuration Consistency
The lifecycle configuration to ignore changes to tags might not be consistent with the organization's policy on managing infrastructure as code. This could lead to discrepancies in the environment configuration that are not tracked through version control.

tags["support_group"],
tags["system"],
tags["technical_steward"],
tags["zone"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a severity label to the alert definition to help in categorizing and prioritizing alerts. This can be done by adding a new tag for severity in the tags section. [important]

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Enhancement
Add a recovery action to notify when the system status returns to normal

Consider adding a recovery action or notification when the number of 5XX errors
falls below the threshold to inform stakeholders when the system status returns to
normal.

operations/template/alert.tf [64-66]

 action {
   action_group_id = data.azurerm_monitor_action_group.notify_slack_email[count.index].id
+  # Consider adding a recovery action here
 }
Suggestion importance[1-10]: 6

Why: Adding a recovery action is a valuable enhancement for better monitoring and notification management. This suggestion could significantly improve the alert's functionality by informing stakeholders of status normalization.

6
Possible issue
Adjust the condition to correctly enable the alert in the intended environments

Ensure that the count condition for creating the "azure_5XX_alert" resource is
correctly set to enable the alert in non-production environments only. If the
intention is to enable this alert in production environments, the condition should
be reversed.

operations/template/alert.tf [48]

-count = local.non_pr_environment ? 1 : 0
+count = local.non_pr_environment ? 0 : 1
Suggestion importance[1-10]: 5

Why: The suggestion to review the 'count' condition is valid as it impacts the deployment of the alert in different environments. However, without specific knowledge of the intended deployment strategy, it's difficult to assess if the suggested change is correct, hence a moderate score.

5
Confirm the accuracy of the metric namespace and name for the alert

Verify the metric namespace and name to ensure they correctly target the desired
Azure resource and metric for monitoring HTTP 5XX errors.

operations/template/alert.tf [57-58]

 metric_namespace = "Microsoft.Web/sites"
-metric_name      = "Http5xx"
+metric_name      = "Http5xx"  # Confirm this is the correct metric name for monitoring
Suggestion importance[1-10]: 4

Why: The suggestion to verify the metric namespace and name is a good practice to ensure correct monitoring setup. However, it does not directly improve or correct the code, rather it prompts a verification step.

4
Best practice
Ensure critical tags are not ignored in the lifecycle configuration

Review the list of ignored changes in the lifecycle block to ensure that no critical
tags that might affect resource management or compliance are being overlooked.

operations/template/alert.tf [70-82]

 ignore_changes = [
   tags["business_steward"],
   ...
   tags["zone"]
+  # Ensure no critical tags are ignored
 ]
Suggestion importance[1-10]: 4

Why: Reviewing ignored tags is a best practice to ensure important configurations are not inadvertently skipped. However, this suggestion is more about caution and does not directly address a specific issue in the PR.

4

Co-authored-by: James Herr <[email protected]>

Co-authored-by: Sylvie <[email protected]>
Copy link

Copy link
Member

@halprin halprin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks hot to trot!

@pluckyswan pluckyswan merged commit ab270c2 into main Oct 24, 2024
15 checks passed
@pluckyswan pluckyswan deleted the 5xx-alertz branch October 24, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants