Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 4XX Alert to Ignore RS Receiver Status Check #1519

Merged
merged 2 commits into from
Oct 30, 2024
Merged

Conversation

halprin
Copy link
Member

@halprin halprin commented Oct 30, 2024

Description

Our 4XX alert was constantly triggering due to RS (and other things) calling non-existent endpoints. Changed the alert to be log based and filtering out the pre-live slot, the RS receiver status check, and Microsoft pinging non-existent endpoints.

Issue

#1517

Checklist

  • Tested ignore works correctly in Internal.
  • Tested it activates correctly still in Internal.

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

1517 - Partially compliant

Fully compliant requirements:

  • Investigate what is triggering 4xx alerts in staging.
  • Test fix in TI Internal.

Not compliant requirements:

  • Merge PR with final fix.
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Query Logic
Ensure the query logic effectively filters out the intended 4xx errors without excluding other potential critical alerts.

action_group_id = azurerm_monitor_action_group.notify_slack_email[count.index].id
data_source_id = azurerm_linux_web_app.api.id
description = "Alert when 4xx errors cross threshold"
enabled = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a specific log for when the alert is triggered. This would help in debugging and understanding the context of the alert when reviewing logs. [important]

Copy link
Contributor

@GilmoreA6 GilmoreA6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! testing in internal followed expectations,

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Possible issue
Refine the log filtering conditions to ensure necessary logs are not excluded

Ensure that the query does not inadvertently exclude necessary logs by refining the
conditions that filter out specific methods and URIs. Consider using more specific
conditions or adding additional URIs that might be relevant to ignore.

operations/template/alert.tf [142-148]

+query = <<-QUERY
+    AppServiceHTTPLogs |
+    where CsHost !contains "pre-live" |
+    where UserAgent != "AlwaysOn" |
+    where not(CsMethod == "GET" and CsUriStem startswith "/v1/etor/") |  // ignore RS receiver status check that uses GET
+    where not(CsMethod == "GET" and CsUriStem startswith "/admin/") |  // ignore Microsoft hitting non-existent URLs
+    where ScStatus >= 400 and ScStatus < 500
+  QUERY
 
-
Suggestion importance[1-10]: 5

Why: The suggestion to refine log filtering conditions is relevant as it ensures important logs are not excluded, which is crucial for accurate monitoring. However, the improved code provided is identical to the existing code, indicating no specific changes were suggested.

5
Performance
Optimize the alert's frequency and time window for effective monitoring

Consider adjusting the frequency and time_window parameters to optimize the alert's
responsiveness and accuracy, ensuring it effectively captures relevant data without
causing unnecessary noise.

operations/template/alert.tf [152-153]

+frequency = 5
+time_window = 60
 
-
Suggestion importance[1-10]: 4

Why: Optimizing the frequency and time window can enhance the alert system's effectiveness. However, the suggestion lacks specific improvements or alternatives, making it less impactful.

4
Maintainability
Ensure all necessary tags are included in the ignore_changes lifecycle configuration

Review the ignore_changes lifecycle configuration to ensure it includes all
necessary tags that should not trigger a redeployment or update of the resource,
potentially adding more tags as needed.

operations/template/alert.tf [162-167]

+ignore_changes = [
+    tags["business_steward"],
+    tags["center"],
+    tags["environment"],
+    tags["escid"],
 
-
Suggestion importance[1-10]: 4

Why: Ensuring all necessary tags are included in the ignore_changes configuration is important for maintainability. However, the suggestion does not provide specific tags to add, limiting its practical applicability.

4
Best practice
Check and adjust the severity level of the alert to match organizational standards

Verify the severity level set for the alert to ensure it aligns with organizational
standards for 4xx error monitoring, potentially adjusting it if necessary.

operations/template/alert.tf [151]

+severity = 3
 
-
Suggestion importance[1-10]: 3

Why: This suggestion is a good practice to ensure the severity level aligns with organizational standards. However, it's more of a verification step rather than a direct code improvement, hence the moderate score.

3

Copy link

@halprin halprin merged commit 21fedf5 into main Oct 30, 2024
17 checks passed
@halprin halprin deleted the 4xx-alert-ignore-rs branch October 30, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants