Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 4xx Alerts #203

Merged
merged 4 commits into from
Oct 24, 2024
Merged

Add 4xx Alerts #203

merged 4 commits into from
Oct 24, 2024

Conversation

saquino0827
Copy link
Contributor

@saquino0827 saquino0827 commented Oct 23, 2024

Description

  • Add Azure Alert setup for 4xx alerting
    • Terraform updates for Action Group, Alert and variables for each environment
    • Update github workflows for each environment to add alert email secret
  • Made a new branch to exclude the Alert slack email secret as we are recycling the TI applications action group

We manually tested this with a shorter lookback period in the internal environment

Issue

1396

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Error Handling
The error handling in the new HTTP handler might not be sufficient. It logs the error but does not handle the client response appropriately if an error occurs during the write operation.

Conditional Deployment
The conditional deployment logic using 'count' might lead to resources not being properly deployed or updated in certain environments, which could affect monitoring and alerting reliability.

src/cmd/main.go Outdated Show resolved Hide resolved
operations/template/alert.tf Show resolved Hide resolved
resource_group_name = data.azurerm_resource_group.group.name
scopes = [azurerm_linux_web_app.sftp.id]
description = "Action will be triggered when Http Status Code 4XX is greater than or equal to 3"
frequency = "PT1M" // Checks every 1 minute

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enhance the alert's responsiveness and reduce potential downtime, consider adjusting the 'frequency' and 'window_size' parameters based on the expected traffic and error rates. This could help in faster detection and resolution of issues. [important]

Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Possible bug
Ensure that all bytes are correctly written to the response body

Consider handling the case where the WriteString method might not write all bytes to
the response body. This can be done by checking the number of bytes written and
comparing it to the length of the string being written.

src/cmd/main.go [84]

-_, err := io.WriteString(response, "400 Peters are Great")
+n, err := io.WriteString(response, "400 Peters are Great")
+if n != len("400 Peters are Great") {
+    slog.Error("Incomplete write of response body", slog.Int("bytesWritten", n))
+}
Suggestion importance[1-10]: 7

Why: This suggestion is relevant as it ensures data integrity by verifying that all bytes are written to the response body, which is crucial for reliable server responses.

7
Enhancement
Set the content type of the response for proper data interpretation by the client

Add a specific content type header to the response to ensure that the client
correctly interprets the response data format.

src/cmd/main.go [83]

+response.Header().Set("Content-Type", "text/plain; charset=utf-8")
 response.WriteHeader(400)
Suggestion importance[1-10]: 6

Why: Setting the content type header is a good practice to ensure that the client interprets the response data format correctly. This suggestion enhances the clarity and correctness of the HTTP response.

6
Best practice
Narrow down the scope of the ignore_changes lifecycle policy to essential tags

Ensure that the ignore_changes lifecycle policy is correctly scoped to prevent
unintended consequences. Overly broad use might lead to important changes being
ignored, which could affect the alert's effectiveness.

operations/template/alert.tf [30-42]

 ignore_changes = [
   tags["business_steward"],
-  ...
-  tags["zone"]
+  tags["security_compliance"],
+  tags["technical_steward"]
 ]
Suggestion importance[1-10]: 4

Why: This suggestion is valid as it aims to refine the ignore_changes policy to focus on essential tags, potentially improving the alert's responsiveness and accuracy. However, the impact is moderate as it depends on the specific operational requirements and policies.

4

Co-authored-by: Sylvie <[email protected]>
Co-authored-by: pluckyswan <[email protected]>
Co-authored-by: halprin <[email protected]>
criteria {
metric_namespace = "Microsoft.Web/sites"
metric_name = "Http4xx"
aggregation = "Total"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change the aggregate type to Total (a.k.a. Sum) instead of Count. The reason why is because Count is the total number of measurements made within the granularity period. However, Total is the sum of values within that granularity period. For 4XX http calls, the Total is the total number of HTTP calls made in the granularity specified. While the Count is the total number of times Azure made a measurement within that granularity.

For Example: If Granularity is 1 minute, if I trigger five 4xx http calls within 1 minute, then the Total would be five but the count would only be 1 because it only measured it once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to update this in TI CDCgov/trusted-intermediary#1486

Copy link

@somesylvie somesylvie merged commit adb2f45 into main Oct 24, 2024
15 checks passed
@somesylvie somesylvie deleted the 4xx-alerts-fresh branch October 24, 2024 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants