Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notifications: Investigate daily-ish deadlock occurrences #700

Open
hggutvik opened this issue Jan 24, 2025 · 2 comments
Open

Notifications: Investigate daily-ish deadlock occurrences #700

hggutvik opened this issue Jan 24, 2025 · 2 comments
Assignees
Labels
kind/analysis kind/bug Something isn't working

Comments

@hggutvik
Copy link
Contributor

hggutvik commented Jan 24, 2025

Description of the bug

From what the logs show, we started logging exceptions due to deadlocks at 08:00 AM approx. every day in October 2024.
The stack trace shows that it originates from SmsNotificationRepository.UpdateSendStatus.

The pattern starts showing up in the logs from the end of October, with some variation between environments:
Oct 10th (Prod).
Oct 27th (AT24)
Oct 28th (AT23),
Oct 29th (AT22, TT02).
YT01 doesn't show any such logged exceptions.

The pattern isn't completely regular. It can happen 2 times on the same day, or several days can pass without deadlock ocurring.

In Production, the recorded deadlocks between Oct 10th and 26th occurred at 07:00 AM, while they started occurring at 08:00 AM from Oct 31st.

These findings could indicate that the deadlocks originate from an automated daily job.

Looks like the issue increased with one order of magnitude during week 5, in AT24 (graph constructed by running query in 'Additional Information' below)

Image

Why this increase in only AT24?
A disaster recovery from backup was performed in AT24 on Jan 28th. Coincidence, or are these things related?

Steps To Reproduce

Steps to observe in logs:
Open Logs in Application Insights, and run the following query with time range set to e.g. last 7 days:

exceptions
| where cloud_RoleName startswith "platform-notifications" | where outerMessage startswith "40P01: deadlock detected"

Additional Information

Query that counts occurences per date:

exceptions
| where cloud_RoleName startswith "platform-notifications" | where timestamp between (datetime("2025-01-19")..datetime("2025-02-04")) | where outerMessage startswith "40P01: deadlock detected"
| extend date_output = floor(timestamp, 1d)
| project date_output
| summarize count() by date_output
@hggutvik hggutvik added kind/bug Something isn't working kind/analysis labels Jan 24, 2025
@Ahmed-Ghanam Ahmed-Ghanam self-assigned this Feb 5, 2025
@NathalieFroissart
Copy link
Member

Find out concequences of dead-lock

@Ahmed-Ghanam
Copy link
Contributor

Ahmed-Ghanam commented Feb 11, 2025

Findings

This issue arises because two transactions are updating the same table but in a different order, causing a circular dependency (deadlock).

Description

The Altinn Notifications service comprises several coordinated components, including:

  1. The Notifications API for registering notification orders and update their status.
  2. The Notifications SMS API for handling notification orders where the user intends to send an SMS to one or more recipients.
  3. The Notifications Email API for handling notification orders where the user intends to send an email to one or more recipients.
  4. A cron job to trigger the endpoints responsible for dispatching new email and SMS orders.

The scheduled cron job triggers the endpoint responsible for dispatching new SMS orders. Upon invocation, the Notifications API retrieves the new SMS orders from the database, updates their status from “New” to “Sending” and enqueues each SMS order individually onto a Kafka topic.

The SMS-specific component, the Notifications SMS API, consumes these orders from Kafka topic and transmits each order via the Link Mobility gateway. During the transmission process, Link Mobility returns either a unique reference identifier for a successfully queued SMS or null if the delivery attempt fails. The Notifications SMS API then replaces the null with an empty string and forwards the delivery report by posting the response to a Kafka topic.

Finally, the Notifications API retrieves these responses from the Kafka topic and updates the status of the corresponding SMS orders accordingly. In the current implementation, an OR condition is used to determine which SMS order to update; the SMS-order is identified by either its unique SMS order identifier or the reference returned by the Link Mobility gateway, which can be either a valid reference or an empty string. In cases where the delivery fails (i.e., an empty string is returned), the update query inadvertently matches more than 12,000 rows that have an empty gateway reference, leading to unintended status updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/analysis kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants