Notifications: Investigate daily-ish deadlock occurrences #700

hggutvik · 2025-01-24T15:14:43Z

Description of the bug

From what the logs show, we started logging exceptions due to deadlocks at 08:00 AM approx. every day in October 2024.
The stack trace shows that it originates from SmsNotificationRepository.UpdateSendStatus.

The pattern starts showing up in the logs from the end of October, with some variation between environments:
Oct 10th (Prod).
Oct 27th (AT24)
Oct 28th (AT23),
Oct 29th (AT22, TT02).
YT01 doesn't show any such logged exceptions.

The pattern isn't completely regular. It can happen 2 times on the same day, or several days can pass without deadlock ocurring.

In Production, the recorded deadlocks between Oct 10th and 26th occurred at 07:00 AM, while they started occurring at 08:00 AM from Oct 31st.

These findings could indicate that the deadlocks originate from an automated daily job.

Looks like the issue increased with one order of magnitude during week 5, in AT24 (graph constructed by running query in 'Additional Information' below)

Why this increase in only AT24?
A disaster recovery from backup was performed in AT24 on Jan 28th. Coincidence, or are these things related?

Steps To Reproduce

Steps to observe in logs:
Open Logs in Application Insights, and run the following query with time range set to e.g. last 7 days:

exceptions
| where cloud_RoleName startswith "platform-notifications" | where outerMessage startswith "40P01: deadlock detected"

Additional Information

Query that counts occurences per date:

exceptions
| where cloud_RoleName startswith "platform-notifications" | where timestamp between (datetime("2025-01-19")..datetime("2025-02-04")) | where outerMessage startswith "40P01: deadlock detected"
| extend date_output = floor(timestamp, 1d)
| project date_output
| summarize count() by date_output

The text was updated successfully, but these errors were encountered:

NathalieFroissart · 2025-02-10T08:06:38Z

Find out concequences of dead-lock

Ahmed-Ghanam · 2025-02-11T12:33:31Z

Findings

This issue arises because two transactions are updating the same table but in a different order, causing a circular dependency (deadlock).

Description

The Altinn Notifications service comprises several coordinated components, including:

The Notifications API for registering notification orders and update their status.
The Notifications SMS API for handling notification orders where the user intends to send an SMS to one or more recipients.
The Notifications Email API for handling notification orders where the user intends to send an email to one or more recipients.
A cron job to trigger the endpoints responsible for dispatching new email and SMS orders.

The scheduled cron job triggers the endpoint responsible for dispatching new SMS orders. Upon invocation, the Notifications API retrieves the new SMS orders from the database, updates their status from “New” to “Sending” and enqueues each SMS order individually onto a Kafka topic.

The SMS-specific component, the Notifications SMS API, consumes these orders from Kafka topic and transmits each order via the Link Mobility gateway. During the transmission process, Link Mobility returns either a unique reference identifier for a successfully queued SMS or null if the delivery attempt fails. The Notifications SMS API then replaces the null with an empty string and forwards the delivery report by posting the response to a Kafka topic.

Finally, the Notifications API retrieves these responses from the Kafka topic and updates the status of the corresponding SMS orders accordingly. In the current implementation, an OR condition is used to determine which SMS order to update; the SMS-order is identified by either its unique SMS order identifier or the reference returned by the Link Mobility gateway, which can be either a valid reference or an empty string. In cases where the delivery fails (i.e., an empty string is returned), the update query inadvertently matches more than 12,000 rows that have an empty gateway reference, leading to unintended status updates.

hggutvik added kind/bug Something isn't working kind/analysis labels Jan 24, 2025

Ahmed-Ghanam self-assigned this Feb 5, 2025

This was referenced Feb 11, 2025

Do not update the SMS order based on the gateway reference. #707

Merged

Investigate and fix data integrity issues in production due to deadlock bug #708

Open

Split a data update query into two separate queries #713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notifications: Investigate daily-ish deadlock occurrences #700

Notifications: Investigate daily-ish deadlock occurrences #700

hggutvik commented Jan 24, 2025 •

edited

Loading

NathalieFroissart commented Feb 10, 2025

Ahmed-Ghanam commented Feb 11, 2025 •

edited

Loading

Notifications: Investigate daily-ish deadlock occurrences #700

Notifications: Investigate daily-ish deadlock occurrences #700

Comments

hggutvik commented Jan 24, 2025 • edited Loading

Description of the bug

Steps To Reproduce

Additional Information

NathalieFroissart commented Feb 10, 2025

Ahmed-Ghanam commented Feb 11, 2025 • edited Loading

Findings

Description

hggutvik commented Jan 24, 2025 •

edited

Loading

Ahmed-Ghanam commented Feb 11, 2025 •

edited

Loading