You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From what the logs show, we started logging exceptions due to deadlocks at 08:00 AM approx. every day in October 2024.
The stack trace shows that it originates from SmsNotificationRepository.UpdateSendStatus.
The pattern starts showing up in the logs from the end of October, with some variation between environments:
Oct 10th (Prod).
Oct 27th (AT24)
Oct 28th (AT23),
Oct 29th (AT22, TT02).
YT01 doesn't show any such logged exceptions.
The pattern isn't completely regular. It can happen 2 times on the same day, or several days can pass without deadlock ocurring.
In Production, the recorded deadlocks between Oct 10th and 26th occurred at 07:00 AM, while they started occurring at 08:00 AM from Oct 31st.
These findings could indicate that the deadlocks originate from an automated daily job.
Looks like the issue increased with one order of magnitude during week 5, in AT24 (graph constructed by running query in 'Additional Information' below)
This issue arises because two transactions are updating the same table but in a different order, causing a circular dependency (deadlock).
Description
The Altinn Notifications service comprises several coordinated components, including:
The Notifications API for registering notification orders and update their status.
The Notifications SMS API for handling notification orders where the user intends to send an SMS to one or more recipients.
The Notifications Email API for handling notification orders where the user intends to send an email to one or more recipients.
A cron job to trigger the endpoints responsible for dispatching new email and SMS orders.
The scheduled cron job triggers the endpoint responsible for dispatching new SMS orders. Upon invocation, the Notifications API retrieves the new SMS orders from the database, updates their status from “New” to “Sending” and enqueues each SMS order individually onto a Kafka topic.
The SMS-specific component, the Notifications SMS API, consumes these orders from Kafka topic and transmits each order via the Link Mobility gateway. During the transmission process, Link Mobility returns either a unique reference identifier for a successfully queued SMS or null if the delivery attempt fails. The Notifications SMS API then replaces the null with an empty string and forwards the delivery report by posting the response to a Kafka topic.
Finally, the Notifications API retrieves these responses from the Kafka topic and updates the status of the corresponding SMS orders accordingly. In the current implementation, an OR condition is used to determine which SMS order to update; the SMS-order is identified by either its unique SMS order identifier or the reference returned by the Link Mobility gateway, which can be either a valid reference or an empty string. In cases where the delivery fails (i.e., an empty string is returned), the update query inadvertently matches more than 12,000 rows that have an empty gateway reference, leading to unintended status updates.
Description of the bug
From what the logs show, we started logging exceptions due to deadlocks at 08:00 AM approx. every day in October 2024.
The stack trace shows that it originates from SmsNotificationRepository.UpdateSendStatus.
The pattern starts showing up in the logs from the end of October, with some variation between environments:
Oct 10th (Prod).
Oct 27th (AT24)
Oct 28th (AT23),
Oct 29th (AT22, TT02).
YT01 doesn't show any such logged exceptions.
The pattern isn't completely regular. It can happen 2 times on the same day, or several days can pass without deadlock ocurring.
In Production, the recorded deadlocks between Oct 10th and 26th occurred at 07:00 AM, while they started occurring at 08:00 AM from Oct 31st.
These findings could indicate that the deadlocks originate from an automated daily job.
Looks like the issue increased with one order of magnitude during week 5, in AT24 (graph constructed by running query in 'Additional Information' below)
Why this increase in only AT24?
A disaster recovery from backup was performed in AT24 on Jan 28th. Coincidence, or are these things related?
Steps To Reproduce
Steps to observe in logs:
Open Logs in Application Insights, and run the following query with time range set to e.g. last 7 days:
Additional Information
Query that counts occurences per date:
The text was updated successfully, but these errors were encountered: