You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The remove_at column has not worked out as well as planned. We are unable to differentiate activations that are stuck in sqlite because workers are unable to process them (worker death), and tasks that are stuck in sqlite because we have no workers (absent workers) available.
In sandbox testing this scenario has come up, as we can fill a broker's db with activations, shut it down, and then when it starts in the future all of its tasks are past remove_at but the broker is unable to make progress because all remove_at values are in the past.
The solution discussed on Feb 6 for this was to replace remove_at with processing_attempts. Each time we reset an activation from processing -> pending, we also increment the processing_attempts counter.
In upkeep we can scan for pending tasks that have a processing_attempt higher than the max allowed attempts, and discard/deadletter those activations. This will allow us to move from timestamp based purging to attempt based, which also simplifies the absent worker scenario.
Changes to make
Add processing_attempts to sqlite
Remove remove_at from sqlite.
Each time an activation is moved out of processing into pending increment the attempt counter.
Add max_processing_attempts to configuration
Remove remove_deadline from config.
During upkeep any activations with processing_attempts in excess of configuration value should be moved to failed so that they can be discarded/deadlettered.
Simplify logic used to remove_completed to no longer require an incomplete task to follow a complete one. All completed tasks can be removed from sqlite.
The text was updated successfully, but these errors were encountered:
The
remove_at
column has not worked out as well as planned. We are unable to differentiate activations that are stuck in sqlite because workers are unable to process them (worker death), and tasks that are stuck in sqlite because we have no workers (absent workers) available.In sandbox testing this scenario has come up, as we can fill a broker's db with activations, shut it down, and then when it starts in the future all of its tasks are past
remove_at
but the broker is unable to make progress because allremove_at
values are in the past.The solution discussed on Feb 6 for this was to replace
remove_at
withprocessing_attempts
. Each time we reset an activation from processing -> pending, we also increment theprocessing_attempts
counter.In upkeep we can scan for
pending
tasks that have aprocessing_attempt
higher than the max allowed attempts, and discard/deadletter those activations. This will allow us to move from timestamp based purging to attempt based, which also simplifies the absent worker scenario.Changes to make
processing_attempts
to sqliteremove_at
from sqlite.processing
intopending
increment the attempt counter.max_processing_attempts
to configurationremove_deadline
from config.processing_attempts
in excess of configuration value should be moved tofailed
so that they can be discarded/deadlettered.remove_completed
to no longer require an incomplete task to follow a complete one. All completed tasks can be removed from sqlite.The text was updated successfully, but these errors were encountered: