[BUG] Racing condition in executeAsyncActions #17103
Labels
bug
Something isn't working
Search:Query Capabilities
v2.19.0
Issues and PRs related to version 2.19.0
Describe the bug
Issue was found by @dbwiddis while reviewing #16818 , raised this bug issue to track and proper fix this:
A race condition exists in the QueryRewriteContext class in the method executeAsyncActions(). When multiple threads concurrently interact with the asyncActions list (via the registerAsyncAction() and executeAsyncActions() methods), it can result in actions being lost, or failure notifications not being delivered to the listener. This issue can occur under certain conditions where actions are added to the list while the method executeAsyncActions() is in progress.
Related component
Search:Query Capabilities
To Reproduce
Root Cause
The issue arises from the following race condition:
Thread 1 is executing executeAsyncActions(), but other threads (Thread 2 and Thread 3) are concurrently adding new actions to asyncActions.
The list is copied to biConsumers before the list is cleared, leading to discrepancies between the actions being iterated over and the actual actions that should have been processed.
The countDown value is based on the state of asyncActions when executeAsyncActions() starts, but the list may have changed by the time the actions are processed, leading to missing actions and incomplete failure reporting.
Expected behavior
1. Quick Fix: Synchronization Around Copying and Clearing
Problem: The race condition occurs because the
asyncActions
list can be modified whileexecuteAsyncActions()
is processing it.Solution:
Wrap the
copy
andclear
operations ofasyncActions
in a synchronized block. This will ensure that no other thread can modify the list while it's being copied or cleared.Also wrap the
registerAsyncAction
method (or just the part where actions are added to the list) with a synchronized block to prevent concurrent modifications.Further Improvement:
countDown
to after the list is copied, right after the synchronized block. This ensures you have full control over the list at that point and avoids any race when counting actions.2. Alternate Fix: Synchronize
registerAsyncAction
and Refactor into a New MethodasyncActions
inregisterAsyncAction
while it is being copied and cleared causes the race condition.Add the
synchronized
keyword toregisterAsyncAction
to prevent concurrent modifications to the list while it's being processed.Create a new synchronized helper method,
copyAndClearAsyncActions()
, which handles copying the list, clearing it, and returning the copied list. This method will encapsulate the synchronization logic.Move the countdown calculation after calling
copyAndClearAsyncActions()
, so you are working with the list in a consistent state.3. Simpler Alternative: Use
AtomicReference
to Manage the List (proposed by @msfroh )Replace
asyncActions
with anAtomicReference<List<BiConsumer<Client, ActionListener<?>>>>
to safely manage concurrent access to the list without explicit synchronization.Use the
AtomicReference
to atomically set and get the list. This eliminates the need to copy the list manually and avoids the race condition.Benefits:
AtomicReference
ensures that updates to the list are thread-safe without explicit locks.Comparison of Solutions
Recommendation
AtomicReference
seems like the best option. It requires minimal changes, eliminates the need for manual synchronization, and provides better thread safety. This approach is also cleaner and more efficient, as it avoids the overhead associated with usingsynchronized
blocks.Additional Details
Plugins
core
Additional context
discussion was raised from https://github.com/opensearch-project/OpenSearch/pull/16818/files#r1919283967
The text was updated successfully, but these errors were encountered: