Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Racing condition in executeAsyncActions #17103

Open
mingshl opened this issue Jan 23, 2025 · 1 comment
Open

[BUG] Racing condition in executeAsyncActions #17103

mingshl opened this issue Jan 23, 2025 · 1 comment
Assignees
Labels
bug Something isn't working Search:Query Capabilities v2.19.0 Issues and PRs related to version 2.19.0

Comments

@mingshl
Copy link
Contributor

mingshl commented Jan 23, 2025

Describe the bug

Issue was found by @dbwiddis while reviewing #16818 , raised this bug issue to track and proper fix this:

A race condition exists in the QueryRewriteContext class in the method executeAsyncActions(). When multiple threads concurrently interact with the asyncActions list (via the registerAsyncAction() and executeAsyncActions() methods), it can result in actions being lost, or failure notifications not being delivered to the listener. This issue can occur under certain conditions where actions are added to the list while the method executeAsyncActions() is in progress.

Related component

Search:Query Capabilities

To Reproduce

  • Thread 1 calls executeAsyncActions() and retrieves the number of registered async actions (e.g., 3 actions) on line 113, and stores this number in the countDown object.
  • Thread 2 (in a different part of the system) calls registerAsyncAction() and adds a 4th action to the asyncActions list.
  • Thread 1 proceeds and copies the current asyncActions list (which now contains 4 actions) into biConsumers. The count of actions in biConsumers is now 4, but the count stored in countDown is still 3, reflecting the actions at the time executeAsyncActions() started.
  • Thread 3 (in another part of the system) calls registerAsyncAction() again, adding a 5th action to the asyncActions list.
  • Thread 1 clears the asyncActions list, so the 5th action added in step 4 is lost.
  • Thread 1 starts iterating over biConsumers (which contains only 4 actions, not 5) and processes the actions.
  • As countDown reaches 0 after processing the 3rd action, the listener's onResponse() is triggered, but the 4th action is still executed.
  • If the 4th action fails, the failure is ignored because the listener is already completed (onResponse() has been triggered).
  • Expected Behavior
  • Action 5 should not be lost, even if executeAsyncActions() is in progress when new actions are added to the list.
  • If any action fails (including the 4th action), the failure should be properly reported via the onFailure() method of the listener.
  • Actual Behavior
  • Action 5 is lost because it was added to asyncActions after executeAsyncActions() had already copied the list to biConsumers.
  • If the 4th action fails, the failure is not reported to the user because the listener has already been notified with onResponse() when countDown reaches 0.
Thread 1 (executeAsyncActions)                   Thread 2 (registerAsyncAction)    Thread 3 (registerAsyncAction)
  | Call executeAsyncActions()                     |                                |
  | Retrieve countDown = 3                         |                                |
  | Copy asyncActions (3 items) -> biConsumers     |                                |
  | Thread 1 stores countDown, iterates biConsumers|                                |
  | ---------------------------------------------> | Add Action 4 to asyncActions |
  | Process Action 1                               |                                |
  | Process Action 2                               |                                |
  | Process Action 3                               |                                |
  | Action 3 -> countDown reaches 0, listener      |                                |
  | onResponse()                                   |                                |
  | Continue executing Action 4                    | Add Action 5 to asyncActions |
  | Clear asyncActions (loses Action 5)            |                                |
  | ---------------------------------------------> |                                |
  | Fail Action 4 -> No failure notification       |                                |

Root Cause

The issue arises from the following race condition:

Thread 1 is executing executeAsyncActions(), but other threads (Thread 2 and Thread 3) are concurrently adding new actions to asyncActions.
The list is copied to biConsumers before the list is cleared, leading to discrepancies between the actions being iterated over and the actual actions that should have been processed.
The countDown value is based on the state of asyncActions when executeAsyncActions() starts, but the list may have changed by the time the actions are processed, leading to missing actions and incomplete failure reporting.

Expected behavior

1. Quick Fix: Synchronization Around Copying and Clearing

  • Problem: The race condition occurs because the asyncActions list can be modified while executeAsyncActions() is processing it.

  • Solution:

    • Wrap the copy and clear operations of asyncActions in a synchronized block. This will ensure that no other thread can modify the list while it's being copied or cleared.

    • Also wrap the registerAsyncAction method (or just the part where actions are added to the list) with a synchronized block to prevent concurrent modifications.

      java
      Copy
      synchronized (asyncActions) { // Copy and clear the list in executeAsyncActions List<BiConsumer<Client, ActionListener<?>>> biConsumers = new ArrayList<>(asyncActions); asyncActions.clear(); }

      synchronized (asyncActions) {
      // Add new actions in registerAsyncAction
      asyncActions.add(asyncAction);
      }

  • Further Improvement:

    • Move the calculation of countDown to after the list is copied, right after the synchronized block. This ensures you have full control over the list at that point and avoids any race when counting actions.

2. Alternate Fix: Synchronize registerAsyncAction and Refactor into a New Method

  • Problem: Modifying asyncActions in registerAsyncAction while it is being copied and cleared causes the race condition.
  • Solution:
    • Add the synchronized keyword to registerAsyncAction to prevent concurrent modifications to the list while it's being processed.

    • Create a new synchronized helper method, copyAndClearAsyncActions(), which handles copying the list, clearing it, and returning the copied list. This method will encapsulate the synchronization logic.

      java
      Copy
      private synchronized List<BiConsumer<Client, ActionListener<?>>> copyAndClearAsyncActions() {
      List<BiConsumer<Client, ActionListener<?>>> copy = new ArrayList<>(asyncActions);
      asyncActions.clear();
      return copy;
      }

      // In executeAsyncActions
      List<BiConsumer<Client, ActionListener<?>>> biConsumers = copyAndClearAsyncActions();

    • Move the countdown calculation after calling copyAndClearAsyncActions(), so you are working with the list in a consistent state.

3. Simpler Alternative: Use AtomicReference to Manage the List (proposed by @msfroh )

  • Problem: The current approach with a plain list allows race conditions when modifying the list.
  • Solution:
    • Replace asyncActions with an AtomicReference<List<BiConsumer<Client, ActionListener<?>>>> to safely manage concurrent access to the list without explicit synchronization.

    • Use the AtomicReference to atomically set and get the list. This eliminates the need to copy the list manually and avoids the race condition.

      java
      Copy
      private final AtomicReference<List<BiConsumer<Client, ActionListener<?>>>> asyncActionsRef = new AtomicReference<>(new ArrayList<>());

      // In registerAsyncAction
      asyncActionsRef.get().add(asyncAction);

      // In executeAsyncActions
      List<BiConsumer<Client, ActionListener<?>>> asyncActions = asyncActionsRef.getAndSet(new ArrayList<>());

    • Benefits:

      • You no longer need to manually synchronize or copy the list.
      • AtomicReference ensures that updates to the list are thread-safe without explicit locks.
      • Eliminates the risk of race conditions and makes the code simpler and more efficient.

Comparison of Solutions

Fix Pros Cons
Quick Fix (Synchronization) - Simple to implement.- No major refactoring needed. - May introduce performance overhead due to synchronization.- Synchronization may become complex with multiple threads accessing shared resources.
Alternate Fix (Method Refactor) - Clearer separation of concerns with a new method.- Keeps synchronization in one place. - Adds additional methods and complexity.- Slightly more verbose than the quick fix.
Simpler Alternative (AtomicReference) - Thread-safe without needing explicit synchronization.- Eliminates race conditions with minimal code changes. - Requires replacing the list with an AtomicReference.- Might need some adaptation in other parts of the code.

Recommendation

  • The simpler alternative using AtomicReference seems like the best option. It requires minimal changes, eliminates the need for manual synchronization, and provides better thread safety. This approach is also cleaner and more efficient, as it avoids the overhead associated with using synchronized blocks.

Additional Details

Plugins
core

Additional context
discussion was raised from https://github.com/opensearch-project/OpenSearch/pull/16818/files#r1919283967

@dbwiddis
Copy link
Member

I had a feeling AtomicReference would help here but don't have enough experience with it (yet) to know. Thanks @msfroh for the suggestion!

@dbwiddis dbwiddis added the v2.19.0 Issues and PRs related to version 2.19.0 label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Query Capabilities v2.19.0 Issues and PRs related to version 2.19.0
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants