Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stateless worker race condition causing activation directory leak #9190

Merged

Conversation

EdeMeijer
Copy link
Contributor

@EdeMeijer EdeMeijer commented Oct 18, 2024

Explanation of issue

StatelessWorkerGrainContext listens to destruction events of its internal worker activations using OnDestroyActivation. It intends to unregister itself from the catalog once its last worker has been destroyed (collected).

It did this by enqueueing a work item that removes the worker context from the _workers list. It then checked if _workers was empty.

Work items are processed in a background loop, triggered by a work signal. The code relied on the work signal's RunContinuationsAsynchronously property being set to false, seemingly assuming that that would guarantee the work item to be processed on the same thread enqueueing and signalling the work item.

However, RunContinuationsAsynchronously does not guarantee the continuation to run synchronously when set to false, only that it runs asynchronously when set to true. I couldn't reproduce this behaviour in a unit test, but apparently it happens in production.

Since StatelessWorkerGrainContext was the only occurence of RunContinuationsAsynchronously = false in the codebase and everywhere else it's set to true, I changed it to true here as well. This triggers the race condition (makes the newly added test fail).

Then as a proper fix, I moved the check that unregisters the stateless worker context from the catalog when the last worker is removed to the work item itself to make sure it always runs synchronously after the worker list update.

Background

We noticed that our "total activations" metrics were not matching "per grain type activations" metrics and were ever-growing in production. This seemed correlate with long GC pauses after a few days of runtime, causing silo restarts.

A production memory dump revealed that, indeed, millions of activations were tracked in the ActivationDirectory that shouldn't be there, and they were all stateless worker contexts.

Microsoft Reviewers: Open in CodeFlow

Explanation of issue:

`StatelessWorkerGrainContext` listens to destruction events of its internal worker activations using `OnDestroyActivation`. It intends to unregister itself from the catalog once its last worker has been destroyed (collected).

It did this by enqueueing a work item that removes the worker context from the `_workers` list. It then checked if `_workers` was empty.

Work items are processed in a background loop, triggered by a work signal. The code relied on the work signal's `RunContinuationsAsynchronously` property being set to `false`, seemingly assuming that that would guarantee the work item to be processed on the same thread enqueueing and signalling the work item.

However, `RunContinuationsAsynchronously` does _not_ guarantee the continuation to run synchronously when set to `false`, only that it runs asynchronously when set to `true`. I couldn't reproduce this behaviour in a unit test, but apparently it happens in production.

Since `StatelessWorkerGrainContext` was the only occurence of `RunContinuationsAsynchronously = false` in the codebase and everywhere else it's set to `true`, I changed it to `true` here as well. This triggers the race condition (makes the newly added test fail).

Then as a proper fix, I moved the check that unregisters the stateless worker context from the catalog when the last worker is removed to the work item itself to make sure it always runs synchronously after the worker list update.

Background:

We noticed that our "total activations" metrics were not matching "per grain type activations" metrics and were ever-growing in production. This seemed correlate with long GC pauses after a few days of runtime, causing silo restarts.

A production memory dump revealed that, indeed, millions of activations were tracked in the `ActivationDirectory` that shouldn't be there, and they were all stateless worker contexts.
@ReubenBond
Copy link
Member

Great find, @EdeMeijer, this is very clearly a bug, and your fix is correct 🤦 I reverted the RunContinuationsAsynchronously change since it's not necessary for the fix

@ReubenBond ReubenBond merged commit 2f0c339 into dotnet:main Oct 18, 2024
22 checks passed
@EdeMeijer EdeMeijer deleted the fix-stateless-worker-catalog-removal branch October 19, 2024 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants