Fix stateless worker race condition causing activation directory leak #9190

EdeMeijer · 2024-10-18T07:30:44Z

Explanation of issue

StatelessWorkerGrainContext listens to destruction events of its internal worker activations using OnDestroyActivation. It intends to unregister itself from the catalog once its last worker has been destroyed (collected).

It did this by enqueueing a work item that removes the worker context from the _workers list. It then checked if _workers was empty.

Work items are processed in a background loop, triggered by a work signal. The code relied on the work signal's RunContinuationsAsynchronously property being set to false, seemingly assuming that that would guarantee the work item to be processed on the same thread enqueueing and signalling the work item.

However, RunContinuationsAsynchronously does not guarantee the continuation to run synchronously when set to false, only that it runs asynchronously when set to true. I couldn't reproduce this behaviour in a unit test, but apparently it happens in production.

Since StatelessWorkerGrainContext was the only occurence of RunContinuationsAsynchronously = false in the codebase and everywhere else it's set to true, I changed it to true here as well. This triggers the race condition (makes the newly added test fail).

Then as a proper fix, I moved the check that unregisters the stateless worker context from the catalog when the last worker is removed to the work item itself to make sure it always runs synchronously after the worker list update.

Background

We noticed that our "total activations" metrics were not matching "per grain type activations" metrics and were ever-growing in production. This seemed correlate with long GC pauses after a few days of runtime, causing silo restarts.

A production memory dump revealed that, indeed, millions of activations were tracked in the ActivationDirectory that shouldn't be there, and they were all stateless worker contexts.

Microsoft Reviewers: Open in CodeFlow

Explanation of issue: `StatelessWorkerGrainContext` listens to destruction events of its internal worker activations using `OnDestroyActivation`. It intends to unregister itself from the catalog once its last worker has been destroyed (collected). It did this by enqueueing a work item that removes the worker context from the `_workers` list. It then checked if `_workers` was empty. Work items are processed in a background loop, triggered by a work signal. The code relied on the work signal's `RunContinuationsAsynchronously` property being set to `false`, seemingly assuming that that would guarantee the work item to be processed on the same thread enqueueing and signalling the work item. However, `RunContinuationsAsynchronously` does _not_ guarantee the continuation to run synchronously when set to `false`, only that it runs asynchronously when set to `true`. I couldn't reproduce this behaviour in a unit test, but apparently it happens in production. Since `StatelessWorkerGrainContext` was the only occurence of `RunContinuationsAsynchronously = false` in the codebase and everywhere else it's set to `true`, I changed it to `true` here as well. This triggers the race condition (makes the newly added test fail). Then as a proper fix, I moved the check that unregisters the stateless worker context from the catalog when the last worker is removed to the work item itself to make sure it always runs synchronously after the worker list update. Background: We noticed that our "total activations" metrics were not matching "per grain type activations" metrics and were ever-growing in production. This seemed correlate with long GC pauses after a few days of runtime, causing silo restarts. A production memory dump revealed that, indeed, millions of activations were tracked in the `ActivationDirectory` that shouldn't be there, and they were all stateless worker contexts.

src/Orleans.Runtime/Catalog/StatelessWorkerGrainContext.cs

ReubenBond · 2024-10-18T12:35:48Z

Great find, @EdeMeijer, this is very clearly a bug, and your fix is correct 🤦 I reverted the RunContinuationsAsynchronously change since it's not necessary for the fix

ReubenBond approved these changes Oct 18, 2024

View reviewed changes

ReubenBond reviewed Oct 18, 2024

View reviewed changes

src/Orleans.Runtime/Catalog/StatelessWorkerGrainContext.cs Outdated Show resolved Hide resolved

Update src/Orleans.Runtime/Catalog/StatelessWorkerGrainContext.cs

b8d15da

ReubenBond approved these changes Oct 18, 2024

View reviewed changes

ReubenBond merged commit 2f0c339 into dotnet:main Oct 18, 2024
22 checks passed

EdeMeijer deleted the fix-stateless-worker-catalog-removal branch October 19, 2024 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stateless worker race condition causing activation directory leak #9190

Fix stateless worker race condition causing activation directory leak #9190

EdeMeijer commented Oct 18, 2024 •

edited by dotnet-policy-service bot

Loading

ReubenBond commented Oct 18, 2024

Fix stateless worker race condition causing activation directory leak #9190

Fix stateless worker race condition causing activation directory leak #9190

Conversation

EdeMeijer commented Oct 18, 2024 • edited by dotnet-policy-service bot Loading

Explanation of issue

Background

Microsoft Reviewers: Open in CodeFlow

ReubenBond commented Oct 18, 2024

EdeMeijer commented Oct 18, 2024 •

edited by dotnet-policy-service bot

Loading