Fix stateless worker race condition causing activation directory leak #9190
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Explanation of issue
StatelessWorkerGrainContext
listens to destruction events of its internal worker activations usingOnDestroyActivation
. It intends to unregister itself from the catalog once its last worker has been destroyed (collected).It did this by enqueueing a work item that removes the worker context from the
_workers
list. It then checked if_workers
was empty.Work items are processed in a background loop, triggered by a work signal. The code relied on the work signal's
RunContinuationsAsynchronously
property being set tofalse
, seemingly assuming that that would guarantee the work item to be processed on the same thread enqueueing and signalling the work item.However,
RunContinuationsAsynchronously
does not guarantee the continuation to run synchronously when set tofalse
, only that it runs asynchronously when set totrue
. I couldn't reproduce this behaviour in a unit test, but apparently it happens in production.Since
StatelessWorkerGrainContext
was the only occurence ofRunContinuationsAsynchronously = false
in the codebase and everywhere else it's set totrue
, I changed it totrue
here as well. This triggers the race condition (makes the newly added test fail).Then as a proper fix, I moved the check that unregisters the stateless worker context from the catalog when the last worker is removed to the work item itself to make sure it always runs synchronously after the worker list update.
Background
We noticed that our "total activations" metrics were not matching "per grain type activations" metrics and were ever-growing in production. This seemed correlate with long GC pauses after a few days of runtime, causing silo restarts.
A production memory dump revealed that, indeed, millions of activations were tracked in the
ActivationDirectory
that shouldn't be there, and they were all stateless worker contexts.Microsoft Reviewers: Open in CodeFlow