Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix stateless worker race condition causing activation directory leak
Explanation of issue: `StatelessWorkerGrainContext` listens to destruction events of its internal worker activations using `OnDestroyActivation`. It intends to unregister itself from the catalog once its last worker has been destroyed (collected). It did this by enqueueing a work item that removes the worker context from the `_workers` list. It then checked if `_workers` was empty. Work items are processed in a background loop, triggered by a work signal. The code relied on the work signal's `RunContinuationsAsynchronously` property being set to `false`, seemingly assuming that that would guarantee the work item to be processed on the same thread enqueueing and signalling the work item. However, `RunContinuationsAsynchronously` does _not_ guarantee the continuation to run synchronously when set to `false`, only that it runs asynchronously when set to `true`. I couldn't reproduce this behaviour in a unit test, but apparently it happens in production. Since `StatelessWorkerGrainContext` was the only occurence of `RunContinuationsAsynchronously = false` in the codebase and everywhere else it's set to `true`, I changed it to `true` here as well. This triggers the race condition (makes the newly added test fail). Then as a proper fix, I moved the check that unregisters the stateless worker context from the catalog when the last worker is removed to the work item itself to make sure it always runs synchronously after the worker list update. Background: We noticed that our "total activations" metrics were not matching "per grain type activations" metrics and were ever-growing in production. This seemed correlate with long GC pauses after a few days of runtime, causing silo restarts. A production memory dump revealed that, indeed, millions of activations were tracked in the `ActivationDirectory` that shouldn't be there, and they were all stateless worker contexts.
- Loading branch information