Strong consistency, distributed, in-memory grain directory #9103

ReubenBond · 2024-08-06T14:47:16Z

An improved in-memory distributed grain directory

This PR introduces a new in-memory grain directory for Orleans. We plan to make this the default implementation shortly (possibly in 9.0), but this PR adds it as an experimental, opt-in feature.

The new grain directory has the following attributes, contrasting with the current default implementation:

Strong consistency: the new directory maintains strong consistency for all silos which are active members of the cluster. The existing directory is eventually consistent, which means that there can be duplicate grain activations for a short period during cluster membership changes, while the cluster is coalescing.
Reliable: with the existing directory, when a silo is removed from the cluster (eg, due to a crash or an upgrade), any grains registered to that silo must be terminated. This causes a significant amount of disruption during upgrades, to the point where some teams prefer to perform blue/green upgrades instead of rolling upgrades. The new directory does not suffer from this issue: when cluster membership changes, silos hand-off the registrations which they owned to the new owners. If a silo terminates before hand-off can be completed, recovery is performed: each silo in the cluster is asked for its currently active grains and these are re-registered before continuing.
Balanced: the existing directory uses a consistent hash ring to assign ownership for key ranges to silos. The new directory also uses a consistent hash ring, but instead of assigning each silo one section of the ring, it assigns each silo a configurable number of sections on the ring, defaulting to 30. This has a couple of benefits. First, it probabilistically improves balance across hosts to within a few percent of each other. Second, it spreads the load of range hand-off between more silos. Giving each silo a single section on the hash ring would result in hand-off always occurring between a pair of hosts.

Design

The directory is distributed across all silos in the cluster by partitioning the key space into range using consistent hashing where each silo owns some pre-configured number of ring ranges, defaulting to 30 ranges per silo. A hash function is used to associate GrainIds with a point on the ring. The silo which owns that point on the ring is responsible for serving any requests for that grain.

Ring ranges are assigned to replicas deterministically based on the current cluster membership. We refer to consistent snapshots of membership as views, and each view has a unique number which is used to identify and order views. A new view is created whenever a silo in the cluster changes state (eg, between Joining, Active, ShuttingDown, and Dead), and all non-faulty silos learn of view changes quickly through a combination of gossip and polling of the membership service.

The image below shows a representation of a ring with 3 silos (A, B, & C), each owning 4 ranges on the ring. The grain user/terry hashes to 0x4000_000, which is owned by Silo B.

The directory operates in two modes: normal operation where a fixed set of processes communicate in the absence of failure, and view change, which punctuates periods of normal operation to transfer of state and control from the processes in the preceding view to processes in the current view.

Once state and control has been transferred from the previous view, normal operation resumes. This coordinated state and control transfer is performed on a range-by-range basis and some ranges may resume normal operation before others. For partitions which do not change owners during a view change, normal operation never stops. For partitions which do change ownership during a view change, service is suspended when the previous owner learns of the view change and resumed when current owner successfully transfers state and control.

When a view change begins, any ring ranges which were previously owned and are now no-longer owned are sealed, preventing further modifications by the previous owner, and a snapshot is created and retained in-memory. The snapshot is transferred to the new range owners and is deleted once the new owners acknowledge that it has been retrieved or if they are evicted from the cluster.

The image below depicts a view change, where Silo B is shutting down and Silo C must transfer state & control from Silo B for the ranges which it owned. Silo B will not complete shutting down until Silo C acknowledges that it has completed the transfer.

After the transfer has completed, the ring looks like so:

Failures are handled by performing recovery. Partition owners perform recovery by requesting the set of registered grain activations belonging to that partition from all active silos in the cluster. Requests are not served until recovery completes. As an optimization, the new partition owner can serve lookup requests for grain registrations which it has already recovered.

To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.

All requests and responses include the current membership version. If a client or replica receives a message with a higher membership version, they refresh membership to at least that version before continuing. After refreshing their view, clients retry misdirected requests (requests targeting the wrong replica for a given point on the ring), sending them to the owner for the current view. Replicas reject misdirected requests.

Range partitioning is more complicated than having a fixed number of fixed-size partitions, but it has nice properties such as elastic scaling (adding more hosts increases system capacity), good balancing, minimal data movement during scaling events, and the partition mapping is derived directly from cluster membership with no other state (such as partition assignments) required.

Enabling `DistributedGrainDirectory`

For now, the new directory is opt-in, so you will need to use the following code to enable it on your silos:

#pragma warning disable ORLEANSEXP002 // Type is for evaluation purposes only and is subject to change or removal in future updates. Suppress this diagnostic to proceed.
siloBuilder.AddDistributedGrainDirectory();
#pragma warning restore ORLEANSEXP002 // Type is for evaluation purposes only and is subject to change or removal in future updates. Suppress this diagnostic to proceed.

If performing a rolling upgrade, you will experience unavailability of the directory until all hosts have the new directory

Does this allow strong single-activation guarantees?

See #2428 for context.

This directory guarantees that a grain will never be concurrently active on more than one silo which is a member of the cluster but it is possible for a grain to still be active on a silo which was evicted from the cluster but is still operating (eg, it's very slow or has experienced a network partition and has not refreshed membership yet). We could provide some stronger guarantees using leases, as described in #2428. This could be implemented host-wide by making silos call Environment.FailFast(...) on themselves after some period of time if they have not managed to successfully refresh membership (or contacted some pre-ordained set of other silos to renew their lease on life). Leases do not guarantee single activation unless you assume bounded clock drift and bounded time between lease checks.

TODO

Rename ReplicatedGrainDirectory, since there is currently only one active replica per range. Eg, DistributedGrainDirectory, InternalGrainDirectory, ClusterGrainDirectory, RangePartitionedGrainDirectory
Consider batching calls
Add hosting extensions, marked as [Experimental] to opt-in to this implementation
More tests
Split out silo lifecycle changes into separate PR
Split out TaskExtensions changes into separate PR

AlgorithmsAreCool · 2024-09-27T14:49:22Z

The new directory also uses a consistent hash ring, but instead of assigning each silo one section of the ring, it assigns each silo a configurable number of sections on the ring, defaulting to 30.

Does this number have any impact on load balancing between silos?

Requests are not served until recovery completes.

I assume this means only requests to views that are affected by the membership change?

To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.

Are these locks or leases? If it is a lock, what happens if a lock is taken and then the owner dies?

Lastly, in a normal silo shutdown, what is the ballpack amount of time to perform this handoff? Seconds? Milliseconds?

ReubenBond · 2024-09-27T19:26:46Z

Good questions, @AlgorithmsAreCool!

The new directory also uses a consistent hash ring, but instead of assigning each silo one section of the ring, it assigns each silo a configurable number of sections on the ring, defaulting to 30.

Does this number have any impact on load balancing between silos?

It does not impact the distribution of grain activations across silos (there are other features for that), but it does impact the balancing of grain directory registrations across silos. The balancing is much better with this new directory implementation since each silo claims multiple points on the ring instead of just one point on the ring. As a result, silos are typically balanced within a single digit percent of each other.

Requests are not served until recovery completes.

I assume this means only requests to views that are affected by the membership change?

Yes, that's right: the 'locks' are scoped to (Range, MembershipVersion), so they are at fairly small granularity.

To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.

Are these locks or leases? If it is a lock, what happens if a lock is taken and then the owner dies?

The Virtual Synchrony literature calls them 'wedges' which makes me think of wheel chocks on a plane or car wheel. They are local-only, not leases or distributed locks. Each one locks a range against access by views with a version equal or later to the MembershipVersion in the lock entry.

Lastly, in a normal silo shutdown, what is the ballpark amount of time to perform this handoff? Seconds? Milliseconds?

Typically, I expect to see 10s to 100s of milliseconds, but seconds should be expected in some cases. I've added some metrics to record recovery & transfer times. In this case, we have a 'chaotic' cluster with 500K grains, scaling out to 10 silos and then back down to 1, adding or removing a silo every 10s and ungracefully shutting down every second silo removed.

ReubenBond · 2024-10-08T18:58:12Z

src/Orleans.Runtime/Networking/SiloConnection.cs

                {
+                    this.Shared.ServiceProvider.GetRequiredService<InsideRuntimeClient>().BreakOutstandingMessagesToSilo(this.RemoteSiloAddress);


Suggested change

this.Shared.ServiceProvider.GetRequiredService<InsideRuntimeClient>().BreakOutstandingMessagesToSilo(this.RemoteSiloAddress);

src/Orleans.Runtime/Silo/SiloControl.cs

test/TesterInternal/GrainDirectory/DistributedGrainDirectoryResilienceTests.cs

ReubenBond force-pushed the feature/directory-coordination/final branch 2 times, most recently from 28b9edc to d8e5412 Compare August 6, 2024 17:25

ReubenBond mentioned this pull request Aug 15, 2024

Clustering: ensure snapshots are globally uniform #9115

Merged

ReubenBond force-pushed the feature/directory-coordination/final branch 7 times, most recently from 018ff69 to 52e0ab7 Compare August 28, 2024 21:35

ReubenBond changed the title ~~[WIP] Strong consistency, distributed, in-memory grain directory~~ Strong consistency, distributed, in-memory grain directory Aug 28, 2024

ReubenBond commented Oct 8, 2024

View reviewed changes

src/Orleans.Runtime/Silo/SiloControl.cs Outdated Show resolved Hide resolved

ReubenBond commented Oct 8, 2024

View reviewed changes

test/TesterInternal/GrainDirectory/DistributedGrainDirectoryResilienceTests.cs Outdated Show resolved Hide resolved

ReubenBond force-pushed the feature/directory-coordination/final branch 7 times, most recently from 7d551dd to a7ba9c4 Compare October 15, 2024 19:58

ReubenBond force-pushed the feature/directory-coordination/final branch 2 times, most recently from 822882c to e990caa Compare October 15, 2024 22:49

ReubenBond added 2 commits October 15, 2024 15:54

Consistent and reliable distributed grain directory

1bcc1a0

Limit concurrency during mass deactivation

a80aa98

ReubenBond force-pushed the feature/directory-coordination/final branch from e990caa to a80aa98 Compare October 15, 2024 22:55

ReubenBond merged commit b5f9e66 into dotnet:main Oct 16, 2024
22 checks passed

ReubenBond deleted the feature/directory-coordination/final branch October 16, 2024 16:26

ReubenBond mentioned this pull request Oct 22, 2024

Improve timely shutdown of directory partitions when snapshot transfer has been abandoned #9197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strong consistency, distributed, in-memory grain directory #9103

Strong consistency, distributed, in-memory grain directory #9103

ReubenBond commented Aug 6, 2024 •

edited

Loading

AlgorithmsAreCool commented Sep 27, 2024 •

edited

Loading

ReubenBond commented Sep 27, 2024 •

edited

Loading

ReubenBond Oct 8, 2024

		{
		this.Shared.ServiceProvider.GetRequiredService<InsideRuntimeClient>().BreakOutstandingMessagesToSilo(this.RemoteSiloAddress);

Strong consistency, distributed, in-memory grain directory #9103

Strong consistency, distributed, in-memory grain directory #9103

Conversation

ReubenBond commented Aug 6, 2024 • edited Loading

An improved in-memory distributed grain directory

Design

Enabling DistributedGrainDirectory

Does this allow strong single-activation guarantees?

TODO

AlgorithmsAreCool commented Sep 27, 2024 • edited Loading

ReubenBond commented Sep 27, 2024 • edited Loading

ReubenBond Oct 8, 2024

Choose a reason for hiding this comment

ReubenBond commented Aug 6, 2024 •

edited

Loading

Enabling `DistributedGrainDirectory`

AlgorithmsAreCool commented Sep 27, 2024 •

edited

Loading

ReubenBond commented Sep 27, 2024 •

edited

Loading