Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strong consistency, distributed, in-memory grain directory #9103

Merged

Conversation

ReubenBond
Copy link
Member

@ReubenBond ReubenBond commented Aug 6, 2024

Fixes #9074
Fixes #2656
Fixes #4217

An improved in-memory distributed grain directory

This PR introduces a new in-memory grain directory for Orleans. We plan to make this the default implementation shortly (possibly in 9.0), but this PR adds it as an experimental, opt-in feature.

The new grain directory has the following attributes, contrasting with the current default implementation:

  • Strong consistency: the new directory maintains strong consistency for all silos which are active members of the cluster. The existing directory is eventually consistent, which means that there can be duplicate grain activations for a short period during cluster membership changes, while the cluster is coalescing.
  • Reliable: with the existing directory, when a silo is removed from the cluster (eg, due to a crash or an upgrade), any grains registered to that silo must be terminated. This causes a significant amount of disruption during upgrades, to the point where some teams prefer to perform blue/green upgrades instead of rolling upgrades. The new directory does not suffer from this issue: when cluster membership changes, silos hand-off the registrations which they owned to the new owners. If a silo terminates before hand-off can be completed, recovery is performed: each silo in the cluster is asked for its currently active grains and these are re-registered before continuing.
  • Balanced: the existing directory uses a consistent hash ring to assign ownership for key ranges to silos. The new directory also uses a consistent hash ring, but instead of assigning each silo one section of the ring, it assigns each silo a configurable number of sections on the ring, defaulting to 30. This has a couple of benefits. First, it probabilistically improves balance across hosts to within a few percent of each other. Second, it spreads the load of range hand-off between more silos. Giving each silo a single section on the hash ring would result in hand-off always occurring between a pair of hosts.

Design

The directory is distributed across all silos in the cluster by partitioning the key space into range using consistent hashing where each silo owns some pre-configured number of ring ranges, defaulting to 30 ranges per silo. A hash function is used to associate GrainIds with a point on the ring. The silo which owns that point on the ring is responsible for serving any requests for that grain.

Ring ranges are assigned to replicas deterministically based on the current cluster membership. We refer to consistent snapshots of membership as views, and each view has a unique number which is used to identify and order views. A new view is created whenever a silo in the cluster changes state (eg, between Joining, Active, ShuttingDown, and Dead), and all non-faulty silos learn of view changes quickly through a combination of gossip and polling of the membership service.

The image below shows a representation of a ring with 3 silos (A, B, & C), each owning 4 ranges on the ring. The grain user/terry hashes to 0x4000_000, which is owned by Silo B.

image

The directory operates in two modes: normal operation where a fixed set of processes communicate in the absence of failure, and view change, which punctuates periods of normal operation to transfer of state and control from the processes in the preceding view to processes in the current view.
image

Once state and control has been transferred from the previous view, normal operation resumes. This coordinated state and control transfer is performed on a range-by-range basis and some ranges may resume normal operation before others. For partitions which do not change owners during a view change, normal operation never stops. For partitions which do change ownership during a view change, service is suspended when the previous owner learns of the view change and resumed when current owner successfully transfers state and control.

When a view change begins, any ring ranges which were previously owned and are now no-longer owned are sealed, preventing further modifications by the previous owner, and a snapshot is created and retained in-memory. The snapshot is transferred to the new range owners and is deleted once the new owners acknowledge that it has been retrieved or if they are evicted from the cluster.

The image below depicts a view change, where Silo B is shutting down and Silo C must transfer state & control from Silo B for the ranges which it owned. Silo B will not complete shutting down until Silo C acknowledges that it has completed the transfer.
image
After the transfer has completed, the ring looks like so:
image

Failures are handled by performing recovery. Partition owners perform recovery by requesting the set of registered grain activations belonging to that partition from all active silos in the cluster. Requests are not served until recovery completes. As an optimization, the new partition owner can serve lookup requests for grain registrations which it has already recovered.

To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.

All requests and responses include the current membership version. If a client or replica receives a message with a higher membership version, they refresh membership to at least that version before continuing. After refreshing their view, clients retry misdirected requests (requests targeting the wrong replica for a given point on the ring), sending them to the owner for the current view. Replicas reject misdirected requests.

Range partitioning is more complicated than having a fixed number of fixed-size partitions, but it has nice properties such as elastic scaling (adding more hosts increases system capacity), good balancing, minimal data movement during scaling events, and the partition mapping is derived directly from cluster membership with no other state (such as partition assignments) required.

Enabling DistributedGrainDirectory

For now, the new directory is opt-in, so you will need to use the following code to enable it on your silos:

#pragma warning disable ORLEANSEXP002 // Type is for evaluation purposes only and is subject to change or removal in future updates. Suppress this diagnostic to proceed.
siloBuilder.AddDistributedGrainDirectory();
#pragma warning restore ORLEANSEXP002 // Type is for evaluation purposes only and is subject to change or removal in future updates. Suppress this diagnostic to proceed.

If performing a rolling upgrade, you will experience unavailability of the directory until all hosts have the new directory

Does this allow strong single-activation guarantees?

See #2428 for context.

This directory guarantees that a grain will never be concurrently active on more than one silo which is a member of the cluster but it is possible for a grain to still be active on a silo which was evicted from the cluster but is still operating (eg, it's very slow or has experienced a network partition and has not refreshed membership yet). We could provide some stronger guarantees using leases, as described in #2428. This could be implemented host-wide by making silos call Environment.FailFast(...) on themselves after some period of time if they have not managed to successfully refresh membership (or contacted some pre-ordained set of other silos to renew their lease on life). Leases do not guarantee single activation unless you assume bounded clock drift and bounded time between lease checks.

TODO

  • Rename ReplicatedGrainDirectory, since there is currently only one active replica per range. Eg, DistributedGrainDirectory, InternalGrainDirectory, ClusterGrainDirectory, RangePartitionedGrainDirectory
  • Consider batching calls
  • Add hosting extensions, marked as [Experimental] to opt-in to this implementation
  • More tests
  • Split out silo lifecycle changes into separate PR
  • Split out TaskExtensions changes into separate PR

@ReubenBond ReubenBond force-pushed the feature/directory-coordination/final branch 2 times, most recently from 28b9edc to d8e5412 Compare August 6, 2024 17:25
@ReubenBond ReubenBond force-pushed the feature/directory-coordination/final branch 7 times, most recently from 018ff69 to 52e0ab7 Compare August 28, 2024 21:35
@ReubenBond ReubenBond changed the title [WIP] Strong consistency, distributed, in-memory grain directory Strong consistency, distributed, in-memory grain directory Aug 28, 2024
@AlgorithmsAreCool
Copy link

AlgorithmsAreCool commented Sep 27, 2024

The new directory also uses a consistent hash ring, but instead of assigning each silo one section of the ring, it assigns each silo a configurable number of sections on the ring, defaulting to 30.

Does this number have any impact on load balancing between silos?

Requests are not served until recovery completes.

I assume this means only requests to views that are affected by the membership change?

To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.

Are these locks or leases? If it is a lock, what happens if a lock is taken and then the owner dies?

Lastly, in a normal silo shutdown, what is the ballpack amount of time to perform this handoff? Seconds? Milliseconds?

@ReubenBond
Copy link
Member Author

ReubenBond commented Sep 27, 2024

Good questions, @AlgorithmsAreCool!

The new directory also uses a consistent hash ring, but instead of assigning each silo one section of the ring, it assigns each silo a configurable number of sections on the ring, defaulting to 30.

Does this number have any impact on load balancing between silos?

It does not impact the distribution of grain activations across silos (there are other features for that), but it does impact the balancing of grain directory registrations across silos. The balancing is much better with this new directory implementation since each silo claims multiple points on the ring instead of just one point on the ring. As a result, silos are typically balanced within a single digit percent of each other.

Requests are not served until recovery completes.

I assume this means only requests to views that are affected by the membership change?

Yes, that's right: the 'locks' are scoped to (Range, MembershipVersion), so they are at fairly small granularity.

To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.

Are these locks or leases? If it is a lock, what happens if a lock is taken and then the owner dies?

The Virtual Synchrony literature calls them 'wedges' which makes me think of wheel chocks on a plane or car wheel. They are local-only, not leases or distributed locks. Each one locks a range against access by views with a version equal or later to the MembershipVersion in the lock entry.

Lastly, in a normal silo shutdown, what is the ballpark amount of time to perform this handoff? Seconds? Milliseconds?

Typically, I expect to see 10s to 100s of milliseconds, but seconds should be expected in some cases. I've added some metrics to record recovery & transfer times. In this case, we have a 'chaotic' cluster with 500K grains, scaling out to 10 silos and then back down to 1, adding or removing a silo every 10s and ungracefully shutting down every second silo removed.
image
image
image

{
this.Shared.ServiceProvider.GetRequiredService<InsideRuntimeClient>().BreakOutstandingMessagesToSilo(this.RemoteSiloAddress);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.Shared.ServiceProvider.GetRequiredService<InsideRuntimeClient>().BreakOutstandingMessagesToSilo(this.RemoteSiloAddress);

@ReubenBond ReubenBond force-pushed the feature/directory-coordination/final branch 7 times, most recently from 7d551dd to a7ba9c4 Compare October 15, 2024 19:58
@ReubenBond ReubenBond force-pushed the feature/directory-coordination/final branch 2 times, most recently from 822882c to e990caa Compare October 15, 2024 22:49
@ReubenBond ReubenBond force-pushed the feature/directory-coordination/final branch from e990caa to a80aa98 Compare October 15, 2024 22:55
@ReubenBond ReubenBond merged commit b5f9e66 into dotnet:main Oct 16, 2024
22 checks passed
@ReubenBond ReubenBond deleted the feature/directory-coordination/final branch October 16, 2024 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants