-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strong consistency, distributed, in-memory grain directory #9103
Strong consistency, distributed, in-memory grain directory #9103
Conversation
28b9edc
to
d8e5412
Compare
018ff69
to
52e0ab7
Compare
Does this number have any impact on load balancing between silos?
I assume this means only requests to views that are affected by the membership change?
Are these locks or leases? If it is a lock, what happens if a lock is taken and then the owner dies? Lastly, in a normal silo shutdown, what is the ballpack amount of time to perform this handoff? Seconds? Milliseconds? |
Good questions, @AlgorithmsAreCool!
It does not impact the distribution of grain activations across silos (there are other features for that), but it does impact the balancing of grain directory registrations across silos. The balancing is much better with this new directory implementation since each silo claims multiple points on the ring instead of just one point on the ring. As a result, silos are typically balanced within a single digit percent of each other.
Yes, that's right: the 'locks' are scoped to
The Virtual Synchrony literature calls them 'wedges' which makes me think of wheel chocks on a plane or car wheel. They are local-only, not leases or distributed locks. Each one locks a range against access by views with a version equal or later to the
Typically, I expect to see 10s to 100s of milliseconds, but seconds should be expected in some cases. I've added some metrics to record recovery & transfer times. In this case, we have a 'chaotic' cluster with 500K grains, scaling out to 10 silos and then back down to 1, adding or removing a silo every 10s and ungracefully shutting down every second silo removed. |
{ | ||
this.Shared.ServiceProvider.GetRequiredService<InsideRuntimeClient>().BreakOutstandingMessagesToSilo(this.RemoteSiloAddress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.Shared.ServiceProvider.GetRequiredService<InsideRuntimeClient>().BreakOutstandingMessagesToSilo(this.RemoteSiloAddress); |
test/TesterInternal/GrainDirectory/DistributedGrainDirectoryResilienceTests.cs
Outdated
Show resolved
Hide resolved
7d551dd
to
a7ba9c4
Compare
822882c
to
e990caa
Compare
e990caa
to
a80aa98
Compare
Fixes #9074
Fixes #2656
Fixes #4217
An improved in-memory distributed grain directory
This PR introduces a new in-memory grain directory for Orleans. We plan to make this the default implementation shortly (possibly in 9.0), but this PR adds it as an experimental, opt-in feature.
The new grain directory has the following attributes, contrasting with the current default implementation:
Design
The directory is distributed across all silos in the cluster by partitioning the key space into range using consistent hashing where each silo owns some pre-configured number of ring ranges, defaulting to 30 ranges per silo. A hash function is used to associate
GrainIds
with a point on the ring. The silo which owns that point on the ring is responsible for serving any requests for that grain.Ring ranges are assigned to replicas deterministically based on the current cluster membership. We refer to consistent snapshots of membership as views, and each view has a unique number which is used to identify and order views. A new view is created whenever a silo in the cluster changes state (eg, between
Joining
,Active
,ShuttingDown
, andDead
), and all non-faulty silos learn of view changes quickly through a combination of gossip and polling of the membership service.The image below shows a representation of a ring with 3 silos (A, B, & C), each owning 4 ranges on the ring. The grain
user/terry
hashes to0x4000_000
, which is owned by Silo B.The directory operates in two modes: normal operation where a fixed set of processes communicate in the absence of failure, and view change, which punctuates periods of normal operation to transfer of state and control from the processes in the preceding view to processes in the current view.
Once state and control has been transferred from the previous view, normal operation resumes. This coordinated state and control transfer is performed on a range-by-range basis and some ranges may resume normal operation before others. For partitions which do not change owners during a view change, normal operation never stops. For partitions which do change ownership during a view change, service is suspended when the previous owner learns of the view change and resumed when current owner successfully transfers state and control.
When a view change begins, any ring ranges which were previously owned and are now no-longer owned are sealed, preventing further modifications by the previous owner, and a snapshot is created and retained in-memory. The snapshot is transferred to the new range owners and is deleted once the new owners acknowledge that it has been retrieved or if they are evicted from the cluster.
The image below depicts a view change, where Silo B is shutting down and Silo C must transfer state & control from Silo B for the ranges which it owned. Silo B will not complete shutting down until Silo C acknowledges that it has completed the transfer.
After the transfer has completed, the ring looks like so:
Failures are handled by performing recovery. Partition owners perform recovery by requesting the set of registered grain activations belonging to that partition from all active silos in the cluster. Requests are not served until recovery completes. As an optimization, the new partition owner can serve lookup requests for grain registrations which it has already recovered.
To prevent requests from being served during a view change, range locks are acquired during view change, respected by each request, and released when view change completes.
All requests and responses include the current membership version. If a client or replica receives a message with a higher membership version, they refresh membership to at least that version before continuing. After refreshing their view, clients retry misdirected requests (requests targeting the wrong replica for a given point on the ring), sending them to the owner for the current view. Replicas reject misdirected requests.
Range partitioning is more complicated than having a fixed number of fixed-size partitions, but it has nice properties such as elastic scaling (adding more hosts increases system capacity), good balancing, minimal data movement during scaling events, and the partition mapping is derived directly from cluster membership with no other state (such as partition assignments) required.
Enabling
DistributedGrainDirectory
For now, the new directory is opt-in, so you will need to use the following code to enable it on your silos:
If performing a rolling upgrade, you will experience unavailability of the directory until all hosts have the new directory
Does this allow strong single-activation guarantees?
See #2428 for context.
This directory guarantees that a grain will never be concurrently active on more than one silo which is a member of the cluster but it is possible for a grain to still be active on a silo which was evicted from the cluster but is still operating (eg, it's very slow or has experienced a network partition and has not refreshed membership yet). We could provide some stronger guarantees using leases, as described in #2428. This could be implemented host-wide by making silos call
Environment.FailFast(...)
on themselves after some period of time if they have not managed to successfully refresh membership (or contacted some pre-ordained set of other silos to renew their lease on life). Leases do not guarantee single activation unless you assume bounded clock drift and bounded time between lease checks.TODO
ReplicatedGrainDirectory
, since there is currently only one active replica per range. Eg,DistributedGrainDirectory
,InternalGrainDirectory
,ClusterGrainDirectory
,RangePartitionedGrainDirectory
[Experimental]
to opt-in to this implementation