-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.0 view change bugfix #3
base: develop
Are you sure you want to change the base?
Conversation
commit a view change using quorums not including the new view
This commit completely overhauls how riak_ensemble handles global cluster state as well as how individual ensembles handle membership and view changes. These changes are designed to address several corner cases that were identified during testing. While there were numerous specific bugs encountered, the fundamental issue was always global or ensemble state diverging and/or not propagating as intended. This commit addresses these issues by simplifying state management. **** Global state changes **** The first major change deals with global state. Previously, global state was represented as distinct keys stored in the root ensemble which were updated using a variety of key-specific modification functions. The root leader was responsible for periodically broadcasting out this state to the ensemble manager on all cluster nodes. However, there are cases in which the state needs to propagate in order for the root ensemble to be able to elect a leader. Having the root leader responsible for state propagation therefore does not work. Rather than storing root state as ad hoc keys, the cluster state is now managed as a single entity in riak_ensemble_state.erl. Furthermore, all elements of this state are independently versioned and implement merging logic that converges to the most recent value. This change removes the requirement that state propagation needs to be handled by a single actor. Instead, the riak_ensemble_manager on each node now periodically gossips with other nodes in the cluster. Thus, a change made to the state on one node will eventually propagate to all other nodes. The root leader is still used as a serialization point for modifying certain elements of the cluster state. The root ensemble maintains its own version of the cluster state object which is consistently modified as needed. The root leader then periodically gossips this state to the local ensemble manager, which eventually propagates to all other nodes. The new root ensemble logic is implemented in the new riak_ensemble_root module rather than as part of riak_ensemble_manager. **** Ensemble manager changes **** The riak_ensemble_manager has been simplified. In the past, the manager maintained a lot of its own state and had to constantly reconcile that with the global state. This was error prone, and bugs were found where the manager diverged from the global state. Now, the new propagated cluster state serves as the primary truth. The manager simply derives its state from the cluster state, operating mostly as a pure entity. Likewise, peer pid management has also been simplified. The manager still manages the lookup table for remote peer pids, but no longer handles local pids. Instead, the riak_ensemble_peer_sup now handles tracking local peers, removing the need to ensure the manager and supervisor are in sync. Lastly, the manager no longer provides a bootstrap view to local peers. Instead, peers periodically check their view against the view published in the manger ETS table, which is itself derived from the propagated cluster state. This removes yet another place where duplicated state could become out of sync. **** Clustering changes *** This commit changes how clusters are initially created. Previously, singleton nodes would start up with their own independent root ensembles and cluster state. Nodes were then federated into a cluster by joining them to each other, and the joining node would replace its state with the state of the target node. Unfortunately, there is a race between when the node joins the cluster and when it notices its old ensembles are outdated and shuts them down. It was possible for a stale root leader from the singleton cluster to modify cluster state in a way that prevents it from merging with the replacement state. Even if this specific issue was fixed, it is fundamentally unsound to join two completely disjoint ensemble histories together. This commit takes a different approach. Singleton nodes now come online without any running ensembles. The root ensemble is created by calling riak_ensemble_manager:enable(). The new joining protocol only allows not-enabled nodes to join enabled nodes. A cluster is therefore created by enabling one node, and then joining the rest of the nodes to that one node. The joining node adopts the state of the target node, and has no previously activate state to worry about replacing. **** Peer membership/view changes **** Dynamic ensemble membership is handled via joint consensus. Previously, when changing peer membership, a request was sent to the ensemble leader who would directly commit the joint view to the ensemble. However, peers are created dynamically. Thus, the ensemble would have a joint view that includes peers that have not yet been started. This new view information would need to be propagated across the cluster before the relevant ensemble managers would notice and start the necessary peers. There are corner cases in which view propagation is dependent on the ensemble leader. Unfortunately, since the leader immediately committed the joint view, the leader needs a quorum from the new peers to maintain its leadership. So, in the right scenario, the ensemble would stall and fail to make progress. This commit separates the notion of peer creation from view change. Rather than directly committing the view change, the ensemble leader now commits the pending view change. This ensure the value is durable and consistent, but does not affect the ability for the ensemble to maintain quorum. This pending state is eventually propagated from the leader to the local ensemble manager via periodic gossip, and then from the local manager to the rest of the cluster via global gossip. Relevant managers will then create the necessary new peers as needed based on this pending view. The leader periodically checks if there is a pending view change. If so, the leader will then commit the pending view as a joint view. The leader will then later transition to the final view as it already did before this commit, and then eventually the leader will notice that it has fully committed the pending view and will delete the stale pending view. Thus, the entire view change protocol is now a safe multi-step process.
If this is still valid, it likely needs a rebase. |
Just a note. Joe and I are still thinking about what to do with this. There have been related changes to the protocol since this pr. The pending state in particular makes this somewhat unnecessary, but we don't want to rule this patch out prematurely. |
I'll make a go/no-go decision on this issue today. |
This hasn't turned out to be a major issue in practice and we're running short on time for 2.0. Let's punt this for now and revisit when we get around to also addressing #16 (eg. do both at once). Likely a 2.1 change, but could potentially be a 2.0.1 change. Moving to 2.0.1 milestone for now. |
commit a view change using quorums not including the new view
This change is