Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0 view change bugfix #3

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Jan 30, 2014

commit a view change using quorums not including the new view


This change is Reviewable

commit a view change using quorums not including the new view
@jtuple jtuple added this to the 2.0-beta milestone Mar 24, 2014
sumerman referenced this pull request Mar 28, 2014
This commit completely overhauls how riak_ensemble handles global
cluster state as well as how individual ensembles handle membership
and view changes.

These changes are designed to address several corner cases that were
identified during testing. While there were numerous specific bugs
encountered, the fundamental issue was always global or ensemble state
diverging and/or not propagating as intended.

This commit addresses these issues by simplifying state management.

**** Global state changes ****

The first major change deals with global state.

Previously, global state was represented as distinct keys stored in
the root ensemble which were updated using a variety of key-specific
modification functions. The root leader was responsible for
periodically broadcasting out this state to the ensemble manager on
all cluster nodes.

However, there are cases in which the state needs to propagate in order
for the root ensemble to be able to elect a leader. Having the root
leader responsible for state propagation therefore does not work.

Rather than storing root state as ad hoc keys, the cluster state is now
managed as a single entity in riak_ensemble_state.erl. Furthermore, all
elements of this state are independently versioned and implement merging
logic that converges to the most recent value. This change removes the
requirement that state propagation needs to be handled by a single
actor. Instead, the riak_ensemble_manager on each node now periodically
gossips with other nodes in the cluster. Thus, a change made to the
state on one node will eventually propagate to all other nodes.

The root leader is still used as a serialization point for modifying
certain elements of the cluster state. The root ensemble maintains its
own version of the cluster state object which is consistently modified
as needed. The root leader then periodically gossips this state to the
local ensemble manager, which eventually propagates to all other nodes.

The new root ensemble logic is implemented in the new riak_ensemble_root
module rather than as part of riak_ensemble_manager.

**** Ensemble manager changes ****

The riak_ensemble_manager has been simplified. In the past, the manager
maintained a lot of its own state and had to constantly reconcile that
with the global state. This was error prone, and bugs were found where
the manager diverged from the global state.

Now, the new propagated cluster state serves as the primary truth. The
manager simply derives its state from the cluster state, operating
mostly as a pure entity.

Likewise, peer pid management has also been simplified. The manager
still manages the lookup table for remote peer pids, but no longer
handles local pids. Instead, the riak_ensemble_peer_sup now handles
tracking local peers, removing the need to ensure the manager and
supervisor are in sync.

Lastly, the manager no longer provides a bootstrap view to local
peers. Instead, peers periodically check their view against the
view published in the manger ETS table, which is itself derived
from the propagated cluster state. This removes yet another place
where duplicated state could become out of sync.

**** Clustering changes ***

This commit changes how clusters are initially created. Previously,
singleton nodes would start up with their own independent root ensembles
and cluster state. Nodes were then federated into a cluster by joining
them to each other, and the joining node would replace its state with
the state of the target node.

Unfortunately, there is a race between when the node joins the cluster
and when it notices its old ensembles are outdated and shuts them down.
It was possible for a stale root leader from the singleton cluster to
modify cluster state in a way that prevents it from merging with the
replacement state.

Even if this specific issue was fixed, it is fundamentally unsound to
join two completely disjoint ensemble histories together.

This commit takes a different approach.

Singleton nodes now come online without any running ensembles. The root
ensemble is created by calling riak_ensemble_manager:enable().  The new
joining protocol only allows not-enabled nodes to join enabled nodes. A
cluster is therefore created by enabling one node, and then joining the
rest of the nodes to that one node. The joining node adopts the state of
the target node, and has no previously activate state to worry about
replacing.

**** Peer membership/view changes ****

Dynamic ensemble membership is handled via joint consensus. Previously,
when changing peer membership, a request was sent to the ensemble leader
who would directly commit the joint view to the ensemble.

However, peers are created dynamically. Thus, the ensemble would have a
joint view that includes peers that have not yet been started. This new
view information would need to be propagated across the cluster before
the relevant ensemble managers would notice and start the necessary
peers. There are corner cases in which view propagation is dependent on
the ensemble leader. Unfortunately, since the leader immediately
committed the joint view, the leader needs a quorum from the new peers
to maintain its leadership. So, in the right scenario, the ensemble
would stall and fail to make progress.

This commit separates the notion of peer creation from view change.

Rather than directly committing the view change, the ensemble leader
now commits the pending view change. This ensure the value is durable
and consistent, but does not affect the ability for the ensemble to
maintain quorum.

This pending state is eventually propagated from the leader to the local
ensemble manager via periodic gossip, and then from the local manager to
the rest of the cluster via global gossip. Relevant managers will then
create the necessary new peers as needed based on this pending view.

The leader periodically checks if there is a pending view change. If so,
the leader will then commit the pending view as a joint view. The leader
will then later transition to the final view as it already did before
this commit, and then eventually the leader will notice that it has
fully committed the pending view and will delete the stale pending view.
Thus, the entire view change protocol is now a safe multi-step process.
@andrewjstone andrewjstone modified the milestones: 2.0-RC, 2.0-beta Apr 10, 2014
@lordnull
Copy link

lordnull commented Jun 4, 2014

If this is still valid, it likely needs a rebase.

@andrewjstone
Copy link
Contributor Author

Just a note. Joe and I are still thinking about what to do with this. There have been related changes to the protocol since this pr. The pending state in particular makes this somewhat unnecessary, but we don't want to rule this patch out prematurely.

@jtuple
Copy link
Contributor

jtuple commented Jun 23, 2014

I'll make a go/no-go decision on this issue today.

@jtuple
Copy link
Contributor

jtuple commented Jul 9, 2014

This hasn't turned out to be a major issue in practice and we're running short on time for 2.0. Let's punt this for now and revisit when we get around to also addressing #16 (eg. do both at once). Likely a 2.1 change, but could potentially be a 2.0.1 change. Moving to 2.0.1 milestone for now.

@jtuple jtuple modified the milestones: 2.0.1, 2.0-RC Jul 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants