-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agadgil/chef 15608 probe ping #9420
base: main
Are you sure you want to change the base?
Conversation
A `ProbePing` message will be sent to another supervisor if we receive a `Confirmed` message from another supervisor. This message will be used to make sure we do not blindly accept a supervisor to be `Confirmed` based on what some other node has to say. This is particularly important if a (set of) node(s) enter into partition and come out of partition, then they would think other supervisors are `Confirmed` and spread that 'rumor`, resulting in otherwise `Alive` supervisors to be construed as `Confirmed` by some supervisors. This message will check a supposedly `Confirmed` supervisor with a `ProbePing` message and if there is a response to that `ProbePing` the supervisor is not considered `Confirmed`. The `ProbePing` message is sent in the `outbound` loop. The way this is done is we maintain a set of members to be `ProbePing`ed in a `probe_list` inside the `Server` struct. Signed-off-by: Abhijit Gadgil <[email protected]>
The `insert_member_mlw_rhw` API is supposed to insert the member in the member list (note this is different from `insert_member_from_rumor`). But due to the way we use `Incarnation`, the `Alive` health insertion is not successful. We are using this new API for marking sender as `Alive`. This can be better handled by `insert_member` but there are way too many test cases around that logic, which is not very wise to change at the moment. Signed-off-by: Abhijit Gadgil <[email protected]>
If we receive an election with a `new_term`, we start a new election *only if* we were the leader of previous term. And if we were not *leader* of the previous term, we join the election rumor only when we receive one from the leader. When the leader dies (2 or more) members will re-start election with *new term* and hence the above *new term* logic won't come into play and the normal *leader Election* algorithm will come into play and will work. Signed-off-by: Abhijit Gadgil <[email protected]>
👷 Deploy Preview for chef-habitat processing.
|
ce62e93
to
097ec08
Compare
`five_members_elect_a_new_leader_when_they_are_quorum_partitioned` test case is ignored for now. This test case right now fails because we *explicitly* restart election on a *non-leader* node. The modified algorithm _may_ need to account from such external restarts. Fixed some `rustfmt` failures Signed-off-by: Abhijit Gadgil <[email protected]>
097ec08
to
c4078fd
Compare
@agadgil-progress @mwrock would this need docs updates or additional docs for customer/users? |
Signed-off-by: Abhijit Gadgil <[email protected]>
The implementation alone does not require any additional documentation. However if we change the logic to add |
If a node reboots while it's election is in `Running` state and during the interval, the election is `Finished`. The rebooted node was not able to join the finished election. If the election is `Finished` for the same term. If we receive a rumor that says election is not `Finished` while our state of the election is `Finished` and we steal their vote and make a `hot` rumor so that the other member gets a chance to `finish` their election. Signed-off-by: Abhijit Gadgil <[email protected]>
We need to *restart* election on all the partitioned followers (so that each one of them increases term by 1) and then they can lead to election `Finished` state. This is what will happen in real world where the followers will determine the leader is `Confirmed` and independently update their *term* and start a new election. Signed-off-by: Abhijit Gadgil <[email protected]>
0223633
to
0b12b7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see this as a step in the right direction. I think the whole prob ping thing has the potential fix bad "confirmed" rumors from being spread. The one hole that I see unless I am missing something is with suspect
rumors. What happens when we ger a "bad" suspect
rumor, we set that member to suspect
but we don't get around to pinging that member before the expiration? The more members in the ring, the more likely that could happen I think. Again, I think this is moving in the right direction an we just need to think that scenario through.
I don't really see the probe ping concept helping the problem of electing different leaders though. I really think we need to be able to pump up the suitability of the previous leader to fix that. See my specific comment on that.
I left some comments on the trace
statements and then stopped doing that because I figured you probably are not that concerned with "perfect" tracing messages just yet and we can iron that out later.
Yes that's right!! |
Added a `suitability` parameter to `start_election_rsw_mlr_rhw_msr`. This will be used by the `leader` of the current term to make itself as a leader in the next term as well. Plus other review comments Signed-off-by: Abhijit Gadgil <[email protected]>
Signed-off-by: Abhijit Gadgil <[email protected]>
If we are a leader and if we restart the elections for whatever reasons (believing we lost quorum), we should start with `Max` suitability Signed-off-by: Abhijit Gadgil <[email protected]>
Signed-off-by: Abhijit Gadgil <[email protected]>
Signed-off-by: Abhijit Gadgil <[email protected]>
39e58b2
to
c1a7fcb
Compare
Signed-off-by: Abhijit Gadgil <[email protected]>
CHEF-15608