-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should Ra return an error while multiple nodes try to leave the cluster? #211
Comments
Sounds good. Ra intentionally limits membership changes to one at a time due to how complex reasoning about multi-member changes are. Some other Raft implementations have chosen to adopt the same limitations. This is open source software, so please submit a PR. |
I think this issue should not be a question, maybe I didn't describe the problem sufficiently enough. When multiple ra nodes try to leave the cluster at about the same time, the result of calling
The I would like to submit a PR to fix this bug. But the time required to understand the internals of ra, then fix the bug and possibly write a test case, means that this will happen sometime far, far in the future. |
Some of these come from natural race conditions where a leader is removed and a request is issued to the old leader pid. I suspect retries is the only reasonable way forward here |
That's what I'm doing in my test suite for all the error cases. |
noproc is probably returned when the leader has gone away and the call is redirected to the now gone leader. If this is the case (I haven't validated) then I think noproc is appropriate |
When multiple nodes try to leave the cluster at once (e.g. when trying to stop all nodes in a unit test), it can happen that the master leaves and another node also tries to leave at almost the same time. When the old master process has already terminated, but no new master has been elected yet, a call to
ra:leave_and_delete_server
on the local node can return{error, noproc}
.IMHO it should be
{error,cluster_change_not_permitted}
as long as there is at least one node (e.g. the local node) left in the cluster.The text was updated successfully, but these errors were encountered: