-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When leader fails local get, we should step down #30
Comments
I just now remembered why we don't step down currently: because of the backend pinging logic. For puts it's necessary to step down so the user can resolve partial writes, but for gets it's not strictly required. The question concerning gets is if it helps responsiveness or not. Here's the general issue. A get timing out likely means the backend is slow to respond -- perhaps this is a K/V backend and the vnode is slow due to Bitcask merging or some such. Since we always try local gets first, a slow leader will always cause requests to timeout, even if other peers aren't slow. So, stepping down seems to make sense. If we timeout, let's step down and perhaps a different leader will be elected (no guarantee, it's random election; but in practice, this will hold). However, we already have the backend ping mechanism to handle this. The backend ping logic pings the backend, and causes the leader to step down if the backend does not respond within This means that a momentarily unresponsive backend that recovers won't trigger a step down, but a very sad backend will. Isn't this a better option than always stepping down on first timed out operation? Stepping down leads to the ensemble being unavailable briefly, plus leads to a new epoch. A new epoch is expensive because that forces all operations to rewrite the key on first reference. Avoiding the leader change unless the backend is really sad seems preferable. Thoughts? |
Ahh, I forgot about backend ping. That is the better solution. Let's just close this issue and take note to remove the commented out lines. |
oh cool. I was going to get started with this, but that sounds good. |
As mentioned in here and here -- when the leader fails a local get it should step down.
Uncommenting the sending of
request_failed
messages should be enough, but double check things and test. Is there any reason why we commented out those lines in the first place?/cc basho/riak#536
The text was updated successfully, but these errors were encountered: