-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data loss calculation #1
Comments
Thank you @pkonotopov for giving a try.
It has to be the underlying HA system(patroni) or as you said In your said chase, there will be data loss and patroni should wait until it applies all the transactions on the recovery side, before doing the promotion. I hope, this is possible in patroni. |
BTW, thank you for a great tool! It will be very helpful for me in my work. Regarding Patroni settings: there's no settings to adjust incoming traffic to the new primary after failover. Maybe I didn't look hard enough. That is, access to the new primary is opened before the node becomes fully primary, the http endpoint starts responding 200 before the promote ends, and traffic switching when the roles change is almost instantaneous. Found a workaround for this HA solution. HAProxy setting, when failover happens we do not open the new primary for all incoming traffic immediately, but gradually (see But I would still suggest making the |
Glad it is helpful to you. Understood the idea of Thanks for the workaround which you found in the |
Added this feature Thank you again for validating this tool. |
Hi everyone!
I tested the utility on a Patroni cluster. And found that the calculation of data loss is not very correct in my opinion. With
synchronous_commit: on
andremote_write
settings we are getting message that we have a data loss. If we insert a 10s delay before the data loss calculation, the calculation is correct.This is due to the fact that when failover happened, we immediately read the data from the new primary, without taking into account that the promote has not reached the end and not all WALs have been replayed during the promote.
As a result, we get a message about data loss, while with a delay there is no such message. I don't think this is a bug. But we need to think how to get application into new primary, if without delay we get a non-consistent read, if with some RTO+delta delay we are fine.
From the utility perspective, only the
remote_apply
option guarantees us data consistency on the synchronous standby, which is not exactly accurate :)The text was updated successfully, but these errors were encountered: