Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data loss calculation #1

Open
pkonotopov opened this issue Feb 28, 2023 · 4 comments
Open

data loss calculation #1

pkonotopov opened this issue Feb 28, 2023 · 4 comments

Comments

@pkonotopov
Copy link

pkonotopov commented Feb 28, 2023

Hi everyone!

I tested the utility on a Patroni cluster. And found that the calculation of data loss is not very correct in my opinion. With synchronous_commit: on and remote_write settings we are getting message that we have a data loss. If we insert a 10s delay before the data loss calculation, the calculation is correct.

This is due to the fact that when failover happened, we immediately read the data from the new primary, without taking into account that the promote has not reached the end and not all WALs have been replayed during the promote.
As a result, we get a message about data loss, while with a delay there is no such message. I don't think this is a bug. But we need to think how to get application into new primary, if without delay we get a non-consistent read, if with some RTO+delta delay we are fine.

From the utility perspective, only the remote_apply option guarantees us data consistency on the synchronous standby, which is not exactly accurate :)

@dineshtessell
Copy link
Collaborator

Thank you @pkonotopov for giving a try.

ptor don't know how to wait until the recovery complete (Because, its a kind of app simulator).

It has to be the underlying HA system(patroni) or as you said remote_apply parameter which needs to take care of applying the WALs into the replica side. This tool, just waits for the new connection after the promote and will calculates the RPO, RTO & SLA.

In your said chase, there will be data loss and patroni should wait until it applies all the transactions on the recovery side, before doing the promotion. I hope, this is possible in patroni.

@pkonotopov
Copy link
Author

pkonotopov commented Mar 1, 2023

BTW, thank you for a great tool! It will be very helpful for me in my work.

Regarding Patroni settings: there's no settings to adjust incoming traffic to the new primary after failover. Maybe I didn't look hard enough. That is, access to the new primary is opened before the node becomes fully primary, the http endpoint starts responding 200 before the promote ends, and traffic switching when the roles change is almost instantaneous.

Found a workaround for this HA solution. HAProxy setting, when failover happens we do not open the new primary for all incoming traffic immediately, but gradually (see slowstart parameter - https://cbonte.github.io/haproxy-dconv/1.8/configuration.html#5.2-slowstart).

But I would still suggest making the --validation-delay option so that you can make sure the data is OK. With an explanation in the documentation why this happens and why this delay is needed when calculating the result.

@dineshkumar02
Copy link
Owner

dineshkumar02 commented Mar 1, 2023

Glad it is helpful to you.

Understood the idea of --validation-delay and will try submit the new patch as soon as I can.

Thanks for the workaround which you found in the HaProxy

@dineshtessell
Copy link
Collaborator

@pkonotopov ,

Added this feature --validation-delay in the latest release.

Thank you again for validating this tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants