data loss calculation #1

pkonotopov · 2023-02-28T12:54:32Z

Hi everyone!

I tested the utility on a Patroni cluster. And found that the calculation of data loss is not very correct in my opinion. With synchronous_commit: on and remote_write settings we are getting message that we have a data loss. If we insert a 10s delay before the data loss calculation, the calculation is correct.

This is due to the fact that when failover happened, we immediately read the data from the new primary, without taking into account that the promote has not reached the end and not all WALs have been replayed during the promote.
As a result, we get a message about data loss, while with a delay there is no such message. I don't think this is a bug. But we need to think how to get application into new primary, if without delay we get a non-consistent read, if with some RTO+delta delay we are fine.

From the utility perspective, only the remote_apply option guarantees us data consistency on the synchronous standby, which is not exactly accurate :)

The text was updated successfully, but these errors were encountered:

dineshtessell · 2023-02-28T13:27:05Z

Thank you @pkonotopov for giving a try.

ptor don't know how to wait until the recovery complete (Because, its a kind of app simulator).

It has to be the underlying HA system(patroni) or as you said remote_apply parameter which needs to take care of applying the WALs into the replica side. This tool, just waits for the new connection after the promote and will calculates the RPO, RTO & SLA.

In your said chase, there will be data loss and patroni should wait until it applies all the transactions on the recovery side, before doing the promotion. I hope, this is possible in patroni.

pkonotopov · 2023-03-01T08:31:28Z

BTW, thank you for a great tool! It will be very helpful for me in my work.

Regarding Patroni settings: there's no settings to adjust incoming traffic to the new primary after failover. Maybe I didn't look hard enough. That is, access to the new primary is opened before the node becomes fully primary, the http endpoint starts responding 200 before the promote ends, and traffic switching when the roles change is almost instantaneous.

Found a workaround for this HA solution. HAProxy setting, when failover happens we do not open the new primary for all incoming traffic immediately, but gradually (see slowstart parameter - https://cbonte.github.io/haproxy-dconv/1.8/configuration.html#5.2-slowstart).

But I would still suggest making the --validation-delay option so that you can make sure the data is OK. With an explanation in the documentation why this happens and why this delay is needed when calculating the result.

dineshkumar02 · 2023-03-01T09:54:58Z

Glad it is helpful to you.

Understood the idea of --validation-delay and will try submit the new patch as soon as I can.

Thanks for the workaround which you found in the HaProxy

dineshtessell · 2023-03-11T11:42:11Z

@pkonotopov ,

Added this feature --validation-delay in the latest release.

Thank you again for validating this tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data loss calculation #1

data loss calculation #1

pkonotopov commented Feb 28, 2023 •

edited

Loading

dineshtessell commented Feb 28, 2023

pkonotopov commented Mar 1, 2023 •

edited

Loading

dineshkumar02 commented Mar 1, 2023 •

edited

Loading

dineshtessell commented Mar 11, 2023

data loss calculation #1

data loss calculation #1

Comments

pkonotopov commented Feb 28, 2023 • edited Loading

dineshtessell commented Feb 28, 2023

pkonotopov commented Mar 1, 2023 • edited Loading

dineshkumar02 commented Mar 1, 2023 • edited Loading

dineshtessell commented Mar 11, 2023

pkonotopov commented Feb 28, 2023 •

edited

Loading

pkonotopov commented Mar 1, 2023 •

edited

Loading

dineshkumar02 commented Mar 1, 2023 •

edited

Loading