Repair after replication conflict/split brain #78
Totktonada
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Background
There are situations, when a node in a replicaset ends up with conflicting history of operations. I know the following cases:
Asynchronous transactions can conflict, that's normal. The leader election quorum can be descreased for the sake of the service availability, at least to continue to write to some spaces. It is valid situation too.
What solutions we offer now:
before_replace
trigger1 and, of course, write tests for this logic.The first way has its own complexities and is not always possible. The second way also has disadvantages:
Proposal
I propose to implement a tool for repairing from such situations. The flow is very similar to resolving conflicts in
git
.The tool will look at two history of operations, detect the conflict point and dump the difference into a human readable file. After this a human can resolve all conflicts: choose one or another variant of data or merge them somehow. The resulting file can be applied on the healthy part of the replicaset. The history of operations of the broken node can be rewinded to the conflict point.
I guess it is also possible to interact only with the broken node: rewind it to the conflict point, apply all new changes from the healthy part of the replicaset, apply conflict resolution changes afterwards. After this it should be possible to connect this node to the rest of the replicaset.
Materials
[1] https://www.tarantool.io/en/doc/latest/book/replication/repl_problem_solving/
[2] https://stackoverflow.com/questions/56734280/conflict-resolution-in-tarantool-how-to-fix-replication-in-master-master-mode-i
[3] tarantool/tarantool@af7d703
Footnotes
Using the
on_commit
trigger inside theon_replace
trigger on_space
system space, which should be set in theon_schema_init
trigger. ↩Beta Was this translation helpful? Give feedback.
All reactions