Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replicator: fixed infinity loop on sync replication #223

Open
wants to merge 1 commit into
base: release-2.3
Choose a base branch
from

Conversation

catap
Copy link

@catap catap commented Sep 25, 2024

Before this fix, replicator adds the same lookup into callbacks over and over, until it reaches out of memory, and crashes.

This regression was introduced 447e086 with initial implementation of replicator.

@cmouse
Copy link
Contributor

cmouse commented Sep 25, 2024

Thank you for the merge request, we'll take it into consideration.

@catap
Copy link
Author

catap commented Sep 25, 2024

@cmouse thanks for the very quick response!

I understand that replicator was quite unstable.

Frankly speaking, my thread where I debug this issue for about 10 months is good explanation why you removed it from 2.4.

But, I have cluster and will to make it stable... may I ask you to consider to put replicator back in 2.4? Or at least restore some API that allows me to build it as a dedicated plugin.

As you can see, I'm quite motivated to make it stable enough to work :)

@cmouse
Copy link
Contributor

cmouse commented Sep 25, 2024

Yeah. The only problem is that there is no replicator in 2.4. But lets see if we can have this in 2.3

@catap
Copy link
Author

catap commented Sep 25, 2024

@cmouse because you had removed it at 4c04e4c :)

Maybe recover it when it finally works? For some setups, this feature can be a deal breaker.

@catap
Copy link
Author

catap commented Sep 25, 2024

@cmouse this code contains a condition:

		if (lookups[i].wait_for_next_push) {
			/* another sync request came while user was being
			   replicated */
			i_assert(user->priority == REPLICATION_PRIORITY_SYNC);
			lookups[i].wait_for_next_push = FALSE;
		}

if I understand the replicator code base right, when timeout is happened it may switch the priority into something else.

So, this abort will be triggered. Probably the right code is something like:

		if (lookups[i].wait_for_next_push) {
			/* another sync request came while user was being
			   replicated */
			if (user->priority != REPLICATION_PRIORITY_SYNC)
				continue;
			lookups[i].wait_for_next_push = FALSE;
		}

@catap
Copy link
Author

catap commented Sep 25, 2024

BTW I deployed a version with changes from #223 (comment) and I let you posted how it behaves

@catap
Copy link
Author

catap commented Sep 25, 2024

Well, it survives network issue without any crash:

Sep 25 13:42:20 mx1 dovecot: lmtp([email protected])<74439><nFEoIhL382bHIgEAeeMcAQ>: Warning: replication([email protected]): Sync failure: Timeout in 10 secs

Before this fix, replicator adds the same lookup into callbacks over and
over, until it reaches out of memory, and crashes.

This regression was introduced 447e086
with initial implementation of replicator.
@catap
Copy link
Author

catap commented Sep 25, 2024

@cmouse after some aditional thought I had simplified this changes to fixing a typo: missed count--. If we delete element, we need to iterate for one element less over array.

@catap
Copy link
Author

catap commented Sep 27, 2024

@cmouse I continue track an issue where order UID inside vurtual folders are messed up.

Probably the cause of it is... What virtual forlder should be excluded from synchronisation, because otherwise synchronisation fails with hardcoded error: Can't create virtual mailboxes (see: virtual_mailbox_create).

So, here it seems that I don't understand how should it works. With -x virtual. synchronisation works, but it will mess UID inside virtual folders, and without it... it simple fails.

Have I missed something?

@catap
Copy link
Author

catap commented Sep 29, 2024

After reading https://dovecot.org/mailman3/archives/list/[email protected]/message/5LYQGL2ZOMDDUFUOJU2TUBLLQUWEPZSR/ and near here.... I understand that replicator won't be recoverd.

@cmouse anyway, may I kindly ask you to merge this small typo? With hope that it will be included into the next release of 2.3 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants