Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retry mechanism for transactions when WriteConflict error #27

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nvinuesa
Copy link
Member

@nvinuesa nvinuesa commented Aug 29, 2023

This patch aims to fix a juju bug (https://bugs.launchpad.net/juju/+bug/2031631) by increasing the number of transaction retries in case the returned mongo error is a WriteConflict error.

This implementation copies the mechanism done in the official golang mongodb driver, which has a 120 seconds timeout for all transactions and retries transactions that fail with a transient error until timeout.

Note: This is the jira juju ticket https://warthogs.atlassian.net/browse/JUJU-4504, and this is the original juju PR juju/juju#16159.

This patch aims to fix a juju bug (https://bugs.launchpad.net/juju/+bug/2031631)
by increasing the number of transaction retries in case the returned
mongo error is a WriteConflict error.

This implementation copies the mechanism done in the official golang
mongodb driver, which has a 120 seconds timeout for all transactions and
retries transactions that fail with a transient error until timeout.
Copy link
Member

@jameinel jameinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't strictly block this, but I don't quite see how it addresses:
https://bugs.launchpad.net/juju/+bug/2031631

At least, I don't see where that connection is being made (maybe we just need more context around bug investigation).
And this potentially causes us to retry a bad TXN up to 40 minutes, which is pretty ridiculous.

Given that we have an outer retry, wouldn't that already have handle this case?

sstxn/sstxn.go Outdated
@@ -106,6 +111,9 @@ func NewRunner(db *mgo.Database, logger Logger) *Runner {
// Any number of transactions may be run concurrently, with one
// runner or many.
func (r *Runner) Run(ops []txn.Op, id bson.ObjectId, info interface{}) (err error) {
timeout := time.NewTimer(TRANSACTION_TIMEOUT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern for this is that Juju is also going to be doing a retry (after rebuilding the TXN) on top of this layer. And IIRC our default ends up being 20 retries (Harry and I dug into it).
I don't want to end up with 20 * 120s of retries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are returning a new type of error, juju should not be retrying this (failed) transactions, if it is then we should probably change that logic on the upper layers right?
Also, since we are dealing with an error that the driver layer is seeing and that we don't propagate (WriteConflict), the retries should be at this level IMO.
But I'm happy to go back to what I had done in juju/juju#16159 and retry only at juju level and only at EnterScope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a context should be passed into here, then in the juju/txn package, the runner should start the timer there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants