-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve retry mechanism for transactions when WriteConflict error #27
base: master
Are you sure you want to change the base?
Conversation
This patch aims to fix a juju bug (https://bugs.launchpad.net/juju/+bug/2031631) by increasing the number of transaction retries in case the returned mongo error is a WriteConflict error. This implementation copies the mechanism done in the official golang mongodb driver, which has a 120 seconds timeout for all transactions and retries transactions that fail with a transient error until timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't strictly block this, but I don't quite see how it addresses:
https://bugs.launchpad.net/juju/+bug/2031631
At least, I don't see where that connection is being made (maybe we just need more context around bug investigation).
And this potentially causes us to retry a bad TXN up to 40 minutes, which is pretty ridiculous.
Given that we have an outer retry, wouldn't that already have handle this case?
sstxn/sstxn.go
Outdated
@@ -106,6 +111,9 @@ func NewRunner(db *mgo.Database, logger Logger) *Runner { | |||
// Any number of transactions may be run concurrently, with one | |||
// runner or many. | |||
func (r *Runner) Run(ops []txn.Op, id bson.ObjectId, info interface{}) (err error) { | |||
timeout := time.NewTimer(TRANSACTION_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern for this is that Juju is also going to be doing a retry (after rebuilding the TXN) on top of this layer. And IIRC our default ends up being 20 retries (Harry and I dug into it).
I don't want to end up with 20 * 120s of retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are returning a new type of error, juju should not be retrying this (failed) transactions, if it is then we should probably change that logic on the upper layers right?
Also, since we are dealing with an error that the driver layer is seeing and that we don't propagate (WriteConflict), the retries should be at this level IMO.
But I'm happy to go back to what I had done in juju/juju#16159 and retry only at juju level and only at EnterScope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a context should be passed into here, then in the juju/txn package, the runner should start the timer there.
This patch aims to fix a juju bug (https://bugs.launchpad.net/juju/+bug/2031631) by increasing the number of transaction retries in case the returned mongo error is a WriteConflict error.
This implementation copies the mechanism done in the official golang mongodb driver, which has a 120 seconds timeout for all transactions and retries transactions that fail with a transient error until timeout.
Note: This is the jira juju ticket https://warthogs.atlassian.net/browse/JUJU-4504, and this is the original juju PR juju/juju#16159.