Improve retry mechanism for transactions when WriteConflict error #27

nvinuesa · 2023-08-29T19:14:48Z

This patch aims to fix a juju bug (https://bugs.launchpad.net/juju/+bug/2031631) by increasing the number of transaction retries in case the returned mongo error is a WriteConflict error.

This implementation copies the mechanism done in the official golang mongodb driver, which has a 120 seconds timeout for all transactions and retries transactions that fail with a transient error until timeout.

Note: This is the jira juju ticket https://warthogs.atlassian.net/browse/JUJU-4504, and this is the original juju PR juju/juju#16159.

This patch aims to fix a juju bug (https://bugs.launchpad.net/juju/+bug/2031631) by increasing the number of transaction retries in case the returned mongo error is a WriteConflict error. This implementation copies the mechanism done in the official golang mongodb driver, which has a 120 seconds timeout for all transactions and retries transactions that fail with a transient error until timeout.

jameinel

I wouldn't strictly block this, but I don't quite see how it addresses:
https://bugs.launchpad.net/juju/+bug/2031631

At least, I don't see where that connection is being made (maybe we just need more context around bug investigation).
And this potentially causes us to retry a bad TXN up to 40 minutes, which is pretty ridiculous.

Given that we have an outer retry, wouldn't that already have handle this case?

jameinel · 2023-08-29T20:40:33Z

sstxn/sstxn.go

@@ -106,6 +111,9 @@ func NewRunner(db *mgo.Database, logger Logger) *Runner {
 // Any number of transactions may be run concurrently, with one
 // runner or many.
 func (r *Runner) Run(ops []txn.Op, id bson.ObjectId, info interface{}) (err error) {
+	timeout := time.NewTimer(TRANSACTION_TIMEOUT)


My main concern for this is that Juju is also going to be doing a retry (after rebuilding the TXN) on top of this layer. And IIRC our default ends up being 20 retries (Harry and I dug into it).
I don't want to end up with 20 * 120s of retries.

Since we are returning a new type of error, juju should not be retrying this (failed) transactions, if it is then we should probably change that logic on the upper layers right?
Also, since we are dealing with an error that the driver layer is seeing and that we don't propagate (WriteConflict), the retries should be at this level IMO.
But I'm happy to go back to what I had done in juju/juju#16159 and retry only at juju level and only at EnterScope.

I think a context should be passed into here, then in the juju/txn package, the runner should start the timer there.

…juju/txn

jameinel reviewed Aug 29, 2023

View reviewed changes

nvinuesa mentioned this pull request Sep 1, 2023

[JUJU-4504] Implement timeout-based retry method for txns juju/txn#69

Open

wip: add context to Run(), remove 120 sec timeout which was moved to …

c98f918

…juju/txn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve retry mechanism for transactions when WriteConflict error #27

Improve retry mechanism for transactions when WriteConflict error #27

nvinuesa commented Aug 29, 2023 •

edited

Loading

jameinel left a comment

jameinel Aug 29, 2023

nvinuesa Aug 30, 2023

hpidcock Aug 30, 2023

Improve retry mechanism for transactions when WriteConflict error #27

Are you sure you want to change the base?

Improve retry mechanism for transactions when WriteConflict error #27

Conversation

nvinuesa commented Aug 29, 2023 • edited Loading

jameinel left a comment

Choose a reason for hiding this comment

jameinel Aug 29, 2023

Choose a reason for hiding this comment

nvinuesa Aug 30, 2023

Choose a reason for hiding this comment

hpidcock Aug 30, 2023

Choose a reason for hiding this comment

nvinuesa commented Aug 29, 2023 •

edited

Loading