Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoffs on FailuresCache should be increased #2804

Closed
richvdh opened this issue Nov 1, 2023 · 3 comments
Closed

Backoffs on FailuresCache should be increased #2804

richvdh opened this issue Nov 1, 2023 · 3 comments

Comments

@richvdh
Copy link
Member

richvdh commented Nov 1, 2023

#1315 introduced a FailuresCache which keeps a record of unreachable servers (later extended to nonfunctional devices) so that we don't keep making /keys/query and /keys/claim requests on the same server/device.

However, I assert that the backoffs should be much higher. At present, the backoff starts at 15 seconds, increasing to a maximum of 15 minutes after 7 failures. 15 seconds is neither here nor there; frankly it's quite unusual that anyone will try to send two messages within 15 seconds anyway. Basically, even when the backoff is working correctly (which it isn't, currently, due to #281 and #2793, the actual lived experience of sending events in rooms with broken servers is very painful.

IMHO we could start the backoff at 15 minutes, and cap it at 24 hours or more.

It would also be nice to add some element of randomness, so that we don't have all the broken servers reaching the end of their backoff at exactly the same time.

@dkasak
Copy link
Member

dkasak commented Nov 20, 2023

I think we're conflating two different causes of failure here, for which very different backoff values make sense.

In the event of OTK exhaustion, I agree -- it's unlikely someone will reupload additional OTKs during the 15 seconds of the first backoff and a longer wait seems reasonable.

However, in the event of a transient network failure, 15 seconds makes much more sense (to recover as quickly as possible, therefore losing as few messages as possible) and 15 minutes sounds much too long to be sending UTDs just because of a network hiccup.

frankly it's quite unusual that anyone will try to send two messages within 15 seconds anyway.

I'd argue it's quite common, depending on messaging style, given some people tend to write a greater number of shorter messages rather than a fewer number of longer ones. And if a fix for #2864 is implemented, 15s vs 15m is going to mean the difference between being able to quickly recover and continue participating in the conversation vs remaining frustrated and only being able to read it later.

So instead of making the backoff value drastically longer, is there a potential middle ground? Maybe one of:

a. Using a different backoff strategy depending on the failure mode.
b. Some intermediate backoff value sequence which recovers quickly enough while also not being too intense in case of broken servers? e.g. if we doubled the increment from 15s to 30s, is the experience still painful with broken servers, even after fixing #281 and #2793?

@kegsay
Copy link
Member

kegsay commented Nov 20, 2023

100% agree with transient networks failures - this is precisely what https://github.com/matrix-org/complement-crypto/blob/main/tests/federation_connectivity_test.go#L18 is testing. Long retries are painful in this scenario.

15s exponential backoff feels like the only reasonable compromise here (15>30>60>120s). It'll catch short-lived failures and not hammer too badly.

Ideally the remote server could poke the sending server to try to make it clear if it was a network error, which could help us out a bit more, but in absence of any protocol help, exp backoff seems like the only sensible move to me.

@richvdh
Copy link
Member Author

richvdh commented Nov 21, 2023

However, in the event of a transient network failure, 15 seconds makes much more sense (to recover as quickly as possible, therefore losing as few messages as possible) and 15 minutes sounds much too long to be sending UTDs just because of a network hiccup.

It's certainly true that 15 minutes is too long to recover from a network blip. On the other hand, it's unresponsive servers that cause the most pain in message sends (we end up waiting tens of seconds for them to respond), and so arguably that is the thing we need to back off hardest from.

I'm going to park this for now, and close this issue, because it seems like the performance for now is "good enough", and a better solution to unresponsive servers is going to be something more complicated than just bumping the backoff. (Retrying in the background, maybe?)

@richvdh richvdh closed this as completed Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants