-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoffs on FailuresCache
should be increased
#2804
Comments
I think we're conflating two different causes of failure here, for which very different backoff values make sense. In the event of OTK exhaustion, I agree -- it's unlikely someone will reupload additional OTKs during the 15 seconds of the first backoff and a longer wait seems reasonable. However, in the event of a transient network failure, 15 seconds makes much more sense (to recover as quickly as possible, therefore losing as few messages as possible) and 15 minutes sounds much too long to be sending UTDs just because of a network hiccup.
I'd argue it's quite common, depending on messaging style, given some people tend to write a greater number of shorter messages rather than a fewer number of longer ones. And if a fix for #2864 is implemented, 15s vs 15m is going to mean the difference between being able to quickly recover and continue participating in the conversation vs remaining frustrated and only being able to read it later. So instead of making the backoff value drastically longer, is there a potential middle ground? Maybe one of: a. Using a different backoff strategy depending on the failure mode. |
100% agree with transient networks failures - this is precisely what https://github.com/matrix-org/complement-crypto/blob/main/tests/federation_connectivity_test.go#L18 is testing. Long retries are painful in this scenario. 15s exponential backoff feels like the only reasonable compromise here (15>30>60>120s). It'll catch short-lived failures and not hammer too badly. Ideally the remote server could poke the sending server to try to make it clear if it was a network error, which could help us out a bit more, but in absence of any protocol help, exp backoff seems like the only sensible move to me. |
It's certainly true that 15 minutes is too long to recover from a network blip. On the other hand, it's unresponsive servers that cause the most pain in message sends (we end up waiting tens of seconds for them to respond), and so arguably that is the thing we need to back off hardest from. I'm going to park this for now, and close this issue, because it seems like the performance for now is "good enough", and a better solution to unresponsive servers is going to be something more complicated than just bumping the backoff. (Retrying in the background, maybe?) |
#1315 introduced a
FailuresCache
which keeps a record of unreachable servers (later extended to nonfunctional devices) so that we don't keep making/keys/query
and/keys/claim
requests on the same server/device.However, I assert that the backoffs should be much higher. At present, the backoff starts at 15 seconds, increasing to a maximum of 15 minutes after 7 failures. 15 seconds is neither here nor there; frankly it's quite unusual that anyone will try to send two messages within 15 seconds anyway. Basically, even when the backoff is working correctly (which it isn't, currently, due to #281 and #2793, the actual lived experience of sending events in rooms with broken servers is very painful.
IMHO we could start the backoff at 15 minutes, and cap it at 24 hours or more.
It would also be nice to add some element of randomness, so that we don't have all the broken servers reaching the end of their backoff at exactly the same time.
The text was updated successfully, but these errors were encountered: