Backoffs on `FailuresCache` should be increased #2804

richvdh · 2023-11-01T16:38:36Z

#1315 introduced a FailuresCache which keeps a record of unreachable servers (later extended to nonfunctional devices) so that we don't keep making /keys/query and /keys/claim requests on the same server/device.

However, I assert that the backoffs should be much higher. At present, the backoff starts at 15 seconds, increasing to a maximum of 15 minutes after 7 failures. 15 seconds is neither here nor there; frankly it's quite unusual that anyone will try to send two messages within 15 seconds anyway. Basically, even when the backoff is working correctly (which it isn't, currently, due to #281 and #2793, the actual lived experience of sending events in rooms with broken servers is very painful.

IMHO we could start the backoff at 15 minutes, and cap it at 24 hours or more.

It would also be nice to add some element of randomness, so that we don't have all the broken servers reaching the end of their backoff at exactly the same time.

The text was updated successfully, but these errors were encountered:

dkasak · 2023-11-20T14:25:32Z

I think we're conflating two different causes of failure here, for which very different backoff values make sense.

In the event of OTK exhaustion, I agree -- it's unlikely someone will reupload additional OTKs during the 15 seconds of the first backoff and a longer wait seems reasonable.

However, in the event of a transient network failure, 15 seconds makes much more sense (to recover as quickly as possible, therefore losing as few messages as possible) and 15 minutes sounds much too long to be sending UTDs just because of a network hiccup.

frankly it's quite unusual that anyone will try to send two messages within 15 seconds anyway.

I'd argue it's quite common, depending on messaging style, given some people tend to write a greater number of shorter messages rather than a fewer number of longer ones. And if a fix for #2864 is implemented, 15s vs 15m is going to mean the difference between being able to quickly recover and continue participating in the conversation vs remaining frustrated and only being able to read it later.

So instead of making the backoff value drastically longer, is there a potential middle ground? Maybe one of:

a. Using a different backoff strategy depending on the failure mode.
b. Some intermediate backoff value sequence which recovers quickly enough while also not being too intense in case of broken servers? e.g. if we doubled the increment from 15s to 30s, is the experience still painful with broken servers, even after fixing #281 and #2793?

kegsay · 2023-11-20T20:41:14Z

100% agree with transient networks failures - this is precisely what https://github.com/matrix-org/complement-crypto/blob/main/tests/federation_connectivity_test.go#L18 is testing. Long retries are painful in this scenario.

15s exponential backoff feels like the only reasonable compromise here (15>30>60>120s). It'll catch short-lived failures and not hammer too badly.

Ideally the remote server could poke the sending server to try to make it clear if it was a network error, which could help us out a bit more, but in absence of any protocol help, exp backoff seems like the only sensible move to me.

richvdh · 2023-11-21T22:00:12Z

However, in the event of a transient network failure, 15 seconds makes much more sense (to recover as quickly as possible, therefore losing as few messages as possible) and 15 minutes sounds much too long to be sending UTDs just because of a network hiccup.

It's certainly true that 15 minutes is too long to recover from a network blip. On the other hand, it's unresponsive servers that cause the most pain in message sends (we end up waiting tens of seconds for them to respond), and so arguably that is the thing we need to back off hardest from.

I'm going to park this for now, and close this issue, because it seems like the performance for now is "good enough", and a better solution to unresponsive servers is going to be something more complicated than just bumping the backoff. (Retrying in the background, maybe?)

richvdh mentioned this issue Nov 1, 2023

Element-R: sending messages takes several seconds element-hq/element-web#26375

Closed

richvdh mentioned this issue Nov 17, 2023

Increase backoff for failed servers and users with exhausted OTKs #2858

Closed

1 task

richvdh closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoffs on `FailuresCache` should be increased #2804

Backoffs on `FailuresCache` should be increased #2804

richvdh commented Nov 1, 2023

dkasak commented Nov 20, 2023

kegsay commented Nov 20, 2023 •

edited

Loading

richvdh commented Nov 21, 2023

Backoffs on FailuresCache should be increased #2804

Backoffs on FailuresCache should be increased #2804

Comments

richvdh commented Nov 1, 2023

dkasak commented Nov 20, 2023

kegsay commented Nov 20, 2023 • edited Loading

richvdh commented Nov 21, 2023

Backoffs on `FailuresCache` should be increased #2804

Backoffs on `FailuresCache` should be increased #2804

kegsay commented Nov 20, 2023 •

edited

Loading