-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: One-Time Keys can desync between client and server causing UTDs #2406
Comments
So overall no, it's not really just about dealing with server rollbacks. |
How would the server know which keys were in-flight though? The server knows that a given OTK has been claimed, but, particularly in the case of federation, it doesn't know any better than the receiving client whether that means there is a message on its way using that OTK, or if the client that claimed the OTK exploded in a ball of flames just after the claim, so the OTK will never actually be used. We could maybe have the server do some introspection of the to-device-message queue, but... eeeehrgh. Incidentally, the main problem with #2356 is that the server can hand out OTKs in any order it likes, so the client has no idea whatsoever which keys it should consider useless. Fixing that is easy, and doesn't require any fancy synchronisation system.
Well, in both cases I'd say that we wouldn't (and shouldn't) have the recovery mechanism if the only reason for it was to cover up concurrency bugs. The downsides (papering over implementation bugs) would outweigh the advantages in that situation. The recovery mechanism is inherently racy so you're still going to have failures; you just make the underlying bugs harder to track down and fix. So sure, we can implement a "resync" mechanism, but IMHO "it will help cover up concurrency bugs" is not a compelling reason to do so. One of the reasons I'm trying to get a handle on what we're actually trying to fix here is that I think it will determine the design of a fix:
I'm not convinced this hash mechanism will work. Given there can be a significant delay between an OTK being claimed and it being used in a message, it's not unusual for clients to have more keys stored than are still available on the server. If we could agree that the thing we're trying to fix here is client- or server- rollbacks, then I think possibly what we need is a sequence number: the server can say "the most recent OTK I knew about was number 563". The client can then tell if the server has been rolled back and forgotten some (and the client should re-upload them), or if the client has been rolled back and forgotten some (and needs to tell the server to forget the lost keys asap). I think there's a lot of fiddly detail to sort out, but the principle is something like that. |
Regarding server-rollbacks, @uhoreg had a rather simple and elegant solution: just drop the OTKs table on rollbacks. This will cause 0 OTK counts to be sent down |
Thinking more on this, we could also do a similar solution for clients - if the We are almost certainly going to need this (or something similar) because we have a growing list of devices who have desynced their OTKs. Even if we fix the cause of the desync, it doesn't help that there is this ticking time bomb unless we delete the OTKs. |
I think you've proposed this, or something very like it, before (matrix-org/matrix-rust-sdk#1415 (comment)). I remain hard against it, at least until we have reasonable confidence that we've fixed the underlying bugs.
Which ticking time bomb are you referring to? #2356? I don't think that deleting OTKs on a "key already exists" upload error would do anything to help with that. |
I've made matrix-org/matrix-spec-proposals#4162 which I think addresses this. |
I have opened #2406 to track the addition of a resync mechanism. We're going to use this as a tracker issue for OTK desync in general. |
Removing from crypto backlog because this is a grab-bag, so not directly actionable. Instead, individual tasks like #2406 will be in the backlog. |
To establish a secure channel between a pair of devices, all Matrix clients use a double-ratchet algorithm called Olm. As part of this establishment, a one-time key (OTK) must be "claimed" (via
/keys/claim
) by the sending device on the receiving device's server. OTKs are public/private keypairs where the private key is always kept on the device in a database. The public key gets uploaded to the user's homeserver, so even if the device is offline anyone can establish a secure channel to that device. OTKs are used once and then deleted from both the client and server databases. All E2EE devices will upload some amount of OTKs to the server, and try to keep that amount above a certain value as people constantly claim (and hence delete) OTKs.This issue describes many scenarios where the private keys stored on the device do not match the public keys stored on the server: in other words the set of OTKs are no longer synchronised between client and server. If this happens, there will be at least 1 unable to decrypt message. If clients do not automatically recover from wedged Olm sessions then this will be potentially more than 1 message.
Client Issues
Clients may forget which OTKs were uploaded to the server, either through idempotency failures, usage of multiple clients in multiple processes which do not talk to each other or other unknown reasons. If this happens, things desync as the client does not have the private key to the OTKs stored on the server.
Clients should only delete the private key when the OTK was successfully used. There have been bugs where the client has deleted the private key incorrectly on decryption failures.
Server Issues
The server database could become corrupted, through a hardware failure or via a bad homeserver update. Server admins will usually rollback to a "known good" version of the database when this happens. However, if anyone has claimed a OTK during this period between the bad homeserver update and the rollback, this will be forgotten by the server, causing a desync. The server will then potentially provide the same OTK to a different user in response to
/keys/claim
. This is #2155Protocol Issues
Databases are not infinite. This is especially true on mobile devices. As a result, clients will often keep around the most recent N private OTKs, where N is very large (Vodozemac sets this to
const MAX_ONE_TIME_KEYS: usize = 100 * PUBLIC_MAX_ONE_TIME_KEYS;
which is currently 100*50 = 5,000). If this number is reached, the client will delete OTKs. Unfortunately, there is no API to tell the server to also delete these OTKs, meaning things will desync. This is #2356Multiple users could hit
/keys/claim
at the same time the target user is uploading keys. This creates race conditions which can cause the same key to be used multiple times.What do we do about this?
Some of these issues are just gnarly bugs. Some of these issues are failures in the protocol, but some are fundamental problems that cannot be resolved easily. The thing all these bugs share in common though is a desync between client and server. We can do something to detect this and take corrective action.
For example, the server could send a hash of all the public keys (or some other O(1) checksum which does not scale with the number of OTKs on the server) down
/sync
, which would allow clients to check that the keys stored server-side are in fact the same keys they have stored. If there is a mismatch, there needs to be some API for the client to control which keys are kept and which keys are discarded. If such an API existed, this would address the vast majority of these issues in one go. There are tradeoffs however:The text was updated successfully, but these errors were encountered: