-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost OTK, leading to "OneTime key already exists" error and later UTDs #1415
Comments
Root cause on iOS: Steps to reproduce:
|
Lastly we also need a recovery method, given that OTK upload failure will wedge the machine and never mark OTKs as sent. Possible solution
|
It would, but it doesn't fix the underlying problem here, which is that a client has uploaded OTKs and forgotten about them, which is pretty much a guaranteed recipe for UISIs. |
Agreed about needing to fix the root cause of forgetting about keys. But once the root cause is fixed, using the public key as ID would be a way of at least moving forward. I think the other alternatives are to tell the user that the session is FUBAR and they need to re-log-in, or to have the client claim all of its OTKs and then re-upload new ones. |
If the root cause is fixed, then it doesn't matter what we do?
That would be preferable, IMHO. I'm just concerned that Element Android and Element iOS are unlikely to be the last applications that make the mistake of running two copies of the rust SDK at once - or indeed that a similar bug might arise in any of a million other ways. And when it does, the fact we've implemented a workaround for it is going to make it that much harder to track down. IMHO we've been too ready to implement workarounds for crypto bugs in the past (in particular: spamming out key requests left right and centre), and doing so has made it harder to actually fix those bugs. |
@richvdh following the discussion on element-hq/element-web#26231 (comment) , hm, this seems to be a hard problem, even completely ignoring the implementation aspect. (Keeping crypto state between a server and a split-brain client consistent – yay.) Conventional wisdom would be that anything that wants to update the next key id, whichever shape it takes, to read-modify-write, atomically. That seems to be a hard to achieve constraint in separate processes, which, as far as I understand it, element-web considers it to easier to aim for a single crypto worker instead of hoping to make matrix-rust-sdk-crypto MT-safe. That's of course a bit sad for me as user who'd really like to have multiple tabs of Element open. As pointed out to me in the issue above, synchronization through (browser) local storage is going to be prohibitive, performance-wise. Now I'm left a bit dumbfounded: if we can't have a central lock on creation of OTKs, would the root cause ever be fixable? |
@marcusmueller as others have already mentioned, the long-term solution in element-web is to move crypto operations into a web worker. In general, it is up to the platform code (web, android and iOS) to ensure that only one copy of the rust-sdk is used on a datastore at a time. I'm not actually sure what work remains to be done here. AIUI the issue has been solved for now in all platforms by ensuring we only have one rust-sdk process active at a time. @BillCarsonFr @uhoreg wdyt? |
@richvdh thanks for the reply! I'd personally say "solved" is a bit of a mischaracterization: it's a rust-sdk shortcoming that's been worked around, to the effect that end users experience significantly regressed usability. Anyways, if the sensible way, and one that is actually attainable in a finite window of time, is going for a webworker architecture: agree, then there's nothing to be done here. |
The issue as reported no longer occurs. If anyone feels that there is more work to be done, please open a new issue. |
https://github.com/matrix-org/element-android-rageshakes/issues/59348 Reported today on android 1.6.8-dev |
New occurence reported on EIX element-hq/element-x-ios#2238 |
I got this today and dug out some more information. This bug is more pernicious that originally thought because it blocks all keys being uploaded rather than just the affected key. In my case, I saw in EAX logs (v0.4.0):
This happens constantly now, so I cannot upload more OTKs, meaning I am slowly running out of OTKs which will cause UTDs eventually. It also means I can't upload my fallback key. |
Well, eventually someone will claim |
There is another rageshake showing this behaviour: https://github.com/element-hq/element-x-android-rageshakes/issues/1134 But in this one we can see that there was two OlmMachine created, with the same deviceId but different keys:
later on we can see the error:
The second machine probably trying to upload keys with same ids (ids are just simple increments starting 0) Update 2024/01/10: Now tracked separately at #2998 |
It would be nice if |
Hard disagree, for the reason already explained at #1415 (comment). TL;DR: if we're not correctly persisting DB transactions, it's better we know about it than paper over it and wait for some even more subtle error to happen. |
This comment was marked as resolved.
This comment was marked as resolved.
So The error message makes it clear that the key ID was reused with a different key:
Vodozemac seems to store In order for a key ID to be reused then afaict the following has to happen:
I would be surprised if this flow can happen, because I'd expect the pickled struct to be persisted prior to the upload request. If so, this couldn't get out of sync because upon reload, the next key ID would be correct along with the unpublished public keys. One thing that stood out to me looking at the code is https://github.com/matrix-org/vodozemac/blob/main/src/olm/account/one_time_keys.rs#L56C1-L58C6 -
I don't think that can cause the dupe key ID issues in this bug though. |
I don't think this has been fixed. |
Need to check recent rageshakes to see how often it is still occuring |
We are also experiencing this in several reports that decrypting messages fails. We have reports for iOS (multi instance) as well as MacOS (single instance) of this problem: where we see an endless loop of trying to submit the same key that is rejected by the server eventually leading to the user not being able to decrypt messages. I can provide further logs for info upon request (but not in public). (Slightly modified for privacy) we are seeing this repeatedly (like hundredth of times)
and then it switches to a different key at some (random) point, but still has the same error:
And sometimes we are seeing (but only once seen so far):
|
As I was asked to provide any information I have about this: while we had seen "unable to decrypt" and also this sending message occasionally before, it seems to have a huge uptick recently with several of our users reporting this issue (I am not sure whether it is an iOS only issue, but it seems to happen there predominantly, so it can be NSE related). So I looked at the sdk upgrades we did and from the timeline I think this problem was introduced some time since 1d6c01f. On a quick review (especially of the olm machine) of the changes since I couldn't really see anything directly myself, but maybe this is a useful information for someone else with more in-depth knowledge about that piece of code... |
I still see this in recent rageshakes. It's hard to know if this is due to an older version, as every OTK upload request will fail until the duplicate key ID has been /claim'd by someone. There is definitely a multiprocess bug which can cause this, see this test which can fail sometimes due to insufficient cross process protection when uploading OTKs. It's currently skipped because it doesn't always reproduce the bug, hence it is flakey. However, I've also seen this on Element X Android, so that cannot be the only scenario where dupe OTKs can be uploaded. |
In EXA, I see this:
followed hours later by:
confirming that OTK already exists errors are causing failed Olm sessions and hence UTDs. This particular bug report was sent on Element X/0.4.12, 2024051500; Sdk e709cca). At the time the keys were uploaded:
The OTKs were uploaded as the app started, possibly hinting at the cause? |
One possibly cause might be that EXA was as well creating multiple client instances: element-hq/element-x-android#3050. |
And another case where EX creates multiple |
This might be why I've struggled to reproduce this in complement-crypto if they are app-level issues.. |
Can we close this then, if it's due to applications mis-using the api (now fixed?) rather than a bug in matrix-rust-sdk? |
I would say we're not quite sure it is the root cause, and what would increase confidence a lot is resolving this issue. But we can close this one in the meanwhile, and reopen when/if it happens later (with the same doubt that the linked issue could be the panacea fix for it). |
Collating stats on failed
Meanwhile, for failed decryptions due to
Failed |
Could be one of many failure modes: the apps until recently would create several |
Another case of this, at least on Android will be due to the following:
The The constructor for the matrix-rust-sdk/crates/matrix-sdk-ui/src/notification_client.rs Lines 109 to 122 in 5827bb7
Which in turn calls
Which creates a new
Later on we set the matrix-rust-sdk/crates/matrix-sdk-base/src/client.rs Lines 266 to 267 in 5827bb7
So now the two This is likely the biggest culprit for this issue and once we fix it we might even close it. |
matrix-org/complement-crypto#133 is a failing regression test for the scenario @poljar is describing. This fails with a dupe OTK upload error, as per this issue's title. |
From time to time we are getting this error on various clients.
Element R iOS shows it as a modal:
Most other clients (including Element Web R and Element X) just log the error without showing it to the user.
This happens (among other reasons) when a client creates a one-time keypair, uploads the public part to the server, and then forgets that it has done so (eg due to unsafe concurrent access to the crypto store). The client is then unable to upload further OTKs until the forgotten one is claimed by another device; and once it is claimed by another device, the resultant messages will be undecryptable.
The text was updated successfully, but these errors were encountered: