Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost OTK, leading to "OneTime key already exists" error and later UTDs #1415

Closed
Tracked by #245
BillCarsonFr opened this issue Jan 27, 2023 · 35 comments · Fixed by #3000 or #3992
Closed
Tracked by #245

Lost OTK, leading to "OneTime key already exists" error and later UTDs #1415

BillCarsonFr opened this issue Jan 27, 2023 · 35 comments · Fixed by #3000 or #3992

Comments

@BillCarsonFr
Copy link
Member

BillCarsonFr commented Jan 27, 2023

From time to time we are getting this error on various clients.

Element R iOS shows it as a modal:

image

Most other clients (including Element Web R and Element X) just log the error without showing it to the user.

This happens (among other reasons) when a client creates a one-time keypair, uploads the public part to the server, and then forgets that it has done so (eg due to unsafe concurrent access to the crypto store). The client is then unable to upload further OTKs until the forgotten one is claimed by another device; and once it is claimed by another device, the resultant messages will be undecryptable.

@Anderas
Copy link
Contributor

Anderas commented Jan 31, 2023

Root cause on iOS: recieve_sync_changes is not multiprocess safe, and one process may be writing new values of account to the store whilst another retains stale cache that may eventually override the store with outdated values. When this happens, next_key_id for one time keys is dropped to some lower value, and this then triggers duplicate OTK ids.

Steps to reproduce:

  1. create two OlmMachines, foreground and background
  2. simulate few sync loops on foreground by
    • calling receive_sync_changes with key_counts lower than 50
    • mark_request_as_sent for keysUpload setting it back to 50
    • => as a result foreground will keep creating new OTKs and incrementing next_key_id
  3. call single receive_sync_changes on background
    => result: background has just overriden account in store with next_key_id that is much lower than background

@Anderas
Copy link
Contributor

Anderas commented Jan 31, 2023

Lastly we also need a recovery method, given that OTK upload failure will wedge the machine and never mark OTKs as sent. Possible solution

  1. Pretend that the upload succeeded, setting a safe default value for the uploaded one-time key count on the Account
  2. Claim all your keys on the server, you don't know which one-time key has lost his private parts
  3. Re-upload a new fresh set of one-time keys

@richvdh
Copy link
Member

richvdh commented Aug 22, 2023

@uhoreg wrote

Should we use the public key as the key ID rather than a counter-based key ID? That would ensure that key IDs don't collide.

It would, but it doesn't fix the underlying problem here, which is that a client has uploaded OTKs and forgotten about them, which is pretty much a guaranteed recipe for UISIs.

@uhoreg
Copy link
Member

uhoreg commented Aug 22, 2023

Agreed about needing to fix the root cause of forgetting about keys. But once the root cause is fixed, using the public key as ID would be a way of at least moving forward. I think the other alternatives are to tell the user that the session is FUBAR and they need to re-log-in, or to have the client claim all of its OTKs and then re-upload new ones.

@richvdh
Copy link
Member

richvdh commented Aug 22, 2023

If the root cause is fixed, then it doesn't matter what we do?

I think the other alternatives are to tell the user that the session is FUBAR and they need to re-log-in,

That would be preferable, IMHO. I'm just concerned that Element Android and Element iOS are unlikely to be the last applications that make the mistake of running two copies of the rust SDK at once - or indeed that a similar bug might arise in any of a million other ways. And when it does, the fact we've implemented a workaround for it is going to make it that much harder to track down.

IMHO we've been too ready to implement workarounds for crypto bugs in the past (in particular: spamming out key requests left right and centre), and doing so has made it harder to actually fix those bugs.

@marcusmueller
Copy link

@richvdh following the discussion on element-hq/element-web#26231 (comment) , hm, this seems to be a hard problem, even completely ignoring the implementation aspect. (Keeping crypto state between a server and a split-brain client consistent – yay.)

Conventional wisdom would be that anything that wants to update the next key id, whichever shape it takes, to read-modify-write, atomically. That seems to be a hard to achieve constraint in separate processes, which, as far as I understand it, element-web considers it to easier to aim for a single crypto worker instead of hoping to make matrix-rust-sdk-crypto MT-safe. That's of course a bit sad for me as user who'd really like to have multiple tabs of Element open.

As pointed out to me in the issue above, synchronization through (browser) local storage is going to be prohibitive, performance-wise. Now I'm left a bit dumbfounded: if we can't have a central lock on creation of OTKs, would the root cause ever be fixable?

@richvdh
Copy link
Member

richvdh commented Oct 2, 2023

@marcusmueller as others have already mentioned, the long-term solution in element-web is to move crypto operations into a web worker. In general, it is up to the platform code (web, android and iOS) to ensure that only one copy of the rust-sdk is used on a datastore at a time.

I'm not actually sure what work remains to be done here. AIUI the issue has been solved for now in all platforms by ensuring we only have one rust-sdk process active at a time. @BillCarsonFr @uhoreg wdyt?

@marcusmueller
Copy link

@richvdh thanks for the reply! I'd personally say "solved" is a bit of a mischaracterization: it's a rust-sdk shortcoming that's been worked around, to the effect that end users experience significantly regressed usability. Anyways, if the sensible way, and one that is actually attainable in a finite window of time, is going for a webworker architecture: agree, then there's nothing to be done here.

@richvdh
Copy link
Member

richvdh commented Nov 21, 2023

The issue as reported no longer occurs. If anyone feels that there is more work to be done, please open a new issue.

@richvdh richvdh closed this as completed Nov 21, 2023
@BillCarsonFr
Copy link
Member Author

https://github.com/matrix-org/element-android-rageshakes/issues/59348

Reported today on android 1.6.8-dev

@BillCarsonFr BillCarsonFr reopened this Nov 22, 2023
@BillCarsonFr
Copy link
Member Author

New occurence reported on EIX element-hq/element-x-ios#2238

@kegsay
Copy link
Member

kegsay commented Jan 8, 2024

I got this today and dug out some more information. This bug is more pernicious that originally thought because it blocks all keys being uploaded rather than just the affected key.

In my case, I saw in EAX logs (v0.4.0):

  • No unused fallback keys were found on the server, generated a new fallback key. removed_fallback_key=None
  • A /keys/upload request being sent out.
  • HTTP 400 One time key signed_curve25519:AAAAAAAAAEQ already exists. Old key:...

This happens constantly now, so I cannot upload more OTKs, meaning I am slowly running out of OTKs which will cause UTDs eventually. It also means I can't upload my fallback key.

@richvdh
Copy link
Member

richvdh commented Jan 8, 2024

I am slowly running out of OTKs which will cause UTDs eventually

Well, eventually someone will claim AAAAAAAAAEQ, which will mean you can start uploading OTKs again. The real problem is that they will claim the AAAAAAAAAEQ that is on the server, rather than the AAAAAAAAAEQ that your client knows about... causing UTDs.

@BillCarsonFr
Copy link
Member Author

BillCarsonFr commented Jan 8, 2024

There is another rageshake showing this behaviour: https://github.com/element-hq/element-x-android-rageshakes/issues/1134

But in this one we can see that there was two OlmMachine created, with the same deviceId but different keys:

2023-11-30T18:18:11.343073Z DEBUG matrix_sdk_crypto::machine: Created a new Olm account | crates/matrix-sdk-crypto/src/machine.rs:288 | spans: login{method="identifier and password"} > with_store{user_id="@xuanjun:matrix.org" device_id="KSBNNMTVDE" ed25519_key=yzQQLODIhkZ5v9Tku9fcvodHddLF7+os8SkyFDL1E/8 curve25519_key=y1/IrbHdpFDhdcF5fHI0QVUubNmp5EcGU9WMNL4/pAU}
2023-11-30T18:18:15.945739Z DEBUG matrix_sdk_crypto::machine: Created a new Olm account | crates/matrix-sdk-crypto/src/machine.rs:288 | spans: restore_session > restore_session > with_store{user_id="@xuanjun:matrix.org" device_id="KSBNNMTVDE" ed25519_key=VvjbqDo1DCfzvv/H8nHiiJ2P00uH7DscBWEaw8NZA+8 curve25519_key=/ecKBMJ1HSdIXfaTFduL/Y8pnO+oIxRjHMH83QUXlE8}

later on we can see the error:

2023-11-30T18:20:33.322647Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA0 already exists. Old key: {\"key\":\"NomqZsEfsQgkKXRYhTlqJtVXrPrtyUmwyAp1rkNz5j4\",\"signatures\":{\"@xuanjun:matrix.org\":{\"ed25519:KSBNNMTVDE\":\"xU2pC19yLgr3Cd1oz/gbadpqH3LjlqJShBROCZ5y3UFGLNReHbI1pYfTUppmwdmNMyvmkMk5585BnYvPNArODQ\"}}}; new key: {'key': 'M0oBPMrurD54JggEqCDH1uR4vH4UtoEzohnUWkwl5GA', 'signatures': {'@xuanjun:matrix.org': {'ed25519:KSBNNMTVDE': 'nUWT9sKlU8KuwsfoRBjTJhbVqvCMPRFfQrtdDm21eIhwuFnDDT9OlL9AaK6uNzUtRjUmgduddoM3cE+SKzeEDw'}}}" } })))

The second machine probably trying to upload keys with same ids (ids are just simple increments starting 0)


Update 2024/01/10: Now tracked separately at #2998

@kegsay
Copy link
Member

kegsay commented Jan 8, 2024

It would be nice if /keys/upload didn't reject all the keys when just one is duplicated fwiw.

@richvdh
Copy link
Member

richvdh commented Jan 8, 2024

It would be nice if /keys/upload didn't reject all the keys when just one is duplicated fwiw.

Hard disagree, for the reason already explained at #1415 (comment). TL;DR: if we're not correctly persisting DB transactions, it's better we know about it than paper over it and wait for some even more subtle error to happen.

@richvdh

This comment was marked as resolved.

@kegsay
Copy link
Member

kegsay commented Jan 8, 2024

So AAAAAAAAAEQ equates to key 68 looking at https://github.com/matrix-org/vodozemac/blob/main/src/types/mod.rs#L39

The error message makes it clear that the key ID was reused with a different key:

message: "One time key signed_curve25519:AAAAAAAAAEQ already exists. 
Old key: {
    \"key\":\"UOPTwvyF3/Kcsh3ToGBCCa/PXIO/42U3VN0fvpfceGI\",
    \"signatures\":{
        \"@kegan:matrix.org\":{
            \"ed25519:GJDDXTSNAD\":\"LG0n4WLaNoEI6zX6pm5mV2OzjxFJEaJEBzQxRskvZG9BU7rFs8rShfYEhuEeYEpMtH2QZT/tMKZehrmib3DLBA\"
        }
    }
};
new key: {
    'key': 'OvKerkVcp3+p8qeaxjxM6N1L+ZyeqzvGOWTBXQeGUxs', 
    'signatures': {
        '@kegan:matrix.org': {
            'ed25519:GJDDXTSNAD': 'YHAneQMnV5vA3BZNMnetBGRP8t3wWH8CCKOdTSz8HCqZ1golF/xj4zXSQhEIeYjuu9s9Az8WCuIabD92pxaEAw'
        }
    }
}

Vodozemac seems to store next_key_id u64 along with unpublished_public_keys: BTreeMap<KeyId, Curve25519PublicKey> in-memory. Both fields get pickled into the same struct https://github.com/matrix-org/vodozemac/blob/main/src/olm/account/one_time_keys.rs#L28

In order for a key ID to be reused then afaict the following has to happen:

  • Generate N OTKs. next_key_id = N and len(unpublished_public_keys) == N.
  • Upload the OTKs. This gets a 200 OK.
  • Before the next key ID is pickled, the app is terminated.
  • On restart, the pickled struct will not have next_key_id = N so will generate new keys and cause duplicates.

I would be surprised if this flow can happen, because I'd expect the pickled struct to be persisted prior to the upload request. If so, this couldn't get out of sync because upon reload, the next key ID would be correct along with the unpublished public keys.

One thing that stood out to me looking at the code is https://github.com/matrix-org/vodozemac/blob/main/src/olm/account/one_time_keys.rs#L56C1-L58C6 - mark_as_published does not take a list of key IDs, instead marking everything in unpublished_public_keys as published. Consider:

  • Create N public keys
  • Start uploading them
  • Generate M public keys before the upload completes
  • The upload completes, mark_as_published is called, causing keys N+M to be marked as published but only N actually are. M exist on the client but not the server.

I don't think that can cause the dupe key ID issues in this bug though.

@richvdh richvdh changed the title ER error OneTime key already exists OneTime key already exists error Jan 10, 2024
@richvdh
Copy link
Member

richvdh commented Jan 11, 2024

I don't think this has been fixed.

@richvdh richvdh reopened this Jan 11, 2024
@BillCarsonFr
Copy link
Member Author

Need to check recent rageshakes to see how often it is still occuring

@BillCarsonFr
Copy link
Member Author

@gnunicorn
Copy link
Contributor

gnunicorn commented Apr 8, 2024

We are also experiencing this in several reports that decrypting messages fails. We have reports for iOS (multi instance) as well as MacOS (single instance) of this problem: where we see an endless loop of trying to submit the same key that is rejected by the server eventually leading to the user not being able to decrypt messages. I can provide further logs for info upon request (but not in public).

(Slightly modified for privacy) we are seeing this repeatedly (like hundredth of times)

[2024-04-08][10:04:51.289793][matrix_sdk::encryption][WARN] Error when sending out an outgoing E2EE request error=Http(Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA8 already exists. Old key: {\"key\":\"cuglvp20g6Eg1LmVHXJoprTcgfd61onhOzTjPVWCiiQ\",\"signatures\":{\"@XXX\":{\"ed25519:DUPNNDODMN\":\"sM5EK3lkwuyzP/aEZGVich4b1v9LD6VQ41QZuFVBBbgtKTlxFK9b2/Yepu2WQnlxsaUVHaKe0ubQ50irtn3ACg\"}}}; new key: {'key': 'yDoIRQ3PK3VKbysZV14UvcG4TsVxeFxqbwOAy3foeEA', 'signatures': {'@XXX': {'ed25519:DUPNNDODMN': 'nJOR/n9bxG5VLlVKcyEdZ1cGTcDIV1QpjdKYnhmoqlrk0T4rxguIfHbauMbB+uHgltMSFHNMzt4YFAIp0fuaAA'}}}" } }))))

and then it switches to a different key at some (random) point, but still has the same error:


[2024-04-08][10:06:48.933601][matrix_sdk::encryption][WARN] Error when sending out an outgoing E2EE request error=Http(Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA8 already exists. Old key: {\"key\":\"cuglvp20g6Eg1LmVHXJoprTcgfd61onhOzTjPVWCiiQ\",\"signatures\":{\"XXX\":{\"ed25519:DUPNNDODMN\":\"sM5EK3lkwuyzP/aEZGVich4b1v9LD6VQ41QZuFVBBbgtKTlxFK9b2/Yepu2WQnlxsaUVHaKe0ubQ50irtn3ACg\"}}}; new key: {'key': 'yDoIRQ3PK3VKbysZV14UvcG4TsVxeFxqbwOAy3foeEA', 'signatures': {'@XX': {'ed25519:DUPNNDODMN': 'nJOR/n9bxG5VLlVKcyEdZ1cGTcDIV1QpjdKYnhmoqlrk0T4rxguIfHbauMbB+uHgltMSFHNMzt4YFAIp0fuaAA'}}}" } }))))

And sometimes we are seeing (but only once seen so far):

[2024-04-07][12:37:39.935698][matrix_sdk_crypto::session_manager::sessions][WARN] Error creating Olm session user_id="@XXXX" device_id="BAGEPAEHCQ" error=InvalidSignature { one_time_key: SignedKey { key: "curve25519:8ZgL1Y8o2dLQRmMC0EXqOLR+ZUu8MPEzo8CKSDIFYV8", signatures: Signatures({"@XXXXX": {"ed25519:BAGEPAEHCQ": Ok(Ed25519("ed25519:z26+zO9wFUGouKXQ/rI75SzzqvG3dfkcBblypH/kUP7mYYXASr645AG7TgCmo7hJXgv34qpdVJkt0CNGV+yTBQ"))}}), fallback: Some(Some(true)), other: {} }, signing_key: Some("ed25519:E15iY8LBAymJj3O2AUVX/qWf2dBdM8leu39slyH7REA"), error: VerificationError(Signature(signature::Error { source: Some(Verification equation was not satisfied) })) }
[2024-04-07][12:37:39.955065][matrix_sdk_crypto::session_manager::sessions][WARN] Error creating Olm session user_id="@YYYY" device_id="UBAWJIQUWK" error=InvalidSignature { one_time_key: SignedKey { key: "curve25519:C8n9lk96+Yvk6mZoR08yaCp08OHj4NO71aW1qJ7i8gY", signatures: Signatures({"@YYYY": {"ed25519:UBAWJIQUWK": Ok(Ed25519("ed25519:vbcYScbWIL0HARMUZPRA3uUfD37GuPIBiwrhbMKeo8EdZl+2fi+JR8KJKf46JrOUW3Fnwbas5tjWZUlphKoJBw"))}}), fallback: None, other: {} }, signing_key: Some("ed25519:K39DjnYRNtx3AqmJWrQ2PEDpMM+YgNgirz+ESLj7PGU"), error: VerificationError(Signature(signature::Error { source: Some(Verification equation was not satisfied) })) }

@gnunicorn
Copy link
Contributor

As I was asked to provide any information I have about this: while we had seen "unable to decrypt" and also this sending message occasionally before, it seems to have a huge uptick recently with several of our users reporting this issue (I am not sure whether it is an iOS only issue, but it seems to happen there predominantly, so it can be NSE related).

So I looked at the sdk upgrades we did and from the timeline I think this problem was introduced some time since 1d6c01f. On a quick review (especially of the olm machine) of the changes since I couldn't really see anything directly myself, but maybe this is a useful information for someone else with more in-depth knowledge about that piece of code...

@kegsay
Copy link
Member

kegsay commented Jun 4, 2024

I still see this in recent rageshakes. It's hard to know if this is due to an older version, as every OTK upload request will fail until the duplicate key ID has been /claim'd by someone.

There is definitely a multiprocess bug which can cause this, see this test which can fail sometimes due to insufficient cross process protection when uploading OTKs. It's currently skipped because it doesn't always reproduce the bug, hence it is flakey.

However, I've also seen this on Element X Android, so that cannot be the only scenario where dupe OTKs can be uploaded.

@kegsay
Copy link
Member

kegsay commented Jun 7, 2024

In EXA, I see this:

2024-05-14T23:32:01.479257Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAT0 already exists. Old key: {\"key\":\"/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024\...

followed hours later by:

2024-05-15T13:00:28.678130Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024"))

confirming that OTK already exists errors are causing failed Olm sessions and hence UTDs. This particular bug report was sent on Element X/0.4.12, 2024051500; Sdk e709cca). At the time the keys were uploaded:

2024-05-14T23:32:00.935934Z DEBUG elementx:  Application version: 0.4.12 (40004120) - release / gplay | 
2024-05-14T23:32:00.936164Z DEBUG elementx:  Git SHA: 4f3a66f2 | 
2024-05-14T23:32:00.936370Z DEBUG elementx:  SDK version: SDK VERSION (TODO) | 
2024-05-14T23:32:00.936646Z DEBUG elementx:  Local time: 05-15 01:32:00.928+0200 | 

The OTKs were uploaded as the app started, possibly hinting at the cause?

@poljar
Copy link
Contributor

poljar commented Jun 19, 2024

In EXA, I see this:

2024-05-14T23:32:01.479257Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAT0 already exists. Old key: {\"key\":\"/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024\...

followed hours later by:

2024-05-15T13:00:28.678130Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024"))

confirming that OTK already exists errors are causing failed Olm sessions and hence UTDs. This particular bug report was sent on Element X/0.4.12, 2024051500; Sdk e709cca). At the time the keys were uploaded:

2024-05-14T23:32:00.935934Z DEBUG elementx:  Application version: 0.4.12 (40004120) - release / gplay | 
2024-05-14T23:32:00.936164Z DEBUG elementx:  Git SHA: 4f3a66f2 | 
2024-05-14T23:32:00.936370Z DEBUG elementx:  SDK version: SDK VERSION (TODO) | 
2024-05-14T23:32:00.936646Z DEBUG elementx:  Local time: 05-15 01:32:00.928+0200 | 

The OTKs were uploaded as the app started, possibly hinting at the cause?

One possibly cause might be that EXA was as well creating multiple client instances: element-hq/element-x-android#3050.

@poljar
Copy link
Contributor

poljar commented Jun 20, 2024

And another case where EX creates multiple Client objects in the same process: element-hq/element-x-ios#2944

@kegsay
Copy link
Member

kegsay commented Jun 20, 2024

This might be why I've struggled to reproduce this in complement-crypto if they are app-level issues..

@richvdh
Copy link
Member

richvdh commented Jun 26, 2024

Can we close this then, if it's due to applications mis-using the api (now fixed?) rather than a bug in matrix-rust-sdk?

@bnjbvr
Copy link
Member

bnjbvr commented Jun 26, 2024

I would say we're not quite sure it is the root cause, and what would increase confidence a lot is resolving this issue. But we can close this one in the meanwhile, and reopen when/if it happens later (with the same doubt that the linked issue could be the panacea fix for it).

@kegsay
Copy link
Member

kegsay commented Jul 12, 2024

Collating stats on failed /keys/upload:

  • EXI (Today): key 12 with request_size="11.9 kiB" - indicating the initial set of 50 OTKs were re-uploaded, dupe client instance?
  • EXA (Dec 2023): key 51 with request_size="255 B" - indicating a re-upload of a consumed key.
  • EA (Nov 2023): key 52
  • EXA (Nov 2023): key 13 with request_size="11.9 kiB" - indicating the initial set of 50 OTKs were re-uploaded, dupe client instance?
  • EI (Jan 2024)
  • EXI (June 2024): key 610 with request_size="1.2 kiB"
  • EXI (June 2024): key 1109 with request_size="254 B"

Meanwhile, for failed decryptions due to Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey:

  • EW-R (April 2024)
  • EXI (June 2024)
  • EW-R (July 2024)

Failed /keys/upload will result in InboundCreation(MissingOneTimeKey eventually. What's notable is that EW-R does not appears to have problems with its /keys/upload loop, hinting that the problem is something to do with the mobile apps. The main difference I can think of is that the mobile apps will all use the sqlite crypto store, but EW uses the indexeddb store. Could there be transactional failures here which could cause this?

@bnjbvr
Copy link
Member

bnjbvr commented Jul 15, 2024

Could be one of many failure modes: the apps until recently would create several Client instances, which is super bad — I'd expect Web would create only a single OlmMachine; EXI is the only one to use the crypto cross-process lock; etc.

@richvdh richvdh changed the title OneTime key already exists error Lost OTK, leading to "OneTime key already exists" error and later UTDs Sep 13, 2024
@poljar
Copy link
Contributor

poljar commented Sep 13, 2024

Another case of this, at least on Android will be due to the following:

  1. Android does not use the cross-process lock, so the regenerating of a OlmMachine does not happen.
  2. They are using the NotificationClient.

The NotificationClient seems to have a bug where it unconditionally creates a copy of theClient object with a in-memory state and more importantly a completely new OlmMachine, this OlmMachine will over time fork off from the original OlmMachine the parent Client object has.

The constructor for the NotificationClient calls Client::notification_client():

/// Create a new notification client.
pub async fn new(
parent_client: Client,
process_setup: NotificationProcessSetup,
) -> Result<Self, Error> {
let client = parent_client.notification_client().await?;
Ok(NotificationClient {
client,
parent_client,
notification_sync_mutex: AsyncMutex::new(()),
encryption_sync_mutex: AsyncMutex::new(()),
process_setup,
})
}

Which in turn calls BaseClient::clone_with_in_memory_state_store():

self.inner.base_client.clone_with_in_memory_state_store(),

Which creates a new BaseClient with the OlmMachine set to None:

let mut result = Self::with_store_config(config);

Later on we set the OlmMachine to Some with the set_session_meta() call:

#[cfg(feature = "e2e-encryption")]
self.regenerate_olm(custom_account).await?;

So now the two OlmMachine objects exists creating duplicate one-time keys, forgetting or rolling back Session objects.

This is likely the biggest culprit for this issue and once we fix it we might even close it.

@kegsay
Copy link
Member

kegsay commented Sep 13, 2024

matrix-org/complement-crypto#133 is a failing regression test for the scenario @poljar is describing. This fails with a dupe OTK upload error, as per this issue's title.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants