Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully recover from wedged session (discard session for next distribution, or reshare as m.room.key) #1992

Open
pmaier1 opened this issue Aug 17, 2023 · 12 comments

Comments

@pmaier1
Copy link
Contributor

pmaier1 commented Aug 17, 2023

High Level:

Explore a way to (selectively!) re-request encryption keys to increase crypto reliability.
Given that olm session can get wedged, could it be possible find a way to recover from that: i.e be able to get back the failed to decrypt message.

IT IS NOT the re-implementation of key-requesting that was terrible (creating a lot of traffic and slowing down everything)

Scenario:

  • As a sender I am notified that a recipient wasn't able to decrypt a message I sent
  • Given that this user was supposed to get that key, we could reset the olm session then re-share the current outbound session as a m.room_key (not a forwarded key)

The recipient would send the to_device failed to decrypt ack only to the exact sender of the to device! (different from key requesting that would ask to all the current user devices)

@pmaier1 pmaier1 changed the title 🕛 Gracefully recover from wedged session (discard session for next distribution, or reshare as m.room.key) Gracefully recover from wedged session (discard session for next distribution, or reshare as m.room.key) Aug 28, 2023
@richvdh
Copy link
Member

richvdh commented Oct 19, 2023

@pmaier1 I'm not really sure what's to be done here. We already have recovery from wedged olm sessions. Can you explain what precise scenario we're trying to cover here?

@pmaier1
Copy link
Contributor Author

pmaier1 commented Oct 25, 2023

Well, to my understanding there still are cases where a user has to manually type /discardsession in order to recover. The idea here was to automate this, if possible.

@richvdh
Copy link
Member

richvdh commented Oct 25, 2023

Hrm. /discardsession replaces the megolm session, and megolm sessions can't really get "wedged" in the same way as olm sessions. /discardsession won't do anything to help with a wedged olm session.

Generally I'd say we should figure out what cases a /discardsession actually helps with, and propose real fixes for them on a case-by-case basis, rather than automating a /discardsession (which I suspect would be hard, in any case).

Any idea what those cases are?

@kegsay
Copy link

kegsay commented Jan 17, 2024

We only expect to see this when clients or servers need to rollback their database.

There are other cases where /discardsession will help fix things, but those other cases should be fixable, whereas server/client rollbacks aren't.

@pmaier1
Copy link
Contributor Author

pmaier1 commented Jan 19, 2024

The solution to this ticket is supposed to also solve #2155.

@BillCarsonFr
Copy link
Member

We think this would be usefull for robustness. But for now we focus efforts on finding root causes of wedging

@richvdh
Copy link
Member

richvdh commented Feb 28, 2024

It isn't made explicit anywhere above: I believe the intention of this issue is to improve the current olm session recovery: the current implementation does nothing to help with Olm messages that have already been sent. (Which is why /discardsession helps: it ensures that the next message the user sends will cause a new megolm keyshare, over a new Olm session.)

As a recipient, if we detect a wedged olm session (which causes us to make a new olm session), we need to tell the sender about the situation so that they know they still need to send us the key. Ideally, the sender needs to send us all megolm keys they already tried to send, but at the very least they need to mark all existing keys as "not yet shared with this device".

@richvdh
Copy link
Member

richvdh commented Apr 10, 2024

As a partial solution to this, we could not worry about past messages, but at least improve the situation for future messages.

Currently: the olm-session unwedging doesn't help with existing megolm sessions until the sending user does a /discardsession, or the megolm session is rotated for another reason (eg, someone leaving the room).

Instead we could:

  • When a recipient receives an undecryptable Olm message: instead of sending an Olm-encrypted m.dummy message (as now), they instead send a more explicit "failure" message.
  • When the sender receives a "failure" message, then: for all active outgoing megolm sessions, they clear the flag that says the session has been successfully shared with the recipient device.
  • This will then mean that, next time the sender sends a message, the megolm session will be *re-*shared using a new Olm session (though note that it will only be shared in its ratcheted state, so this won't help decrypt messages which have already been sent). In other words, we are automating the work currently done by a /discardsession in this scenario, in a rather more efficient way.

⚠️ we would need to be careful to still rotate the megolm session if the device leaves the room after sending the "failure" notification, in case it was lying about the failure.

@richvdh
Copy link
Member

richvdh commented Apr 11, 2024

As a partial solution to this, we could not worry about past messages, but at least improve the situation for future messages....

I have split this out to #2389.

@richvdh richvdh added the A-E2EE label Apr 11, 2024
@BillCarsonFr
Copy link
Member

Some more context for the record.
Issues like the following element-hq/synapse#17117 (following mx.org outage), could result also in dlivery failures. This is a bug that need to be fixed, but currently there is no way to recover from it (until next session rotation)

@kegsay
Copy link

kegsay commented Apr 25, 2024

I asked some questions about how we recover from wedged Olm sessions. The purpose of sending an m.dummy Olm message is:

to try to make the receiver use the new session and hence not cause the sender to get a UTD from them by continuing to use the wedged session

There are potentially problems around this mechanism though. Rust SDK seems to not follow the spec which says:

If a client has multiple sessions established with another device, it should use the session from which it last received and successfully decrypted a message.

It seems to use creation time instead. We need to align on which is correct. In addition, vdh points out that client session timestamps have been corrupted so we likely need to consider that too.

In all cases, recovering from a wedged OIm session involves a new /keys/claim which itself may cause UTDs which is particularly ironic, given we're doing this to fix UTDs.

@BillCarsonFr
Copy link
Member

Crosslink matrix-org/matrix-rust-sdk#3356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants