Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that existing megolm sessions are re-shared after a wedged olm session #2389

Closed
richvdh opened this issue Apr 11, 2024 · 2 comments · Fixed by matrix-org/matrix-rust-sdk#3604
Assignees
Labels

Comments

@richvdh
Copy link
Member

richvdh commented Apr 11, 2024

TL;DR: implement a partial solution to #1992 which at least fixes the problem for future messages.


Suppose a recipient fails to decrypt an Olm message, due to either database rollback or other Olm implementation errors such as database corruption. Obviously, if the Olm message contained a Megolm key, the recipient is then unable to decrypt any messages encrypted with that Megolm session.

Currently, when this happens, the recipient attempts an "unwedging" operation, which involves creating a new Olm session (and sending a dummy message over it). This should help for any future Megolm sessions, but does nothing to help with any Megolm sessions whose keys have already been shared over the broken Olm session. Indeed, since an existing Megolm session can be used for some time, the sender may continue to send room messages that the recipient cannot decrypt for some time after a "successful" unwedging operation. (In this case, a manual /discardsession on the sender can provide a workaround.)

A full solution to this problem involves the sender re-sharing keys for all Megolm messages that have already been sent over the broken session. This is tracked at #1992. However, that full operation is somewhat thorny, as it means keeping good records of exactly which Megolm keys have been shared with which recipient devices, and at what ratchet state.

As a partial solution, we could not worry about past messages, but at least improve the situation for future messages.

So, we could:

  • When a recipient receives an undecryptable Olm message: instead of sending an Olm-encrypted m.dummy message (as now), they instead send a more explicit "failure" message.
  • When the sender receives a "failure" message, then: for all active outgoing megolm sessions, they clear the flag that says the session has been successfully shared with the recipient device.
  • This will then mean that, next time the sender sends a message, the megolm session will be *re-*shared using a new Olm session (though note that it will only be shared in its ratcheted state, so this won't help decrypt messages which have already been sent). In other words, we are automating the work currently done by a /discardsession in this scenario, in a rather more efficient way.

⚠️ we would need to be careful to still rotate the megolm session if the device leaves the room after sending the "failure" notification, in case it was lying about the failure.


I think this would be easier than a full solution to #1992, whilst still providing a useful base on which to build a full solution.

@richvdh
Copy link
Member Author

richvdh commented Jun 18, 2024

@uhoreg wrote:

I think that there are two main ways to fix this:

  1. when we get an m.dummy from someone, we search through all outbound group sessions, and mark them as "needs-resharing" for that group session if we've already shared with them. "needs-reshare" counts as not-shared when determining whether we should share the session with a device, but counts as shared when determining whether the session needs to be rotated. This should be fairly simple to do, but could be slow if there are many outbound group sessions. The main problem is that we don't seem to have any function to fetch all outbound group sessions, so we'd need a new function in all the CryptoStore implementations.

  2. for each device, we store some sort of counter. When we share a group session with a device, we store the current value of the counter with the group session. When we check if a group session is shared with a device, we compare the stored counter for that device (if available) with the device's current counter. If the device's current counter is bigger than the counter stored with the group session, it needs to be re-shared. When we get an m.dummy from someone, we increment that device's counter. This may be a bit more complicated, but it won't matter how many outbound sessions there are.

I think I agree, though it's not completely obvious to me that fetching all outbound group sessions is going to be prohibitively slow (It's worth noting that we have a maximum of one OGS per (encrypted) room, since whenever we create a new session in a room it replaces the old one. So maybe a few hundred?)

That said, even a few hundred could take a while to iterate, and it would be better to avoid it. The counter mechanism you suggest is elegant and has similarity to our solution to slow key backup reset (element-hq/element-web#26892).

I think it should be fairly easy to extend the SharedWith struct to include a counter, and likewise to add a counter to (the confusingly-named) ReadOnlyDevice.

@uhoreg
Copy link
Member

uhoreg commented Jun 25, 2024

The approach that I took in matrix-org/matrix-rust-sdk#3604 was to check, when we decrypt an Olm message, to see whether that Olm message starts a brand now Olm session that we haven't seen before. If this happens, then we consider it to be an unwedging attempt from the other party. This should be more robust that checking that the event is an m.dummy, and also allows us to keep treating m.dummy itself as a no-op.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants