Ensure that existing megolm sessions are re-shared after a wedged olm session #2389

richvdh · 2024-04-11T09:53:27Z

TL;DR: implement a partial solution to #1992 which at least fixes the problem for future messages.

Suppose a recipient fails to decrypt an Olm message, due to either database rollback or other Olm implementation errors such as database corruption. Obviously, if the Olm message contained a Megolm key, the recipient is then unable to decrypt any messages encrypted with that Megolm session.

Currently, when this happens, the recipient attempts an "unwedging" operation, which involves creating a new Olm session (and sending a dummy message over it). This should help for any future Megolm sessions, but does nothing to help with any Megolm sessions whose keys have already been shared over the broken Olm session. Indeed, since an existing Megolm session can be used for some time, the sender may continue to send room messages that the recipient cannot decrypt for some time after a "successful" unwedging operation. (In this case, a manual /discardsession on the sender can provide a workaround.)

A full solution to this problem involves the sender re-sharing keys for all Megolm messages that have already been sent over the broken session. This is tracked at #1992. However, that full operation is somewhat thorny, as it means keeping good records of exactly which Megolm keys have been shared with which recipient devices, and at what ratchet state.

As a partial solution, we could not worry about past messages, but at least improve the situation for future messages.

So, we could:

When a recipient receives an undecryptable Olm message: instead of sending an Olm-encrypted m.dummy message (as now), they instead send a more explicit "failure" message.
When the sender receives a "failure" message, then: for all active outgoing megolm sessions, they clear the flag that says the session has been successfully shared with the recipient device.
This will then mean that, next time the sender sends a message, the megolm session will be *re-*shared using a new Olm session (though note that it will only be shared in its ratcheted state, so this won't help decrypt messages which have already been sent). In other words, we are automating the work currently done by a /discardsession in this scenario, in a rather more efficient way.

⚠️ we would need to be careful to still rotate the megolm session if the device leaves the room after sending the "failure" notification, in case it was lying about the failure.

I think this would be easier than a full solution to #1992, whilst still providing a useful base on which to build a full solution.

The text was updated successfully, but these errors were encountered:

richvdh · 2024-06-18T11:53:28Z

@uhoreg wrote:

I think that there are two main ways to fix this:

when we get an m.dummy from someone, we search through all outbound group sessions, and mark them as "needs-resharing" for that group session if we've already shared with them. "needs-reshare" counts as not-shared when determining whether we should share the session with a device, but counts as shared when determining whether the session needs to be rotated. This should be fairly simple to do, but could be slow if there are many outbound group sessions. The main problem is that we don't seem to have any function to fetch all outbound group sessions, so we'd need a new function in all the CryptoStore implementations.

for each device, we store some sort of counter. When we share a group session with a device, we store the current value of the counter with the group session. When we check if a group session is shared with a device, we compare the stored counter for that device (if available) with the device's current counter. If the device's current counter is bigger than the counter stored with the group session, it needs to be re-shared. When we get an m.dummy from someone, we increment that device's counter. This may be a bit more complicated, but it won't matter how many outbound sessions there are.

I think I agree, though it's not completely obvious to me that fetching all outbound group sessions is going to be prohibitively slow (It's worth noting that we have a maximum of one OGS per (encrypted) room, since whenever we create a new session in a room it replaces the old one. So maybe a few hundred?)

That said, even a few hundred could take a while to iterate, and it would be better to avoid it. The counter mechanism you suggest is elegant and has similarity to our solution to slow key backup reset (element-hq/element-web#26892).

I think it should be fairly easy to extend the SharedWith struct to include a counter, and likewise to add a counter to (the confusingly-named) ReadOnlyDevice.

uhoreg · 2024-06-25T00:58:34Z

The approach that I took in matrix-org/matrix-rust-sdk#3604 was to check, when we decrypt an Olm message, to see whether that Olm message starts a brand now Olm session that we haven't seen before. If this happens, then we consider it to be an unwedging attempt from the other party. This should be more robust that checking that the event is an m.dummy, and also allows us to keep treating m.dummy itself as a no-op.

richvdh mentioned this issue Apr 11, 2024

Gracefully recover from wedged session (discard session for next distribution, or reshare as m.room.key) #1992

Open

richvdh added the A-E2EE label Apr 11, 2024

richvdh mentioned this issue May 22, 2024

Unwedging doesn't work if wedged session is new, and first one created for device matrix-org/matrix-rust-sdk#3427

Open

uhoreg mentioned this issue Jun 25, 2024

Reshare Megolm session after the other party unwedges matrix-org/matrix-rust-sdk#3604

Merged

uhoreg self-assigned this Jun 26, 2024

poljar closed this as completed in matrix-org/matrix-rust-sdk#3604 Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that existing megolm sessions are re-shared after a wedged olm session #2389

Ensure that existing megolm sessions are re-shared after a wedged olm session #2389

richvdh commented Apr 11, 2024 •

edited

Loading

richvdh commented Jun 18, 2024 •

edited

Loading

uhoreg commented Jun 25, 2024

Ensure that existing megolm sessions are re-shared after a wedged olm session #2389

Ensure that existing megolm sessions are re-shared after a wedged olm session #2389

Comments

richvdh commented Apr 11, 2024 • edited Loading

richvdh commented Jun 18, 2024 • edited Loading

uhoreg commented Jun 25, 2024

richvdh commented Apr 11, 2024 •

edited

Loading

richvdh commented Jun 18, 2024 •

edited

Loading