Skip to content
Kegsay edited this page Jul 9, 2020 · 4 revisions

A whole lot about holes in Federation

Following on from the state resolution v2 blog post we did, the Dendrite team is here to try to explain another mystical aspect of Federation: holes.

Background context

Events in a room form a Directed Acyclic Graph (DAG), where each event "points" to the previous event before it.

 A  ----- B  ----- C
 []      [A]      [B]  prev_events

Events can also branch out and form forks, which can then later be merged:

              .--- D --.
              |   [B]  |
              |        |
 A  ----- B  -+--- C --+--- E
 []      [A]      [B]      [C,D] prev_events

When a server receives an event, it probably has the prev_events and therefore it can perform all the calculations it needs to. However, if it doesn't have them it can request them via /get_missing_events:

                A ---- B ---- C ---- D ---- E
Server 1 has?   ✓      ✓                    ✓
Server 2 has?   ✓      ✓      ✓      ✓      ✓

Server 1 has A,B and then receives E. It needs to request C,D:

POST /_matrix/federation/v1/get_missing_events/%21SomeRoom%3Amatrix.org HTTP/1.1
Content-Type: application/json

{
  "limit": 10,
  "min_depth": 0,
  "earliest_events": [
    "$event_B:example.org"
  ],
  "latest_events": [
    "$event_E:example.org"
  ]
}

NB: Be aware of the misleading key names. /get_missing_events does not return events "between" the two events given, but instead walks the DAG, backwards from latest_events and ignoring earliest_events. This distinction is important for DAGs like this:

          .----- D ---- E ---- F   Server 2
A --- B --|
          `----- C                 Server 1

Server 1 has A,B,C and then receives F
Server 2 has A,B,D,E,F

Server 1 needs to request D,E

POST /_matrix/federation/v1/get_missing_events/%21SomeRoom%3Amatrix.org HTTP/1.1
Content-Type: application/json

{
  "limit": 10,
  "min_depth": 0,
  "earliest_events": [
    "$event_C:example.org"
  ],
  "latest_events": [
    "$event_F:example.org"
  ]
}

If /get_missing_events returned events "between" the two events, it would not return D,E as they are not between C,F in the DAG.

This API works well when the server is missing a few events, but sometimes the server is missing a lot of events. When this happens, the server may decide to give up trying to call /get_missing_events over and over and instead a hole is born.

What's a hole and why should I care?

A hole refers to the "hole" in the DAG that is formed when a server has missed many events from Federation and decides to give up backfilling via /get_missing_events e.g. the server has been offline for a long time.

These DAGs can have many different shapes: (Server previously had A,B,C and wants to now store X,Y,Z) - we'll refer to these events throughout.

 A         A            A
 |        / \           |
 B       B   D          B
 |       |   |          |
 C       C   E          C
 |           |         / \ 
...         ...      ... ...
 |           |         \ /
 X           X          X
 |           |          |
 Y           Y          Y
 |           |          |
 Z           Z          Z
Linear  Early fork    Fork

Servers need to care about holes so they can perform auth checks and verify the room state for all the messages it wants to persist. This means it needs to know the state of the room before X to work out if X is allowed, and so on for Y and Z. Clients need to care about holes so they can show the gap on their UI and backfill via /messages requests.

How servers should handle holes

Servers need to perform auth checks to work out if the event being processed is "allowed" according to the rules of the room. These checks are done on a snapshot of the room state at a particular point in time: just before the event being processed is added to the DAG.

A bit on state terminology

A frequent gotcha is the distinction between before/after/at when referring to the "state of the room" for a particular event. The state "at" an event is an ambigous term as it could refer to before or after the event has been applied. Ideally, no code or specification should refer to the state "at" an event. The terms "before" and "after" are also somewhat suggestive and unfortunately the intuition is wrong: the state after B is not always the state before C.

Consider the following graphs, one linear and one fork:

Before A ___            ________ Before B
After A ____A     A   B ________ After B - no state from A involved, B applied.
Before B ___|      \ /__________ Before C - state resolution over A,B applied, C not involved.
After B ____B       C___________ After C - state resolution over A,B applied, C applied.
Before C ___|
After C ____C

For the linear DAG, the state after B is the same as the state before C. However, for the fork DAG, the state after B has no knowledge of A and hence isn't included. The state before C is the result of the state resolution algorithm being applied over the two branches A,B and hence will result in different state.

TODO: When to use /backfill vs /get_missing_events

How to persist backfilled messages

  • We want to persist X,Y,Z
  • Request the state before X from the remote homeserver e.g via /state_ids
  • Authorise that X is allowed according to that state, then apply X to that state.
  • Authorise that Y is allowed according to the new state, then apply Y to that state.
  • Authorise that Z is allowed according to the new state, then apply Z to that state.

It's very important that if there is a hole (that is, you do not know X's prev_events) that you mix in what you last knew as the following diagrams show:

 A         A            A
 |        / \           |
 B       B   D          B
 |       |   |          |
 C--.    C   E          C----.
 |  |    |   |         / \   |
... |    |  ...      ... ... |
 |  |    |   |         \ /   |
 X  |    |   X          X    |
 |  |    |   |          |    |
 Y  |    |   Y          Y    |
 |  |    |   |          |    |
 Z--`    `---Z          Z----`

TODO: Explain why

Rejected events

TODO

  • It can still be referenced in the DAG by other servers
  • But also, just because the event is rejected doesn't mean its useless
  • (eg the prev events might still be useful for ordering purposes, and for figuring out state of any events pointing at it)