Interop: make op-node play back old data to sync op-supervisor #12784

protolambda · 2024-11-01T15:32:30Z

When the op-supervisor is out of sync, due to e.g. a DB wipe, we need the op-node to reproduce the local-safe relations that the op-supervisor needs.

I.e. the op-node needs to resync derivation to help catch up the op-supervisor.

Alternatively/additionally we could make the op-supervisor start from a non-zero block. (~~We'll need to fix one edge case, where the parent-block is missing for the first block entry, causing a panic~~).

Then the op-supervisor could assume everything is cross-safe / finalized starting at that point, and error if there's a user-query for data before the sync anchor point.
The op-supervisor will still need to sync historical receipts data, to be able to verify cross-chain message dependencies of new blocks.

axelKingsley · 2024-11-05T18:02:27Z

Alternatively/additionally we could make the op-supervisor start from a non-zero block. (We'll need to fix one edge case, where the parent-block is missing for the first block entry, causing a panic).

This is fixed via https://github.com/ethereum-optimism/optimism/pull/12818/files

axelKingsley · 2024-11-05T18:37:46Z

Step 1: Reproduction of Bug

launched the local-devnet on an affected commit
observe standard operation
take the supervisor offline for a short while
start the supervisor back up

Upon startup, we see that the supervisor is unable to proceed:

t=2024-11-05T18:37:09+0000 lvl=warn msg="Served supervisor_updateLocalSafe" conn=172.18.0.9:38158 reqid=81307 duration=56.583µs err="derived block 0xad087d93b78de617f4b616116dcd4918dc6f0fe2aceaa8c42d24cce5c1f84cbf:23869 (parent: 0xcf91be7cb63362224a86484451fbaffe4dc3bd6f0d6559489462bfd273290506) is too new, expected to build on top of BlockSeal(hash:0x7c42c132e61786a56bcf348dc2b6f63757ac489e3b7e928d16c4be23a0d8bc5f, number:22854, time:1730829777): data out of order"

axelKingsley · 2024-11-12T21:43:18Z

(making this note because I forgot about the particulars of this issue and had to rediscover)

So the issue here is not to do with logs, which sync and continue to sync without issue when the supervisor is restarted. So, the entire chain_processor is not faulty as far as I can tell.

Rather, this is about marking the Derived that the op-node sees. The op-node goes forward while the Supervisor is offline, and there's no replay.

axelKingsley · 2024-11-12T22:23:50Z

I expanded the logging during this error, since we were only emitting the lastDerivedFrom:

t=2024-11-12T22:05:40+0000 lvl=warn msg="Served supervisor_updateLocalSafe" conn=172.18.0.9:47268 reqid=4082 duration=77µs err="cannot add block (0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959:0 derived from 0x12f888ce4789f20f5be783c102e149de0efcb3bc7afa69cfb356a5e81ce1c45f:45), last block (BlockSeal(hash:0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959, number:0, time:1731446404) derived from BlockSeal(hash:0xe7c27b3ee022eee0456b8c0c4bfb8702f5d3c2e02bf1614f01a24e50dd25d0f1, number:43, time:1731449128)) is in the future: data out of order"

Pulling the data out:

New Block Attempt:

derived 0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959:0
derivedFrom 0x12f888ce4789f20f5be783c102e149de0efcb3bc7afa69cfb356a5e81ce1c45f:45

Existing Block:

derived 0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959:0
derivedFrom 0xe7c27b3ee022eee0456b8c0c4bfb8702f5d3c2e02bf1614f01a24e50dd25d0f1:43

This makes sense, unfortunately. It's because the L1 derivation information has advanced, and the derived block 0 can be derived from both L1 Block 43 AND L1 Block 45.

The Supervisor should be able to reconstruct and repair this data when we detect this error. It can achieve this by:

when triggered (like by this error), start a "Derived Backfill"
make calls to op-node (or preferably op-geth) to see the L1<>L2 relationships, making the assumption that a gap in L1 indicates all those L1 blocks have the same L2 Derived
Call this failing Add Derived handling for each block in the path

This should recover the supervisor by forcing a backfill of data for the L1 blocks which don't contain L2 updates.

axelKingsley · 2024-11-12T22:35:19Z

I.e. the op-node needs to resync derivation to help catch up the op-supervisor.

@protolambda this suggestion seems very destructive to the node, but seeing as the Supervisor only has outbound connections to the Execution Engine, it seems the L1<>L2 relationship must be driven by the node. Are there ways we can redrive this data without resetting the node's derivation?

axelKingsley · 2024-11-13T16:59:22Z

Ok, in response to my own query from yesterday, "this seems destructive, is there no other way".

The "other way" would be a Safety Index of some kind. Here is a document describing the need for a Safety Index in the op-node. However, I think the Safety Index would be much better located on the Supervisor, if the Supervisor was made the source of L1 data for Nodes. That's an idea I have in draft here.

So, in the short term, resetting the derivation pipeline back to the last safe head known to Supervisor should work. Nodes would then play forward and report the correct data.

axelKingsley · 2024-11-13T22:15:05Z

https://github.com/ethereum-optimism/optimism/pull/12919/files

protolambda · 2024-11-25T15:20:22Z

Above PR was merged, closing this. But we'll continue to track how the op-node / op-supervisor interact, so sync works well.

protolambda added this to the Interop: Stable Devnet milestone Nov 1, 2024

protolambda added this to Interoperability Nov 1, 2024

axelKingsley self-assigned this Nov 4, 2024

protolambda closed this as completed Nov 25, 2024

github-project-automation bot moved this to Done in Interoperability Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interop: make op-node play back old data to sync op-supervisor #12784

Interop: make op-node play back old data to sync op-supervisor #12784

protolambda commented Nov 1, 2024 •

edited by axelKingsley

Loading

axelKingsley commented Nov 5, 2024

axelKingsley commented Nov 5, 2024

axelKingsley commented Nov 12, 2024

axelKingsley commented Nov 12, 2024 •

edited

Loading

axelKingsley commented Nov 12, 2024

axelKingsley commented Nov 13, 2024

axelKingsley commented Nov 13, 2024

protolambda commented Nov 25, 2024

Interop: make op-node play back old data to sync op-supervisor #12784

Interop: make op-node play back old data to sync op-supervisor #12784

Comments

protolambda commented Nov 1, 2024 • edited by axelKingsley Loading

axelKingsley commented Nov 5, 2024

axelKingsley commented Nov 5, 2024

Step 1: Reproduction of Bug

axelKingsley commented Nov 12, 2024

axelKingsley commented Nov 12, 2024 • edited Loading

axelKingsley commented Nov 12, 2024

axelKingsley commented Nov 13, 2024

axelKingsley commented Nov 13, 2024

protolambda commented Nov 25, 2024

protolambda commented Nov 1, 2024 •

edited by axelKingsley

Loading

axelKingsley commented Nov 12, 2024 •

edited

Loading