Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interop: make op-node play back old data to sync op-supervisor #12784

Closed
protolambda opened this issue Nov 1, 2024 · 8 comments
Closed

Interop: make op-node play back old data to sync op-supervisor #12784

protolambda opened this issue Nov 1, 2024 · 8 comments
Assignees

Comments

@protolambda
Copy link
Contributor

protolambda commented Nov 1, 2024

When the op-supervisor is out of sync, due to e.g. a DB wipe, we need the op-node to reproduce the local-safe relations that the op-supervisor needs.

I.e. the op-node needs to resync derivation to help catch up the op-supervisor.

Alternatively/additionally we could make the op-supervisor start from a non-zero block. (We'll need to fix one edge case, where the parent-block is missing for the first block entry, causing a panic).

Then the op-supervisor could assume everything is cross-safe / finalized starting at that point, and error if there's a user-query for data before the sync anchor point.
The op-supervisor will still need to sync historical receipts data, to be able to verify cross-chain message dependencies of new blocks.

@protolambda protolambda added this to the Interop: Stable Devnet milestone Nov 1, 2024
@axelKingsley axelKingsley self-assigned this Nov 4, 2024
@axelKingsley
Copy link
Contributor

Alternatively/additionally we could make the op-supervisor start from a non-zero block. (We'll need to fix one edge case, where the parent-block is missing for the first block entry, causing a panic).

This is fixed via https://github.com/ethereum-optimism/optimism/pull/12818/files

@axelKingsley
Copy link
Contributor

Step 1: Reproduction of Bug

  • launched the local-devnet on an affected commit
  • observe standard operation
  • take the supervisor offline for a short while
  • start the supervisor back up

Upon startup, we see that the supervisor is unable to proceed:

t=2024-11-05T18:37:09+0000 lvl=warn msg="Served supervisor_updateLocalSafe" conn=172.18.0.9:38158 reqid=81307 duration=56.583µs err="derived block 0xad087d93b78de617f4b616116dcd4918dc6f0fe2aceaa8c42d24cce5c1f84cbf:23869 (parent: 0xcf91be7cb63362224a86484451fbaffe4dc3bd6f0d6559489462bfd273290506) is too new, expected to build on top of BlockSeal(hash:0x7c42c132e61786a56bcf348dc2b6f63757ac489e3b7e928d16c4be23a0d8bc5f, number:22854, time:1730829777): data out of order"

@axelKingsley
Copy link
Contributor

(making this note because I forgot about the particulars of this issue and had to rediscover)

So the issue here is not to do with logs, which sync and continue to sync without issue when the supervisor is restarted. So, the entire chain_processor is not faulty as far as I can tell.

Rather, this is about marking the Derived that the op-node sees. The op-node goes forward while the Supervisor is offline, and there's no replay.

@axelKingsley
Copy link
Contributor

axelKingsley commented Nov 12, 2024

I expanded the logging during this error, since we were only emitting the lastDerivedFrom:

t=2024-11-12T22:05:40+0000 lvl=warn msg="Served supervisor_updateLocalSafe" conn=172.18.0.9:47268 reqid=4082 duration=77µs err="cannot add block (0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959:0 derived from 0x12f888ce4789f20f5be783c102e149de0efcb3bc7afa69cfb356a5e81ce1c45f:45), last block (BlockSeal(hash:0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959, number:0, time:1731446404) derived from BlockSeal(hash:0xe7c27b3ee022eee0456b8c0c4bfb8702f5d3c2e02bf1614f01a24e50dd25d0f1, number:43, time:1731449128)) is in the future: data out of order"

Pulling the data out:

New Block Attempt:

  • derived 0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959:0
  • derivedFrom 0x12f888ce4789f20f5be783c102e149de0efcb3bc7afa69cfb356a5e81ce1c45f:45

Existing Block:

  • derived 0xc51c18bd34e0e7c4ca3ee57fba622aeebb898bbc5ef4e86517f437406b877959:0
  • derivedFrom 0xe7c27b3ee022eee0456b8c0c4bfb8702f5d3c2e02bf1614f01a24e50dd25d0f1:43

This makes sense, unfortunately. It's because the L1 derivation information has advanced, and the derived block 0 can be derived from both L1 Block 43 AND L1 Block 45.

The Supervisor should be able to reconstruct and repair this data when we detect this error. It can achieve this by:

  • when triggered (like by this error), start a "Derived Backfill"
  • make calls to op-node (or preferably op-geth) to see the L1<>L2 relationships, making the assumption that a gap in L1 indicates all those L1 blocks have the same L2 Derived
  • Call this failing Add Derived handling for each block in the path

This should recover the supervisor by forcing a backfill of data for the L1 blocks which don't contain L2 updates.

@axelKingsley
Copy link
Contributor

I.e. the op-node needs to resync derivation to help catch up the op-supervisor.

@protolambda this suggestion seems very destructive to the node, but seeing as the Supervisor only has outbound connections to the Execution Engine, it seems the L1<>L2 relationship must be driven by the node. Are there ways we can redrive this data without resetting the node's derivation?

@axelKingsley
Copy link
Contributor

Ok, in response to my own query from yesterday, "this seems destructive, is there no other way".

The "other way" would be a Safety Index of some kind. Here is a document describing the need for a Safety Index in the op-node. However, I think the Safety Index would be much better located on the Supervisor, if the Supervisor was made the source of L1 data for Nodes. That's an idea I have in draft here.

So, in the short term, resetting the derivation pipeline back to the last safe head known to Supervisor should work. Nodes would then play forward and report the correct data.

@axelKingsley
Copy link
Contributor

@protolambda
Copy link
Contributor Author

Above PR was merged, closing this. But we'll continue to track how the op-node / op-supervisor interact, so sync works well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants